How to define an LLM-as-a-judge evaluator

LLM-as-a-judge evaluator

LLM applications can be challenging to evaluate since they often generate conversational text with no single correct answer. This guide shows you how to define an LLM-as-a-judge evaluator for offline evaluation using either the LangSmith SDK or the UI. Note: To run evaluations in real-time on your production traces, refer to setting up online evaluations.

Pre-built evaluators

Pre-built evaluators are a useful starting point for setting up evaluations. Refer to pre-built evaluators for how to use pre-built evaluators with LangSmith.

Create your own LLM-as-a-judge evaluator

For complete control of evaluator logic, create your own LLM-as-a-judge evaluator and run it using the LangSmith SDK (Python / TypeScript).Requires langsmith>=0.2.0

from langsmith import evaluate, traceable, wrappers, Client
from openai import OpenAI
# Assumes you've installed pydantic
from pydantic import BaseModel

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""
    instructions = """\
Given the following question, answer, and reasoning, determine if the reasoning \
for the answer is logically valid and consistent with question and the answer.\"""
    
    class Response(BaseModel):
        reasoning_is_valid: bool
    
    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = oai_client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

ls_client = Client()
dataset = ls_client.create_dataset("big questions")
examples = [
    {"inputs": {"question": "how will the universe end"}},
    {"inputs": {"question": "are we alone"}},
]
ls_client.create_examples(dataset_id=dataset.id, examples=examples)

results = evaluate(
    dummy_app,
    data=dataset,
    evaluators=[valid_reasoning]
)

See here for more on how to write a custom evaluator.

Datasets

Evaluations

Analyze experiment results

Annotation & human feedback

Tutorials

How to define an LLM-as-a-judge evaluator

Pre-built evaluators

Create your own LLM-as-a-judge evaluator

Datasets

Evaluations

Analyze experiment results

Annotation & human feedback

Tutorials

​Pre-built evaluators

​Create your own LLM-as-a-judge evaluator

Pre-built evaluators

Create your own LLM-as-a-judge evaluator