key
: The name of the metric.score
| value
: The value of the metric. Use score
if it’s a numerical metric and value
if it’s categorical.comment
(optional): The reasoning or additional string information justifying the score.pytest
or vitest/jest
out of convenience.
pytest
and Vitest/Jest
Vitest/Jest
. These make it easy to:
assistant node
is an LLM that determines whether to invoke a tool based upon the input. The tool condition
sees if a tool was selected by the assistant node
and, if so, routes to the tool node
. The tool node
executes the tool and returns the output as a tool message to the assistant node
. This loop continues until as long as the assistant node
selects a tool. If no tool is selected, then the agent directly returns the LLM response.
Final Response
: Evaluate the agent’s final response.Single step
: Evaluate any agent step in isolation (e.g., whether it selects the appropriate tool).Trajectory
: Evaluate whether the agent took the expected path (e.g., of tool calls) to arrive at the final answer.RAG From Scratch
series.LLM-as-judge
is a commonly used evaluator for RAG because it’s an effective way to evaluate factual accuracy or consistency between texts.
Offline evaluation
: Use offline evaluation for any prompts that rely on a reference answer. This is most commonly used for RAG answer correctness evaluation, where the reference is a ground truth (correct) answer.
Online evaluation
: Employ online evaluation for any reference-free prompts. This allows you to assess the RAG application’s performance in real-time scenarios.
Pairwise evaluation
: Utilize pairwise evaluation to compare answers produced by different RAG chains. This evaluation focuses on user-specified criteria (e.g., answer format or style) rather than correctness, which can be evaluated using self-consistency or a ground truth reference.
Evaluator | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant |
---|---|---|---|---|
Document relevance | Are documents relevant to the question? | No | Yes - prompt | No |
Answer faithfulness | Is the answer grounded in the documents? | No | Yes - prompt | No |
Answer helpfulness | Does the answer help address the question? | No | Yes - prompt | No |
Answer correctness | Is the answer consistent with a reference answer? | Yes | Yes - prompt | No |
Pairwise comparison | How do multiple answer versions compare? | No | Yes - prompt | Yes |
Developer curated examples
of texts to summarize are commonly used for evaluation (see a dataset example here). However, user logs
from a production (summarization) app can be used for online evaluation with any of the Reference-free
evaluation prompts below.
LLM-as-judge
is typically used for evaluation of summarization (as well as other types of writing) using Reference-free
prompts that follow provided criteria to grade a summary. It is less common to provide a particular Reference
summary, because summarization is a creative task and there are many possible correct answers.
Online
or Offline
evaluation are feasible because of the Reference-free
prompt used. Pairwise
evaluation is also a powerful way to perform comparisons between different summarization chains (e.g., different summarization prompts or LLMs):
Use Case | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant |
---|---|---|---|---|
Factual accuracy | Is the summary accurate relative to the source documents? | No | Yes - prompt | Yes |
Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - prompt | Yes |
Helpfulness | Is summary helpful relative to user need? | No | Yes - prompt | Yes |
reference
labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
If ground truth reference labels are provided, then it’s common to simply define a custom heuristic evaluator to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use LLM-as-judge
to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
Online
or Offline
evaluation is feasible when using LLM-as-judge
with the Reference-free
prompt used. In particular, this is well suited to Online
evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
Use Case | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant |
---|---|---|---|---|
Accuracy | Standard definition | Yes | No | No |
Precision | Standard definition | Yes | No | No |
Recall | Standard definition | Yes | No | No |
num_repetitions
argument to evaluate
/ aevaluate
(Python, TypeScript). Repeating the experiment involves both re-running the target function to generate outputs and re-running the evaluators.
To learn more about running repetitions on experiments, read the how-to-guide.
max_concurrency
argument to evaluate
/ aevaluate
, you can specify the concurrency of your experiment. The max_concurrency
argument has slightly different semantics depending on whether you are using evaluate
or aevaluate
.
evaluate
max_concurrency
argument to evaluate
specifies the maximum number of concurrent threads to use when running the experiment. This is both for when running your target function as well as your evaluators.
aevaluate
max_concurrency
argument to aevaluate
is fairly similar to evaluate
, but instead uses a semaphore to limit the number of concurrent tasks that can run at once. aevaluate
works by creating a task for each example in the dataset. Each task consists of running the target function as well as all of the evaluators on that specific example. The max_concurrency
argument specifies the maximum number of concurrent tasks, or put another way - examples, to run at once.
LANGSMITH_TEST_CACHE
to a valid folder on your device with write access. This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.