openevals
package, which contains prebuilt evaluators and other convenient resources for evaluating your AI apps. It will also use OpenAI models, though you can use other providers as well.
yarn
as your package manager, you will also need to manually install @langchain/core
as a peer dependency of openevals
. This is not required for LangSmith evals in general.app
: Your application, or a function wrapping it. Must accept a single chat message (dict with “role” and “content” keys) as an input arg and a thread_id
as a kwarg. Should accept other kwargs as more may be added in future releases. Returns a chat message as output with at least role and content keys.user
: The simulated user. In this guide, we will use an imported prebuilt function named create_llm_simulated_user
which uses an LLM to generate user responses, though you can create your own too.openevals
passes a single chat message to your app
from the user
for each turn. Therefore you should statefully track the current history internally based on thread_id
if needed.
Here’s an example that simulates a multi-turn customer support interaction. This guide uses a simple chat app that wraps a single call to the OpenAI chat completions API, however this is where you would call your application or agent. In this example, our simulated user is playing the role of a particularly aggressive customer:
user
, then passes response chat messages back and forth until it reaches max_turns
(you can alternatively pass a stopping_condition
that takes the current trajectory and returns True
or False
- see the OpenEvals README for more information). The return value is the final list of chat messages that make up the converation’s trajectory.
app
and user
interleaved:
pytest
(Python-only), Vitest
/Jest
(JS only), or evaluate
runners.
pytest
or Vitest/Jest
trajectory_evaluators
param when running the simulation. These evaluators will run at the end of the simulation, taking the final list of chat messages as an outputs
kwarg. Your passed trajectory_evaluator
must therefore accept this kwarg.
trajectory_evaluators
, adding it to the experiment. Note also that the test case uses the fixed_responses
param on the simulated user to start the conversation with a specific input, which you can log and make part of your stored dataset.
You may also find it convenient to have the simulated user’s system prompt to be part of your logged dataset as well.
evaluate
evaluate
runner to evaluate simulated multi-turn interactions. This will be a little bit different from the pytest
/Vitest
/Jest
example in the following ways:
target
function, and your target function should return the final trajectory.
outputs
that LangSmith will pass to your evaluators.trajectory_evaluators
param, you should pass your evaluators as a param into the evaluate()
method.system
parameter passed into create_llm_simulated_user
. If you would like to use a different persona for specific items in your dataset, you can update your dataset examples to also contain an extra field with the desired system
prompt, then pass that field in when creating your simulated user like this: