Before diving into this content, it might be helpful to read the guide on versioning datasets. Additionally, it might be helpful to read the guide on fetching examples.

Using list_examples

You can take advantage of the fact that evaluate / aevaluate allows passing in an iterable of examples to evaluate on a particular version of a dataset. Simply use list_examples / listExamples to fetch examples from a particular version tag using as_of / asOf and pass that in to the data argument.
from langsmith import Client

ls_client = Client()

# Assumes actual outputs have a 'class' key.
# Assumes example outputs have a 'label' key.
def correct(outputs: dict, reference_outputs: dict) -> bool:
  return outputs["class"] == reference_outputs["label"]

results = ls_client.evaluate(
    lambda inputs: {"class": "Not toxic"},
    # Pass in filtered data here:
    data=ls_client.list_examples(
      dataset_name="Toxic Queries",
      as_of="latest",  # specify version here
    ),
    evaluators=[correct],
)
  • Learn more about how to fetch views of a dataset here