Before diving into this content, it might be helpful to read the following:
It is highly recommended to run evals with either the Python or TypeScript SDKs. The SDKs have many optimizations and features that enhance the performance and reliability of your evals. However, if you are unable to use the SDKs, either because you are using a different language or because you are running in a restricted environment, you can use the REST API directly. This guide will show you how to run evals using the REST API, using the requests library in Python as an example. However, the same principles apply to any language.

1. Create a target function endpoint

Here, we are using the python SDK for convenience. You can also use the API directly use the UI, see this guide for more information.
  • Accept POST requests
  • Accept JSON input with the example inputs
  • Return JSON output with the results
Here’s an example using FastAPI:
from fastapi import FastAPI
from pydantic import BaseModel
from typing import Dict, Any

app = FastAPI()

class EvaluationInput(BaseModel):
    inputs: Dict[str, Any]

class EvaluationOutput(BaseModel):
    outputs: Dict[str, Any]

@app.post("/evaluate", response_model=EvaluationOutput)
async def evaluate(inputs: EvaluationInput):
    # Your evaluation logic here
    # This is just an example
    result = {"output": f"Processed: {inputs.inputs}"}
    return EvaluationOutput(outputs=result)

2. Start the evaluation

First, pull all of the examples you’d want to use in your experiment.
import requests
import json

# Your API key
api_key = "your-api-key"

# The dataset ID
dataset_id = "your-dataset-id"

# The URL of your target function
target_url = "https://your-function-url.com/evaluate"

# Start the evaluation
response = requests.post(
    "https://api.smith.langchain.com/evaluations",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    },
    json={
        "dataset_id": dataset_id,
        "target_url": target_url,
        "evaluators": ["your-evaluator-name"],
        "experiment_prefix": "api-evaluation"
    }
)

print(response.json())
We are going to run completions on all examples using two models: gpt-3.5-turbo and gpt-4o-mini. You can check the status of your evaluation:
evaluation_id = response.json()["id"]

status_response = requests.get(
    f"https://api.smith.langchain.com/evaluations/{evaluation_id}",
    headers={"Authorization": f"Bearer {api_key}"}
)

print(status_response.json())

Run a pairwise experiment

Next, we’ll demonstrate how to run a pairwise experiment. In a pairwise experiment, you compare two examples against each other. For more information, check out this guide.
results_response = requests.get(
    f"https://api.smith.langchain.com/evaluations/{evaluation_id}/results",
    headers={"Authorization": f"Bearer {api_key}"}
)

print(results_response.json())