evaluate
method as follows:
inputs: list[dict]
: A list of the inputs corresponding to a single example in a dataset.outputs: list[dict]
: A list of the dict outputs produced by each experiment on the given inputs.reference_outputs/referenceOutputs: list[dict]
: A list of the reference outputs associated with the example, if available.runs: list[Run]
: A list of the full Run objects generated by the two experiments on the given example. Use this if you need access to intermediate steps or metadata about each run.examples: list[Example]
: All of the dataset Example objects, including the example inputs, outputs (if available), and metdata (if available).dict
: dicts of the form {"score": ..., "name": ...}
allow you to pass a numeric or boolean score and metric name.int | float | bool
: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.