Evaluation

Evaluation and evaluator endpoints.

List evaluators

get

List evaluators for the authenticated project, optionally filtered by container.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Query parameters
containerany ofOptional
stringOptional
or
nullOptional
limitinteger · min: 1 · max: 100OptionalDefault: 20
offsetintegerOptionalDefault: 0
Responses
chevron-right
200

Successful Response

application/json
objectconst: listOptional

Object type identifier.

Default: list
totalintegerRequired

Total number of items available for this resource.

get
/v1/evaluators

Create an evaluator

post

Create a new evaluator scoped to a project or container.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Body
containerany ofOptional

Container label.

stringOptional
or
nullOptional
typestring · enumRequired

Evaluator type.

Possible values:
modelany ofOptional

Judge model slug.

stringOptional
or
nullOptional
promptany ofOptional

Judge prompt.

stringOptional
or
nullOptional
sourceany ofOptional

Code evaluator source.

stringOptional
or
nullOptional
requirementsany ofOptional

Optional requirements list.

string[]Optional
or
nullOptional
pass_thresholdany ofOptional

Optional pass threshold.

numberOptional
or
nullOptional
modestring · enumOptional

Evaluation mode.

Default: pointwisePossible values:
metadataany ofOptional

Optional metadata.

or
nullOptional
nameany ofOptional

Evaluator name.

stringOptional
or
nullOptional
descriptionany ofOptional

Evaluator description.

stringOptional
or
nullOptional
apiany ofOptional

API type.

stringOptional
or
nullOptional
inference_parametersany ofOptional

Optional inference parameters for judge evaluators (e.g. response_format, temperature).

or
nullOptional
Responses
post
/v1/evaluators

Get an evaluator

get

Fetch a single evaluator by id or name within the authenticated project.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Path parameters
evaluatorstringRequired
Responses
chevron-right
200

Successful Response

application/json
or
get
/v1/evaluators/{evaluator}

Update an evaluator

patch

Update an existing evaluator by id or name.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Path parameters
evaluatorstringRequired
Body

Patchable fields for an evaluator.

containerany ofOptional

Container label.

stringOptional
or
nullOptional
modelany ofOptional

Judge model slug.

stringOptional
or
nullOptional
promptany ofOptional

Judge prompt.

stringOptional
or
nullOptional
sourceany ofOptional

Code evaluator source.

stringOptional
or
nullOptional
requirementsany ofOptional

Optional requirements list.

string[]Optional
or
nullOptional
pass_thresholdany ofOptional

Optional pass threshold.

numberOptional
or
nullOptional
modestring · enum · nullableOptional

Evaluation mode.

Possible values:
metadataany ofOptional

Optional metadata.

or
nullOptional
nameany ofOptional

Evaluator name.

stringOptional
or
nullOptional
descriptionany ofOptional

Evaluator description.

stringOptional
or
nullOptional
apiany ofOptional

API type.

stringOptional
or
nullOptional
inference_parametersany ofOptional

Optional inference parameters for judge evaluators (e.g. response_format, temperature).

or
nullOptional
Responses
chevron-right
200

Successful Response

application/json
or
patch
/v1/evaluators/{evaluator}

Delete an evaluator

delete

Delete an evaluator by id.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Path parameters
evaluator_idstringRequired
Responses
chevron-right
200

Successful Response

application/json
objectconst: evaluatorOptional

Object type.

Default: evaluator
idstringRequired

Evaluator id.

deletedconst: Optional

Deletion status.

Default: true
delete
/v1/evaluators/{evaluator_id}

List evaluation runs

get

List evaluation runs for the authenticated project. Optionally filter by container and status.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Query parameters
containerany ofOptional

Container ID or label to filter by.

stringOptional
or
nullOptional
statusany ofOptional

Filter by run status (e.g. 'running', 'completed', 'error').

stringOptional
or
nullOptional
limitinteger · min: 1 · max: 100OptionalDefault: 20
offsetintegerOptionalDefault: 0
Responses
chevron-right
200

Successful Response

application/json
objectconst: listOptional

Object type identifier.

Default: list
totalintegerRequired

Total number of items available for this resource.

get
/v1/evaluation/runs

Create an evaluation run

post

Launch a standalone evaluation run. Validates access to the specified container, evaluators, data sources, and models, then dispatches the evaluation to the Modal backend.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Body

Request body for creating an evaluation run.

Each side of the evaluation (sample and ground_truth) is described by a single data-source object whose type discriminator determines how data is obtained:

  • "dataset" — pull from a dataset.
  • "container" — pull from the container's task logs.
  • "generate" — generate completions using one or more models.

Both fields are optional, but at least one must be provided. When a side is omitted it defaults to the top-level container's task logs. At least one resolved side must not be type='generate' so there is seed input to evaluate against.

containerstringRequired

Container id or label.

evaluatorsstring[] · min: 1Required

Evaluator ids or labels.

sampleany ofOptional

Sample-side data source. Omit to default to the container's task logs. Use type='dataset' to pull from a dataset, type='container' to pull from task logs, or type='generate' to generate completions with the specified models.

anyOptional
or
or
or
nullOptional
ground_truthany ofOptional

Ground-truth-side data source. Omit to default to the container's task logs. Use type='dataset' to pull from a dataset, type='container' to pull from task logs, or type='generate' to generate completions with a model.

anyOptional
or
or
or
nullOptional
metadataany ofOptional

Optional metadata.

or
nullOptional
environmentany ofOptional

Execution environment name (maps to Modal app suffix).

Default: main
stringOptional
or
nullOptional
Responses
post
/v1/evaluation/runs

Get an evaluation run

get

Retrieve a single evaluation run by ID within the authenticated project.

Authorizations
AuthorizationstringRequired

API key in Authorization header using Bearer .

Path parameters
run_idstringRequired
Responses
chevron-right
200

Successful Response

application/json

Response model for an evaluation run.

created_atstringRequired
finished_atany ofOptional
stringOptional
or
nullOptional
error_atany ofOptional
stringOptional
or
nullOptional
statusstringRequired
errorany ofOptional
anyOptional
or
nullOptional
objectconst: evaluation.runOptional

Object type.

Default: evaluation.run
idstringRequired

Evaluation run id (run group id).

process_idany ofOptional

Process id for lifecycle tracking.

stringOptional
or
nullOptional
evaluatorsany ofOptional

Evaluator ids used in this run.

string[]Optional
or
nullOptional
containerany ofOptional

Container id.

stringOptional
or
nullOptional
dataset_idany ofOptional

Dataset id (if a dataset was used).

stringOptional
or
nullOptional
sampleany ofOptional

Resolved sample-side data source.

anyOptional
or
or
or
nullOptional
ground_truthany ofOptional

Resolved ground-truth-side data source.

anyOptional
or
or
or
nullOptional
resultsany ofOptional

Evaluation results (populated on completion).

or
nullOptional
metricsany ofOptional

Evaluation metrics (populated on completion).

or
nullOptional
configany ofOptional

Run configuration as submitted.

or
nullOptional
spendany ofOptional

Estimated spend.

numberOptional
or
nullOptional
metadataany ofOptional

Optional metadata.

or
nullOptional
get
/v1/evaluation/runs/{run_id}

Last updated