For the complete documentation index, see llms.txt. This page is also available as Markdown.

Creating Evaluations

Code and judge evals for your Maniac models.

Evaluations (“evals”) define what Maniac uses to guide optimization.

Maniac supports two types of evaluations:

  1. Judge prompt evaluations

  2. Code evaluations

Judge Prompt Evals

Judge prompt evals compare two candidate outputs—Response A and Response B—for the same input. Maniac runs these comparisons in a tournament-style setup to determine which model performs better.

In Maniac’s tournament setup:

  • Response A is produced by the candidate model being evaluated

  • Response B is produced by a reference source, such as a frontier LLM or labeled data uploaded into your container.

The judge will returns TRUE if Response A is at least as good as Response B, and FALSE otherwise.

For example:

Is response A at least as good as response B in generating a headline for the provided news snippet? Return either True or False.


Code Evals

Code-based evals give you full control over how outputs are scored. These evals are written as Python functions and operate on a single sample at a time.

A code eval must:

  • Accept a single item argument

  • Return a dictionary with a numeric score

  • Use item["sample"] and item["ground_truth"] as inputs

Item Structure

Dependencies

Code evaluations can depend on any external Python packages you need. Just list them in the requirements.txt field when creating the evaluation.

Examples

Example: Exact Match

Example: Multi-Label IuO (Jaccard Similarity)

Example: JSON Schema Validation

Example: Semantic Similarity Evaluation

Last updated