Creating Evaluations

Code and judge evals for your Maniac models.

Evaluations (“evals”) define what Maniac uses to guide optimization.

Maniac supports two types of evaluations:

  1. Judge prompt evaluations

  2. Code evaluations

Judge Prompt Evals

Judge prompt evals compare two candidate outputs—Response A and Response B—for the same input. Maniac runs these comparisons in a tournament-style setup to determine which model performs better.

In Maniac’s tournament setup:

  • Response A is produced by the candidate model being evaluated

  • Response B is produced by a reference source, such as a frontier LLM or labeled data uploaded into your container.

The judge will returns TRUE if Response A is at least as good as Response B, and FALSE otherwise.

For example:

Is response A at least as good as response B in generating a headline for the provided news snippet? Return either True or False.


Code Evals

Code-based evals give you full control over how outputs are scored. These evals are written as Python functions and operate on a single sample at a time.

A code eval must:

  • Accept a single item argument

  • Return a dictionary with a numeric score

  • Use item["sample"] and item["ground_truth"] as inputs

Item Structure

Dependencies

Code evaluations can depend on any external Python packages you need. Just list them in the requirements.txt field when creating the evaluation.

Examples

Example: Exact Match

Example: Multi-Label IuO (Jaccard Similarity)

Example: JSON Schema Validation

Example: Semantic Similarity Evaluation

Last updated