Upload Datasets

Adding pre-existing data to a container.

Maniac lets you direct existing LLM traffic from a different inference provider into a container or upload a static dataset. Once uploaded, these logs can be used for optimization and evaluation.

Example: Uploading a HuggingFace Dataset

Let's walk through an example of uploading a static dataset using LEDGARarrow-up-right (Tuggener et al. 2020), great for training and testing legal classification models.

Prerequisites

export MANIAC_API_KEY=...

Load the HuggingFace Dataset

from datasets import load_dataset

DATASET = "coastalchp/ledgar"

# Load dataset
dataset = load_dataset(DATASET)
train_split = dataset["train"]

# Extract inputs and labels
clauses = train_split["text"]                  
label_ids = train_split["label"]              
label_names = train_split.features["label"].names

Define the System Prompt

Create a container

You can also skip this step and upload a dataset to an existing container, where it will be combined with any existing inference logs.

Upload Data in Batches

For large datasets, it's recommended uploading in batches to avoid timeouts. Each dataset entry consists of an input and output in chat completions format, as well as optional metadata.

Note: Unlike generating completions inside a container—where the container’s system prompt is automatically applied—registering (logging) existing completions requires the system prompt to be included explicitly with each messages object. Registered completions do not inherit the container-level system prompt.

Last updated