Prompt Experiments

Helicone’s experiment system lets you compare prompt versions side-by-side, run A/B tests, and measure which variations perform best before deploying to production.

What Are Experiments?

Experiments allow you to:

Compare multiple prompts or versions against the same inputs
Test different models (GPT-4 vs Claude vs Gemini)
Measure performance with automated evaluators
Validate improvements before production deployment
Build datasets from production traffic or manual inputs

Side-by-Side Comparison

Run multiple prompt versions simultaneously and compare outputs

Automated Scoring

Use evaluators to score responses on quality, relevance, and custom metrics

Production Testing

Test variations with real production inputs before deployment

Dataset Building

Create reusable test datasets from experiments

Creating an Experiment

From the UI

Navigate to Experiments

Go to Experiments in your Helicone dashboard.

Choose creation method

Create an experiment:

From scratch: Manual input rows
From prompt: Use an existing prompt as the baseline
From dataset: Use a saved dataset

Configure inputs

Add test cases with variables that will be passed to prompts:

{
  "userName": "Alice",
  "topic": "machine learning"
}

Add hypotheses

Create columns for each variation you want to test:

Different prompt versions
Different models
Different parameters (temperature, max_tokens)

Run the experiment

Execute all hypothesis combinations and view results in a table.

From a Prompt

Start an experiment directly from a prompt:

Open your prompt

Navigate to the prompt you want to test.

Click Experiment

Find the Experiment button in the prompt editor.

Select inputs

Choose inputs from:

Production request history
Manual input creation
Existing datasets

Add variations

Create hypothesis columns with different versions or configurations.

Experiment Structure

An experiment consists of:

Inputs (Rows)

Each row represents a test case with variables:

userName	topic	context
Alice	machine learning	beginner
Bob	neural networks	advanced
Carol	prompt engineering	intermediate

Hypotheses (Columns)

Each column represents a variation to test:

Hypothesis	Prompt Version	Model	Temperature
Baseline	1.0	gpt-4o	0.7
New Version	2.0	gpt-4o	0.7
Claude Alt	1.0	claude-3-5	0.7
Higher Temp	2.0	gpt-4o	0.9

Results

The experiment grid shows outputs for each input × hypothesis combination, along with:

Response text
Token usage
Latency
Cost
Evaluator scores

Running Experiments

Execute All Tests

Run the experiment to generate outputs for all combinations:

// Experiments automatically execute when created via API
const response = await fetch("https://api.helicone.ai/v2/experiment/create/empty", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
});

const { experimentId } = await response.json();

View Results

Results appear in a table format with:

Input variables
Output for each hypothesis
Token counts and costs
Latency measurements
Evaluator scores

Compare Side-by-Side

The experiment table lets you:

Scroll horizontally to compare outputs
Sort by evaluator scores
Filter by input variables
Highlight differences between versions

Evaluators

Evaluators automatically score experiment outputs based on criteria you define.

Adding Evaluators

Open experiment settings

Click the Evaluators button in your experiment.

Create or select evaluator

Choose from:

LLM-as-Judge: Use GPT-4 or Claude to score responses
Regex: Pattern matching for specific content
Custom: API-based evaluator with your own logic

Configure scoring

Define scoring criteria:

{
  "name": "Relevance",
  "description": "Score 1-5 based on how relevant the response is to the question",
  "type": "llm-judge",
  "model": "gpt-4o",
  "prompt": "Rate the relevance of this response..."
}

Run evaluators

Execute evaluators across all experiment outputs.

Evaluator Types

LLM-as-Judge

Use another LLM to score responses:

{
  "type": "llm-judge",
  "model": "gpt-4o",
  "criteria": [
    "Accuracy",
    "Clarity",
    "Helpfulness"
  ],
  "scale": "1-5"
}

Pattern Matching

Score based on content patterns:

{
  "type": "regex",
  "patterns": [
    {
      "pattern": "\\b(sorry|apologize)\\b",
      "score": -1,
      "reason": "Unnecessary apology"
    },
    {
      "pattern": "\\b(step-by-step|walkthrough)\\b",
      "score": 1,
      "reason": "Structured response"
    }
  ]
}

Custom Evaluator

Call your own API for scoring:

{
  "type": "custom",
  "endpoint": "https://api.example.com/evaluate",
  "method": "POST",
  "headers": {
    "Authorization": "Bearer YOUR_TOKEN"
  }
}

Viewing Scores

Evaluator scores appear as columns in the experiment table:

Input	Baseline Output	New Version Output	Relevance (Baseline)	Relevance (New)	Winner
…	…	…	3.5	4.2	New
…	…	…	4.0	3.8	Baseline

Datasets

Datasets are reusable collections of test inputs.

Creating Datasets

From production traffic

Select requests from your logs and add them to a dataset.

From experiments

Save successful experiment inputs as a dataset for future tests.

Manual creation

Create a dataset by manually defining test cases.

CSV import

Upload a CSV file with input variables.

Using Datasets

Reuse datasets across multiple experiments:

// Create experiment from dataset
const response = await fetch("https://api.helicone.ai/v2/experiment", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    datasetId: "dataset-123",
    hypotheses: [
      {
        promptVersionId: "version-abc",
        model: "gpt-4o",
      },
      {
        promptVersionId: "version-xyz",
        model: "gpt-4o",
      },
    ],
  }),
});

Experiment Workflows

A/B Testing Workflow

Create baseline

Run experiment with current production prompt (version 1.0).

Add variation

Create hypothesis column with new prompt version (version 2.0).

Run on production inputs

Use recent production requests as test inputs.

Score with evaluators

Run automated evaluators to measure quality.

Analyze results

Compare scores, costs, and latency:

Version 2.0: Avg score 4.3, $0.02/request
Version 1.0: Avg score 3.8, $0.03/request

Deploy winner

If version 2.0 performs better, deploy to production.

Model Comparison

Create hypotheses for each model

GPT-4o
Claude 3.5 Sonnet
Gemini 1.5 Pro

Use same prompt across all

Keep prompt constant, vary only the model.

Run experiment

Execute all combinations.

Compare cost vs quality

GPT-4o: Score 4.2, $0.02
Claude: Score 4.5, $0.025
Gemini: Score 3.9, $0.015

Parameter Tuning

Create hypotheses with different parameters

Temperature: 0.3, 0.7, 0.9
Max tokens: 500, 1000, 2000

Run experiments

Test all combinations.

Find optimal settings

Identify the sweet spot for quality, cost, and latency.

Experiment API

Create Experiment

curl -X POST https://api.helicone.ai/v2/experiment/create/empty \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

List Experiments

curl -X GET https://api.helicone.ai/v2/experiment \
  -H "Authorization: Bearer YOUR_API_KEY"

Get Experiment Details

curl -X POST https://api.helicone.ai/v1/experiment/query \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "experiment": {
        "id": {
          "equals": "experiment-123"
        }
      }
    },
    "include": {
      "inputs": true,
      "responseBodies": true
    }
  }'

Delete Experiment

curl -X DELETE https://api.helicone.ai/v2/experiment/{experimentId} \
  -H "Authorization: Bearer YOUR_API_KEY"

Best Practices

Use production data: Test with real inputs your users will provide

Run evaluators consistently: Define clear scoring criteria and use them across all experiments

Test multiple dimensions: Compare prompts, models, and parameters separately

Build reusable datasets: Save high-quality test cases for future experiments

Document findings: Add notes to experiments explaining results and decisions

Don’t over-optimize: A small score difference may not matter in production

Consider cost: A higher-scoring version may not justify significantly higher costs

Analyzing Results

Statistical Significance

Consider:

Sample size: Test with at least 20-50 inputs for meaningful results
Variance: Look at score distribution, not just averages
Edge cases: Ensure new versions handle corner cases well

Cost Analysis

Compare total cost per hypothesis:

Hypothesis A: 50 tests × $0.02 = $1.00 (avg score: 4.2)
Hypothesis B: 50 tests × $0.03 = $1.50 (avg score: 4.3)

Cost increase: 50%
Score increase: 2.4%
Verdict: Hypothesis A is more cost-effective

Latency Impact

Monitor response times:

P50 latency
P95 latency
Max latency

A better prompt isn’t worth it if it makes your app feel slow.

Next Steps

Versioning

Learn more about managing prompt versions

Deployment

Deploy winning variations to production

Observability

Monitor deployed prompts in production

Evaluation

Deep dive into evaluation strategies

Get Started

AI Gateway

Observability

Prompt Management

Evaluation & Testing

Features

Self-Hosting

Integrations

Documentation Index

​What Are Experiments?

Side-by-Side Comparison

Automated Scoring

Production Testing

Dataset Building

​Creating an Experiment

​From the UI

​From a Prompt

​Experiment Structure

​Inputs (Rows)

​Hypotheses (Columns)

​Results

​Running Experiments

​Execute All Tests

​View Results

​Compare Side-by-Side

​Evaluators

​Adding Evaluators

​Evaluator Types

​LLM-as-Judge

​Pattern Matching

​Custom Evaluator

​Viewing Scores

​Datasets

​Creating Datasets

​Using Datasets

​Experiment Workflows

​A/B Testing Workflow

​Model Comparison

​Parameter Tuning

​Experiment API

​Create Experiment

​List Experiments

​Get Experiment Details

​Delete Experiment

​Best Practices

​Analyzing Results

​Statistical Significance

​Cost Analysis

​Latency Impact

​Next Steps

Versioning

Deployment

Observability

Evaluation

What Are Experiments?

Creating an Experiment

From the UI

From a Prompt

Experiment Structure

Inputs (Rows)

Hypotheses (Columns)

Results

Running Experiments

Execute All Tests

View Results

Compare Side-by-Side

Evaluators

Adding Evaluators

Evaluator Types

LLM-as-Judge

Pattern Matching

Custom Evaluator

Viewing Scores

Datasets

Creating Datasets

Using Datasets

Experiment Workflows

A/B Testing Workflow

Model Comparison

Parameter Tuning

Experiment API

Create Experiment

List Experiments

Get Experiment Details

Delete Experiment

Best Practices

Analyzing Results

Statistical Significance

Cost Analysis

Latency Impact

Next Steps