Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt

Use this file to discover all available pages before exploring further.

Helicone’s experiment system lets you compare prompt versions side-by-side, run A/B tests, and measure which variations perform best before deploying to production.

What Are Experiments?

Experiments allow you to:
  • Compare multiple prompts or versions against the same inputs
  • Test different models (GPT-4 vs Claude vs Gemini)
  • Measure performance with automated evaluators
  • Validate improvements before production deployment
  • Build datasets from production traffic or manual inputs

Side-by-Side Comparison

Run multiple prompt versions simultaneously and compare outputs

Automated Scoring

Use evaluators to score responses on quality, relevance, and custom metrics

Production Testing

Test variations with real production inputs before deployment

Dataset Building

Create reusable test datasets from experiments

Creating an Experiment

From the UI

1

Navigate to Experiments

Go to Experiments in your Helicone dashboard.
2

Choose creation method

Create an experiment:
  • From scratch: Manual input rows
  • From prompt: Use an existing prompt as the baseline
  • From dataset: Use a saved dataset
3

Configure inputs

Add test cases with variables that will be passed to prompts:
{
  "userName": "Alice",
  "topic": "machine learning"
}
4

Add hypotheses

Create columns for each variation you want to test:
  • Different prompt versions
  • Different models
  • Different parameters (temperature, max_tokens)
5

Run the experiment

Execute all hypothesis combinations and view results in a table.

From a Prompt

Start an experiment directly from a prompt:
1

Open your prompt

Navigate to the prompt you want to test.
2

Click Experiment

Find the Experiment button in the prompt editor.
3

Select inputs

Choose inputs from:
  • Production request history
  • Manual input creation
  • Existing datasets
4

Add variations

Create hypothesis columns with different versions or configurations.

Experiment Structure

An experiment consists of:

Inputs (Rows)

Each row represents a test case with variables:
userNametopiccontext
Alicemachine learningbeginner
Bobneural networksadvanced
Carolprompt engineeringintermediate

Hypotheses (Columns)

Each column represents a variation to test:
HypothesisPrompt VersionModelTemperature
Baseline1.0gpt-4o0.7
New Version2.0gpt-4o0.7
Claude Alt1.0claude-3-50.7
Higher Temp2.0gpt-4o0.9

Results

The experiment grid shows outputs for each input × hypothesis combination, along with:
  • Response text
  • Token usage
  • Latency
  • Cost
  • Evaluator scores

Running Experiments

Execute All Tests

Run the experiment to generate outputs for all combinations:
// Experiments automatically execute when created via API
const response = await fetch("https://api.helicone.ai/v2/experiment/create/empty", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
});

const { experimentId } = await response.json();

View Results

Results appear in a table format with:
  • Input variables
  • Output for each hypothesis
  • Token counts and costs
  • Latency measurements
  • Evaluator scores

Compare Side-by-Side

The experiment table lets you:
  • Scroll horizontally to compare outputs
  • Sort by evaluator scores
  • Filter by input variables
  • Highlight differences between versions

Evaluators

Evaluators automatically score experiment outputs based on criteria you define.

Adding Evaluators

1

Open experiment settings

Click the Evaluators button in your experiment.
2

Create or select evaluator

Choose from:
  • LLM-as-Judge: Use GPT-4 or Claude to score responses
  • Regex: Pattern matching for specific content
  • Custom: API-based evaluator with your own logic
3

Configure scoring

Define scoring criteria:
{
  "name": "Relevance",
  "description": "Score 1-5 based on how relevant the response is to the question",
  "type": "llm-judge",
  "model": "gpt-4o",
  "prompt": "Rate the relevance of this response..."
}
4

Run evaluators

Execute evaluators across all experiment outputs.

Evaluator Types

LLM-as-Judge

Use another LLM to score responses:
{
  "type": "llm-judge",
  "model": "gpt-4o",
  "criteria": [
    "Accuracy",
    "Clarity",
    "Helpfulness"
  ],
  "scale": "1-5"
}

Pattern Matching

Score based on content patterns:
{
  "type": "regex",
  "patterns": [
    {
      "pattern": "\\b(sorry|apologize)\\b",
      "score": -1,
      "reason": "Unnecessary apology"
    },
    {
      "pattern": "\\b(step-by-step|walkthrough)\\b",
      "score": 1,
      "reason": "Structured response"
    }
  ]
}

Custom Evaluator

Call your own API for scoring:
{
  "type": "custom",
  "endpoint": "https://api.example.com/evaluate",
  "method": "POST",
  "headers": {
    "Authorization": "Bearer YOUR_TOKEN"
  }
}

Viewing Scores

Evaluator scores appear as columns in the experiment table:
InputBaseline OutputNew Version OutputRelevance (Baseline)Relevance (New)Winner
3.54.2New
4.03.8Baseline

Datasets

Datasets are reusable collections of test inputs.

Creating Datasets

1

From production traffic

Select requests from your logs and add them to a dataset.
2

From experiments

Save successful experiment inputs as a dataset for future tests.
3

Manual creation

Create a dataset by manually defining test cases.
4

CSV import

Upload a CSV file with input variables.

Using Datasets

Reuse datasets across multiple experiments:
// Create experiment from dataset
const response = await fetch("https://api.helicone.ai/v2/experiment", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    datasetId: "dataset-123",
    hypotheses: [
      {
        promptVersionId: "version-abc",
        model: "gpt-4o",
      },
      {
        promptVersionId: "version-xyz",
        model: "gpt-4o",
      },
    ],
  }),
});

Experiment Workflows

A/B Testing Workflow

1

Create baseline

Run experiment with current production prompt (version 1.0).
2

Add variation

Create hypothesis column with new prompt version (version 2.0).
3

Run on production inputs

Use recent production requests as test inputs.
4

Score with evaluators

Run automated evaluators to measure quality.
5

Analyze results

Compare scores, costs, and latency:
  • Version 2.0: Avg score 4.3, $0.02/request
  • Version 1.0: Avg score 3.8, $0.03/request
6

Deploy winner

If version 2.0 performs better, deploy to production.

Model Comparison

1

Create hypotheses for each model

  • GPT-4o
  • Claude 3.5 Sonnet
  • Gemini 1.5 Pro
2

Use same prompt across all

Keep prompt constant, vary only the model.
3

Run experiment

Execute all combinations.
4

Compare cost vs quality

  • GPT-4o: Score 4.2, $0.02
  • Claude: Score 4.5, $0.025
  • Gemini: Score 3.9, $0.015

Parameter Tuning

1

Create hypotheses with different parameters

  • Temperature: 0.3, 0.7, 0.9
  • Max tokens: 500, 1000, 2000
2

Run experiments

Test all combinations.
3

Find optimal settings

Identify the sweet spot for quality, cost, and latency.

Experiment API

Create Experiment

curl -X POST https://api.helicone.ai/v2/experiment/create/empty \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

List Experiments

curl -X GET https://api.helicone.ai/v2/experiment \
  -H "Authorization: Bearer YOUR_API_KEY"

Get Experiment Details

curl -X POST https://api.helicone.ai/v1/experiment/query \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "experiment": {
        "id": {
          "equals": "experiment-123"
        }
      }
    },
    "include": {
      "inputs": true,
      "responseBodies": true
    }
  }'

Delete Experiment

curl -X DELETE https://api.helicone.ai/v2/experiment/{experimentId} \
  -H "Authorization: Bearer YOUR_API_KEY"

Best Practices

Use production data: Test with real inputs your users will provide
Run evaluators consistently: Define clear scoring criteria and use them across all experiments
Test multiple dimensions: Compare prompts, models, and parameters separately
Build reusable datasets: Save high-quality test cases for future experiments
Document findings: Add notes to experiments explaining results and decisions
Don’t over-optimize: A small score difference may not matter in production
Consider cost: A higher-scoring version may not justify significantly higher costs

Analyzing Results

Statistical Significance

Consider:
  • Sample size: Test with at least 20-50 inputs for meaningful results
  • Variance: Look at score distribution, not just averages
  • Edge cases: Ensure new versions handle corner cases well

Cost Analysis

Compare total cost per hypothesis:
Hypothesis A: 50 tests × $0.02 = $1.00 (avg score: 4.2)
Hypothesis B: 50 tests × $0.03 = $1.50 (avg score: 4.3)

Cost increase: 50%
Score increase: 2.4%
Verdict: Hypothesis A is more cost-effective

Latency Impact

Monitor response times:
  • P50 latency
  • P95 latency
  • Max latency
A better prompt isn’t worth it if it makes your app feel slow.

Next Steps

Versioning

Learn more about managing prompt versions

Deployment

Deploy winning variations to production

Observability

Monitor deployed prompts in production

Evaluation

Deep dive into evaluation strategies