Running LLM Experiments

We are deprecating the Experiments feature and it will be removed from the platform on September 1st, 2025. We recommend using Prompt Versions for testing prompt changes.

Experiments let you test changes to prompts, models, and parameters before deploying to production. Run experiments on historical data to validate improvements and prevent quality regressions.

Why Run Experiments?

Test Safely

Validate changes without impacting production users

Data-Driven Decisions

Compare outputs side-by-side with metrics

Save Costs

Test on specific datasets instead of full production traffic

Prevent Regressions

Catch quality issues before they reach users

Experiment Workflow

Navigate to Prompts

Go to the Prompts tab in your Helicone dashboard and select the prompt you want to test.

To experiment on your production prompt, look for the production tag.

Start a New Experiment

Click “Start Experiment” in the top right corner.

Start Experiment button in the Prompts tab

Select Base Prompt

Choose the baseline prompt to compare against. This is typically your current production prompt.

Prompt selection interface for experiments

Edit the Prompt

Make your changes. This creates a new version without affecting the original prompt.

Prompt editor showing variant being tested

Common changes to test:

Different instructions or tone
More/fewer examples
System prompt modifications
Response format changes

Configure Experiment Settings

Set up your test parameters:

Dataset - Select existing dataset or generate random samples
Model - Same as baseline or test a different model
Provider Keys - Which API keys to use

Click “Generate random dataset” to create a dataset from up to 10 random production requests. Great for quick tests!

Configuration interface for experiment parameters

Review and Run

The Diff Viewer shows exactly what changed between your base prompt and experiment variant.

Side-by-side diff of base and experiment prompts

Review the changes and click “Run Experiment” to start.

Analyze Results

Once complete, view side-by-side comparisons of outputs:

Results comparison table with base and experiment outputs

For each input, you can:

Compare outputs side-by-side
Review response quality
Check token usage and cost
Identify where variants perform better

What to Test

Prompt Variations

Test different approaches to the same task:

Base Prompt
Variant A: More Context
Variant B: Step-by-Step

You are a customer support assistant. 
Answer user questions professionally.

User: {question}

You are a customer support assistant for TechCorp, 
a B2B SaaS company. Be professional, concise, and 
always offer to escalate if needed.

Company policies:
- Refunds within 30 days
- 24/7 support for enterprise
- Self-service docs at docs.techcorp.com

User: {question}

You are a customer support assistant.

Follow these steps:
1. Understand the user's issue
2. Check if it's in our FAQ
3. Provide a clear solution
4. Ask if they need more help

User: {question}

Model Comparison

Compare different models on the same prompt:

GPT-4o vs GPT-4o-mini - Does the cheaper model work as well?
GPT-4o vs Claude 3.5 Sonnet - Which provider is better for your use case?
GPT-4o vs GPT-4o-2024-08-06 - Test new model versions

Parameter Tuning

Experiment with model parameters:

// Base configuration
{
  temperature: 0.7,
  max_tokens: 500
}

// Variant A: More deterministic
{
  temperature: 0.3,
  max_tokens: 500
}

// Variant B: More creative
{
  temperature: 0.9,
  max_tokens: 500
}

Creating Effective Datasets

Use Representative Data

Your test dataset should cover:

Common queries (80% of traffic)
Edge cases (10%)
Error-prone scenarios (10%)

Dataset Size Guidelines

Quick validation: 10-20 examples
Thorough testing: 50-100 examples
Statistical significance: 200+ examples

Start small! Even 10 examples can reveal obvious issues. Scale up for more confidence.

Building Datasets Programmatically

Create datasets from production data:

const response = await fetch("https://api.helicone.ai/v1/request/query", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    filter: {
      properties: {
        "Feature": "customer-support",
      },
      response_status: { equals: 200 },
    },
    limit: 50,
  }),
});

const requests = await response.json();

// Extract prompts for dataset
const dataset = requests.data.map(req => ({
  input: req.request.messages,
  metadata: {
    original_request_id: req.request_id,
    user_tier: req.properties.UserTier,
  }
}));

Evaluating Results

Manual Review

For each output pair, ask:

Is the variant more helpful?
Is it more accurate?
Is the tone appropriate?
Is it more concise?
Does it follow instructions better?

Automated Metrics

Track quantitative improvements:

const results = {
  base: {
    avg_tokens: 250,
    avg_latency: 1200,
    avg_cost: 0.003,
  },
  variant: {
    avg_tokens: 180,  // 28% reduction
    avg_latency: 900,  // 25% faster
    avg_cost: 0.002,   // 33% cheaper
  },
};

Using External Evaluators

Integrate with evaluation frameworks:

RAGAS Evaluation
Custom Scoring

from ragas import evaluate
from ragas.metrics import answer_correctness

# Load experiment results from Helicone
results = helicone_client.get_experiment_results(experiment_id)

# Evaluate with RAGAS
scores = evaluate(
    dataset=results.to_dataset(),
    metrics=[answer_correctness]
)

# Post scores back to Helicone
for request_id, score in scores.items():
    helicone_client.add_score(request_id, {
        "ragas_correctness": score
    })

See the RAGAS Evaluations tutorial for a complete example.

async function scoreExperimentResults(experimentId: string) {
  const results = await getExperimentResults(experimentId);

  for (const result of results) {
    const scores = {
      length_appropriate: result.output.length < 500 ? 1 : 0,
      contains_greeting: result.output.includes("Hello") ? 1 : 0,
      no_errors: !result.output.includes("error") ? 1 : 0,
    };

    await fetch(`https://api.helicone.ai/v1/request/${result.request_id}/score`, {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${HELICONE_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ scores }),
    });
  }
}

Best Practices

Test One Thing at a Time

Change only the prompt OR the model OR the parameters. Testing multiple changes makes it impossible to know what caused improvements.

Document Your Hypothesis

Before running an experiment, write down:

What you’re changing
Why you think it will improve results
How you’ll measure success

Include Edge Cases

Test on difficult inputs where your current prompt struggles. These reveal whether your changes fix real problems.

Consider Cost vs Quality Trade-offs

A 5% quality improvement might not justify 200% higher costs. Factor economics into your decisions.

Re-run Experiments Periodically

As models improve, old experiment results become outdated. Re-test important decisions quarterly.

Migration Path

Since Experiments are being deprecated, we recommend migrating to Prompt Versions for ongoing testing.

Prompt Versions offer:

Version control for all prompt changes
A/B testing in production with traffic splitting
Real-time comparison of version performance
Automatic rollback if quality degrades

Alternative Approaches

Production A/B Testing

Test in production with traffic splitting:

const variant = Math.random() < 0.5 ? "A" : "B";

const prompt = variant === "A" 
  ? basePrompt 
  : experimentPrompt;

const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [{ role: "system", content: prompt }, ...],
  },
  {
    headers: {
      "Helicone-Property-Variant": variant,
      "Helicone-Property-Experiment": "prompt-comparison-v3",
    },
  }
);

Then filter by Variant property to compare real-world performance.

Shadow Deployments

Run experiments on every production request:

// Production request
const productionResponse = await client.chat.completions.create(
  params,
  { headers: { "Helicone-Property-Type": "production" } }
);

// Shadow experiment (async, not returned to user)
setImmediate(async () => {
  await client.chat.completions.create(
    experimentParams,
    { headers: { "Helicone-Property-Type": "shadow" } }
  );
});

return productionResponse;

Next Steps

RAGAS Evaluations

Automate experiment evaluation with RAGAS metrics

Fine-Tuning

Move beyond prompt engineering to model fine-tuning

Prompt Management

Version control and deploy prompts systematically

Custom Properties

Track experiment variants in production

Use Cases

Tutorials

Running LLM Experiments

Why Run Experiments?

Test Safely

Data-Driven Decisions

Save Costs

Prevent Regressions

Experiment Workflow

What to Test

Prompt Variations

Model Comparison

Parameter Tuning

Creating Effective Datasets

Use Representative Data

Dataset Size Guidelines

Building Datasets Programmatically

Evaluating Results

Manual Review

Automated Metrics

Using External Evaluators

Best Practices

Migration Path

Alternative Approaches

Production A/B Testing

Shadow Deployments

Next Steps

RAGAS Evaluations

Fine-Tuning

Prompt Management

Custom Properties

Use Cases

Tutorials

Documentation Index

​Why Run Experiments?

Test Safely

Data-Driven Decisions

Save Costs

Prevent Regressions

​Experiment Workflow

​What to Test

​Prompt Variations

​Model Comparison

​Parameter Tuning

​Creating Effective Datasets

​Use Representative Data

​Dataset Size Guidelines

​Building Datasets Programmatically

​Evaluating Results

​Manual Review

​Automated Metrics

​Using External Evaluators

​Best Practices

​Migration Path

​Alternative Approaches

​Production A/B Testing

​Shadow Deployments

​Next Steps

RAGAS Evaluations

Fine-Tuning

Prompt Management

Custom Properties

Why Run Experiments?

Experiment Workflow

What to Test

Prompt Variations

Model Comparison

Parameter Tuning

Creating Effective Datasets

Use Representative Data

Dataset Size Guidelines

Building Datasets Programmatically

Evaluating Results

Manual Review

Automated Metrics

Using External Evaluators

Best Practices

Migration Path

Alternative Approaches

Production A/B Testing

Shadow Deployments

Next Steps