Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt

Use this file to discover all available pages before exploring further.

We are deprecating the Experiments feature and it will be removed from the platform on September 1st, 2025. We recommend using Prompt Versions for testing prompt changes.
Experiments let you test changes to prompts, models, and parameters before deploying to production. Run experiments on historical data to validate improvements and prevent quality regressions.

Why Run Experiments?

Test Safely

Validate changes without impacting production users

Data-Driven Decisions

Compare outputs side-by-side with metrics

Save Costs

Test on specific datasets instead of full production traffic

Prevent Regressions

Catch quality issues before they reach users

Experiment Workflow

1

Navigate to Prompts

Go to the Prompts tab in your Helicone dashboard and select the prompt you want to test.
To experiment on your production prompt, look for the production tag.
2

Start a New Experiment

Click “Start Experiment” in the top right corner.
Start Experiment button in the Prompts tab
3

Select Base Prompt

Choose the baseline prompt to compare against. This is typically your current production prompt.
Prompt selection interface for experiments
4

Edit the Prompt

Make your changes. This creates a new version without affecting the original prompt.
Prompt editor showing variant being tested
Common changes to test:
  • Different instructions or tone
  • More/fewer examples
  • System prompt modifications
  • Response format changes
5

Configure Experiment Settings

Set up your test parameters:
  • Dataset - Select existing dataset or generate random samples
  • Model - Same as baseline or test a different model
  • Provider Keys - Which API keys to use
Click “Generate random dataset” to create a dataset from up to 10 random production requests. Great for quick tests!
Configuration interface for experiment parameters
6

Review and Run

The Diff Viewer shows exactly what changed between your base prompt and experiment variant.
Side-by-side diff of base and experiment prompts
Review the changes and click “Run Experiment” to start.
7

Analyze Results

Once complete, view side-by-side comparisons of outputs:
Results comparison table with base and experiment outputs
For each input, you can:
  • Compare outputs side-by-side
  • Review response quality
  • Check token usage and cost
  • Identify where variants perform better

What to Test

Prompt Variations

Test different approaches to the same task:
You are a customer support assistant. 
Answer user questions professionally.

User: {question}

Model Comparison

Compare different models on the same prompt:
  • GPT-4o vs GPT-4o-mini - Does the cheaper model work as well?
  • GPT-4o vs Claude 3.5 Sonnet - Which provider is better for your use case?
  • GPT-4o vs GPT-4o-2024-08-06 - Test new model versions

Parameter Tuning

Experiment with model parameters:
// Base configuration
{
  temperature: 0.7,
  max_tokens: 500
}

// Variant A: More deterministic
{
  temperature: 0.3,
  max_tokens: 500
}

// Variant B: More creative
{
  temperature: 0.9,
  max_tokens: 500
}

Creating Effective Datasets

Use Representative Data

Your test dataset should cover:
  • Common queries (80% of traffic)
  • Edge cases (10%)
  • Error-prone scenarios (10%)

Dataset Size Guidelines

  • Quick validation: 10-20 examples
  • Thorough testing: 50-100 examples
  • Statistical significance: 200+ examples
Start small! Even 10 examples can reveal obvious issues. Scale up for more confidence.

Building Datasets Programmatically

Create datasets from production data:
const response = await fetch("https://api.helicone.ai/v1/request/query", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    filter: {
      properties: {
        "Feature": "customer-support",
      },
      response_status: { equals: 200 },
    },
    limit: 50,
  }),
});

const requests = await response.json();

// Extract prompts for dataset
const dataset = requests.data.map(req => ({
  input: req.request.messages,
  metadata: {
    original_request_id: req.request_id,
    user_tier: req.properties.UserTier,
  }
}));

Evaluating Results

Manual Review

For each output pair, ask:
  • Is the variant more helpful?
  • Is it more accurate?
  • Is the tone appropriate?
  • Is it more concise?
  • Does it follow instructions better?

Automated Metrics

Track quantitative improvements:
const results = {
  base: {
    avg_tokens: 250,
    avg_latency: 1200,
    avg_cost: 0.003,
  },
  variant: {
    avg_tokens: 180,  // 28% reduction
    avg_latency: 900,  // 25% faster
    avg_cost: 0.002,   // 33% cheaper
  },
};

Using External Evaluators

Integrate with evaluation frameworks:
from ragas import evaluate
from ragas.metrics import answer_correctness

# Load experiment results from Helicone
results = helicone_client.get_experiment_results(experiment_id)

# Evaluate with RAGAS
scores = evaluate(
    dataset=results.to_dataset(),
    metrics=[answer_correctness]
)

# Post scores back to Helicone
for request_id, score in scores.items():
    helicone_client.add_score(request_id, {
        "ragas_correctness": score
    })
See the RAGAS Evaluations tutorial for a complete example.

Best Practices

Change only the prompt OR the model OR the parameters. Testing multiple changes makes it impossible to know what caused improvements.
Before running an experiment, write down:
  • What you’re changing
  • Why you think it will improve results
  • How you’ll measure success
Test on difficult inputs where your current prompt struggles. These reveal whether your changes fix real problems.
A 5% quality improvement might not justify 200% higher costs. Factor economics into your decisions.
As models improve, old experiment results become outdated. Re-test important decisions quarterly.

Migration Path

Since Experiments are being deprecated, we recommend migrating to Prompt Versions for ongoing testing.
Prompt Versions offer:
  • Version control for all prompt changes
  • A/B testing in production with traffic splitting
  • Real-time comparison of version performance
  • Automatic rollback if quality degrades

Alternative Approaches

Production A/B Testing

Test in production with traffic splitting:
const variant = Math.random() < 0.5 ? "A" : "B";

const prompt = variant === "A" 
  ? basePrompt 
  : experimentPrompt;

const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [{ role: "system", content: prompt }, ...],
  },
  {
    headers: {
      "Helicone-Property-Variant": variant,
      "Helicone-Property-Experiment": "prompt-comparison-v3",
    },
  }
);
Then filter by Variant property to compare real-world performance.

Shadow Deployments

Run experiments on every production request:
// Production request
const productionResponse = await client.chat.completions.create(
  params,
  { headers: { "Helicone-Property-Type": "production" } }
);

// Shadow experiment (async, not returned to user)
setImmediate(async () => {
  await client.chat.completions.create(
    experimentParams,
    { headers: { "Helicone-Property-Type": "shadow" } }
  );
});

return productionResponse;

Next Steps

RAGAS Evaluations

Automate experiment evaluation with RAGAS metrics

Fine-Tuning

Move beyond prompt engineering to model fine-tuning

Prompt Management

Version control and deploy prompts systematically

Custom Properties

Track experiment variants in production