Helicone’s experiment system lets you compare prompt versions side-by-side, run A/B tests, and measure which variations perform best before deploying to production.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt
Use this file to discover all available pages before exploring further.
What Are Experiments?
Experiments allow you to:- Compare multiple prompts or versions against the same inputs
- Test different models (GPT-4 vs Claude vs Gemini)
- Measure performance with automated evaluators
- Validate improvements before production deployment
- Build datasets from production traffic or manual inputs
Side-by-Side Comparison
Run multiple prompt versions simultaneously and compare outputs
Automated Scoring
Use evaluators to score responses on quality, relevance, and custom metrics
Production Testing
Test variations with real production inputs before deployment
Dataset Building
Create reusable test datasets from experiments
Creating an Experiment
From the UI
Choose creation method
Create an experiment:
- From scratch: Manual input rows
- From prompt: Use an existing prompt as the baseline
- From dataset: Use a saved dataset
Add hypotheses
Create columns for each variation you want to test:
- Different prompt versions
- Different models
- Different parameters (temperature, max_tokens)
From a Prompt
Start an experiment directly from a prompt:Select inputs
Choose inputs from:
- Production request history
- Manual input creation
- Existing datasets
Experiment Structure
An experiment consists of:Inputs (Rows)
Each row represents a test case with variables:| userName | topic | context |
|---|---|---|
| Alice | machine learning | beginner |
| Bob | neural networks | advanced |
| Carol | prompt engineering | intermediate |
Hypotheses (Columns)
Each column represents a variation to test:| Hypothesis | Prompt Version | Model | Temperature |
|---|---|---|---|
| Baseline | 1.0 | gpt-4o | 0.7 |
| New Version | 2.0 | gpt-4o | 0.7 |
| Claude Alt | 1.0 | claude-3-5 | 0.7 |
| Higher Temp | 2.0 | gpt-4o | 0.9 |
Results
The experiment grid shows outputs for each input × hypothesis combination, along with:- Response text
- Token usage
- Latency
- Cost
- Evaluator scores
Running Experiments
Execute All Tests
Run the experiment to generate outputs for all combinations:View Results
Results appear in a table format with:- Input variables
- Output for each hypothesis
- Token counts and costs
- Latency measurements
- Evaluator scores
Compare Side-by-Side
The experiment table lets you:- Scroll horizontally to compare outputs
- Sort by evaluator scores
- Filter by input variables
- Highlight differences between versions
Evaluators
Evaluators automatically score experiment outputs based on criteria you define.Adding Evaluators
Create or select evaluator
Choose from:
- LLM-as-Judge: Use GPT-4 or Claude to score responses
- Regex: Pattern matching for specific content
- Custom: API-based evaluator with your own logic
Evaluator Types
LLM-as-Judge
Use another LLM to score responses:Pattern Matching
Score based on content patterns:Custom Evaluator
Call your own API for scoring:Viewing Scores
Evaluator scores appear as columns in the experiment table:| Input | Baseline Output | New Version Output | Relevance (Baseline) | Relevance (New) | Winner |
|---|---|---|---|---|---|
| … | … | … | 3.5 | 4.2 | New |
| … | … | … | 4.0 | 3.8 | Baseline |
Datasets
Datasets are reusable collections of test inputs.Creating Datasets
Using Datasets
Reuse datasets across multiple experiments:Experiment Workflows
A/B Testing Workflow
Analyze results
Compare scores, costs, and latency:
- Version 2.0: Avg score 4.3, $0.02/request
- Version 1.0: Avg score 3.8, $0.03/request
Model Comparison
Parameter Tuning
Experiment API
Create Experiment
List Experiments
Get Experiment Details
Delete Experiment
Best Practices
Use production data: Test with real inputs your users will provide
Run evaluators consistently: Define clear scoring criteria and use them across all experiments
Test multiple dimensions: Compare prompts, models, and parameters separately
Build reusable datasets: Save high-quality test cases for future experiments
Document findings: Add notes to experiments explaining results and decisions
Analyzing Results
Statistical Significance
Consider:- Sample size: Test with at least 20-50 inputs for meaningful results
- Variance: Look at score distribution, not just averages
- Edge cases: Ensure new versions handle corner cases well
Cost Analysis
Compare total cost per hypothesis:Latency Impact
Monitor response times:- P50 latency
- P95 latency
- Max latency
Next Steps
Versioning
Learn more about managing prompt versions
Deployment
Deploy winning variations to production
Observability
Monitor deployed prompts in production
Evaluation
Deep dive into evaluation strategies
