Experiments let you test changes to prompts, models, and parameters before deploying to production. Run experiments on historical data to validate improvements and prevent quality regressions.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt
Use this file to discover all available pages before exploring further.
Why Run Experiments?
Test Safely
Validate changes without impacting production users
Data-Driven Decisions
Compare outputs side-by-side with metrics
Save Costs
Test on specific datasets instead of full production traffic
Prevent Regressions
Catch quality issues before they reach users
Experiment Workflow
Navigate to Prompts
Go to the Prompts tab in your Helicone dashboard and select the prompt you want to test.
Select Base Prompt
Choose the baseline prompt to compare against. This is typically your current production prompt.
Edit the Prompt
Make your changes. This creates a new version without affecting the original prompt.
Common changes to test:
- Different instructions or tone
- More/fewer examples
- System prompt modifications
- Response format changes
Configure Experiment Settings
Set up your test parameters:
- Dataset - Select existing dataset or generate random samples
- Model - Same as baseline or test a different model
- Provider Keys - Which API keys to use
Review and Run
The Diff Viewer shows exactly what changed between your base prompt and experiment variant.
Review the changes and click “Run Experiment” to start.
What to Test
Prompt Variations
Test different approaches to the same task:- Base Prompt
- Variant A: More Context
- Variant B: Step-by-Step
Model Comparison
Compare different models on the same prompt:- GPT-4o vs GPT-4o-mini - Does the cheaper model work as well?
- GPT-4o vs Claude 3.5 Sonnet - Which provider is better for your use case?
- GPT-4o vs GPT-4o-2024-08-06 - Test new model versions
Parameter Tuning
Experiment with model parameters:Creating Effective Datasets
Use Representative Data
Your test dataset should cover:- Common queries (80% of traffic)
- Edge cases (10%)
- Error-prone scenarios (10%)
Dataset Size Guidelines
- Quick validation: 10-20 examples
- Thorough testing: 50-100 examples
- Statistical significance: 200+ examples
Start small! Even 10 examples can reveal obvious issues. Scale up for more confidence.
Building Datasets Programmatically
Create datasets from production data:Evaluating Results
Manual Review
For each output pair, ask:- Is the variant more helpful?
- Is it more accurate?
- Is the tone appropriate?
- Is it more concise?
- Does it follow instructions better?
Automated Metrics
Track quantitative improvements:Using External Evaluators
Integrate with evaluation frameworks:- RAGAS Evaluation
- Custom Scoring
Best Practices
Test One Thing at a Time
Test One Thing at a Time
Change only the prompt OR the model OR the parameters. Testing multiple changes makes it impossible to know what caused improvements.
Document Your Hypothesis
Document Your Hypothesis
Before running an experiment, write down:
- What you’re changing
- Why you think it will improve results
- How you’ll measure success
Include Edge Cases
Include Edge Cases
Test on difficult inputs where your current prompt struggles. These reveal whether your changes fix real problems.
Consider Cost vs Quality Trade-offs
Consider Cost vs Quality Trade-offs
A 5% quality improvement might not justify 200% higher costs. Factor economics into your decisions.
Re-run Experiments Periodically
Re-run Experiments Periodically
As models improve, old experiment results become outdated. Re-test important decisions quarterly.
Migration Path
Since Experiments are being deprecated, we recommend migrating to Prompt Versions for ongoing testing.
- Version control for all prompt changes
- A/B testing in production with traffic splitting
- Real-time comparison of version performance
- Automatic rollback if quality degrades
Alternative Approaches
Production A/B Testing
Test in production with traffic splitting:Variant property to compare real-world performance.
Shadow Deployments
Run experiments on every production request:Next Steps
RAGAS Evaluations
Automate experiment evaluation with RAGAS metrics
Fine-Tuning
Move beyond prompt engineering to model fine-tuning
Prompt Management
Version control and deploy prompts systematically
Custom Properties
Track experiment variants in production
