> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt > Use this file to discover all available pages before exploring further. # Prompt Experiments > Run A/B tests and experiments to compare prompt versions and measure performance Helicone's experiment system lets you compare prompt versions side-by-side, run A/B tests, and measure which variations perform best before deploying to production. ## What Are Experiments? Experiments allow you to: * **Compare multiple prompts or versions** against the same inputs * **Test different models** (GPT-4 vs Claude vs Gemini) * **Measure performance** with automated evaluators * **Validate improvements** before production deployment * **Build datasets** from production traffic or manual inputs Run multiple prompt versions simultaneously and compare outputs Use evaluators to score responses on quality, relevance, and custom metrics Test variations with real production inputs before deployment Create reusable test datasets from experiments ## Creating an Experiment ### From the UI Go to **Experiments** in your Helicone dashboard. Create an experiment: * **From scratch**: Manual input rows * **From prompt**: Use an existing prompt as the baseline * **From dataset**: Use a saved dataset Add test cases with variables that will be passed to prompts: ```json theme={null} { "userName": "Alice", "topic": "machine learning" } ``` Create columns for each variation you want to test: * Different prompt versions * Different models * Different parameters (temperature, max\_tokens) Execute all hypothesis combinations and view results in a table. ### From a Prompt Start an experiment directly from a prompt: Navigate to the prompt you want to test. Find the **Experiment** button in the prompt editor. Choose inputs from: * Production request history * Manual input creation * Existing datasets Create hypothesis columns with different versions or configurations. ## Experiment Structure An experiment consists of: ### Inputs (Rows) Each row represents a test case with variables: | userName | topic | context | | -------- | ------------------ | ------------ | | Alice | machine learning | beginner | | Bob | neural networks | advanced | | Carol | prompt engineering | intermediate | ### Hypotheses (Columns) Each column represents a variation to test: | Hypothesis | Prompt Version | Model | Temperature | | ----------- | -------------- | ---------- | ----------- | | Baseline | 1.0 | gpt-4o | 0.7 | | New Version | 2.0 | gpt-4o | 0.7 | | Claude Alt | 1.0 | claude-3-5 | 0.7 | | Higher Temp | 2.0 | gpt-4o | 0.9 | ### Results The experiment grid shows outputs for each input × hypothesis combination, along with: * Response text * Token usage * Latency * Cost * Evaluator scores ## Running Experiments ### Execute All Tests Run the experiment to generate outputs for all combinations: ```typescript theme={null} // Experiments automatically execute when created via API const response = await fetch("https://api.helicone.ai/v2/experiment/create/empty", { method: "POST", headers: { "Authorization": `Bearer ${HELICONE_API_KEY}`, "Content-Type": "application/json", }, }); const { experimentId } = await response.json(); ``` ### View Results Results appear in a table format with: * Input variables * Output for each hypothesis * Token counts and costs * Latency measurements * Evaluator scores ### Compare Side-by-Side The experiment table lets you: * Scroll horizontally to compare outputs * Sort by evaluator scores * Filter by input variables * Highlight differences between versions ## Evaluators Evaluators automatically score experiment outputs based on criteria you define. ### Adding Evaluators Click the **Evaluators** button in your experiment. Choose from: * **LLM-as-Judge**: Use GPT-4 or Claude to score responses * **Regex**: Pattern matching for specific content * **Custom**: API-based evaluator with your own logic Define scoring criteria: ```json theme={null} { "name": "Relevance", "description": "Score 1-5 based on how relevant the response is to the question", "type": "llm-judge", "model": "gpt-4o", "prompt": "Rate the relevance of this response..." } ``` Execute evaluators across all experiment outputs. ### Evaluator Types #### LLM-as-Judge Use another LLM to score responses: ```json theme={null} { "type": "llm-judge", "model": "gpt-4o", "criteria": [ "Accuracy", "Clarity", "Helpfulness" ], "scale": "1-5" } ``` #### Pattern Matching Score based on content patterns: ```json theme={null} { "type": "regex", "patterns": [ { "pattern": "\\b(sorry|apologize)\\b", "score": -1, "reason": "Unnecessary apology" }, { "pattern": "\\b(step-by-step|walkthrough)\\b", "score": 1, "reason": "Structured response" } ] } ``` #### Custom Evaluator Call your own API for scoring: ```json theme={null} { "type": "custom", "endpoint": "https://api.example.com/evaluate", "method": "POST", "headers": { "Authorization": "Bearer YOUR_TOKEN" } } ``` ### Viewing Scores Evaluator scores appear as columns in the experiment table: | Input | Baseline Output | New Version Output | Relevance (Baseline) | Relevance (New) | Winner | | ----- | --------------- | ------------------ | -------------------- | --------------- | -------- | | ... | ... | ... | 3.5 | 4.2 | New | | ... | ... | ... | 4.0 | 3.8 | Baseline | ## Datasets Datasets are reusable collections of test inputs. ### Creating Datasets Select requests from your logs and add them to a dataset. Save successful experiment inputs as a dataset for future tests. Create a dataset by manually defining test cases. Upload a CSV file with input variables. ### Using Datasets Reuse datasets across multiple experiments: ```typescript theme={null} // Create experiment from dataset const response = await fetch("https://api.helicone.ai/v2/experiment", { method: "POST", headers: { "Authorization": `Bearer ${HELICONE_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ datasetId: "dataset-123", hypotheses: [ { promptVersionId: "version-abc", model: "gpt-4o", }, { promptVersionId: "version-xyz", model: "gpt-4o", }, ], }), }); ``` ## Experiment Workflows ### A/B Testing Workflow Run experiment with current production prompt (version 1.0). Create hypothesis column with new prompt version (version 2.0). Use recent production requests as test inputs. Run automated evaluators to measure quality. Compare scores, costs, and latency: * Version 2.0: Avg score 4.3, \$0.02/request * Version 1.0: Avg score 3.8, \$0.03/request If version 2.0 performs better, deploy to production. ### Model Comparison * GPT-4o * Claude 3.5 Sonnet * Gemini 1.5 Pro Keep prompt constant, vary only the model. Execute all combinations. * GPT-4o: Score 4.2, \$0.02 * Claude: Score 4.5, \$0.025 * Gemini: Score 3.9, \$0.015 ### Parameter Tuning * Temperature: 0.3, 0.7, 0.9 * Max tokens: 500, 1000, 2000 Test all combinations. Identify the sweet spot for quality, cost, and latency. ## Experiment API ### Create Experiment ```bash theme={null} curl -X POST https://api.helicone.ai/v2/experiment/create/empty \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" ``` ### List Experiments ```bash theme={null} curl -X GET https://api.helicone.ai/v2/experiment \ -H "Authorization: Bearer YOUR_API_KEY" ``` ### Get Experiment Details ```bash theme={null} curl -X POST https://api.helicone.ai/v1/experiment/query \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "filter": { "experiment": { "id": { "equals": "experiment-123" } } }, "include": { "inputs": true, "responseBodies": true } }' ``` ### Delete Experiment ```bash theme={null} curl -X DELETE https://api.helicone.ai/v2/experiment/{experimentId} \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Best Practices **Use production data**: Test with real inputs your users will provide **Run evaluators consistently**: Define clear scoring criteria and use them across all experiments **Test multiple dimensions**: Compare prompts, models, and parameters separately **Build reusable datasets**: Save high-quality test cases for future experiments **Document findings**: Add notes to experiments explaining results and decisions **Don't over-optimize**: A small score difference may not matter in production **Consider cost**: A higher-scoring version may not justify significantly higher costs ## Analyzing Results ### Statistical Significance Consider: * **Sample size**: Test with at least 20-50 inputs for meaningful results * **Variance**: Look at score distribution, not just averages * **Edge cases**: Ensure new versions handle corner cases well ### Cost Analysis Compare total cost per hypothesis: ``` Hypothesis A: 50 tests × $0.02 = $1.00 (avg score: 4.2) Hypothesis B: 50 tests × $0.03 = $1.50 (avg score: 4.3) Cost increase: 50% Score increase: 2.4% Verdict: Hypothesis A is more cost-effective ``` ### Latency Impact Monitor response times: * P50 latency * P95 latency * Max latency A better prompt isn't worth it if it makes your app feel slow. ## Next Steps Learn more about managing prompt versions Deploy winning variations to production Monitor deployed prompts in production Deep dive into evaluation strategies