Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt

Use this file to discover all available pages before exploring further.

Fine-tuning adapts pre-trained models to your specific use case, improving quality and reducing costs for specialized tasks. Helicone integrates with OpenPipe to streamline the entire fine-tuning workflow.

When to Fine-Tune

Fine-tuning is ideal when:

Specialized Domain

Your domain requires specialized knowledge (medical, legal, technical)

Consistent Format

You need consistent output formatting that prompting can’t achieve

Cost Optimization

High volume makes a smaller fine-tuned model more economical

Latency Requirements

You need faster responses than larger models provide
Don’t fine-tune if: You’re just starting out, need flexibility to change behavior frequently, or have less than 50 high-quality examples. Start with prompt engineering instead.

Fine-Tuning Workflow

1

Set Up OpenPipe Integration

Connect your Helicone account to OpenPipe:
  1. Navigate to Settings → Integrations in your Helicone dashboard
  2. Find the OpenPipe integration
  3. Click Connect and authorize the integration
OpenPipe integration configuration in Helicone dashboard
This allows you to manage fine-tuning datasets and jobs directly from Helicone.
2

Collect Training Data

Fine-tuning requires high-quality training examples. You can:Option 1: Use Production DataSelect successful requests from your production traffic:
// Tag high-quality responses
const response = await client.chat.completions.create(
  params,
  {
    headers: {
      "Helicone-Property-Quality": "high",
      "Helicone-Property-Use-For-Training": "true",
    },
  }
);
Then filter by these properties in Helicone to export training data.Option 2: Create Synthetic DataGenerate examples programmatically:
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}"
    }
)

# Generate training examples
training_examples = []
for scenario in scenarios:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Generate a training example..."},
            {"role": "user", "content": scenario}
        ],
        extra_headers={
            "Helicone-Property-Type": "synthetic-training-data"
        }
    )
    training_examples.append({
        "input": scenario,
        "output": response.choices[0].message.content
    })
3

Create a Training Dataset

In the Helicone dashboard:
  1. Go to DatasetsCreate New Dataset
  2. Select requests to include (filter by your training properties)
  3. Review and clean the data
  4. Export to OpenPipe
Dataset creation interface showing request selection
Quality over quantity: Start with 50-200 high-quality examples. More data doesn’t always mean better results.

Dataset Best Practices

Include variety in your training data:
  • Different input lengths
  • Various edge cases
  • Multiple query types
  • Representative of production distribution
Ensure all examples follow the same structure:
  • Identical system prompts
  • Consistent output format
  • Same level of detail
Every example should be:
  • Factually correct
  • Following your desired style
  • Representative of ideal behavior
  • Free of errors or inconsistencies
4

Configure Fine-Tuning Job

Click Start Fine-Tuning and configure:
  • Base Model: Start with a model family (GPT-4o, GPT-3.5, etc.)
  • Training Epochs: Usually 3-5 (more risks overfitting)
  • Learning Rate: Use automatic or adjust based on results
  • Validation Split: Hold out 10-20% for validation
Configuration options for fine-tuning job parameters
Fine-tuning jobs typically take 10 minutes to a few hours depending on dataset size and model.
5

Monitor Training Progress

Track your fine-tuning job in real-time:
  • Training loss: Should decrease steadily
  • Validation loss: Should decrease without diverging from training
  • Estimated completion time
Training progress dashboard showing loss curves
If validation loss starts increasing while training loss decreases, you’re overfitting - stop training early.
6

Evaluate the Fine-Tuned Model

Once training completes, test your model:
import { OpenAI } from "openai";

const client = new OpenAI({
  baseURL: "https://oai.helicone.ai/v1",
  apiKey: process.env.OPENAI_API_KEY,
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Use your fine-tuned model
const response = await client.chat.completions.create(
  {
    model: "ft:gpt-4o-mini-2024-07-18:your-org:model-name:xyz123",
    messages: [
      { role: "user", content: "Test input" }
    ],
  },
  {
    headers: {
      "Helicone-Property-Model-Type": "fine-tuned",
      "Helicone-Property-Base-Model": "gpt-4o-mini",
    },
  }
);
Compare outputs against:
  • Base model performance
  • Your validation set expectations
  • Production requirements
7

Deploy and Monitor

Deploy your fine-tuned model to production:
def get_model(use_finetuned: bool = True):
    if use_finetuned:
        return "ft:gpt-4o-mini-2024-07-18:your-org:model:xyz123"
    return "gpt-4o-mini"  # Fallback

response = client.chat.completions.create(
    model=get_model(),
    messages=messages,
    extra_headers={
        "Helicone-Property-Model-Version": "fine-tuned-v1",
        "Helicone-Property-Deployment": "production"
    }
)
Track performance metrics:
  • Response quality vs. base model
  • Cost per request
  • Latency improvements
  • User satisfaction scores

Comparing Fine-Tuned vs Base Models

Run side-by-side comparisons:
async function compareModels(input: string) {
  const sessionId = `comparison-${Date.now()}`;

  // Base model
  const baseResponse = await client.chat.completions.create(
    {
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: input }],
    },
    {
      headers: {
        "Helicone-Session-Id": sessionId,
        "Helicone-Property-Model-Type": "base",
      },
    }
  );

  // Fine-tuned model
  const fineTunedResponse = await client.chat.completions.create(
    {
      model: "ft:gpt-4o-mini-2024-07-18:org:model:id",
      messages: [{ role: "user", content: input }],
    },
    {
      headers: {
        "Helicone-Session-Id": sessionId,
        "Helicone-Property-Model-Type": "fine-tuned",
      },
    }
  );

  return {
    base: baseResponse.choices[0].message.content,
    fineTuned: fineTunedResponse.choices[0].message.content,
  };
}
View both responses in the same session to compare quality, cost, and latency.

Cost Analysis

Fine-tuning economics depend on volume:

Example Calculation

Scenario: 100,000 requests/month, 500 input + 200 output tokens each
Input:  100k * 500 tokens * $2.50/1M = $125
Output: 100k * 200 tokens * $10.00/1M = $200
Total: $325/month
These are example calculations. Actual costs depend on your provider, model, and usage patterns. Always test with your own data.

Iterating on Fine-Tuned Models

Improve your model over time:

Collect Feedback

// Track which responses need improvement
await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    scores: {
      "user-satisfaction": userRating,
      "needs-retraining": userRating < 3 ? 1 : 0,
    },
  }),
});

Create New Training Data

Filter for low-scoring responses and correct them:
# Query requests needing correction
response = requests.post(
    "https://api.helicone.ai/v1/request/query",
    headers={"Authorization": f"Bearer {HELICONE_API_KEY}"},
    json={
        "filter": {
            "properties": {
                "Model-Type": "fine-tuned"
            },
            "scores": {
                "needs-retraining": {"gte": 1}
            }
        }
    }
)

# Export and manually correct these examples
problematic_requests = response.json()["data"]

Retrain Periodically

Create new versions as you collect more data:
  • Monthly: Add new high-quality examples
  • Quarterly: Major updates with improved examples
  • Annually: Evaluate if a newer base model would perform better

Troubleshooting

Poor Performance After Fine-Tuning

Symptoms: Great on training data, poor on new inputsSolutions:
  • Reduce training epochs (try 2-3 instead of 5+)
  • Add more diverse training examples
  • Use a larger validation set (20%)
Symptoms: Model behavior is inconsistentSolutions:
  • Collect 2-3x more examples
  • Focus on quality over quantity
  • Use data augmentation to increase variety
Symptoms: No improvement over base modelSolutions:
  • Try a different base model family
  • Ensure task matches model capabilities
  • Verify training data format is correct

Fine-Tuning Resources

Training Data Best Practices

Deep dive into creating effective training datasets

Model Selection Guide

Choosing the right base model for fine-tuning

RAG vs Fine-Tuning

When to use each approach

OpenAI Fine-Tuning API

Direct API usage without OpenPipe

Next Steps

Cost Tracking

Monitor fine-tuned model economics

Experiments

A/B test fine-tuned vs base models

RAGAS Evaluations

Evaluate fine-tuned model quality systematically

OpenPipe Integration

Learn more about the OpenPipe platform