LLM Fine-Tuning for Enterprise: When the OpenAI API Isn't Enough
A practical guide to enterprise LLM fine-tuning — when to fine-tune vs prompt, how to prepare training data, which technique to use, and how to evaluate results.
LLM Fine-Tuning for Enterprise: When the OpenAI API Isn't Enough
Most enterprise AI projects should start with the OpenAI API or another foundation model provider. Prompt engineering is fast, cheap, and more powerful than most teams realise. Fine-tuning is expensive, time-consuming, and unnecessary for the majority of use cases.
That said, there are real scenarios where fine-tuning is the right answer. This guide explains when fine-tuning is genuinely justified, how to choose the right technique, what data preparation actually involves, and what to expect from deployment.
When Fine-Tuning Is Worth It
Fine-tuning is justified in four specific scenarios:
1. Consistent output format is business-critical Foundation models generate variable output formats unless constrained by complex system prompts. If your use case requires structured JSON, specific medical terminology, legal citation formats, or industry-specific notation — and prompt engineering alone cannot make the output reliable — fine-tuning for format consistency is worth exploring.
2. Domain knowledge is genuinely specialised Foundation models know a lot. If your domain is covered in their training data (general software development, common business processes, standard legal concepts), you probably do not need fine-tuning. If your domain is genuinely narrow and specialised — a proprietary product line, an internal taxonomy, a specific regulatory framework — fine-tuning can meaningfully improve accuracy on domain-specific queries.
3. Latency requirements eliminate large models GPT-4 class models take 2–8 seconds to generate a typical response. For latency-sensitive applications (real-time customer support, interactive tools, high-frequency batch processing), a fine-tuned 7B or 13B model running on dedicated hardware can achieve 100–500ms response times at 80–90% of the capability.
4. Volume makes API costs prohibitive At sufficient query volume, the economics of running your own infrastructure beat API pricing. The break-even point varies but is typically around 500K–2M API calls per month depending on query length and complexity. Below this volume, managed APIs almost always win on total cost when engineering and infrastructure overhead are accounted for.
Fine-tuning is NOT a solution for:
- Knowledge injection (use RAG instead — it is cheaper, more updatable, and auditable)
- Fixing a bad base model (a better base model is almost always faster and cheaper)
- Making a model "smarter" in general (fine-tuning specialises, it does not generalise)
The Three Fine-Tuning Techniques
Full Fine-Tuning
All model parameters are updated during training. Produces the highest potential accuracy improvement but requires significant GPU compute and has the highest risk of catastrophic forgetting (degrading the model's general capabilities).
When to use: Large proprietary datasets (100K+ examples), significant budget, and a clear need for maximum domain specialisation.
Hardware requirements: Multi-GPU setup (8× A100 80GB for a 70B parameter model). Training a 7B model requires at minimum 2× A100 40GB.
LoRA (Low-Rank Adaptation)
Adds small trainable matrices to frozen model layers. Only the LoRA parameters are updated — typically 1–5% of total model parameters. Preserves the base model's general capabilities while adding domain-specific behaviour.
When to use: Most enterprise fine-tuning scenarios. Well-tested, predictable, and produces good results with 1,000–100,000 training examples.
Hardware requirements: A single A100 40GB or equivalent handles 7B–13B parameter models. 70B models require 4–8× A100 with model parallelism.
QLoRA (Quantized LoRA)
LoRA training on a quantised (4-bit or 8-bit) version of the model. Dramatically reduces memory requirements — a 70B model can be trained on 2× A100 40GB. Minor accuracy tradeoff compared to standard LoRA.
When to use: When you need to fine-tune large models on limited hardware budgets. The accuracy difference versus full LoRA is usually small in practice.
Hardware requirements: A single A100 40GB can fine-tune a 13B model, 2× A100 for 70B models.
Dataset Preparation
Dataset quality dominates training outcomes. An excellent dataset with 1,000 examples will outperform a poor dataset with 100,000 examples.
What makes a good fine-tuning dataset
Format: Instruction-response pairs that exactly represent the task you want the model to learn. For a customer support fine-tune: {"instruction": "Customer says: [query]", "response": "[ideal support response]"}.
Consistency: The labels (expected responses) must be internally consistent. Inconsistent labelling is the primary cause of fine-tuning failures. If two similar inputs have contradictory expected outputs, the model cannot learn a coherent behaviour.
Coverage: The dataset must cover the full distribution of inputs the model will see in production. A dataset covering only simple cases will produce a model that fails on edge cases.
Size: For LoRA fine-tuning, 1,000 high-quality examples is often sufficient for format and style adaptation. For knowledge injection or significant behavioural change, 10,000–50,000 examples are typically needed.
Data sources for enterprise fine-tuning
- Human-labelled examples: Most reliable but expensive. Use for the most critical training examples.
- Existing logs: If your team has answered similar queries manually (support tickets, emails, Q&A databases), this is valuable training data. Requires cleaning and standardisation.
- Synthetic data: GPT-4 or Claude generating training examples is a common and effective approach. Must be validated — synthetic data amplifies biases in the prompt used to generate it.
- Hybrid: Generate synthetic data at scale, then have human annotators review and correct the most important or edge-case examples.
Data quality checks
Before training, run these checks on your dataset:
- Duplication: Remove near-duplicate examples — they inflate training loss without adding information
- Length distribution: Check that instruction and response lengths are reasonable; outliers often indicate parsing errors
- Format consistency: Validate that all examples match the expected format
- Label consistency: Sample 100–200 examples and manually verify the expected outputs are correct and consistent
Training Setup
Base model selection
Start with the best available base model for your target size:
- 7B class: Llama 3.1 8B, Mistral 7B, Phi-3 Mini — good balance of speed and capability
- 13B class: Llama 3.1 13B — solid general capability with manageable hardware requirements
- 70B class: Llama 3.1 70B, Mixtral 8×7B — near-GPT-4 capability for complex tasks
For most enterprise fine-tuning, start with the 7B–13B range. If results are insufficient, move up.
Hyperparameters to tune
The most impactful hyperparameters for LoRA fine-tuning:
- Learning rate: 2e-4 to 2e-5. Too high causes instability; too low means slow learning. Start at 2e-4 and reduce if training loss is noisy.
- LoRA rank (r): 8–64. Higher rank = more trainable parameters = more capacity but more risk of overfitting. Start at 16.
- LoRA alpha: Typically 2× the rank value.
- Batch size: Maximize within GPU memory constraints. Gradient accumulation allows effective large batch sizes.
- Number of epochs: 1–3 epochs typically. More epochs risks overfitting on small datasets.
Training framework
Use Hugging Face PEFT + TRL for most LoRA/QLoRA training — well-maintained, GPU-efficient, and integrates with the Hugging Face ecosystem. DeepSpeed ZeRO for multi-GPU training.
For MLflow or W&B tracking, log all hyperparameters, training loss curves, and validation metrics from the start. You will want this when debugging underperformance.
Evaluation
Task-specific evaluation
Define 50–200 evaluation examples that represent your actual use case. These must be held out from training. Evaluate:
- Accuracy on expected outputs: Does the model generate the right answer?
- Format compliance: Does the output match the required format?
- Edge case handling: Does the model handle unusual inputs gracefully?
Regression testing
Fine-tuning can degrade the model's general capabilities (catastrophic forgetting with full fine-tuning, less common with LoRA). Test the fine-tuned model on a general capability benchmark to confirm no significant regression.
Human evaluation
For tasks with no single "correct" answer (writing style adaptation, customer tone), human evaluation of blinded A/B comparisons between base model and fine-tuned model is the most reliable signal.
Deployment Considerations
Serving fine-tuned models
Options in order of operational complexity:
- OpenAI fine-tuning API: If you fine-tuned using OpenAI, you can serve directly via their API. Simplest but no data sovereignty.
- vLLM: Open-source, high-throughput serving for Llama, Mistral, and other architectures. GPU-efficient with paged attention. Best for self-hosted production.
- llama.cpp: CPU-runnable (slower), or GPU-accelerated. Good for edge deployment or limited GPU budget.
- Managed inference services: AWS Bedrock, Azure ML, Hugging Face Inference Endpoints — simplify infrastructure but add cost.
Merging LoRA adapters
For production deployment, merge the LoRA adapters back into the base model weights to eliminate the adapter overhead at inference time. This produces a single full-precision model file that can be served with any standard inference server.
The Build vs Buy Decision
One more consideration before embarking on fine-tuning: the total cost of ownership.
Fine-tuning costs include:
- Data preparation: 2–6 weeks of engineering time for data cleaning, formatting, and quality checks
- Training compute: $500–$5,000 for a LoRA run on 7B–13B models (on cloud GPUs); more for large models or many iterations
- Evaluation infrastructure: Building evaluation pipelines and golden datasets
- Ongoing maintenance: Retraining as your data distribution changes, monitoring for performance drift
For many use cases, OpenAI's managed fine-tuning API or a better-prompted foundation model is cheaper over 12 months than self-hosted fine-tuning. Do the math for your specific volume and requirements.
Fine-tuning is a powerful tool. It is not the default tool.
Our team has fine-tuned LLMs for manufacturing, education, and enterprise clients — ranging from 7B models for real-time applications to 70B models for complex reasoning tasks. If you are evaluating whether fine-tuning makes sense for your use case, talk to us.
