LLM in Translation Quality Assessment: Capabilities and Limitations in 2025
Two years ago, translation quality assessment was a purely human job. A linguist read the source, read the target, marked errors, assigned severity levels, and moved on to the next segment. It was accurate but slow, expensive, and limited to small sample sizes.
LLMs changed that equation. Models like GPT-4, Claude, and Gemini can now identify translation errors, explain quality issues, and produce MQM-compliant evaluations at scale. They're not perfect — but they're good enough to rethink how QA works.
This guide covers what LLMs can and can't do for translation QA, with practical guidance on making them work in production.
How LLMs Evaluate Translation Quality
Traditional MTQE models output a single score. An LLM does something fundamentally different: it reads both texts, reasons about them, and explains what it finds in plain language.
| Capability | Description |
|---|---|
| Natural language output | Explains errors in understandable terms |
| Zero-shot learning | Works without domain-specific training |
| Contextual understanding | Considers document-level context |
| Multilingual | Supports 100+ language pairs |
| Flexible instructions | Adapts to custom quality criteria via prompts |
Basic LLM Evaluation Flow
┌─────────────────────────────────────────────────────────────┐ │ Input │ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │ │ │ Source Text │ │ Translation │ │ Instructions │ │ │ │ (English) │ │ (German) │ │ (MQM criteria) │ │ │ └───────┬───────┘ └───────┬───────┘ └────────┬────────┘ │ │ │ │ │ │ │ └──────────────────┼────────────────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ LLM │ │ │ │ (GPT-4/Claude) │ │ │ └────────┬────────┘ │ │ │ │ │ Output ▼ │ │ ┌─────────────────────────────────────────────────────────┐│ │ │ • Error annotations with categories ││ │ │ • Severity levels (Critical/Major/Minor) ││ │ │ • Explanations for each issue ││ │ │ • Overall quality score ││ │ │ • Improvement suggestions ││ │ └─────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────┘ Current LLM Capabilities for Translation QA (2025)
What LLMs Do Well
Based on benchmarks and production deployments:
1. Error Detection
LLMs effectively identify:
- Mistranslations and meaning changes (85-90% accuracy)
- Omissions and additions (85-90% accuracy)
- Grammar and spelling errors (95%+ accuracy)
- Terminology inconsistencies (90%+ with glossary)
- Style and register issues (80-85% accuracy)
Those numbers are good enough for a first pass. Not good enough to skip human review on a pharmaceutical insert.
2. Error Categorization
LLMs can classify errors according to MQM taxonomy:
Example Output: Source: "The server will restart automatically." Translation: "Der Server wird manuell neu gestartet." LLM Analysis: { "errors": [ { "type": "Accuracy/Mistranslation", "source_span": "automatically", "target_span": "manuell", "severity": "Major", "explanation": "The translation says 'manually' but the source says 'automatically' - this reverses the meaning." } ], "score": 95, "overall_assessment": "One major accuracy error that changes operational meaning." } 3. Contextual Evaluation
This is where LLMs genuinely outperform older approaches. They consider:
- Document-level consistency
- Term usage across the text
- Tone and style coherence
- Reference to previously translated content
A traditional QA check looks at one segment in isolation. An LLM can notice that "Dashboard" was translated as "Armaturenbrett" in segment 12 but "Übersicht" in segment 47.
4. Explanation Generation
Unlike black-box models, LLMs explain their reasoning:
"The translation uses the informal 'du' form, but the source text and the formal business context suggest the formal 'Sie' should be used. This is a Style/Register error with Minor severity as it doesn't affect meaning but impacts brand voice consistency." This kind of feedback is actually useful to translators. A score of 94 tells them nothing. This tells them what to fix.
Benchmark Performance (2025)
| Model | Error Detection | Severity Accuracy | MQM Alignment | Speed |
|---|---|---|---|---|
| GPT-4 Turbo | 87% | 82% | High | 2-4s |
| Claude 3.5 Sonnet | 86% | 84% | High | 2-3s |
| Gemini 1.5 Pro | 84% | 80% | Medium | 2-4s |
| GPT-4o | 85% | 81% | High | 1-2s |
| Claude 3 Haiku | 78% | 75% | Medium | 0.5-1s |
Based on MQM-annotated test sets across EN-DE, EN-FR, EN-ZH language pairs
LLM Limitations for Translation QA
Here's where honesty matters. LLMs have real limitations, and ignoring them leads to bad outcomes.
1. Hallucination Risk
LLMs sometimes flag errors that don't exist, or miss real ones:
False Positive Example: Source: "The quick brown fox" Translation: "Der schnelle braune Fuchs" LLM (incorrectly): "Minor fluency issue - consider 'flinke' instead of 'schnelle'" Reality: Both translations are perfectly valid. Mitigation: Implement confidence thresholds and human review for critical content.
2. Inconsistent Severity Assessment
The same error may receive different severity ratings across runs:
Run 1: "Terminology error - Major severity" Run 2: "Terminology error - Minor severity" This is a real problem for any process that needs repeatability.
Mitigation: Use temperature=0, structured outputs, and calibration prompts.
3. Domain Knowledge Gaps
General-purpose LLMs don't know that "contra-indicated" has a specific meaning in pharmacology, or that "force majeure" needs careful handling in French legal contexts.
- Medical terminology nuances
- Legal jurisdiction-specific terms
- Industry-specific jargon
- Cultural references
Mitigation: Provide domain context, glossaries, and reference materials in prompts.
4. Language Pair Variability
Performance varies significantly by language:
| Language Pair | Relative Performance |
|---|---|
| EN ↔ DE/FR/ES | High (benchmark languages) |
| EN ↔ ZH/JA/KO | Medium-High |
| EN ↔ AR/HE | Medium |
| Low-resource pairs | Lower |
Mitigation: Calibrate thresholds per language pair; consider human review for lower-performing pairs.
5. No Guaranteed Consistency
LLMs may evaluate identical segments differently depending on position in the batch:
Segment A at position 10: "No errors found" Same segment at position 50: "Minor style issue flagged" Mitigation: Batch processing with consistent context, deterministic settings.
Implementing LLM-Based QA
Prompt Engineering for Translation QA
Your prompts are everything. A bad prompt turns a great model into a mediocre evaluator.
Basic Prompt Structure:
You are a professional translation quality evaluator. Analyze the following translation according to MQM (Multidimensional Quality Metrics) standards. Source Language: {source_lang} Target Language: {target_lang} Domain: {domain} Source Text: "{source_text}" Translation: "{translation}" Additional Context: - Glossary terms: {glossary} - Style requirements: {style_guide} Evaluate the translation and provide: 1. List of errors with: - Error type (Accuracy, Fluency, Terminology, Style, Locale, Design) - Specific subtype - Severity (Critical, Major, Minor) - Source span and target span - Explanation 2. Overall MQM score (100 - weighted penalties) 3. Brief quality summary Respond in JSON format. Advanced Prompt with Calibration:
You are an expert LQA evaluator. Before evaluation, review these calibration examples that show correct severity assignments for this project: Example 1 - Major Error: Source: "Do not exceed 10mg daily" Translation: "Nehmen Sie täglich 10mg ein" Issue: Omission of "Do not exceed" - safety-critical information missing Severity: Major (would be Critical in medical/pharma context) Example 2 - Minor Error: Source: "Click the button" Translation: "Klicken Sie auf den Button" Issue: "Button" could be "Schaltfläche" per glossary Severity: Minor (meaning preserved, terminology preference) Now evaluate: [...] Calibration examples in prompts make a huge difference. In our testing, they reduce severity disagreement with human evaluators by about 30%.
Structured Output for Reliability
Use JSON schema or function calling for consistent outputs:
from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4-turbo", messages=[...], response_format={ "type": "json_schema", "json_schema": { "name": "translation_evaluation", "schema": { "type": "object", "properties": { "errors": { "type": "array", "items": { "type": "object", "properties": { "error_type": {"type": "string"}, "subtype": {"type": "string"}, "severity": {"enum": ["Critical", "Major", "Minor"]}, "source_span": {"type": "string"}, "target_span": {"type": "string"}, "explanation": {"type": "string"} }, "required": ["error_type", "severity", "explanation"] } }, "score": {"type": "number", "minimum": 0, "maximum": 100}, "summary": {"type": "string"} }, "required": ["errors", "score", "summary"] } } }, temperature=0 ) Batch Processing Architecture
For production deployments:
┌─────────────────────────────────────────────────────────────────┐ │ Translation Batch │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │Seg 1 │ │Seg 2 │ │Seg 3 │ │Seg 4 │ │Seg 5 │ ... │ │ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ │ │ │ │ │ │ └─────────┴─────────┼─────────┴─────────┘ │ │ │ │ │ ┌───────────▼───────────┐ │ │ │ Batch Processor │ │ │ │ - Group by context │ │ │ │ - Include glossary │ │ │ │ - Add style guide │ │ │ └───────────┬───────────┘ │ │ │ │ │ ┌────────────────────┼────────────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │LLM 1 │ │LLM 2 │ │LLM 3 │ Parallel │ │ └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ │ │ │ │ └───────────────────┼───────────────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Result Aggregator │ │ │ │ - Combine results │ │ │ │ - Calculate scores │ │ │ │ - Generate report │ │ │ └─────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ Cost Optimization
LLM-based QA costs more than MTQE. Here's how to keep it manageable:
1. Tiered Processing
defevaluate_segment(segment, mtqe_score): if mtqe_score >= 0.95: return {"status": "auto_approve", "score": 98} elif mtqe_score >= 0.75: # Use faster, cheaper modelreturn evaluate_with_llm(segment, model="gpt-4o-mini") else: # Use best model for problematic segmentsreturn evaluate_with_llm(segment, model="gpt-4-turbo") Don't send obviously good segments to an expensive model. That's burning money.
2. Batch Segments
Instead of one segment per API call, batch related segments:
# Instead of 100 API calls for 100 segments# Send 10 batches of 10 segments each batch_prompt = f""" Evaluate these 10 segments from the same document: Segment 1: Source: "{seg1_source}" Translation: "{seg1_target}" Segment 2: ... """3. Cache Common Evaluations
import hashlib defget_cached_evaluation(source, target): cache_key = hashlib.md5(f"{source}||{target}".encode()).hexdigest() if cache_key in evaluation_cache: return evaluation_cache[cache_key] returnNoneComparing LLM Providers for Translation QA
OpenAI GPT-4 Family
| Model | Best For | Pricing (Dec 2024) |
|---|---|---|
| GPT-4 Turbo | Highest accuracy | $10/1M input, $30/1M output |
| GPT-4o | Balance of speed/quality | $2.50/1M input, $10/1M output |
| GPT-4o-mini | High volume, lower stakes | $0.15/1M input, $0.60/1M output |
Best overall accuracy and reliable JSON output. The main concern is cost at scale.
Anthropic Claude
| Model | Best For | Pricing |
|---|---|---|
| Claude 3.5 Sonnet | Production QA | $3/1M input, $15/1M output |
| Claude 3 Haiku | Fast screening | $0.25/1M input, $1.25/1M output |
Strong reasoning and particularly good at following complex evaluation instructions. Severity accuracy is slightly higher than GPT-4 in our testing.
Google Gemini
| Model | Best For | Pricing |
|---|---|---|
| Gemini 1.5 Pro | Long documents | $1.25/1M input, $5/1M output |
| Gemini 1.5 Flash | Fast processing | $0.075/1M input, $0.30/1M output |
The 1M+ token context window is genuinely useful for document-level QA. JSON output is less reliable than OpenAI — budget extra time for prompt engineering.
Hybrid LLM + Human Workflow
The best results come from combining LLM speed with human judgment. Neither alone is enough.
Workflow Design
┌─────────────────────────────────────────────────────────────┐ │ Translation Input │ └─────────────────────────────────┬───────────────────────────┘ │ ┌───────────────▼───────────────┐ │ LLM Evaluation │ │ (All segments, parallel) │ └───────────────┬───────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ No errors Minor errors Major/Critical │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌───────────┐ ┌───────────────┐ │ Accept │ │Sample 10% │ │ 100% Human │ │ │ │ Human QC │ │ Review │ └─────────┘ └───────────┘ └───────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────┐ │ LLM Accuracy Tracking │ │ (Compare LLM vs Human) │ │ - Update confidence scores │ │ - Adjust thresholds │ │ - Improve prompts │ └─────────────────────────────────┘ Confidence Calibration
Track LLM performance over time:
# After human reviewdefupdate_confidence(llm_result, human_result): agreement = compare_evaluations(llm_result, human_result) # Update running statistics update_stats( language_pair=llm_result.lang_pair, error_type=llm_result.error_type, severity=llm_result.severity, human_agreed=agreement ) # Adjust thresholds if accuracy dropsif get_recent_accuracy() < 0.85: increase_human_review_rate() This feedback loop is what separates a proof-of-concept from a production system. Without it, you're flying blind.
FAQ
Can LLMs replace human translators for quality assessment?
LLMs can handle 70-80% of routine QA tasks effectively, but they can't fully replace human evaluators. They're good at catching objective errors — spelling, grammar, obvious mistranslations — but struggle with cultural appropriateness, creative content, and context-dependent meaning. The optimal approach is hybrid: LLMs for initial evaluation and flagging, humans for verification and edge cases.
Which LLM is best for translation quality assessment?
As of 2025, GPT-4 Turbo and Claude 3.5 Sonnet offer the best accuracy for translation QA. For high-volume, lower-stakes content, GPT-4o-mini or Claude Haiku provide good cost-performance balance. The best choice depends on your specific language pairs, domain, and budget. We recommend benchmarking 2-3 models on your actual content before committing.
How much does LLM-based translation QA cost?
Costs vary by volume and model. For GPT-4o at typical translation QA prompt sizes:
- 1,000 segments: ~$0.50-1.00
- 10,000 segments: ~$5-10
- 100,000 segments: ~$50-100
Using tiered approaches (MTQE filtering + cheaper models for easy cases) can reduce costs by 50-70% while maintaining quality.
How do I validate LLM QA accuracy for my content?
Create a test set of 200-500 segments with human MQM annotations. Run LLM evaluation and compare:
- Error detection rate (does LLM find the same errors?)
- Severity alignment (does LLM assign similar severity?)
- False positive rate (how often does LLM flag non-errors?)
Target 85%+ agreement for production use. Re-validate quarterly as models update.
Can LLMs handle specialized domains like medical or legal translation?
Yes, but with extra setup. For specialized domains:
- Provide domain-specific glossaries in prompts
- Include example errors from your domain in calibration
- Use domain context ("This is a pharmaceutical product insert")
- Increase human review percentage for high-risk content
- Consider fine-tuning or RAG approaches for very specialized terminology
LLMs won't replace human quality evaluators anytime soon. But they've made it possible to check 100% of translated content instead of 5%, catch errors before they reach customers, and give translators feedback they can actually use. The organizations getting the most out of LLM QA aren't the ones who trust it blindly — they're the ones who've built proper calibration loops and know exactly where the model needs human backup.
Ready to implement LLM-powered translation QA? Try KTTC for production-ready AI LQA with MQM compliance and hybrid human-AI workflows.
