Skip to main content

LLM in Translation Quality Assessment: Capabilities and Limitations in 2025

alex-chen1/12/202512 min read
llmgpt-4claudetranslation-qualityai-lqamachine-learning

Two years ago, translation quality assessment was a purely human job. A linguist read the source, read the target, marked errors, assigned severity levels, and moved on to the next segment. It was accurate but slow, expensive, and limited to small sample sizes.

LLMs changed that equation. Models like GPT-4, Claude, and Gemini can now identify translation errors, explain quality issues, and produce MQM-compliant evaluations at scale. They're not perfect — but they're good enough to rethink how QA works.

This guide covers what LLMs can and can't do for translation QA, with practical guidance on making them work in production.

How LLMs Evaluate Translation Quality

Traditional MTQE models output a single score. An LLM does something fundamentally different: it reads both texts, reasons about them, and explains what it finds in plain language.

CapabilityDescription
Natural language outputExplains errors in understandable terms
Zero-shot learningWorks without domain-specific training
Contextual understandingConsiders document-level context
MultilingualSupports 100+ language pairs
Flexible instructionsAdapts to custom quality criteria via prompts

Basic LLM Evaluation Flow

┌─────────────────────────────────────────────────────────────┐ │ Input │ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │ │ │ Source Text │ │ Translation │ │ Instructions │ │ │ │ (English) │ │ (German) │ │ (MQM criteria) │ │ │ └───────┬───────┘ └───────┬───────┘ └────────┬────────┘ │ │ │ │ │ │ │ └──────────────────┼────────────────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ LLM │ │ │ │ (GPT-4/Claude) │ │ │ └────────┬────────┘ │ │ │ │ │ Output ▼ │ │ ┌─────────────────────────────────────────────────────────┐│ │ │ • Error annotations with categories ││ │ │ • Severity levels (Critical/Major/Minor) ││ │ │ • Explanations for each issue ││ │ │ • Overall quality score ││ │ │ • Improvement suggestions ││ │ └─────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────┘ 

Current LLM Capabilities for Translation QA (2025)

What LLMs Do Well

Based on benchmarks and production deployments:

1. Error Detection

LLMs effectively identify:

  • Mistranslations and meaning changes (85-90% accuracy)
  • Omissions and additions (85-90% accuracy)
  • Grammar and spelling errors (95%+ accuracy)
  • Terminology inconsistencies (90%+ with glossary)
  • Style and register issues (80-85% accuracy)

Those numbers are good enough for a first pass. Not good enough to skip human review on a pharmaceutical insert.

2. Error Categorization

LLMs can classify errors according to MQM taxonomy:

Example Output: Source: "The server will restart automatically." Translation: "Der Server wird manuell neu gestartet." LLM Analysis: { "errors": [ { "type": "Accuracy/Mistranslation", "source_span": "automatically", "target_span": "manuell", "severity": "Major", "explanation": "The translation says 'manually' but the source says 'automatically' - this reverses the meaning." } ], "score": 95, "overall_assessment": "One major accuracy error that changes operational meaning." } 

3. Contextual Evaluation

This is where LLMs genuinely outperform older approaches. They consider:

  • Document-level consistency
  • Term usage across the text
  • Tone and style coherence
  • Reference to previously translated content

A traditional QA check looks at one segment in isolation. An LLM can notice that "Dashboard" was translated as "Armaturenbrett" in segment 12 but "Übersicht" in segment 47.

4. Explanation Generation

Unlike black-box models, LLMs explain their reasoning:

"The translation uses the informal 'du' form, but the source text and the formal business context suggest the formal 'Sie' should be used. This is a Style/Register error with Minor severity as it doesn't affect meaning but impacts brand voice consistency." 

This kind of feedback is actually useful to translators. A score of 94 tells them nothing. This tells them what to fix.

Benchmark Performance (2025)

ModelError DetectionSeverity AccuracyMQM AlignmentSpeed
GPT-4 Turbo87%82%High2-4s
Claude 3.5 Sonnet86%84%High2-3s
Gemini 1.5 Pro84%80%Medium2-4s
GPT-4o85%81%High1-2s
Claude 3 Haiku78%75%Medium0.5-1s

Based on MQM-annotated test sets across EN-DE, EN-FR, EN-ZH language pairs

LLM Limitations for Translation QA

Here's where honesty matters. LLMs have real limitations, and ignoring them leads to bad outcomes.

1. Hallucination Risk

LLMs sometimes flag errors that don't exist, or miss real ones:

False Positive Example: Source: "The quick brown fox" Translation: "Der schnelle braune Fuchs" LLM (incorrectly): "Minor fluency issue - consider 'flinke' instead of 'schnelle'" Reality: Both translations are perfectly valid. 

Mitigation: Implement confidence thresholds and human review for critical content.

2. Inconsistent Severity Assessment

The same error may receive different severity ratings across runs:

Run 1: "Terminology error - Major severity" Run 2: "Terminology error - Minor severity" 

This is a real problem for any process that needs repeatability.

Mitigation: Use temperature=0, structured outputs, and calibration prompts.

3. Domain Knowledge Gaps

General-purpose LLMs don't know that "contra-indicated" has a specific meaning in pharmacology, or that "force majeure" needs careful handling in French legal contexts.

  • Medical terminology nuances
  • Legal jurisdiction-specific terms
  • Industry-specific jargon
  • Cultural references

Mitigation: Provide domain context, glossaries, and reference materials in prompts.

4. Language Pair Variability

Performance varies significantly by language:

Language PairRelative Performance
EN ↔ DE/FR/ESHigh (benchmark languages)
EN ↔ ZH/JA/KOMedium-High
EN ↔ AR/HEMedium
Low-resource pairsLower

Mitigation: Calibrate thresholds per language pair; consider human review for lower-performing pairs.

5. No Guaranteed Consistency

LLMs may evaluate identical segments differently depending on position in the batch:

Segment A at position 10: "No errors found" Same segment at position 50: "Minor style issue flagged" 

Mitigation: Batch processing with consistent context, deterministic settings.

Implementing LLM-Based QA

Prompt Engineering for Translation QA

Your prompts are everything. A bad prompt turns a great model into a mediocre evaluator.

Basic Prompt Structure:

You are a professional translation quality evaluator. Analyze the following translation according to MQM (Multidimensional Quality Metrics) standards. Source Language: {source_lang} Target Language: {target_lang} Domain: {domain} Source Text: "{source_text}" Translation: "{translation}" Additional Context: - Glossary terms: {glossary} - Style requirements: {style_guide} Evaluate the translation and provide: 1. List of errors with: - Error type (Accuracy, Fluency, Terminology, Style, Locale, Design) - Specific subtype - Severity (Critical, Major, Minor) - Source span and target span - Explanation 2. Overall MQM score (100 - weighted penalties) 3. Brief quality summary Respond in JSON format. 

Advanced Prompt with Calibration:

You are an expert LQA evaluator. Before evaluation, review these calibration examples that show correct severity assignments for this project: Example 1 - Major Error: Source: "Do not exceed 10mg daily" Translation: "Nehmen Sie täglich 10mg ein" Issue: Omission of "Do not exceed" - safety-critical information missing Severity: Major (would be Critical in medical/pharma context) Example 2 - Minor Error: Source: "Click the button" Translation: "Klicken Sie auf den Button" Issue: "Button" could be "Schaltfläche" per glossary Severity: Minor (meaning preserved, terminology preference) Now evaluate: [...] 

Calibration examples in prompts make a huge difference. In our testing, they reduce severity disagreement with human evaluators by about 30%.

Structured Output for Reliability

Use JSON schema or function calling for consistent outputs:

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4-turbo", messages=[...], response_format={ "type": "json_schema", "json_schema": { "name": "translation_evaluation", "schema": { "type": "object", "properties": { "errors": { "type": "array", "items": { "type": "object", "properties": { "error_type": {"type": "string"}, "subtype": {"type": "string"}, "severity": {"enum": ["Critical", "Major", "Minor"]}, "source_span": {"type": "string"}, "target_span": {"type": "string"}, "explanation": {"type": "string"} }, "required": ["error_type", "severity", "explanation"] } }, "score": {"type": "number", "minimum": 0, "maximum": 100}, "summary": {"type": "string"} }, "required": ["errors", "score", "summary"] } } }, temperature=0 ) 

Batch Processing Architecture

For production deployments:

┌─────────────────────────────────────────────────────────────────┐ │ Translation Batch │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │Seg 1 │ │Seg 2 │ │Seg 3 │ │Seg 4 │ │Seg 5 │ ... │ │ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ │ │ │ │ │ │ └─────────┴─────────┼─────────┴─────────┘ │ │ │ │ │ ┌───────────▼───────────┐ │ │ │ Batch Processor │ │ │ │ - Group by context │ │ │ │ - Include glossary │ │ │ │ - Add style guide │ │ │ └───────────┬───────────┘ │ │ │ │ │ ┌────────────────────┼────────────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │LLM 1 │ │LLM 2 │ │LLM 3 │ Parallel │ │ └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ │ │ │ │ └───────────────────┼───────────────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Result Aggregator │ │ │ │ - Combine results │ │ │ │ - Calculate scores │ │ │ │ - Generate report │ │ │ └─────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ 

Cost Optimization

LLM-based QA costs more than MTQE. Here's how to keep it manageable:

1. Tiered Processing

defevaluate_segment(segment, mtqe_score): if mtqe_score >= 0.95: return {"status": "auto_approve", "score": 98} elif mtqe_score >= 0.75: # Use faster, cheaper modelreturn evaluate_with_llm(segment, model="gpt-4o-mini") else: # Use best model for problematic segmentsreturn evaluate_with_llm(segment, model="gpt-4-turbo") 

Don't send obviously good segments to an expensive model. That's burning money.

2. Batch Segments

Instead of one segment per API call, batch related segments:

# Instead of 100 API calls for 100 segments# Send 10 batches of 10 segments each batch_prompt = f""" Evaluate these 10 segments from the same document: Segment 1: Source: "{seg1_source}" Translation: "{seg1_target}" Segment 2: ... """

3. Cache Common Evaluations

import hashlib defget_cached_evaluation(source, target): cache_key = hashlib.md5(f"{source}||{target}".encode()).hexdigest() if cache_key in evaluation_cache: return evaluation_cache[cache_key] returnNone

Comparing LLM Providers for Translation QA

OpenAI GPT-4 Family

ModelBest ForPricing (Dec 2024)
GPT-4 TurboHighest accuracy$10/1M input, $30/1M output
GPT-4oBalance of speed/quality$2.50/1M input, $10/1M output
GPT-4o-miniHigh volume, lower stakes$0.15/1M input, $0.60/1M output

Best overall accuracy and reliable JSON output. The main concern is cost at scale.

Anthropic Claude

ModelBest ForPricing
Claude 3.5 SonnetProduction QA$3/1M input, $15/1M output
Claude 3 HaikuFast screening$0.25/1M input, $1.25/1M output

Strong reasoning and particularly good at following complex evaluation instructions. Severity accuracy is slightly higher than GPT-4 in our testing.

Google Gemini

ModelBest ForPricing
Gemini 1.5 ProLong documents$1.25/1M input, $5/1M output
Gemini 1.5 FlashFast processing$0.075/1M input, $0.30/1M output

The 1M+ token context window is genuinely useful for document-level QA. JSON output is less reliable than OpenAI — budget extra time for prompt engineering.

Hybrid LLM + Human Workflow

The best results come from combining LLM speed with human judgment. Neither alone is enough.

Workflow Design

┌─────────────────────────────────────────────────────────────┐ │ Translation Input │ └─────────────────────────────────┬───────────────────────────┘ │ ┌───────────────▼───────────────┐ │ LLM Evaluation │ │ (All segments, parallel) │ └───────────────┬───────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ No errors Minor errors Major/Critical │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌───────────┐ ┌───────────────┐ │ Accept │ │Sample 10% │ │ 100% Human │ │ │ │ Human QC │ │ Review │ └─────────┘ └───────────┘ └───────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────┐ │ LLM Accuracy Tracking │ │ (Compare LLM vs Human) │ │ - Update confidence scores │ │ - Adjust thresholds │ │ - Improve prompts │ └─────────────────────────────────┘ 

Confidence Calibration

Track LLM performance over time:

# After human reviewdefupdate_confidence(llm_result, human_result): agreement = compare_evaluations(llm_result, human_result) # Update running statistics update_stats( language_pair=llm_result.lang_pair, error_type=llm_result.error_type, severity=llm_result.severity, human_agreed=agreement ) # Adjust thresholds if accuracy dropsif get_recent_accuracy() < 0.85: increase_human_review_rate() 

This feedback loop is what separates a proof-of-concept from a production system. Without it, you're flying blind.

FAQ

Can LLMs replace human translators for quality assessment?

LLMs can handle 70-80% of routine QA tasks effectively, but they can't fully replace human evaluators. They're good at catching objective errors — spelling, grammar, obvious mistranslations — but struggle with cultural appropriateness, creative content, and context-dependent meaning. The optimal approach is hybrid: LLMs for initial evaluation and flagging, humans for verification and edge cases.

Which LLM is best for translation quality assessment?

As of 2025, GPT-4 Turbo and Claude 3.5 Sonnet offer the best accuracy for translation QA. For high-volume, lower-stakes content, GPT-4o-mini or Claude Haiku provide good cost-performance balance. The best choice depends on your specific language pairs, domain, and budget. We recommend benchmarking 2-3 models on your actual content before committing.

How much does LLM-based translation QA cost?

Costs vary by volume and model. For GPT-4o at typical translation QA prompt sizes:

  • 1,000 segments: ~$0.50-1.00
  • 10,000 segments: ~$5-10
  • 100,000 segments: ~$50-100

Using tiered approaches (MTQE filtering + cheaper models for easy cases) can reduce costs by 50-70% while maintaining quality.

How do I validate LLM QA accuracy for my content?

Create a test set of 200-500 segments with human MQM annotations. Run LLM evaluation and compare:

  • Error detection rate (does LLM find the same errors?)
  • Severity alignment (does LLM assign similar severity?)
  • False positive rate (how often does LLM flag non-errors?)

Target 85%+ agreement for production use. Re-validate quarterly as models update.

Yes, but with extra setup. For specialized domains:

  1. Provide domain-specific glossaries in prompts
  2. Include example errors from your domain in calibration
  3. Use domain context ("This is a pharmaceutical product insert")
  4. Increase human review percentage for high-risk content
  5. Consider fine-tuning or RAG approaches for very specialized terminology

LLMs won't replace human quality evaluators anytime soon. But they've made it possible to check 100% of translated content instead of 5%, catch errors before they reach customers, and give translators feedback they can actually use. The organizations getting the most out of LLM QA aren't the ones who trust it blindly — they're the ones who've built proper calibration loops and know exactly where the model needs human backup.

Ready to implement LLM-powered translation QA? Try KTTC for production-ready AI LQA with MQM compliance and hybrid human-AI workflows.

We use cookies to improve your experience. Learn more in our Cookie Policy.