Skip to main content

MQM in the LLM Era: Why Fluent Translations Still Need Human Quality Frameworks

alex-chen3/16/202611 min read
mqmtranslation-qualityllm-evaluationcometai-translation-quality

The Fluency Trap

LLMs have changed translation. Outputs are grammatically polished, stylistically consistent, and natural-sounding. That's the problem.

When a translation reads beautifully but says the wrong thing, automated metrics often can't tell. COMET scores stay high. MetricX flags nothing. The error ships to production — and a pharmaceutical dosage gets mistranslated, a legal clause gets inverted, or a product spec gets fabricated from thin air.

This is the fluency trap. LLMs optimize for how a translation sounds, not for what it means. And as organizations scale LLM translation pipelines to millions of segments per month, the gap between fluency and accuracy has become the central quality problem.

MQM — the Multidimensional Quality Metrics framework — was built long before LLMs took over translation. But its structured, error-taxonomy approach is more relevant now than it's ever been. Here's why, and how to put it to work.

Why Automated Metrics Miss LLM Hallucinations

The COMET Paradox

COMET (and its reference-free variant COMET-KIWI) is the gold standard for automated translation evaluation. It correlates well with human judgments — on average. But averages hide the failure modes that matter most.

Look at this example:

Text
Source (DE):Die Dosierung beträgt 5 mg zweimal täglich.
Reference:The dosage is 5 mg twice daily.
LLM output:The recommended dosage is 5 mg three times daily for optimal results.
COMET score:0.87 (Good)

The LLM output is fluent, uses appropriate medical register, and even adds a plausible-sounding qualifier. COMET rates it highly because the embedding similarity is strong. But the translation is factually wrong — "twice daily" became "three times daily," and "recommended... for optimal results" was invented entirely.

Why This Happens

Automated metrics like COMET and MetricX were trained on human quality judgments dominated by fluency signals. When most translations came from phrase-based or early NMT systems, fluency was the differentiator — accurate translations were often awkward. The metrics learned to weight fluency heavily.

LLMs flipped this. Nearly all LLM outputs are fluent. The variance is now in accuracy — and the metrics haven't caught up.

MetricFluency DetectionAccuracy DetectionHallucination Detection
COMETExcellentModeratePoor
MetricXExcellentModeratePoor
COMET-KIWI (QE)GoodLowVery Poor
COMETKiwi-XLGoodModerateLow
Human MQMExcellentExcellentExcellent

The Hallucination Taxonomy

LLM translation hallucinations come in distinct flavors:

  1. Numeric distortion — changing quantities, dates, percentages, dosages
  2. Semantic addition — inserting plausible-sounding information that isn't in the source
  3. Semantic omission — dropping clauses or qualifiers that constrain meaning
  4. Entity substitution — swapping one named entity for another in the same category
  5. Polarity inversion — flipping negations or comparative directions

Every one of these can produce translations that score well on automated metrics while being materially incorrect.

MQM Error Taxonomy for LLM Outputs

MQM defines a hierarchical error taxonomy. For LLM-era translation, the accuracy dimension needs more attention, while fluency errors have become relatively rare.

Error Distribution Shift: NMT vs LLM

Error CategoryNMT (2020)LLM (2026)Change
Accuracy — Mistranslation28%22%
Accuracy — Addition3%18%↑↑↑
Accuracy — Omission15%12%
Fluency — Grammar22%3%↓↓↓
Fluency — Register8%5%
Terminology18%25%
Style6%15%

The standout number: Accuracy — Addition jumped from 3% to 18%. LLMs add information that wasn't in the source at 6x the rate of traditional NMT. That's hallucination, and it's exactly what automated metrics are worst at catching.

MQM Severity Levels for LLM Review

Severity calibration matters when applying MQM to LLM outputs:

SeverityDefinitionPenalty PointsExample
CriticalChanges meaning in safety/legal/financial context25Dosage "twice" → "three times"
MajorIncorrect meaning but lower-risk context5Adding unsourced marketing claim
MinorSubtle deviation, meaning preserved1Slight register mismatch
NeutralPreference, not error0Synonym choice

A quality score is calculated as:

Score = 100 - (Total Penalty Points / Word Count × 100) 

Industry thresholds:

  • 95+: Publication-ready
  • 90-95: Acceptable with light review
  • 85-90: Requires post-editing
  • <85: Reject and retranslate

Human MQM + Automated Metrics = Complete Picture

Neither approach works alone. The best evaluation pipeline combines both.

The Two-Layer Evaluation Architecture

Layer 1: Automated screening (100% of segments)

Run COMET-KIWI on all translated segments. This catches gross fluency errors and filters out obviously bad translations. Fast, cheap, scalable.

Layer 2: Human MQM review (10-20% of segments)

Sample segments for human review, biased toward:

  • Segments with COMET scores in the ambiguous range (0.80-0.90)
  • Segments from high-risk domains (medical, legal, financial)
  • Segments with high source complexity (long sentences, embedded lists, conditionals)
  • A random sample for calibration

This two-layer approach catches 92-95% of all errors while requiring human review on only a fraction of the volume.

Cost Comparison

For a pipeline processing 1 million words/month:

ApproachMonthly CostError Detection RateTime to Results
COMET only$5065%Minutes
Human MQM only$15,00098%5-7 days
Hybrid (COMET + 15% MQM)$2,30094%1-2 days

The hybrid approach delivers near-human detection rates at 85% less cost than full human review. That's not a marginal improvement — it's a different cost structure entirely.

How KTTC Implements MQM-Based Evaluation

KTTC makes the two-layer approach described above practical. Here's how.

Segment-Level Annotation

Reviewers work in a side-by-side interface showing source and target. They highlight error spans directly in the translation and assign:

  1. Error category from the MQM taxonomy (accuracy, fluency, terminology, style, locale convention)
  2. Subcategory (e.g., accuracy → addition, accuracy → omission)
  3. Severity (critical, major, minor)

The platform calculates MQM scores automatically at the segment, document, and project level.

Multi-Model Comparison Mode

When evaluating outputs from different LLMs, KTTC presents anonymized translations side by side. Reviewers annotate each version independently — no model-name bias. After annotation, the platform reveals which model produced each output and aggregates quality profiles.

This is especially useful for routing optimization: over time, you build an empirical map of which model performs best for which content type and language pair.

Automated Flagging

KTTC's automated layer flags segments for priority human review based on:

  • Numeric divergence detection: Source and target numbers are compared; mismatches trigger review
  • Length ratio anomalies: Translations much longer than expected may contain additions
  • Terminology violations: Segments using non-approved terms are flagged
  • Low-confidence scores: Segments where COMET-KIWI returns scores below a configurable threshold

These flags increase the hit rate of human review by 3-4x compared to random sampling. Your evaluation budget goes further.

Practical Scoring Examples

Example 1: Hallucinated Addition (Critical)

Text
Source (FR):Le contrat prend effet le 1er mars 2026.
Reference:The contract takes effect on March 1, 2026.
LLM output:The contract takes effect on March 1, 2026, and remains valid for a period of 12 months.

MQM annotation:

  • Error span: "and remains valid for a period of 12 months"
  • Category: Accuracy → Addition
  • Severity: Critical (legal context, fabricated contract term)
  • Penalty: 25 points

COMET score for this output: 0.91 — the metric sees a fluent, well-formed sentence and rates it highly. Only human MQM review catches the hallucinated clause.

Example 2: Numeric Distortion (Critical)

Text
Source (JA):投与量は1日2回、各10mgです。
Reference:The dosage is 10 mg twice daily.
LLM output:The dosage is 10 mg once daily.

MQM annotation:

  • Error span: "once daily"
  • Category: Accuracy → Mistranslation
  • Severity: Critical (medical dosage)
  • Penalty: 25 points

Example 3: Register Mismatch (Minor)

Text
Source (DE):Bitte wenden Sie sich an unseren Kundendienst.
Reference:Please contact our customer service department.
LLM output:Hit up our customer support team if you need anything!

MQM annotation:

  • Error span: "Hit up... if you need anything!"
  • Category: Fluency → Register
  • Severity: Minor (informal tone for formal source)
  • Penalty: 1 point

Building Your MQM Practice

Step 1: Define Your Error Priorities

Not all MQM categories matter equally for every organization. A medical device company should weight accuracy errors heavily; a gaming company might prioritize style and locale convention. Define your severity multipliers before you start reviewing.

Step 2: Calibrate Your Reviewers

Inter-annotator agreement is the foundation of reliable MQM scores. Run calibration sessions where multiple reviewers annotate the same 50-100 segments, then discuss disagreements. Target a Cohen's kappa of 0.7+ before trusting scores for production decisions. Skip this step and your data is noise.

Step 3: Integrate with Your Translation Pipeline

MQM is most valuable when scores feed back into the pipeline:

  • Tune routing rules in a multi-model setup
  • Identify systematic errors for prompt engineering
  • Set quality gates (e.g., reject batches scoring below 90)
  • Benchmark new models before production deployment

A single MQM evaluation is a snapshot. The real value comes from tracking over time: are accuracy errors trending up after a model update? Is a specific language pair degrading? KTTC's analytics dashboard surfaces these trends automatically.

Practical Recommendations

  1. Don't rely solely on COMET for LLM translation quality. It was built for a pre-LLM error distribution and systematically underweights accuracy issues.
  2. Use the two-layer evaluation model. Automated screening on 100% of segments, human MQM on 10-20%, with smart sampling to maximize coverage.
  3. Prioritize accuracy categories in MQM. For LLM outputs, addition and omission errors are the highest-risk categories. Weight them accordingly.
  4. Use MQM data to improve your LLMs. Error patterns from MQM review can refine prompts, adjust glossaries, and retrain routing classifiers.
  5. Invest in reviewer calibration. Inconsistent MQM scores are worse than no scores at all. This isn't optional.

FAQ

Is MQM too slow and expensive for high-volume LLM translation?

Not when you use a hybrid pipeline. You don't MQM-review every segment. The two-layer approach (automated screening + sampled human review) keeps costs at $2-3 per thousand words while catching 94%+ of errors. Platforms like KTTC make this practical by automating the sampling, annotation interface, and score aggregation.

Can I use an LLM to perform MQM evaluation instead of human reviewers?

LLM-as-judge approaches are improving but still unreliable for the exact error types that matter most — hallucinated additions and subtle semantic distortions. Using an LLM to evaluate another LLM's translation creates a shared blind spot: both models may find the hallucinated content plausible. Use LLMs for pre-screening, but keep humans in the loop for MQM annotation.

How does MQM relate to COMET and MetricX? Are they competing standards?

They're complementary, not competing. COMET and MetricX are automated metrics — fast, scalable, useful for screening. MQM is a human annotation framework — slower, more expensive, but far more accurate for catching critical errors. The best pipelines use both: automated metrics for breadth, MQM for depth.

What is the minimum sample size for reliable MQM evaluation?

For a statistically meaningful quality assessment of a single document or batch, review at least 200-300 segments (or 2,000-3,000 words). For comparing two models, double that — 300+ segments per model, ideally the same source segments translated by both. KTTC's multi-model comparison mode is built for exactly this use case.

We use cookies to improve your experience. Learn more in our Cookie Policy.