Skip to main content

AI LQA vs MTQE: What to Choose for Translation Quality in 2025

alex-chen1/8/202511 min read
ai-lqamtqetranslation-qualitymachine-translationllmquality-estimation

AI LQA and MTQE both use AI to evaluate translations. They're often discussed as if they're interchangeable. They're not.

MTQE (Machine Translation Quality Estimation) gives you a number — fast and cheap. AI LQA (AI-powered Linguistic Quality Assurance) gives you specific errors — slower and pricier, but actionable. Choosing the wrong one for your use case either wastes money or leaves you with data you can't act on.

Here's when to use each, and why the best answer is usually both.

What is MTQE?

MTQE predicts the quality of machine translation output without needing a human reference translation. It takes a source-target pair and produces a quality score:

Source: "The server is temporarily unavailable." MT Output: "服务器暂时不可用。" MTQE Score: 0.92 (high confidence, likely acceptable) 

The model learns from examples of human-rated translations during training. Common architectures include:

ArchitectureDescriptionExample
COMETCrosslingual Optimized Metric for Evaluation of TranslationState-of-the-art neural metric
BLEURTBERT-based Learned Evaluation MetricGoogle's trained quality estimator
Quality EstimationDirect prediction without referencesUsed in production MT systems

MTQE Strengths

  1. Speed - Scores in milliseconds
  2. Scale - Millions of segments per hour
  3. Cost - Near-zero per segment after model training
  4. Integration - Drops right into MT pipelines
  5. Triage - Quickly finds segments needing review

MTQE Limitations

  1. No error details - Just a score, no explanation
  2. Training dependency - Only as good as its training data
  3. Domain sensitivity - May underperform on unseen domains
  4. Binary decisions - A score of 0.78 doesn't tell you what to do
  5. No MQM alignment - Scores don't map to error types

That last point matters. If a client asks "what types of errors are we seeing?" — MTQE can't answer that.

What is AI LQA?

AI LQA uses large language models to perform detailed translation quality evaluation, similar to what a human LQA evaluator does:

Source: "The annual report is due by December 31." Translation: "Der Jahresbericht muss bis zum 31. Januar vorgelegt werden." AI LQA Output: - Error 1: Mistranslation (Accuracy) - "December" translated as "Januar" (January) - Severity: Major - Penalty: 5 points - MQM Score: 95 

AI LQA Strengths

  1. Error details - Specific errors with categories and severity
  2. MQM alignment - Uses industry-standard error typology
  3. Explainability - Says why something is wrong
  4. Flexibility - Adapts to different quality requirements
  5. Actionable - Feedback translators can use

AI LQA Limitations

  1. Slower - Seconds per segment vs. milliseconds for MTQE
  2. Higher cost - LLM inference costs per segment
  3. Hallucination risk - May flag non-errors or miss real errors
  4. Calibration needed - Requires tuning for specific use cases
  5. Not deterministic - Results may vary slightly between runs

AI LQA vs MTQE: Detailed Comparison

Purpose & Output

AspectMTQEAI LQA
Primary purposePredict overall qualityIdentify specific errors
Output typeNumeric score (0-1 or 0-100)Error annotations + score
Error detailsNoneFull MQM categorization
ExplainabilityLow (black box)High (natural language)

Performance Characteristics

AspectMTQEAI LQA
Speed~1ms per segment~2-5s per segment
ThroughputMillions/hourThousands/hour
Cost per segment~$0.00001~$0.001-0.01
ScalabilityExcellentModerate

That's a 100-1000x difference in cost per segment. At 10 million segments per month, that gap is the entire business case.

Quality Assessment

AspectMTQEAI LQA
AccuracyGood for rankingGood for error detection
GranularitySegment-level onlyError-level detail
CalibrationDomain-specific trainingPrompt engineering
Human correlationHigh (with good training)High (with good prompts)

Use Case Fit

Use CaseMTQEAI LQA
MT output triageExcellentOverkill
Vendor comparisonLimitedExcellent
Translator feedbackPoorExcellent
SLA verificationLimitedExcellent
Real-time filteringExcellentToo slow
Post-editing guidanceLimitedExcellent

When to Use MTQE

1. Real-Time Quality Filtering

Filter MT output in production pipelines:

# Pseudocodefor segment in mt_output: score = mtqe_model.predict(source, target) if score >= 0.85: publish(segment) # Auto-approveelif score >= 0.60: queue_for_review(segment) # Human reviewelse: queue_for_retranslation(segment) # Redo

When you need a decision in milliseconds, MTQE is the only option. AI LQA at 2-5 seconds per segment simply doesn't fit.

2. MT Engine Selection

Compare multiple MT engines at scale:

EngineAvg MTQE ScoreCostRecommendation
DeepL0.89$25/M charsBest quality
Google0.85$20/M charsGood balance
Custom NMT0.82$5/M charsBudget option

3. Volume Optimization

Prioritize human review effort:

  • High MTQE scores → Skip review
  • Medium scores → Sample review
  • Low scores → Full review

4. Adaptive MT

Route content to appropriate translation methods:

  • MTQE ≥ 0.90 → Raw MT acceptable
  • MTQE 0.70-0.90 → Light post-editing
  • MTQE < 0.70 → Full post-editing or human translation

When to Use AI LQA

1. Detailed Error Reporting

When you need to tell a translator what went wrong, not just that something went wrong:

Segment 47: - Error: Terminology inconsistency - "Dashboard" translated as "Armaturenbrett" in segment 12 - But "Übersicht" here - Action: Use consistent terminology per glossary - Severity: Minor 

2. MQM-Based Quality Scoring

Generate ISO 5060-compliant quality reports:

CategoryCriticalMajorMinorPenalty
Accuracy02313
Fluency01510
Terminology0044
Total031227
MQM Score97.3

3. Vendor Performance Tracking

Compare translator or agency quality over time:

VendorQ4 2024Q1 2025TrendIssues
Agency A96.297.1UpTerminology improved
Agency B94.893.5DownAccuracy declining
Freelancer C97.597.8StableConsistent quality

This kind of data is what separates "we think Agency B is getting worse" from "Agency B's accuracy score dropped 1.3 points this quarter, driven by 40% more omission errors in legal content."

4. Training Data Generation

Identify patterns for translator training: most common error types, specific segments with issues, before/after comparisons, improvement tracking.

5. Compliance Verification

Verify translations meet quality SLAs:

Contract requirement: MQM Score ≥ 95 Batch evaluation result: 96.3 Status: PASS Detailed report: [attached] 

Building a Hybrid Workflow

The real answer in 2025 is: use both. MTQE for speed and triage, AI LQA for depth and detail.

Hybrid Architecture

 ┌─────────────────┐ │ MT Output │ └────────┬────────┘ │ ┌────────▼────────┐ │ MTQE │ │ (Fast Filter) │ └────────┬────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ Score ≥ 0.90 0.70-0.90 Score < 0.70 │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌───────────┐ ┌───────────┐ │ Publish │ │ AI LQA │ │ Human │ │ as-is │ │ Review │ │ Translate │ └─────────┘ └─────┬─────┘ └───────────┘ │ ┌─────────────┼─────────────┐ │ │ │ No errors Minor only Major/Critical │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌───────────┐ │ Publish │ │ Auto- │ │ Human │ │ │ │ fix │ │ Review │ └─────────┘ └─────────┘ └───────────┘ 

Implementation Steps

Step 1: Configure MTQE Thresholds

Based on your quality requirements and content type:

THRESHOLDS = { "marketing": {"high": 0.92, "low": 0.75}, "technical": {"high": 0.88, "low": 0.70}, "legal": {"high": 0.95, "low": 0.85}, } 

Step 2: Set Up AI LQA Pipeline

Configure error categories and severity weights:

AI_LQA_CONFIG = { "error_categories": ["Accuracy", "Fluency", "Terminology", "Style"], "severity_weights": {"critical": 25, "major": 5, "minor": 1}, "pass_threshold": 95, } 

Step 3: Define Routing Rules

MTQE ScoreAI LQA ResultAction
≥ 0.90N/AAuto-publish
0.70-0.90No errorsPublish
0.70-0.90Minor onlyAuto-fix if possible
0.70-0.90Major/CriticalHuman review
< 0.70N/AHuman translation

Step 4: Monitor and Adjust

Track these metrics to optimize thresholds:

  • False positive rate (good translations flagged)
  • False negative rate (bad translations missed)
  • Human review volume
  • Average quality score of published content

Cost-Benefit Analysis

Scenario: 1 Million Segments/Month

Traditional Approach (Human LQA on all)

  • Sample rate: 5% = 50,000 segments
  • Human LQA cost: $0.10/segment = $5,000
  • Coverage: 5%

MTQE Only

  • All segments scored: $10 (near-free)
  • No error details for improvement
  • Coverage: 100% (quality scores only)

AI LQA Only

  • All segments: 1M × $0.005 = $5,000
  • Full error details
  • Coverage: 100%

Hybrid Approach

  • MTQE on all: $10
  • AI LQA on medium scores (30%): 300K × $0.005 = $1,500
  • Human review on flagged (2%): 20K × $0.10 = $2,000
  • Total: $3,510
  • Coverage: 100% with full error details where needed

ROI Summary

ApproachCostCoverageError Details
Human LQA$5,0005%Full
MTQE only$10100%None
AI LQA only$5,000100%Full
Hybrid$3,510100%Where needed

Same cost as the human approach, but 100% coverage instead of 5%. That's the math that sells this to executives.

Tools and Platforms

MTQE Tools

ToolTypeStrengths
COMETOpen-sourceState-of-the-art accuracy
ModernMT QECommercialProduction-ready
Google AutoMLCloudEasy training
Amazon Translate QECloudAWS integration

AI LQA Tools

ToolTypeStrengths
KTTCSaaSFull MQM, ISO 5060 compliant
Phrase Auto LQAEnterpriseTMS integration
ContentQuoSpecializedVendor-agnostic
Custom GPT-4DIYFlexible, requires engineering

FAQ

What's the difference between MTQE and AI LQA?

MTQE (Machine Translation Quality Estimation) predicts a single quality score for translations without explaining why. AI LQA (AI-powered Linguistic Quality Assurance) identifies specific errors, categorizes them by type and severity, and provides detailed feedback. MTQE is faster and cheaper; AI LQA is more informative and actionable.

Which is more accurate: MTQE or AI LQA?

It depends on your goal. MTQE is highly accurate at ranking translations by overall quality and correlates well with human judgments for that purpose. AI LQA is better at identifying specific errors that humans would flag. For error detection accuracy, AI LQA currently outperforms MTQE, but MTQE is more reliable for binary "good enough" decisions at scale.

Can MTQE replace human quality evaluation?

MTQE can replace human evaluation for low-stakes triage decisions (which segments need review) but not for detailed quality assessment. It can't provide the error-specific feedback needed for translator training or SLA compliance reporting. For those use cases, AI LQA or human evaluation is still required.

How do MTQE scores relate to MQM scores?

There's no direct mapping. MTQE scores (typically 0-1 or 0-100) represent predicted quality but don't correspond to MQM penalty points. A segment with MTQE 0.85 might have MQM score 92 or 98 depending on error types. If you need MQM-compatible scoring, use AI LQA which outputs error annotations that can be converted to MQM scores.

Should I train my own MTQE model?

Train your own model if: you have domain-specific content (medical, legal), you have labeled data from your own evaluations, and you need maximum accuracy for your specific use case. Use off-the-shelf models (COMET, BLEURT) if: you're working with general content, you don't have labeled training data, or you need to get started quickly.

The MTQE vs AI LQA debate misses the point. They solve different problems. Treat them as layers in a pipeline, not as alternatives, and you get better quality at lower cost than either approach alone.

Ready to implement AI-powered quality assessment? Try KTTC for hybrid MTQE and AI LQA with MQM-based error categorization.

We use cookies to improve your experience. Learn more in our Cookie Policy.