Blog
Next

Fixing Broken AI Video Translation with Eval Driven System (MFA, WSOLA, Visemes, and Evals)

Why current AI video translation pipelines fail structurally and how to build a production-grade system with forced alignment, syllable budgeting, viseme-driven lip sync, and eval systems.

Overview

On paper, AI video translation looks straightforward:

Speech-to-Text → Translation → Text-to-Speech → Replace original audio

In practice, the result is usually unwatchable. Lips don’t match the words. The voice sounds rushed or unnaturally stretched. The entire performance feels off, killing immersion.

After working on several dubbing systems, I’ve seen the same structural failures repeat. This post breaks down exactly where things go wrong and lays out a battle-tested pipeline that actually ships in production.

Lip sync gap demo

Notice the slight delay between the captions without lip sync


The Core Problem: Meaning and Time Don’t Align

Translation preserves meaning, but video sync lives in time.

Here’s a simple example:

LanguageSentenceApprox. SyllablesTypical Duration
EnglishWhat are you doing right now?~9~2.0s
Hindiतुम अभी क्या कर रहे हो?~11–12~2.4s
Mandarin你现在在做什么?~10~2.2s

Even when the meaning is identical, the temporal footprint changes — different syllable counts, phoneme lengths, and natural speaking rhythms. Your translated audio will almost never match the original clip’s duration. And lip sync is fundamentally about aligning mouth movements with audio over time.

If you ignore this mismatch, no amount of fancy TTS or lip-sync models will save you.


Why the Standard Pipeline Collapses

Original Audio → STT (Whisper) → Translation → TTS → Dubbed Video

It fails at every major stage:

  1. STT only gives text, not timing
    Whisper is fantastic at transcription, but basic word timestamps aren’t enough. You need phoneme-level precision.

  2. Translation ignores duration completely
    LLMs optimize for fluency and accuracy — not syllable count or speech rate. A 2.0s English clip easily becomes 2.6s in another language.

  3. TTS prioritizes naturalness over constraints
    Tools like ElevenLabs produce beautiful voices that add pauses, stretch vowels, and vary rhythm naturally. They have zero awareness of your original timing budget.

  4. No feedback loop
    There’s nothing telling the system, “This segment is 30% too long — compress it.” Errors compound across the video.


The Fixed Pipeline

Here’s the production-grade approach:

Step 1: Forced Alignment (Mandatory)

Use Montreal Forced Aligner (MFA) or Gentle Forced Aligner (GFA) on the original audio.

This gives you:

  • Accurate word-level timestamps
  • Phoneme-level timestamps

Example output for the word “doing”:

Word: "doing"
Start: 0.82s → End: 1.10s

Phonemes:
D   → 0.82–0.88s
UW  → 0.88–0.96s
IH  → 0.96–1.03s
NG  → 1.03–1.10s

You now have the temporal skeleton of the original speech. Everything else builds on this.


Step 2: Syllable Budgeting

Treat every segment as a strict budget.

Example:

  • Original segment duration: 0.8 seconds
  • Speech rate: ~5 syllables/second
  • Syllable budget: ~3–4 syllables

Your translation must fit inside this constraint. Natural phrasing often has to be sacrificed for timing.

Intended MeaningNatural HindiTiming-Constrained Hindi
What are you doing?तुम क्या कर रहे हो?क्या कर रहे हो?
I am going nowमैं अब जा रहा हूँमैं जा रहा

This is timing-constrained semantic approximation, not perfect translation.


Step 3: Duration-Aware Translation

Feed constraints directly into your translation prompt:

Translate to Hindi.
Target duration: 0.8 seconds
Maximum syllables: 4-5
Preserve core meaning.
Prefer shorter, natural phrasing. Avoid filler words.

This one change dramatically reduces downstream fixes.


Step 4: Duration-Controlled TTS + Post-Processing

Even with better translation, small mismatches remain.

Techniques:

  • Control speaking rate during TTS generation
  • Use WSOLA (Waveform Similarity-based Overlap-Add) or phase vocoder for time-stretching without pitch distortion

Success metric:
| generated_duration - original_duration | < 5–8%


Step 5: Viseme-Driven Lip Sync

Matching duration gets you most of the way. Matching mouth shapes gets you the rest.

Phoneme → Viseme Mapping (simplified):

Phoneme GroupViseme (Mouth Shape)
P, B, MLips closed
F, VLower teeth on upper lip
AA, AH, AEOpen mouth
S, Z, SH, ZHNarrow teeth gap

Pipeline:

  1. Generate speech
  2. Extract phonemes from the new audio
  3. Map to visemes
  4. Drive lip animation using Wav2Lip, SadTalker, or similar models

Timing fixes when the mouth moves. Visemes fix how it moves.


Step 6: The Evaluation System (Your Real Moat)

If you’re not measuring quality rigorously, you’re flying blind.

Key Metrics:

  1. Duration Alignment Score
    duration_diff = |gen - original| / original

    • <5% → Excellent
    • 5–10% → Acceptable
    • 15% → Fail

  2. Syllable Rate Consistency
    Compare syllables per second between original and dubbed.

  3. Translation Quality (LLM-as-Judge)
    Score meaning preservation, omissions, and hallucinations.

  4. Lip Sync Score

    • Wav2Lip confidence, or
    • Frame-level mouth curve distance between original and generated
  5. Human Perception Score
    Simple question: “Does this feel natural and in sync?”

Composite Score Example:

Final Score =
  0.25 × Duration Alignment +
  0.25 × Lip Sync +
  0.20 × Translation Quality +
  0.20 × Speech Rate Naturalness +
  0.10 × Human Perception

Stop doing:
“Translate first, then try to fix sync.”

Start doing:
“Lock the timing first, then fit the best possible meaning inside it.”

This inversion treating time as the primary constraint is what separates prototype dubs from production-grade systems


Building a Lip-Sync Dubbing Pipeline: Phonemes, Visemes, and the Art of Matching Mouths to Words

What they're sensing is temporal incoherence between phoneme events and face geometry. In plain English: the mouth isn't doing what the audio says it should be doing, and it's not doing it when it should.

Let's go phase by phase.


The High-Level Architecture

Before diving into any individual component, it helps to see the whole shape of the problem:

Input Video
     │
     ▼
Audio Extract          ← ffmpeg
     │
     ▼
STT + Word Timestamps  ← Whisper
     │
     ▼
Forced Alignment       ← MFA (Montreal Forced Aligner)
     │
     ├─────────────────────────┐
     ▼                         ▼
Segment Builder         Syllable Budgeter
     │                         │
     └──────────┬──────────────┘
                ▼
     Constrained Translation   ← LLM with syllable budget
                │
                ▼
         TTS Generation
                │
                ▼
         Time Adjustment       ← WSOLA / rubberband
                │
                ▼
       Phoneme Extraction      ← MFA re-run on TTS audio
                │
                ▼
        Viseme Mapping
                │
                ▼
       Frame-Level Alignment
                │
                ▼
     Lip Renderer (Wav2Lip + viseme guidance)
                │
                ▼
          Final Video

Each of these stages feeds information into the next. The key insight is that you're building a timing-first pipeline, not a translation-first one. Duration is the constraint everything else bends around.


Phase 1 — Forced Alignment: Establishing Timing Ground Truth

This is the foundation. If you get this wrong, nothing downstream can compensate.

What Whisper gives you (and what it doesn't)

Whisper with word_timestamps=True is tempting because it's dead simple to use:

import whisper
 
model = whisper.load_model("large")
result = model.transcribe("audio.wav", word_timestamps=True)
 
for segment in result["segments"]:
    for word in segment["words"]:
        print(word["word"], word["start"], word["end"])

You get word-level timing. That's useful for segmentation — figuring out where sentence boundaries fall. But it's not enough for lip sync. The gap between "word-level" and "phoneme-level" precision is the gap between passable and convincing. For segmentation, Whisper is fine. For anything that touches the mouth renderer, you need phonemes.

Montreal Forced Aligner: the right tool for timing truth

MFA gives you phoneme-level timestamps by aligning a known transcript against the audio waveform. The setup:

pip install montreal-forced-aligner
 
# Prepare your directory:
# data/
#   audio.wav
#   transcript.txt
 
mfa align data/ english_us_arpa english_us_arpa output/

The output is a TextGrid file — a Praat-format annotation with two tiers: words and phones. Parse it like so:

from textgrid import TextGrid
 
tg = TextGrid.fromFile("output.TextGrid")
 
phonemes = []
for interval in tg.getFirst("phones"):
    phonemes.append({
        "phoneme": interval.mark,
        "start": interval.minTime,
        "end": interval.maxTime
    })

What you end up with is something like:

{
  "word": "doing",
  "start": 0.82,
  "end": 1.1,
  "phonemes": [
    { "phoneme": "D", "start": 0.82, "end": 0.87 },
    { "phoneme": "UW", "start": 0.87, "end": 0.99 },
    { "phoneme": "IH", "start": 0.99, "end": 1.04 },
    { "phoneme": "NG", "start": 1.04, "end": 1.1 }
  ]
}

This is your timing ground truth. Everything — translation budget, TTS duration targets, viseme frame assignment — anchors to these timestamps.


Phase 2 — Segmentation and Syllable Budgeting

Now that you have word timestamps, you need to break the audio into manageable chunks and figure out how many syllables fit into each chunk.

Building segments

The goal is chunks of roughly 1–3 seconds that respect natural speech boundaries. Don't cut in the middle of a word:

def build_segments(words, max_duration=2.0):
    segments = []
    current = []
    start_time = words[0]["start"]
 
    for w in words:
        current.append(w)
        duration = w["end"] - start_time
 
        if duration > max_duration:
            segments.append(current)
            current = []
            start_time = w["start"]
 
    if current:
        segments.append(current)
 
    return segments

Computing syllable budget

Each segment gets a syllable budget — the maximum number of syllables the translated text should contain to fit within the time window. This is what makes constrained translation possible.

First, estimate syllables per word:

import pyphen
 
dic = pyphen.Pyphen(lang='en')
 
def count_syllables(word):
    return len(dic.inserted(word).split('-'))

Then compute the budget for a segment:

def compute_budget(segment):
    duration = segment[-1]["end"] - segment[0]["start"]
    total_syllables = sum(count_syllables(w["word"]) for w in segment)
    speech_rate = total_syllables / duration
 
    return {
        "duration": round(duration, 3),
        "syllable_budget": int(speech_rate * duration)
    }

A typical segment might look like:

{
  "duration": 0.8,
  "syllable_budget": 4
}

Four syllables, 0.8 seconds. That's the box the translator needs to fit into.


Phase 3 — Constrained Translation

This is where most people make a critical mistake. They translate the whole document for accuracy, then try to force-fit the timing. That almost never works. You need the constraint to be part of the translation prompt.

Prompt design

def build_prompt(text, duration, syllables):
    return f"""
Translate the following text to Hindi.
 
Hard constraints:
- The translation must fit within {duration:.2f} seconds of speech
- Maximum syllable count: {syllables}
- Preserve core meaning; compress aggressively if needed
- Avoid filler words and padding phrases
- Choose shorter synonyms where available
 
Source text:
{text}
 
Return only the translated text, nothing else.
"""

Running multiple candidates

Don't just take the first output. Generate two or three candidates and pick the best one based on syllable count proximity to budget:

candidates = [translate(text, duration, syllables) for _ in range(3)]
 
def score_candidate(text, budget):
    actual = sum(count_syllables(w) for w in text.split())
    return abs(actual - budget["syllable_budget"])
 
best = min(candidates, key=lambda t: score_candidate(t, budget))

You can also use an LLM judge to score on meaning preservation, but syllable count proximity to budget is the primary filter.


Phase 4 — TTS Generation and Duration Matching

Generating audio

Coqui TTS gives you local, controllable voice synthesis:

from TTS.api import TTS
 
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")
 
tts.tts_to_file(
    text=translated_text,
    speaker_wav="reference_voice.wav",  # for voice cloning
    language="hi",
    file_path="segment_tts.wav"
)

ElevenLabs is the alternative if you want higher quality at the cost of API dependency.

Measuring and matching duration

import librosa
import soundfile as sf
 
def get_duration(file):
    y, sr = librosa.load(file)
    return librosa.get_duration(y=y, sr=sr)
 
def match_duration(input_path, target_duration, output_path):
    y, sr = librosa.load(input_path)
    current_duration = librosa.get_duration(y=y, sr=sr)
    rate = current_duration / target_duration
    y_stretched = librosa.effects.time_stretch(y, rate=rate)
    sf.write(output_path, y_stretched, sr)

Only apply time-stretching when the deviation is outside a tolerance threshold:

gen_dur = get_duration("segment_tts.wav")
 
if abs(gen_dur - target_dur) / target_dur > 0.05:
    match_duration("segment_tts.wav", target_dur, "segment_adjusted.wav")

How WSOLA works internally

WSOLA diagram

When you call librosa.effects.time_stretch, it uses a Phase Vocoder under the hood. But for speech, you often want WSOLA behavior instead — it handles transients better. Let's look at how WSOLA actually works internally, because understanding this changes how you think about acceptable stretch ratios.

The core problem: you want to change duration without changing pitch. The naive approach — speed up/slow down the sample rate — changes pitch proportionally. That's wrong.

WSOLA's approach is conceptually elegant:

Step 1 — Overlap-add framing

WSOLA overlap-add principle

The input audio is split into short frames, typically 20–40 ms with 50% overlap:

[Frame₀][Frame₁][Frame₂][Frame₃]...

Formally, each frame is extracted as:

xn(t)=x(t+nHa)w(t)x_n(t) = x(t + nH_a)\,w(t)
  • x(t)x(t): input audio signal
  • w(t)w(t): window function (e.g., Hann)
  • HaH_a: analysis hop size (controls overlap)
  • nn: frame index

This equation defines how each frame is just a windowed slice of the original signal, shifted by HaH_a.


Step 2 — Synthesis spacing

For time-stretching, synthesis positions are laid out with different spacing than analysis positions. To slow down, you spread synthesis frames further apart (inserting "extra" frames). To speed up, you bring them closer (skipping frames).

The relationship is captured by:

α=HsHa\alpha = \frac{H_s}{H_a}
  • HsH_s: synthesis hop size

  • HaH_a: analysis hop size

  • α\alpha: time-stretch factor

  • α>1\alpha > 1 → slow down (expand time)

  • α<1\alpha < 1 → speed up (compress time)

Synthesis positions are placed at:

tn(s)=nHst_n^{(s)} = nH_s

This is what actually reshapes time without altering pitch.

WSOLA_FRAME diagram


Step 3 — Cross-correlation for best match

Here's the key difference from naive overlap-add. Instead of blindly stitching whatever frames fall at the synthesis positions, WSOLA does something smarter:

“For the next frame I need to add to the output, search in a local neighborhood of the input signal for the frame that most closely matches the end of what I've already written.”

It uses cross-correlation to find the best matching window:

R(τ)=tytail(t)x(t+τ)R(\tau) = \sum_t y_{\text{tail}}(t)\,x(t + \tau)
  • ytail(t)y_{\text{tail}}(t): end portion of the current output
  • x(t+τ)x(t + \tau): candidate input segment shifted by τ\tau
  • R(τ)R(\tau): similarity score

This measures how well a candidate frame aligns with the existing output waveform.

The best match is selected as:

τ=argmaxτR(τ)\tau^* = \arg\max_{\tau} R(\tau)

This step is critical—it prevents phase discontinuities, which are the root cause of metallic/robotic artifacts in naive overlap-add.


Step 4 — Window and add

The selected frame gets multiplied by a Hann window (smooth taper at both edges) and added to the output buffer:

output[t]+=hann_window(t)×selected_frame[t]\text{output}[t] \mathrel{+}= \text{hann\_window}(t) \times \text{selected\_frame}[t]

The Hann window itself is defined as:

w(t)=0.5(1cos(2πtN1))w(t) = 0.5 \left(1 - \cos\left(\frac{2\pi t}{N - 1}\right)\right)

This smooth taper:

  • avoids sharp discontinuities at frame edges
  • ensures energy blends naturally between frames

The full reconstruction can be written as:

y(t)=nw(tnHs)x(tτnnHs)y(t) = \sum_n w(t - nH_s)\,x(t - \tau_n - nH_s)
  • τn\tau_n: alignment shift found via correlation
  • w()w(\cdot): synthesis window
  • HsH_s: synthesis hop

This equation captures the entire WSOLA pipeline: aligned frames + windowing + overlap-add.


Why WSOLA holds up well for speech

  • It preserves waveform continuity at the local level
  • Sharp consonant transients (plosives like p, t, k) stay sharp because WSOLA selects frames that match, rather than averaging across spectral bins
  • The perception of speech naturalness depends heavily on these transient events being intact

Phase Vocoder comparison

The Phase Vocoder (what librosa actually uses) works in the frequency domain — convert via STFT, adjust frame spacing, fix up the phase, invert. It handles large stretch ratios better (>1.3×) but introduces a characteristic "smearing" artifact that blurs consonants.


Practical implication for the pipeline

Keep stretch ratios within a narrow band:

0.85<α<1.150.85 < \alpha < 1.15

Outside this range, you're better off regenerating the translation with a tighter syllable budget than trying to stretch your way to timing.

The quality degradation compounds with distance from 1.0.

For WSOLA-style behavior in Python, use rubberband:

rubberband -t 1.2 input.wav output.wav

And this is a critical philosophical point about the whole pipeline: time-stretching is a fine-adjustment tool, not a primary solution. The correct priority ordering is:

  1. Translation fits the syllable budget
  2. TTS output lands close to the target duration naturally
  3. Time-stretch to correct the residual 5–10%

Phase 5 — The Viseme Pipeline: Making the Mouth Match

This is the section most tutorials handwave. "Use Wav2Lip" is not an architecture, it's a shortcut. Let's build the real thing.

What you're actually solving

Lip sync is not "match audio with video." That framing leads you to think of it as a signal alignment problem, which causes you to reach for correlation-based approaches that don't work.

What you're actually solving is this:

Map phoneme events to mouth geometry, frame by frame, with correct timing for each phoneme class.

That's a structured rendering problem with domain-specific rules about how human faces move.

Phonemes and visemes

A phoneme is the minimal unit of sound. A viseme is the corresponding mouth shape. The mapping is many-to-one — multiple phonemes look identical at the lip level:

PhonemesVisemeDescription
P, B, MclosedLips pressed together
F, Vteeth_lipUpper teeth on lower lip
AA, AH, AWopenJaw dropped, open mouth
EH, AEmid_openPartially open, slightly spread
T, D, Ntongue_contactSubtle, nearly closed
S, Zteeth_closeTeeth together, narrow aperture
SH, ZHroundedSlightly pursed
W, UWrounded_closeRounded, fairly closed
SilenceneutralRelaxed, slightly parted

This reduction is the key insight: you don't need to model all 40+ English phonemes visually. You only need to model ~8–10 viseme shapes, because that's all the human face actually produces.

Extracting phonemes from TTS output

Run MFA again — this time on your generated TTS audio:

mfa align tts_segments/ english_us_arpa english_us_arpa tts_output/

Or, if your TTS system outputs phoneme sequences directly (Coqui's XTTS can do this), use those. It saves a realignment step.

The result is a phoneme timeline for the generated audio:

[
  { "phoneme": "M", "start": 0.0, "end": 0.08 },
  { "phoneme": "AE", "start": 0.08, "end": 0.2 },
  { "phoneme": "N", "start": 0.2, "end": 0.28 }
]

Converting phonemes to visemes

PHONEME_TO_VISEME = {
    "P": "closed", "B": "closed", "M": "closed",
    "F": "teeth_lip", "V": "teeth_lip",
    "AA": "open", "AH": "open", "AW": "open",
    "EH": "mid_open", "AE": "mid_open",
    "T": "tongue_contact", "D": "tongue_contact", "N": "tongue_contact",
    "S": "teeth_close", "Z": "teeth_close",
    "SH": "rounded", "ZH": "rounded",
    "W": "rounded_close", "UW": "rounded_close",
}
 
def phoneme_to_viseme(phonemes):
    return [
        {
            "viseme": PHONEME_TO_VISEME.get(p["phoneme"], "neutral"),
            "start": p["start"],
            "end": p["end"],
            "phoneme": p["phoneme"]
        }
        for p in phonemes
    ]

Aligning to video frames

Video runs at a fixed frame rate, so you need to convert the continuous time domain viseme timeline into a discrete per-frame assignment:

def viseme_to_frames(visemes, fps=25):
    frames = []
    for v in visemes:
        start_frame = int(v["start"] * fps)
        end_frame = int(v["end"] * fps)
        for f in range(start_frame, end_frame):
            frames.append({
                "frame": f,
                "viseme": v["viseme"],
                "phoneme": v["phoneme"]
            })
    return frames

Now you have:

frame 0  → closed  (M)
frame 1  → closed  (M)
frame 2  → mid_open (AE)
frame 3  → mid_open (AE)
frame 4  → tongue_contact (N)

This is the per-frame instruction set for your mouth renderer.


Internals of the Viseme Pipeline: Coarticulation, Interpolation, and Plosive Handling

Here's where the engineering gets genuinely interesting. The naive approach — assign one viseme per frame and snap between them — produces something that looks distinctly robotic. To understand why, you need to understand how mouths actually move.

Coarticulation: why discrete visemes are a lie

Speech is not a sequence of discrete mouth shapes. The human vocal tract is a continuous physical system, and it starts transitioning toward the next phoneme while it's still completing the current one.

This is called coarticulation, and it has two flavors:

  • Anticipatory coarticulation: your mouth is already shaping itself for a future phoneme. Say "soon" — your lips are already rounding for the /u/ before you've finished the /s/.
  • Carry-over coarticulation: the previous phoneme's articulation bleeds into the current one. After a nasal like /m/, the velum doesn't snap shut instantly.

If you ignore this and render discrete visemes:

frame 10 → closed
frame 11 → open   ← instant jump
frame 12 → open

Every viseme boundary becomes a discontinuity that viewers register as wrong, even if they can't articulate why.

The fix: viseme vectors and interpolation

Instead of treating visemes as discrete states, represent them as continuous vectors in a small "mouth shape space":

VISEME_VECTORS = {
    "closed":           [1.0, 0.0, 0.0, 0.0],
    "open":             [0.0, 1.0, 0.0, 0.0],
    "teeth_lip":        [0.0, 0.0, 1.0, 0.0],
    "mid_open":         [0.0, 0.5, 0.0, 0.5],
    "rounded":          [0.3, 0.0, 0.0, 0.7],
    "neutral":          [0.2, 0.0, 0.0, 0.8],
}

Each dimension corresponds to a mouth shape degree of freedom. Now you can linearly interpolate between any two viseme vectors:

def interpolate(v1, v2, alpha):
    return [(1 - alpha) * a + alpha * b for a, b in zip(v1, v2)]

Apply over the transition window between two visemes:

TRANSITION_FRAMES = 3  # ~120ms at 25fps
 
for i, current_viseme in enumerate(viseme_timeline):
    next_viseme = viseme_timeline[i + 1] if i + 1 < len(viseme_timeline) else current_viseme
 
    for f in range(current_viseme["start_frame"], current_viseme["end_frame"]):
        frames_from_end = current_viseme["end_frame"] - f
 
        if frames_from_end <= TRANSITION_FRAMES:
            alpha = 1.0 - (frames_from_end / TRANSITION_FRAMES)
            blended = interpolate(
                VISEME_VECTORS[current_viseme["viseme"]],
                VISEME_VECTORS[next_viseme["viseme"]],
                alpha
            )
        else:
            blended = VISEME_VECTORS[current_viseme["viseme"]]
 
        frame_vectors.append(blended)

Instead of:

😐 → 😮  (instant jump)

You get:

😐 → 😗 → 😮  (gradual morph)

This is approximately how real human faces work, and it's perceptually the difference between "robotic" and "plausible."

Plosives: the exception that breaks the rule

Plosives (P, B, M, T, D, K, G) have a completely different timing structure than all other phonemes. They're not gradual transitions — they're events with internal temporal structure:

1. Closure  — lips snap shut (or articulators make contact)
2. Hold     — pressure builds behind the closure (~40–100ms)
3. Burst    — rapid release
4. Transition — into the following vowel

The hold-and-burst is what makes speech sound crisp and natural. If you apply the same smooth interpolation to plosives that you use for vowels, you blur the closure and the burst. The listener hears mushy speech. Viewers see lips that never quite close.

You need to handle plosives as a special case:

PLOSIVES = {"P", "B", "M", "T", "D", "K", "G"}
 
def apply_plosive_timing(viseme_entry):
    """Override interpolation for plosives — snap to closure, hold, then release."""
    duration_frames = viseme_entry["end_frame"] - viseme_entry["start_frame"]
 
    closure_frames = max(1, int(duration_frames * 0.2))
    hold_frames    = max(1, int(duration_frames * 0.5))
    burst_frames   = duration_frames - closure_frames - hold_frames
 
    phases = []
    # Closure: rapid transition to fully closed
    for i in range(closure_frames):
        alpha = i / closure_frames
        phases.append(interpolate(VISEME_VECTORS["neutral"], VISEME_VECTORS["closed"], alpha))
 
    # Hold: fully closed, no interpolation
    for _ in range(hold_frames):
        phases.append(VISEME_VECTORS["closed"])
 
    # Burst: rapid transition to next viseme
    next_vec = viseme_entry.get("next_vector", VISEME_VECTORS["neutral"])
    for i in range(burst_frames):
        alpha = i / max(burst_frames, 1)
        phases.append(interpolate(VISEME_VECTORS["closed"], next_vec, alpha))
 
    return phases

The general rule: plosives snap, vowels flow.

Phoneme classTransition style
Vowels (AA, EH, IY...)Smooth linear interpolation
Fricatives (S, F, SH...)Semi-smooth, slight hold
Nasals (M, N, NG)Gradual, sustained
Plosives (P, B, T, K...)Snap → hold → burst

Ignore this and your plosives will be the tell that something is wrong. Every "ba", "pa", "ta" in your dubbed video will feel slightly floaty and unconvincing.

Common rendering problems and fixes

Lag (perceptually the most disruptive)

Audio and visual lip movement have a perceptual tolerance window of roughly ±80ms. Above that threshold, viewers consciously notice the mismatch. Audio-leading-video feels like a recording error. Video-leading-audio (the more common failure mode in synthetic pipelines) feels like bad dubbing from the 1970s.

The fix is simple — shift your entire viseme timeline by a constant offset calibrated against your render path:

AUDIO_VISUAL_OFFSET = -0.05  # 50ms, tune empirically
 
adjusted_visemes = [
    {**v, "start": v["start"] + AUDIO_VISUAL_OFFSET,
          "end": v["end"] + AUDIO_VISUAL_OFFSET}
    for v in visemes
]

Over-smoothing and under-smoothing

Too much interpolation kills transient information and makes everything look like lips moving through molasses. Too little makes the face look like a Flash animation from 2003 — one where someone replaced every intermediate frame with a hard keyframe.

The calibration point is: vowel transitions should feel natural and gradual; plosive events should feel crisp. If both are happening in the same word (like "pub" or "tab"), you need both behaviors in the same utterance.

Fricative buzzing

For S, Z, SH, F, V — the fricatives — the visual correlate is less pronounced than for plosives and vowels. Teeth visibility, subtle lip shape changes. Don't over-model these; slight weight on the teeth_close or teeth_lip viseme is sufficient. The audio carries more information than the face for fricatives anyway.


Phase 6 — Rendering

Option A: Wav2Lip (black box approach)

Wav2Lip is the fastest path to something that looks okay:

python inference.py \
  --checkpoint_path wav2lip.pth \
  --face video.mp4 \
  --audio generated_audio.wav \
  --outfile output.mp4

It implicitly learns the phoneme-to-mouth-shape mapping from training data and aligns the mouth region in each frame to the audio. The quality is decent, and it handles the model inference for you.

The problem: it's completely opaque. You can't inject your carefully computed viseme timing. If the audio timing drifts, Wav2Lip drifts with it. You have no control over the plosive handling or interpolation behavior.

Option B: Guided rendering (the right approach)

The better architecture uses your viseme timeline to guide the renderer, rather than handing it raw audio and hoping:

def render_frame(video_frame, viseme_vector, renderer):
    """
    Apply viseme blendshapes to video frame.
    viseme_vector: [closed, open, teeth_lip, rounded] weights
    """
    return renderer.apply_blend(video_frame, viseme_vector)
 
for frame_idx, (frame, v_entry) in enumerate(zip(video_frames, frame_vectors)):
    output_frame = render_frame(frame, v_entry["vector"], renderer)
    output_frames.append(output_frame)

For 2D video, this requires either a face mesh (MediaPipe gives you 468 landmarks for free), or a warping approach where you parameterize mouth region deformations by viseme vector.

For 3D avatar systems, viseme vectors map directly to blendshape weights — this is the "correct" form of the problem and is how game engines like Unreal handle it natively.

Hybrid strategy (best in practice)

Use your viseme timeline for timing correctness and use Wav2Lip (or a learned renderer) for visual realism:

Audio → phonemes → visemes → timing map
                                │
                    Wav2Lip guided by timing constraints

The timing map tells you when to apply which shape. Wav2Lip fills in the visual detail of how that shape looks on this particular face. You get the correctness of an explicit phoneme-to-viseme pipeline with the realism of a trained renderer.


Putting It All Together

The complete loop across all segments:

for segment in segments:
    text = " ".join(w["word"] for w in segment)
    budget = compute_budget(segment)
 
    # Translation
    translated = best_translation(text, budget)
 
    # TTS
    tts_path = generate_tts(translated)
 
    # Duration matching
    gen_dur = get_duration(tts_path)
    if abs(gen_dur - budget["duration"]) / budget["duration"] > 0.05:
        adjusted_path = match_duration(tts_path, budget["duration"])
    else:
        adjusted_path = tts_path
 
    # Phoneme extraction on TTS output
    phonemes = extract_phonemes(adjusted_path)
 
    # Viseme pipeline
    visemes = phoneme_to_viseme(phonemes)
    frame_vectors = build_frame_vectors(visemes, fps=25)
 
    # Render
    rendered_frames = render_segment(video_frames, frame_vectors, renderer)
 
    save_segment(adjusted_path, rendered_frames)

Phase 7 — Evaluation: Closing the Loop

Shipping a dubbing pipeline without evals is guessing. Every component — alignment, translation, TTS, stretching, visemes — introduces error. Evals tell you where the error lives and whether your fixes are actually working.

The eval system has two layers: per-segment automated metrics that catch structural failures fast, and holistic quality scores that catch perceptual failures that automation misses.


Segment-Level Automated Evals

These run on every segment, every time. They're your CI for the pipeline.

1. Duration Alignment Score

def duration_alignment_score(generated_path, target_duration):
    gen_dur = get_duration(generated_path)
    diff = abs(gen_dur - target_duration) / target_duration
    return {
        "generated": round(gen_dur, 3),
        "target": round(target_duration, 3),
        "diff_pct": round(diff * 100, 1),
        "score": max(0.0, 1.0 - (diff / 0.15)),
        "grade": "excellent" if diff < 0.05 else "acceptable" if diff < 0.10 else "fail"
    }
Diff %Grade
< 5%Excellent
5–10%Acceptable
> 15%Fail

Segments that fail this check should be flagged for retranslation before anything downstream runs. Don't let a duration failure cascade into a broken viseme timeline.


2. Syllable Rate Consistency

Duration alignment measures absolute length. Syllable rate measures rhythm — whether the dubbed speech feels paced like the original.

def syllable_rate_score(original_segment, translated_text, gen_duration):
    orig_duration = original_segment[-1]["end"] - original_segment[0]["start"]
    orig_syllables = sum(count_syllables(w["word"]) for w in original_segment)
    orig_rate = orig_syllables / orig_duration
 
    gen_syllables = sum(count_syllables(w) for w in translated_text.split())
    gen_rate = gen_syllables / gen_duration
 
    rate_diff = abs(gen_rate - orig_rate) / orig_rate
    return {
        "original_rate": round(orig_rate, 2),
        "generated_rate": round(gen_rate, 2),
        "rate_diff_pct": round(rate_diff * 100, 1),
        "score": max(0.0, 1.0 - (rate_diff / 0.20))
    }

A segment that hits the duration target but does so with half the syllable count has been over-stretched. The score catches this where duration alignment alone wouldn't.


3. Viseme Coverage Score

Viseme coverage tracks how completely phonemes map across the generated audio. A low coverage score typically means MFA alignment failed silently, the TTS output contains unexpected silence, or the audio has been clipped.

def viseme_coverage_score(phonemes, audio_duration):
    if not phonemes:
        return {"score": 0.0, "reason": "no phonemes extracted"}
 
    covered = sum(p["end"] - p["start"] for p in phonemes)
    coverage = covered / audio_duration
    unmapped = [p for p in phonemes if p["phoneme"] not in PHONEME_TO_VISEME]
 
    return {
        "coverage_pct": round(coverage * 100, 1),
        "unmapped_phonemes": unmapped,
        "score": min(1.0, coverage)
    }

A coverage score below 0.85 is a signal to re-run MFA alignment before passing to the viseme renderer.


4. Stretch Ratio Guard

Time-stretching degrades quality as the ratio moves away from 1.0. Log the applied ratio for every segment so you can identify where the pipeline is leaning on stretching as a crutch rather than fixing the upstream translation.

def stretch_ratio_eval(original_duration, generated_duration):
    ratio = generated_duration / original_duration
    quality = (
        "clean"     if 0.90 <= ratio <= 1.10 else
        "acceptable" if 0.82 <= ratio <= 1.20 else
        "degraded"
    )
    return {
        "ratio": round(ratio, 3),
        "quality": quality,
        "flag": ratio < 0.82 or ratio > 1.20
    }

When flag is True, the correct fix is to retranslate with a tighter syllable budget — not to accept the degraded audio.


Translation Quality — LLM-as-Judge

Duration metrics tell you whether the audio fits. They say nothing about whether the translated words preserve the original meaning. For this, use an LLM judge.

def translation_quality_eval(source_text, translated_text, client):
    prompt = f"""
You are evaluating machine translation quality for a dubbed video segment.
Score the translation on each dimension from 0.0 to 1.0.
 
Source text: {source_text}
Translation: {translated_text}
 
Respond in JSON with exactly these fields:
{{
  "meaning_preservation": float,
  "omissions": float,         // 1.0 = nothing omitted
  "hallucinations": float,    // 1.0 = no hallucinated content
  "naturalness": float,
  "explanation": string       // one sentence
}}
"""
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
 
    result = json.loads(response.choices[0].message.content)
    result["composite"] = (
        result["meaning_preservation"] * 0.4 +
        result["omissions"] * 0.25 +
        result["hallucinations"] * 0.25 +
        result["naturalness"] * 0.10
    )
    return result

The composite weights meaning preservation heavily because omitting meaning is worse than omitting words — and hallucinating new meaning is a hard failure regardless of timing.


Lip Sync Score

Lip sync quality has two tractable proxies: Wav2Lip confidence when you're using a learned renderer, and frame-level mouth contour distance when you have access to the original face.

def lip_sync_score_contour(original_frames, generated_frames, face_mesh):
    """
    Compute mean distance between mouth landmark contours,
    frame by frame, after applying generated visemes.
    """
    scores = []
    for orig, gen in zip(original_frames, generated_frames):
        orig_landmarks = face_mesh.process(orig).multi_face_landmarks
        gen_landmarks  = face_mesh.process(gen).multi_face_landmarks
 
        if not orig_landmarks or not gen_landmarks:
            continue
 
        # Mouth landmark indices: 61, 185, 40, 39, 37, 0, 267, 269, 270, 409
        MOUTH_IDX = [61, 185, 40, 39, 37, 0, 267, 269, 270, 409]
 
        orig_pts = [(orig_landmarks[0].landmark[i].x,
                     orig_landmarks[0].landmark[i].y) for i in MOUTH_IDX]
        gen_pts  = [(gen_landmarks[0].landmark[i].x,
                     gen_landmarks[0].landmark[i].y) for i in MOUTH_IDX]
 
        dist = sum(
            ((a[0]-b[0])**2 + (a[1]-b[1])**2)**0.5
            for a, b in zip(orig_pts, gen_pts)
        ) / len(MOUTH_IDX)
 
        scores.append(dist)
 
    mean_dist = sum(scores) / len(scores) if scores else 1.0
    return {
        "mean_contour_distance": round(mean_dist, 4),
        "score": max(0.0, 1.0 - (mean_dist / 0.05))
    }

Lower contour distance = better lip sync. A score above 0.80 is acceptable for production; below 0.60 is a visible sync failure.


Composite Segment Score

Every segment gets a single composite score that collapses all the above:

def composite_segment_score(
    duration_score,
    syllable_rate_score,
    translation_composite,
    lip_sync_score,
    viseme_coverage_score
):
    return round(
        0.25 * duration_score +
        0.20 * syllable_rate_score +
        0.25 * translation_composite +
        0.20 * lip_sync_score +
        0.10 * viseme_coverage_score,
        3
    )
Score RangeVerdict
≥ 0.85Ship it
0.70–0.84Review
< 0.70Retranslate

Segments below 0.70 go back to the constrained translation step. Segments in the 0.70–0.84 range get flagged for human review before final render.


Pipeline-Level Reporting

Don't just track individual segments — track pipeline health across a full video run.

def pipeline_eval_report(segment_results):
    scores = [r["composite"] for r in segment_results]
    failed = [r for r in segment_results if r["composite"] < 0.70]
    reviewed = [r for r in segment_results if 0.70 <= r["composite"] < 0.85]
 
    bottleneck = max(
        ["duration", "translation", "lip_sync", "syllable_rate"],
        key=lambda k: sum(1 for r in segment_results if r[k+"_score"] < 0.70)
    )
 
    return {
        "total_segments": len(segment_results),
        "mean_composite": round(sum(scores) / len(scores), 3),
        "pass_rate": round(len([s for s in scores if s >= 0.85]) / len(scores), 2),
        "review_queue": len(reviewed),
        "fail_queue": len(failed),
        "worst_bottleneck": bottleneck,
        "p10_score": sorted(scores)[len(scores) // 10]
    }

The worst_bottleneck field is the most actionable output. If lip sync is the top failure mode, the fix is in your viseme pipeline. If translation quality is the bottleneck, tighten your prompts or your syllable budget constraints. If duration is failing consistently, your TTS speaking rate parameters need tuning.


Human Perception Eval

Automated metrics can't fully substitute for what a human ear catches. Build a lightweight annotation flow alongside your automated evals.

For each reviewed segment, collect three binary signals:

1. Does the lip movement feel natural? [Yes / No]
2. Does the audio feel rushed or stretched? [Yes / No]
3. Does the translation feel like it preserves the original meaning? [Yes / No]

Map these to a 0–1 perception score:

def perception_score(responses):
    return sum(responses.values()) / len(responses)

Even sampling 10–15% of segments through human review will surface systematic failures that automated evals miss — particularly around naturalness, coarticulation quality, and translation register.


Closing Thought

Every metric above measures a different independent point of failures. Duration alignment catches timing failures. Syllable rate catches rhythm failures. Translation quality catches meaning failures. Lip sync catches visual failures. And the composite score routes segments to the right remediation step automatically.

The target workflow:

Segment processed

  Composite ≥ 0.85? ──Yes──▶ Final render queue

       No

  Composite ≥ 0.70? ──Yes──▶ Human review queue

       No

  Worst sub-score? ──duration/syllables──▶ Retranslate with tighter budget
                   ──translation quality──▶ Regenerate with stricter prompt
                   ──lip sync / visemes──▶ Re-run phoneme extraction + renderer

Viseme synchronised output video