Fixing Broken AI Video Translation with Eval Driven System (MFA, WSOLA, Visemes, and Evals)
Why current AI video translation pipelines fail structurally and how to build a production-grade system with forced alignment, syllable budgeting, viseme-driven lip sync, and eval systems.
Overview
On paper, AI video translation looks straightforward:
Speech-to-Text → Translation → Text-to-Speech → Replace original audio
In practice, the result is usually unwatchable. Lips don’t match the words. The voice sounds rushed or unnaturally stretched. The entire performance feels off, killing immersion.
After working on several dubbing systems, I’ve seen the same structural failures repeat. This post breaks down exactly where things go wrong and lays out a battle-tested pipeline that actually ships in production.

Notice the slight delay between the captions without lip sync
The Core Problem: Meaning and Time Don’t Align
Translation preserves meaning, but video sync lives in time.
Here’s a simple example:
| Language | Sentence | Approx. Syllables | Typical Duration |
|---|---|---|---|
| English | What are you doing right now? | ~9 | ~2.0s |
| Hindi | तुम अभी क्या कर रहे हो? | ~11–12 | ~2.4s |
| Mandarin | 你现在在做什么? | ~10 | ~2.2s |
Even when the meaning is identical, the temporal footprint changes — different syllable counts, phoneme lengths, and natural speaking rhythms. Your translated audio will almost never match the original clip’s duration. And lip sync is fundamentally about aligning mouth movements with audio over time.
If you ignore this mismatch, no amount of fancy TTS or lip-sync models will save you.
Why the Standard Pipeline Collapses
Original Audio → STT (Whisper) → Translation → TTS → Dubbed VideoIt fails at every major stage:
-
STT only gives text, not timing
Whisper is fantastic at transcription, but basic word timestamps aren’t enough. You need phoneme-level precision. -
Translation ignores duration completely
LLMs optimize for fluency and accuracy — not syllable count or speech rate. A 2.0s English clip easily becomes 2.6s in another language. -
TTS prioritizes naturalness over constraints
Tools like ElevenLabs produce beautiful voices that add pauses, stretch vowels, and vary rhythm naturally. They have zero awareness of your original timing budget. -
No feedback loop
There’s nothing telling the system, “This segment is 30% too long — compress it.” Errors compound across the video.
The Fixed Pipeline
Here’s the production-grade approach:
Step 1: Forced Alignment (Mandatory)
Use Montreal Forced Aligner (MFA) or Gentle Forced Aligner (GFA) on the original audio.
This gives you:
- Accurate word-level timestamps
- Phoneme-level timestamps
Example output for the word “doing”:
Word: "doing"
Start: 0.82s → End: 1.10s
Phonemes:
D → 0.82–0.88s
UW → 0.88–0.96s
IH → 0.96–1.03s
NG → 1.03–1.10s
You now have the temporal skeleton of the original speech. Everything else builds on this.
Step 2: Syllable Budgeting
Treat every segment as a strict budget.
Example:
- Original segment duration: 0.8 seconds
- Speech rate: ~5 syllables/second
- Syllable budget: ~3–4 syllables
Your translation must fit inside this constraint. Natural phrasing often has to be sacrificed for timing.
| Intended Meaning | Natural Hindi | Timing-Constrained Hindi |
|---|---|---|
| What are you doing? | तुम क्या कर रहे हो? | क्या कर रहे हो? |
| I am going now | मैं अब जा रहा हूँ | मैं जा रहा |
This is timing-constrained semantic approximation, not perfect translation.
Step 3: Duration-Aware Translation
Feed constraints directly into your translation prompt:
Translate to Hindi.
Target duration: 0.8 seconds
Maximum syllables: 4-5
Preserve core meaning.
Prefer shorter, natural phrasing. Avoid filler words.This one change dramatically reduces downstream fixes.
Step 4: Duration-Controlled TTS + Post-Processing
Even with better translation, small mismatches remain.
Techniques:
- Control speaking rate during TTS generation
- Use WSOLA (Waveform Similarity-based Overlap-Add) or phase vocoder for time-stretching without pitch distortion
Success metric:
| generated_duration - original_duration | < 5–8%
Step 5: Viseme-Driven Lip Sync
Matching duration gets you most of the way. Matching mouth shapes gets you the rest.
Phoneme → Viseme Mapping (simplified):
| Phoneme Group | Viseme (Mouth Shape) |
|---|---|
| P, B, M | Lips closed |
| F, V | Lower teeth on upper lip |
| AA, AH, AE | Open mouth |
| S, Z, SH, ZH | Narrow teeth gap |
Pipeline:
- Generate speech
- Extract phonemes from the new audio
- Map to visemes
- Drive lip animation using Wav2Lip, SadTalker, or similar models
Timing fixes when the mouth moves. Visemes fix how it moves.
Step 6: The Evaluation System (Your Real Moat)
If you’re not measuring quality rigorously, you’re flying blind.
Key Metrics:
-
Duration Alignment Score
duration_diff = |gen - original| / original<5%→ Excellent- 5–10% → Acceptable
-
15% → Fail
-
Syllable Rate Consistency
Compare syllables per second between original and dubbed. -
Translation Quality (LLM-as-Judge)
Score meaning preservation, omissions, and hallucinations. -
Lip Sync Score
- Wav2Lip confidence, or
- Frame-level mouth curve distance between original and generated
-
Human Perception Score
Simple question: “Does this feel natural and in sync?”
Composite Score Example:
Final Score =
0.25 × Duration Alignment +
0.25 × Lip Sync +
0.20 × Translation Quality +
0.20 × Speech Rate Naturalness +
0.10 × Human PerceptionStop doing:
“Translate first, then try to fix sync.”
Start doing:
“Lock the timing first, then fit the best possible meaning inside it.”
This inversion treating time as the primary constraint is what separates prototype dubs from production-grade systems
Building a Lip-Sync Dubbing Pipeline: Phonemes, Visemes, and the Art of Matching Mouths to Words
What they're sensing is temporal incoherence between phoneme events and face geometry. In plain English: the mouth isn't doing what the audio says it should be doing, and it's not doing it when it should.
Let's go phase by phase.
The High-Level Architecture
Before diving into any individual component, it helps to see the whole shape of the problem:
Input Video
│
▼
Audio Extract ← ffmpeg
│
▼
STT + Word Timestamps ← Whisper
│
▼
Forced Alignment ← MFA (Montreal Forced Aligner)
│
├─────────────────────────┐
▼ ▼
Segment Builder Syllable Budgeter
│ │
└──────────┬──────────────┘
▼
Constrained Translation ← LLM with syllable budget
│
▼
TTS Generation
│
▼
Time Adjustment ← WSOLA / rubberband
│
▼
Phoneme Extraction ← MFA re-run on TTS audio
│
▼
Viseme Mapping
│
▼
Frame-Level Alignment
│
▼
Lip Renderer (Wav2Lip + viseme guidance)
│
▼
Final Video
Each of these stages feeds information into the next. The key insight is that you're building a timing-first pipeline, not a translation-first one. Duration is the constraint everything else bends around.
Phase 1 — Forced Alignment: Establishing Timing Ground Truth
This is the foundation. If you get this wrong, nothing downstream can compensate.
What Whisper gives you (and what it doesn't)
Whisper with word_timestamps=True is tempting because it's dead simple to use:
import whisper
model = whisper.load_model("large")
result = model.transcribe("audio.wav", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(word["word"], word["start"], word["end"])You get word-level timing. That's useful for segmentation — figuring out where sentence boundaries fall. But it's not enough for lip sync. The gap between "word-level" and "phoneme-level" precision is the gap between passable and convincing. For segmentation, Whisper is fine. For anything that touches the mouth renderer, you need phonemes.
Montreal Forced Aligner: the right tool for timing truth
MFA gives you phoneme-level timestamps by aligning a known transcript against the audio waveform. The setup:
pip install montreal-forced-aligner
# Prepare your directory:
# data/
# audio.wav
# transcript.txt
mfa align data/ english_us_arpa english_us_arpa output/The output is a TextGrid file — a Praat-format annotation with two tiers: words and phones. Parse it like so:
from textgrid import TextGrid
tg = TextGrid.fromFile("output.TextGrid")
phonemes = []
for interval in tg.getFirst("phones"):
phonemes.append({
"phoneme": interval.mark,
"start": interval.minTime,
"end": interval.maxTime
})What you end up with is something like:
{
"word": "doing",
"start": 0.82,
"end": 1.1,
"phonemes": [
{ "phoneme": "D", "start": 0.82, "end": 0.87 },
{ "phoneme": "UW", "start": 0.87, "end": 0.99 },
{ "phoneme": "IH", "start": 0.99, "end": 1.04 },
{ "phoneme": "NG", "start": 1.04, "end": 1.1 }
]
}This is your timing ground truth. Everything — translation budget, TTS duration targets, viseme frame assignment — anchors to these timestamps.
Phase 2 — Segmentation and Syllable Budgeting
Now that you have word timestamps, you need to break the audio into manageable chunks and figure out how many syllables fit into each chunk.
Building segments
The goal is chunks of roughly 1–3 seconds that respect natural speech boundaries. Don't cut in the middle of a word:
def build_segments(words, max_duration=2.0):
segments = []
current = []
start_time = words[0]["start"]
for w in words:
current.append(w)
duration = w["end"] - start_time
if duration > max_duration:
segments.append(current)
current = []
start_time = w["start"]
if current:
segments.append(current)
return segmentsComputing syllable budget
Each segment gets a syllable budget — the maximum number of syllables the translated text should contain to fit within the time window. This is what makes constrained translation possible.
First, estimate syllables per word:
import pyphen
dic = pyphen.Pyphen(lang='en')
def count_syllables(word):
return len(dic.inserted(word).split('-'))Then compute the budget for a segment:
def compute_budget(segment):
duration = segment[-1]["end"] - segment[0]["start"]
total_syllables = sum(count_syllables(w["word"]) for w in segment)
speech_rate = total_syllables / duration
return {
"duration": round(duration, 3),
"syllable_budget": int(speech_rate * duration)
}A typical segment might look like:
{
"duration": 0.8,
"syllable_budget": 4
}Four syllables, 0.8 seconds. That's the box the translator needs to fit into.
Phase 3 — Constrained Translation
This is where most people make a critical mistake. They translate the whole document for accuracy, then try to force-fit the timing. That almost never works. You need the constraint to be part of the translation prompt.
Prompt design
def build_prompt(text, duration, syllables):
return f"""
Translate the following text to Hindi.
Hard constraints:
- The translation must fit within {duration:.2f} seconds of speech
- Maximum syllable count: {syllables}
- Preserve core meaning; compress aggressively if needed
- Avoid filler words and padding phrases
- Choose shorter synonyms where available
Source text:
{text}
Return only the translated text, nothing else.
"""Running multiple candidates
Don't just take the first output. Generate two or three candidates and pick the best one based on syllable count proximity to budget:
candidates = [translate(text, duration, syllables) for _ in range(3)]
def score_candidate(text, budget):
actual = sum(count_syllables(w) for w in text.split())
return abs(actual - budget["syllable_budget"])
best = min(candidates, key=lambda t: score_candidate(t, budget))You can also use an LLM judge to score on meaning preservation, but syllable count proximity to budget is the primary filter.
Phase 4 — TTS Generation and Duration Matching
Generating audio
Coqui TTS gives you local, controllable voice synthesis:
from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(
text=translated_text,
speaker_wav="reference_voice.wav", # for voice cloning
language="hi",
file_path="segment_tts.wav"
)ElevenLabs is the alternative if you want higher quality at the cost of API dependency.
Measuring and matching duration
import librosa
import soundfile as sf
def get_duration(file):
y, sr = librosa.load(file)
return librosa.get_duration(y=y, sr=sr)
def match_duration(input_path, target_duration, output_path):
y, sr = librosa.load(input_path)
current_duration = librosa.get_duration(y=y, sr=sr)
rate = current_duration / target_duration
y_stretched = librosa.effects.time_stretch(y, rate=rate)
sf.write(output_path, y_stretched, sr)Only apply time-stretching when the deviation is outside a tolerance threshold:
gen_dur = get_duration("segment_tts.wav")
if abs(gen_dur - target_dur) / target_dur > 0.05:
match_duration("segment_tts.wav", target_dur, "segment_adjusted.wav")How WSOLA works internally

When you call librosa.effects.time_stretch, it uses a Phase Vocoder under the hood. But for speech, you often want WSOLA behavior instead — it handles transients better. Let's look at how WSOLA actually works internally, because understanding this changes how you think about acceptable stretch ratios.
The core problem: you want to change duration without changing pitch. The naive approach — speed up/slow down the sample rate — changes pitch proportionally. That's wrong.
WSOLA's approach is conceptually elegant:
Step 1 — Overlap-add framing

The input audio is split into short frames, typically 20–40 ms with 50% overlap:
[Frame₀][Frame₁][Frame₂][Frame₃]...
Formally, each frame is extracted as:
- : input audio signal
- : window function (e.g., Hann)
- : analysis hop size (controls overlap)
- : frame index
This equation defines how each frame is just a windowed slice of the original signal, shifted by .
Step 2 — Synthesis spacing
For time-stretching, synthesis positions are laid out with different spacing than analysis positions. To slow down, you spread synthesis frames further apart (inserting "extra" frames). To speed up, you bring them closer (skipping frames).
The relationship is captured by:
-
: synthesis hop size
-
: analysis hop size
-
: time-stretch factor
-
→ slow down (expand time)
-
→ speed up (compress time)
Synthesis positions are placed at:
This is what actually reshapes time without altering pitch.

Step 3 — Cross-correlation for best match
Here's the key difference from naive overlap-add. Instead of blindly stitching whatever frames fall at the synthesis positions, WSOLA does something smarter:
“For the next frame I need to add to the output, search in a local neighborhood of the input signal for the frame that most closely matches the end of what I've already written.”
It uses cross-correlation to find the best matching window:
- : end portion of the current output
- : candidate input segment shifted by
- : similarity score
This measures how well a candidate frame aligns with the existing output waveform.
The best match is selected as:
This step is critical—it prevents phase discontinuities, which are the root cause of metallic/robotic artifacts in naive overlap-add.
Step 4 — Window and add
The selected frame gets multiplied by a Hann window (smooth taper at both edges) and added to the output buffer:
The Hann window itself is defined as:
This smooth taper:
- avoids sharp discontinuities at frame edges
- ensures energy blends naturally between frames
The full reconstruction can be written as:
- : alignment shift found via correlation
- : synthesis window
- : synthesis hop
This equation captures the entire WSOLA pipeline: aligned frames + windowing + overlap-add.
Why WSOLA holds up well for speech
- It preserves waveform continuity at the local level
- Sharp consonant transients (plosives like
p,t,k) stay sharp because WSOLA selects frames that match, rather than averaging across spectral bins - The perception of speech naturalness depends heavily on these transient events being intact
Phase Vocoder comparison
The Phase Vocoder (what librosa actually uses) works in the frequency domain — convert via STFT, adjust frame spacing, fix up the phase, invert. It handles large stretch ratios better (>1.3×) but introduces a characteristic "smearing" artifact that blurs consonants.
Practical implication for the pipeline
Keep stretch ratios within a narrow band:
Outside this range, you're better off regenerating the translation with a tighter syllable budget than trying to stretch your way to timing.
The quality degradation compounds with distance from 1.0.
For WSOLA-style behavior in Python, use rubberband:
rubberband -t 1.2 input.wav output.wavAnd this is a critical philosophical point about the whole pipeline: time-stretching is a fine-adjustment tool, not a primary solution. The correct priority ordering is:
- Translation fits the syllable budget
- TTS output lands close to the target duration naturally
- Time-stretch to correct the residual 5–10%
Phase 5 — The Viseme Pipeline: Making the Mouth Match
This is the section most tutorials handwave. "Use Wav2Lip" is not an architecture, it's a shortcut. Let's build the real thing.
What you're actually solving
Lip sync is not "match audio with video." That framing leads you to think of it as a signal alignment problem, which causes you to reach for correlation-based approaches that don't work.
What you're actually solving is this:
Map phoneme events to mouth geometry, frame by frame, with correct timing for each phoneme class.
That's a structured rendering problem with domain-specific rules about how human faces move.
Phonemes and visemes
A phoneme is the minimal unit of sound. A viseme is the corresponding mouth shape. The mapping is many-to-one — multiple phonemes look identical at the lip level:
| Phonemes | Viseme | Description |
|---|---|---|
| P, B, M | closed | Lips pressed together |
| F, V | teeth_lip | Upper teeth on lower lip |
| AA, AH, AW | open | Jaw dropped, open mouth |
| EH, AE | mid_open | Partially open, slightly spread |
| T, D, N | tongue_contact | Subtle, nearly closed |
| S, Z | teeth_close | Teeth together, narrow aperture |
| SH, ZH | rounded | Slightly pursed |
| W, UW | rounded_close | Rounded, fairly closed |
| Silence | neutral | Relaxed, slightly parted |
This reduction is the key insight: you don't need to model all 40+ English phonemes visually. You only need to model ~8–10 viseme shapes, because that's all the human face actually produces.
Extracting phonemes from TTS output
Run MFA again — this time on your generated TTS audio:
mfa align tts_segments/ english_us_arpa english_us_arpa tts_output/Or, if your TTS system outputs phoneme sequences directly (Coqui's XTTS can do this), use those. It saves a realignment step.
The result is a phoneme timeline for the generated audio:
[
{ "phoneme": "M", "start": 0.0, "end": 0.08 },
{ "phoneme": "AE", "start": 0.08, "end": 0.2 },
{ "phoneme": "N", "start": 0.2, "end": 0.28 }
]Converting phonemes to visemes
PHONEME_TO_VISEME = {
"P": "closed", "B": "closed", "M": "closed",
"F": "teeth_lip", "V": "teeth_lip",
"AA": "open", "AH": "open", "AW": "open",
"EH": "mid_open", "AE": "mid_open",
"T": "tongue_contact", "D": "tongue_contact", "N": "tongue_contact",
"S": "teeth_close", "Z": "teeth_close",
"SH": "rounded", "ZH": "rounded",
"W": "rounded_close", "UW": "rounded_close",
}
def phoneme_to_viseme(phonemes):
return [
{
"viseme": PHONEME_TO_VISEME.get(p["phoneme"], "neutral"),
"start": p["start"],
"end": p["end"],
"phoneme": p["phoneme"]
}
for p in phonemes
]Aligning to video frames
Video runs at a fixed frame rate, so you need to convert the continuous time domain viseme timeline into a discrete per-frame assignment:
def viseme_to_frames(visemes, fps=25):
frames = []
for v in visemes:
start_frame = int(v["start"] * fps)
end_frame = int(v["end"] * fps)
for f in range(start_frame, end_frame):
frames.append({
"frame": f,
"viseme": v["viseme"],
"phoneme": v["phoneme"]
})
return framesNow you have:
frame 0 → closed (M)
frame 1 → closed (M)
frame 2 → mid_open (AE)
frame 3 → mid_open (AE)
frame 4 → tongue_contact (N)
This is the per-frame instruction set for your mouth renderer.
Internals of the Viseme Pipeline: Coarticulation, Interpolation, and Plosive Handling
Here's where the engineering gets genuinely interesting. The naive approach — assign one viseme per frame and snap between them — produces something that looks distinctly robotic. To understand why, you need to understand how mouths actually move.
Coarticulation: why discrete visemes are a lie
Speech is not a sequence of discrete mouth shapes. The human vocal tract is a continuous physical system, and it starts transitioning toward the next phoneme while it's still completing the current one.
This is called coarticulation, and it has two flavors:
- Anticipatory coarticulation: your mouth is already shaping itself for a future phoneme. Say "soon" — your lips are already rounding for the /u/ before you've finished the /s/.
- Carry-over coarticulation: the previous phoneme's articulation bleeds into the current one. After a nasal like /m/, the velum doesn't snap shut instantly.
If you ignore this and render discrete visemes:
frame 10 → closed
frame 11 → open ← instant jump
frame 12 → open
Every viseme boundary becomes a discontinuity that viewers register as wrong, even if they can't articulate why.
The fix: viseme vectors and interpolation
Instead of treating visemes as discrete states, represent them as continuous vectors in a small "mouth shape space":
VISEME_VECTORS = {
"closed": [1.0, 0.0, 0.0, 0.0],
"open": [0.0, 1.0, 0.0, 0.0],
"teeth_lip": [0.0, 0.0, 1.0, 0.0],
"mid_open": [0.0, 0.5, 0.0, 0.5],
"rounded": [0.3, 0.0, 0.0, 0.7],
"neutral": [0.2, 0.0, 0.0, 0.8],
}Each dimension corresponds to a mouth shape degree of freedom. Now you can linearly interpolate between any two viseme vectors:
def interpolate(v1, v2, alpha):
return [(1 - alpha) * a + alpha * b for a, b in zip(v1, v2)]Apply over the transition window between two visemes:
TRANSITION_FRAMES = 3 # ~120ms at 25fps
for i, current_viseme in enumerate(viseme_timeline):
next_viseme = viseme_timeline[i + 1] if i + 1 < len(viseme_timeline) else current_viseme
for f in range(current_viseme["start_frame"], current_viseme["end_frame"]):
frames_from_end = current_viseme["end_frame"] - f
if frames_from_end <= TRANSITION_FRAMES:
alpha = 1.0 - (frames_from_end / TRANSITION_FRAMES)
blended = interpolate(
VISEME_VECTORS[current_viseme["viseme"]],
VISEME_VECTORS[next_viseme["viseme"]],
alpha
)
else:
blended = VISEME_VECTORS[current_viseme["viseme"]]
frame_vectors.append(blended)Instead of:
😐 → 😮 (instant jump)
You get:
😐 → 😗 → 😮 (gradual morph)
This is approximately how real human faces work, and it's perceptually the difference between "robotic" and "plausible."
Plosives: the exception that breaks the rule
Plosives (P, B, M, T, D, K, G) have a completely different timing structure than all other phonemes. They're not gradual transitions — they're events with internal temporal structure:
1. Closure — lips snap shut (or articulators make contact)
2. Hold — pressure builds behind the closure (~40–100ms)
3. Burst — rapid release
4. Transition — into the following vowel
The hold-and-burst is what makes speech sound crisp and natural. If you apply the same smooth interpolation to plosives that you use for vowels, you blur the closure and the burst. The listener hears mushy speech. Viewers see lips that never quite close.
You need to handle plosives as a special case:
PLOSIVES = {"P", "B", "M", "T", "D", "K", "G"}
def apply_plosive_timing(viseme_entry):
"""Override interpolation for plosives — snap to closure, hold, then release."""
duration_frames = viseme_entry["end_frame"] - viseme_entry["start_frame"]
closure_frames = max(1, int(duration_frames * 0.2))
hold_frames = max(1, int(duration_frames * 0.5))
burst_frames = duration_frames - closure_frames - hold_frames
phases = []
# Closure: rapid transition to fully closed
for i in range(closure_frames):
alpha = i / closure_frames
phases.append(interpolate(VISEME_VECTORS["neutral"], VISEME_VECTORS["closed"], alpha))
# Hold: fully closed, no interpolation
for _ in range(hold_frames):
phases.append(VISEME_VECTORS["closed"])
# Burst: rapid transition to next viseme
next_vec = viseme_entry.get("next_vector", VISEME_VECTORS["neutral"])
for i in range(burst_frames):
alpha = i / max(burst_frames, 1)
phases.append(interpolate(VISEME_VECTORS["closed"], next_vec, alpha))
return phasesThe general rule: plosives snap, vowels flow.
| Phoneme class | Transition style |
|---|---|
| Vowels (AA, EH, IY...) | Smooth linear interpolation |
| Fricatives (S, F, SH...) | Semi-smooth, slight hold |
| Nasals (M, N, NG) | Gradual, sustained |
| Plosives (P, B, T, K...) | Snap → hold → burst |
Ignore this and your plosives will be the tell that something is wrong. Every "ba", "pa", "ta" in your dubbed video will feel slightly floaty and unconvincing.
Common rendering problems and fixes
Lag (perceptually the most disruptive)
Audio and visual lip movement have a perceptual tolerance window of roughly ±80ms. Above that threshold, viewers consciously notice the mismatch. Audio-leading-video feels like a recording error. Video-leading-audio (the more common failure mode in synthetic pipelines) feels like bad dubbing from the 1970s.
The fix is simple — shift your entire viseme timeline by a constant offset calibrated against your render path:
AUDIO_VISUAL_OFFSET = -0.05 # 50ms, tune empirically
adjusted_visemes = [
{**v, "start": v["start"] + AUDIO_VISUAL_OFFSET,
"end": v["end"] + AUDIO_VISUAL_OFFSET}
for v in visemes
]Over-smoothing and under-smoothing
Too much interpolation kills transient information and makes everything look like lips moving through molasses. Too little makes the face look like a Flash animation from 2003 — one where someone replaced every intermediate frame with a hard keyframe.
The calibration point is: vowel transitions should feel natural and gradual; plosive events should feel crisp. If both are happening in the same word (like "pub" or "tab"), you need both behaviors in the same utterance.
Fricative buzzing
For S, Z, SH, F, V — the fricatives — the visual correlate is less pronounced than for plosives and vowels. Teeth visibility, subtle lip shape changes. Don't over-model these; slight weight on the teeth_close or teeth_lip viseme is sufficient. The audio carries more information than the face for fricatives anyway.
Phase 6 — Rendering
Option A: Wav2Lip (black box approach)
Wav2Lip is the fastest path to something that looks okay:
python inference.py \
--checkpoint_path wav2lip.pth \
--face video.mp4 \
--audio generated_audio.wav \
--outfile output.mp4It implicitly learns the phoneme-to-mouth-shape mapping from training data and aligns the mouth region in each frame to the audio. The quality is decent, and it handles the model inference for you.
The problem: it's completely opaque. You can't inject your carefully computed viseme timing. If the audio timing drifts, Wav2Lip drifts with it. You have no control over the plosive handling or interpolation behavior.
Option B: Guided rendering (the right approach)
The better architecture uses your viseme timeline to guide the renderer, rather than handing it raw audio and hoping:
def render_frame(video_frame, viseme_vector, renderer):
"""
Apply viseme blendshapes to video frame.
viseme_vector: [closed, open, teeth_lip, rounded] weights
"""
return renderer.apply_blend(video_frame, viseme_vector)
for frame_idx, (frame, v_entry) in enumerate(zip(video_frames, frame_vectors)):
output_frame = render_frame(frame, v_entry["vector"], renderer)
output_frames.append(output_frame)For 2D video, this requires either a face mesh (MediaPipe gives you 468 landmarks for free), or a warping approach where you parameterize mouth region deformations by viseme vector.
For 3D avatar systems, viseme vectors map directly to blendshape weights — this is the "correct" form of the problem and is how game engines like Unreal handle it natively.
Hybrid strategy (best in practice)
Use your viseme timeline for timing correctness and use Wav2Lip (or a learned renderer) for visual realism:
Audio → phonemes → visemes → timing map
│
Wav2Lip guided by timing constraints
The timing map tells you when to apply which shape. Wav2Lip fills in the visual detail of how that shape looks on this particular face. You get the correctness of an explicit phoneme-to-viseme pipeline with the realism of a trained renderer.
Putting It All Together
The complete loop across all segments:
for segment in segments:
text = " ".join(w["word"] for w in segment)
budget = compute_budget(segment)
# Translation
translated = best_translation(text, budget)
# TTS
tts_path = generate_tts(translated)
# Duration matching
gen_dur = get_duration(tts_path)
if abs(gen_dur - budget["duration"]) / budget["duration"] > 0.05:
adjusted_path = match_duration(tts_path, budget["duration"])
else:
adjusted_path = tts_path
# Phoneme extraction on TTS output
phonemes = extract_phonemes(adjusted_path)
# Viseme pipeline
visemes = phoneme_to_viseme(phonemes)
frame_vectors = build_frame_vectors(visemes, fps=25)
# Render
rendered_frames = render_segment(video_frames, frame_vectors, renderer)
save_segment(adjusted_path, rendered_frames)Phase 7 — Evaluation: Closing the Loop
Shipping a dubbing pipeline without evals is guessing. Every component — alignment, translation, TTS, stretching, visemes — introduces error. Evals tell you where the error lives and whether your fixes are actually working.
The eval system has two layers: per-segment automated metrics that catch structural failures fast, and holistic quality scores that catch perceptual failures that automation misses.
Segment-Level Automated Evals
These run on every segment, every time. They're your CI for the pipeline.
1. Duration Alignment Score
def duration_alignment_score(generated_path, target_duration):
gen_dur = get_duration(generated_path)
diff = abs(gen_dur - target_duration) / target_duration
return {
"generated": round(gen_dur, 3),
"target": round(target_duration, 3),
"diff_pct": round(diff * 100, 1),
"score": max(0.0, 1.0 - (diff / 0.15)),
"grade": "excellent" if diff < 0.05 else "acceptable" if diff < 0.10 else "fail"
}| Diff % | Grade |
|---|---|
| < 5% | Excellent |
| 5–10% | Acceptable |
| > 15% | Fail |
Segments that fail this check should be flagged for retranslation before anything downstream runs. Don't let a duration failure cascade into a broken viseme timeline.
2. Syllable Rate Consistency
Duration alignment measures absolute length. Syllable rate measures rhythm — whether the dubbed speech feels paced like the original.
def syllable_rate_score(original_segment, translated_text, gen_duration):
orig_duration = original_segment[-1]["end"] - original_segment[0]["start"]
orig_syllables = sum(count_syllables(w["word"]) for w in original_segment)
orig_rate = orig_syllables / orig_duration
gen_syllables = sum(count_syllables(w) for w in translated_text.split())
gen_rate = gen_syllables / gen_duration
rate_diff = abs(gen_rate - orig_rate) / orig_rate
return {
"original_rate": round(orig_rate, 2),
"generated_rate": round(gen_rate, 2),
"rate_diff_pct": round(rate_diff * 100, 1),
"score": max(0.0, 1.0 - (rate_diff / 0.20))
}A segment that hits the duration target but does so with half the syllable count has been over-stretched. The score catches this where duration alignment alone wouldn't.
3. Viseme Coverage Score
Viseme coverage tracks how completely phonemes map across the generated audio. A low coverage score typically means MFA alignment failed silently, the TTS output contains unexpected silence, or the audio has been clipped.
def viseme_coverage_score(phonemes, audio_duration):
if not phonemes:
return {"score": 0.0, "reason": "no phonemes extracted"}
covered = sum(p["end"] - p["start"] for p in phonemes)
coverage = covered / audio_duration
unmapped = [p for p in phonemes if p["phoneme"] not in PHONEME_TO_VISEME]
return {
"coverage_pct": round(coverage * 100, 1),
"unmapped_phonemes": unmapped,
"score": min(1.0, coverage)
}A coverage score below 0.85 is a signal to re-run MFA alignment before passing to the viseme renderer.
4. Stretch Ratio Guard
Time-stretching degrades quality as the ratio moves away from 1.0. Log the applied ratio for every segment so you can identify where the pipeline is leaning on stretching as a crutch rather than fixing the upstream translation.
def stretch_ratio_eval(original_duration, generated_duration):
ratio = generated_duration / original_duration
quality = (
"clean" if 0.90 <= ratio <= 1.10 else
"acceptable" if 0.82 <= ratio <= 1.20 else
"degraded"
)
return {
"ratio": round(ratio, 3),
"quality": quality,
"flag": ratio < 0.82 or ratio > 1.20
}When flag is True, the correct fix is to retranslate with a tighter syllable budget — not to accept the degraded audio.
Translation Quality — LLM-as-Judge
Duration metrics tell you whether the audio fits. They say nothing about whether the translated words preserve the original meaning. For this, use an LLM judge.
def translation_quality_eval(source_text, translated_text, client):
prompt = f"""
You are evaluating machine translation quality for a dubbed video segment.
Score the translation on each dimension from 0.0 to 1.0.
Source text: {source_text}
Translation: {translated_text}
Respond in JSON with exactly these fields:
{{
"meaning_preservation": float,
"omissions": float, // 1.0 = nothing omitted
"hallucinations": float, // 1.0 = no hallucinated content
"naturalness": float,
"explanation": string // one sentence
}}
"""
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
result["composite"] = (
result["meaning_preservation"] * 0.4 +
result["omissions"] * 0.25 +
result["hallucinations"] * 0.25 +
result["naturalness"] * 0.10
)
return resultThe composite weights meaning preservation heavily because omitting meaning is worse than omitting words — and hallucinating new meaning is a hard failure regardless of timing.
Lip Sync Score
Lip sync quality has two tractable proxies: Wav2Lip confidence when you're using a learned renderer, and frame-level mouth contour distance when you have access to the original face.
def lip_sync_score_contour(original_frames, generated_frames, face_mesh):
"""
Compute mean distance between mouth landmark contours,
frame by frame, after applying generated visemes.
"""
scores = []
for orig, gen in zip(original_frames, generated_frames):
orig_landmarks = face_mesh.process(orig).multi_face_landmarks
gen_landmarks = face_mesh.process(gen).multi_face_landmarks
if not orig_landmarks or not gen_landmarks:
continue
# Mouth landmark indices: 61, 185, 40, 39, 37, 0, 267, 269, 270, 409
MOUTH_IDX = [61, 185, 40, 39, 37, 0, 267, 269, 270, 409]
orig_pts = [(orig_landmarks[0].landmark[i].x,
orig_landmarks[0].landmark[i].y) for i in MOUTH_IDX]
gen_pts = [(gen_landmarks[0].landmark[i].x,
gen_landmarks[0].landmark[i].y) for i in MOUTH_IDX]
dist = sum(
((a[0]-b[0])**2 + (a[1]-b[1])**2)**0.5
for a, b in zip(orig_pts, gen_pts)
) / len(MOUTH_IDX)
scores.append(dist)
mean_dist = sum(scores) / len(scores) if scores else 1.0
return {
"mean_contour_distance": round(mean_dist, 4),
"score": max(0.0, 1.0 - (mean_dist / 0.05))
}Lower contour distance = better lip sync. A score above 0.80 is acceptable for production; below 0.60 is a visible sync failure.
Composite Segment Score
Every segment gets a single composite score that collapses all the above:
def composite_segment_score(
duration_score,
syllable_rate_score,
translation_composite,
lip_sync_score,
viseme_coverage_score
):
return round(
0.25 * duration_score +
0.20 * syllable_rate_score +
0.25 * translation_composite +
0.20 * lip_sync_score +
0.10 * viseme_coverage_score,
3
)| Score Range | Verdict |
|---|---|
| ≥ 0.85 | Ship it |
| 0.70–0.84 | Review |
| < 0.70 | Retranslate |
Segments below 0.70 go back to the constrained translation step. Segments in the 0.70–0.84 range get flagged for human review before final render.
Pipeline-Level Reporting
Don't just track individual segments — track pipeline health across a full video run.
def pipeline_eval_report(segment_results):
scores = [r["composite"] for r in segment_results]
failed = [r for r in segment_results if r["composite"] < 0.70]
reviewed = [r for r in segment_results if 0.70 <= r["composite"] < 0.85]
bottleneck = max(
["duration", "translation", "lip_sync", "syllable_rate"],
key=lambda k: sum(1 for r in segment_results if r[k+"_score"] < 0.70)
)
return {
"total_segments": len(segment_results),
"mean_composite": round(sum(scores) / len(scores), 3),
"pass_rate": round(len([s for s in scores if s >= 0.85]) / len(scores), 2),
"review_queue": len(reviewed),
"fail_queue": len(failed),
"worst_bottleneck": bottleneck,
"p10_score": sorted(scores)[len(scores) // 10]
}The worst_bottleneck field is the most actionable output. If lip sync is the top failure mode, the fix is in your viseme pipeline. If translation quality is the bottleneck, tighten your prompts or your syllable budget constraints. If duration is failing consistently, your TTS speaking rate parameters need tuning.
Human Perception Eval
Automated metrics can't fully substitute for what a human ear catches. Build a lightweight annotation flow alongside your automated evals.
For each reviewed segment, collect three binary signals:
1. Does the lip movement feel natural? [Yes / No]
2. Does the audio feel rushed or stretched? [Yes / No]
3. Does the translation feel like it preserves the original meaning? [Yes / No]
Map these to a 0–1 perception score:
def perception_score(responses):
return sum(responses.values()) / len(responses)Even sampling 10–15% of segments through human review will surface systematic failures that automated evals miss — particularly around naturalness, coarticulation quality, and translation register.
Closing Thought
Every metric above measures a different independent point of failures. Duration alignment catches timing failures. Syllable rate catches rhythm failures. Translation quality catches meaning failures. Lip sync catches visual failures. And the composite score routes segments to the right remediation step automatically.
The target workflow:
Segment processed
│
Composite ≥ 0.85? ──Yes──▶ Final render queue
│
No
│
Composite ≥ 0.70? ──Yes──▶ Human review queue
│
No
│
Worst sub-score? ──duration/syllables──▶ Retranslate with tighter budget
──translation quality──▶ Regenerate with stricter prompt
──lip sync / visemes──▶ Re-run phoneme extraction + rendererViseme synchronised output video