Blog
Previous

Building Auto Reframe: Remotion + Mediapipe with interpolation

Building a production auto-reframe system using Remotion, MediaPipe, and real-time interpolation for dynamic face tracking across aspect ratios.

The Problem

You have a 16:9 video — a podcast, an interview, a talk. You want to post it as a 9:16 vertical reel, a 1:1 square clip, or a 4:5 Instagram post. If you just letterbox it, you waste 40–60% of the frame on black bars. If you center-crop it, the speaker drifts out of frame every time they lean or move.

The real solution: auto-reframe — dynamically pan the crop window to follow the speaker's face. This post explains every piece of the system we built, starting from the deepest layer — how MediaPipe actually works — and building up to the full render pipeline.

Auto-Reframe with Mediapipe

Part 0: Understanding MediaPipe — From the Ground Up

Before we write a single line of application code, it's worth deeply understanding what MediaPipe is and how it executes inference in the browser. Most tutorials treat it as a black box. We won't.

What is MediaPipe?

MediaPipe is Google's open-source, cross-platform ML pipeline framework, originally developed for real-time perception on mobile devices (Android, iOS). Its core design goal is to run ML models on-device — no round-trips to a cloud API, no GPU cluster, just local CPU or GPU inference at interactive frame rates.

The framework provides:

  1. A graph-based pipeline model — inference is expressed as a DAG (directed acyclic graph) of "calculators" that pass typed packets through edges. This is why it's called *MediaPipe* — data flows through pipes.

  2. Pre-built, optimized model bundles — BlazeFace, BlazePose, Face Mesh, Hands, etc. These aren't raw TensorFlow SavedModels; they're TFLite flatbuffers compiled and tuned for specific hardware delegates.

  3. The Tasks API — a higher-level abstraction introduced in 2022 that wraps the graph-based API into simpler FaceDetector, PoseLandmarker, etc. classes. This is what we use via @mediapipe/tasks-vision.

The WASM Runtime: How ML Runs in a Browser

Browsers can't run native code directly. So MediaPipe's inference engine is compiled to WebAssembly (WASM) — a binary instruction format that runs in a sandboxed virtual machine inside the browser at near-native speed.

Here's what happens when you call FilesetResolver.forVisionTasks(cdnUrl):

Browser                              CDN (jsDelivr)
  │                                       │
  │── fetch vision_wasm_internal.wasm ───▶│
  │◀─ ~5 MB binary ──────────────────────│
  │                                       │
  │── fetch vision_wasm_internal.js  ────▶│
  │◀─ JS glue code ──────────────────────│
  │                                       │
  │  [Browser WASM VM instantiates]
  │  [Linear memory allocated: ~64 MB]
  │  [Function table populated]
  │  [Imports wired: canvas2D, WebGL]

The vision_wasm_internal.wasm binary contains the entire TFLite interpreter — the inference engine that can load and execute .tflite model files. It's compiled from C++ using Emscripten. The JS glue code (vision_wasm_internal.js) bridges the WASM module to browser APIs that WASM can't access directly (like <canvas>, WebGLRenderingContext, and fetch).

Once instantiated, the WASM module has a linear memory space — a single contiguous ArrayBuffer that it uses as its "heap." Tensors, model weights, intermediate activations — everything lives in this buffer. JavaScript communicates with it by writing to and reading from specific offsets in this buffer.

GPU Delegation: How WebGL Accelerates Inference

When you pass delegate: "GPU", the TFLite runtime doesn't run matrix multiplications in WASM — it offloads them to the GPU via WebGL shaders.

Here's the data flow for a single inference pass with GPU delegation:

1. JavaScript draws video frame to <canvas>
         │
         ▼
2. GPU delegate reads pixels from canvas via WebGL texture upload
   (gl.texImage2D — frame goes from CPU RAM → GPU VRAM)
         │
         ▼
3. Each TFLite Conv2D / DepthwiseConv2D op is compiled to a GLSL fragment shader
   GPU executes hundreds of shader programs in parallel (one per output pixel)
         │
         ▼
4. Output tensor pixels read back from GPU → CPU via gl.readPixels()
   (This is the expensive step — GPU→CPU roundtrip)
         │
         ▼
5. WASM post-processing: NMS, decode bounding boxes, return detections

Why is this ~3–5× faster than CPU? Because neural network layers are dominated by tensor multiplications — operations that are embarrassingly parallel. A GPU has thousands of shader cores that can multiply matrix elements simultaneously, whereas WASM runs on a few CPU cores.

The catch: gl.readPixels() (step 4) is a synchronous GPU→CPU transfer that stalls the pipeline. The GPU delegate minimizes this by keeping intermediate tensors on the GPU between layers, only reading back the final output. For BlazeFace's shallow architecture (the final output is a small set of bounding box predictions, not a full feature map), this readback is cheap.

The BlazeFace Model: Architecture Deep Dive

BlazeFace is a single-shot face detector — it produces bounding boxes directly from a single forward pass, without region proposal stages (unlike two-stage detectors like Faster R-CNN). Here's its architecture:

Input: 128×128 RGB image
         │
         ▼
┌─────────────────────────────────────────────────────┐
│  Stem: Conv2D(24, 5×5, stride=2) + BN + ReLU6       │  → 64×64×24
│  5× BlazeBlock (depthwise separable convolutions)    │  → 32×32×48
│  6× BlazeBlock (double BlazeBlock with residual)     │  → 16×16×96
└─────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────┐
│  Prediction heads (applied to 8×8 and 16×16 grids)  │
│    Box regression:    → 8×8×2×4 + 16×16×6×4 anchors │
│    Classification:    → 8×8×2×1 + 16×16×6×1 scores  │
└─────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────┐
│  Non-Maximum Suppression (NMS)                       │
│  Score threshold: 0.5, IoU threshold: 0.3            │
└─────────────────────────────────────────────────────┘
         │
         ▼
Output: [{boundingBox, keypoints, score}, ...]

BlazeBlocks are the key innovation. Instead of standard 3×3 convolutions, BlazeFace uses depthwise separable convolutions — a 3×3 depthwise conv (one filter per channel) followed by a 1×1 pointwise conv (linear combination of channels). This reduces parameters by ~8–9× compared to a standard Conv2D while achieving similar accuracy.

The anchor-based detection means the model doesn't predict absolute coordinates — it predicts offsets from pre-defined anchor boxes tiled across the 8×8 and 16×16 feature map grids. At the 8×8 grid, there are 2 anchors per cell (128 total anchors). At the 16×16 grid, there are 6 anchors per cell (1,536 total anchors). That's 1,664 candidate boxes for a single image — NMS reduces these to the top predictions.

Why does this matter for our use case? The model's input is 128×128, but we feed it a 640×360 canvas. The Tasks API internally resizes and pads the input to 128×128 before inference. Our canvas resolution only affects how much pre-processing WebGL does — the model itself always sees 128×128. This means detecting at 1920×1080 vs 640×360 produces identically accurate results, and the 640×360 choice purely saves texture upload time.

The .tflite File: What's Inside

The model file (blaze_face_short_range.tflite) is a FlatBuffer — a binary serialization format (like Protocol Buffers but with zero-copy random access). Inside it contains:

  • The computation graph: a list of operators (Conv2D, DepthwiseConv2D, Reshape, etc.) with their connectivity
  • All pre-trained weights, quantized to float16 (~200 KB total)
  • Operator-specific metadata (padding, stride, activation function)

When TFLite loads this model, it allocates tensor memory in the WASM linear buffer, copies weights in, and builds an execution plan. For GPU delegation, it additionally compiles each supported op to a GLSL shader at initialization time — this is why the first call to FaceDetector.createFromOptions() takes a second or two, even though subsequent detect() calls are fast (the shaders are cached in the GPU driver).

Tasks API: The Abstraction Layer

The @mediapipe/tasks-vision package is MediaPipe's high-level TypeScript SDK. It wraps the low-level WASM API with a clean interface, handling:

  • WASM module lifecycle (download, compile, instantiate)
  • Model download and loading into WASM memory
  • Input preprocessing (resize, normalize, format conversion)
  • Output postprocessing (decode anchors, NMS, coordinate denormalization)
  • GPU delegate setup and WebGL context management

When you call detector.detect(canvas), the Tasks API:

  1. Calls gl.texImage2D() to upload the canvas as a WebGL texture
  2. Calls into WASM to run inference with the GPU delegate
  3. Reads the raw output tensors (box offsets + scores)
  4. Runs NMS in WASM
  5. Denormalizes coordinates from [0,1] back to pixel space
  6. Returns a typed FaceDetectorResult JavaScript object

You never touch any of this directly — but understanding it explains why the first inference is slow (shader compilation), why GPU is faster (parallel texture ops), and why using runningMode: "IMAGE" with explicit seeking gives us more control than "VIDEO" mode (which has its own internal buffering).


Architecture: The Four-Stage Pipeline

Now that we understand the engine, here's how the full system is structured:

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  1. DETECTION   │────▶│  2. SMOOTHING    │────▶│ 3. INTERPOLATION │────▶│  4. RENDER LOOP  │
│                 │     │                  │     │                  │     │                  │
│ MediaPipe WASM  │     │  EMA (offline)   │     │ Linear lerp      │     │ Cosine kernel    │
│ blaze_face model│     │  α = 0.35        │     │ any frame→pos    │     │ ±12 frame blend  │
│ every 10th frame│     │  stored in JSON  │     │ clamp to [0,1]   │     │ → CSS objectPos  │
└─────────────────┘     └──────────────────┘     └──────────────────┘     └──────────────────┘

Stages 1–2 run once in a browser tool (/detect) and produce a JSON file. Stages 3–4 run every frame inside the Remotion render loop.


Stage 1: MediaPipe Face Detection

Initializing the Runtime

With the internals understood, the initialization code reads differently now:

import { FaceDetector, FilesetResolver } from "@mediapipe/tasks-vision";
 
// Step 1: Download and compile the WASM module + JS glue
// This fetches vision_wasm_internal.wasm (~5 MB) and instantiates it in the browser's WASM VM.
// The resulting `vision` object is a handle to the initialized WASM module instance.
const vision = await FilesetResolver.forVisionTasks(
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm"
);
 
// Step 2: Download the TFLite model, load it into WASM linear memory,
// and compile all GPU-delegated ops to GLSL shaders (one-time cost).
// After this call, `detector` holds a handle to a fully initialized inference session.
const detector = await FaceDetector.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath:
      "https://storage.googleapis.com/mediapipe-models/face_detector/blaze_face_short_range/float16/1/blaze_face_short_range.tflite",
    delegate: "GPU", // Use WebGL shader execution for Conv2D ops
  },
  runningMode: "IMAGE", // Process one frame at a time; no internal video buffering
});

Why runningMode: "IMAGE" and not "VIDEO"?

MediaPipe's "VIDEO" mode accepts a timestamp and maintains an internal temporal filter — it expects frames arriving in chronological order at real-time speed. Our seek-and-detect approach jumps arbitrarily across the video timeline (frame 0, 10, 20, 30...). "VIDEO" mode's temporal state would corrupt with out-of-order seeks. "IMAGE" mode is stateless — each call is completely independent, which is exactly what we need.

What MediaPipe Returns

Each call to detector.detect(canvas) produces:

interface Detection {
  boundingBox?: {
    originX: number; // top-left x in pixels (in canvas space)
    originY: number; // top-left y in pixels (in canvas space)
    width: number; // bounding box width in pixels
    height: number; // bounding box height in pixels
  };
  categories: {
    categoryName: string; // Always "Face" for face detection
    score: number; // Confidence score in [0, 1]
  }[];
  keypoints?: {
    // 6 facial keypoints: eyes, nose, mouth corners, ears
    x: number; // normalized [0,1]
    y: number; // normalized [0,1]
    label?: string;
  }[];
}

The boundingBox coordinates are in the canvas's pixel space (0 to canvas.width, 0 to canvas.height). The Tasks API has already denormalized them from BlazeFace's internal [0,1] output space. We immediately renormalize:

const detections = detector.detect(canvas).detections;
 
if (detections.length > 0) {
  const bb = detections[0]!.boundingBox!;
  // Re-normalize to [0,1]: makes coordinates resolution-independent
  const x = (bb.originX + bb.width / 2) / canvas.width;
  const y = (bb.originY + bb.height / 2) / canvas.height;
}

Why normalize? We detect on a 640×360 canvas for speed, but our outputs will be rendered at 1080×1920, 1080×1080, or 1080×1350. Normalized coordinates (0.5, 0.35) always mean "centered horizontally, upper-third vertically" regardless of which canvas they're applied to. If we stored raw pixel coordinates, we'd need a conversion factor for every output format.

The Seek-and-Detect Loop

The core detection loop is a careful balance between coverage and speed:

const SOURCE_FPS = 30;
const SAMPLE_INTERVAL = 10; // Detect every 10th frame (~3 Hz sampling rate)
const CANVAS_WIDTH = 640;
const CANVAS_HEIGHT = 360;
 
const canvas = document.createElement("canvas");
canvas.width = CANVAS_WIDTH;
canvas.height = CANVAS_HEIGHT;
const ctx = canvas.getContext("2d")!;
 
const totalFrames = Math.floor(video.duration * SOURCE_FPS);
const numSamples = Math.floor(totalFrames / SAMPLE_INTERVAL);
 
let lastX = 0.5; // Default: center
let lastY = 0.35; // Default: upper-center (typical face position)
 
for (let i = 0; i < numSamples; i++) {
  const frameNum = i * SAMPLE_INTERVAL;
  const time = frameNum / SOURCE_FPS;
 
  // Seek the video element to the target timestamp.
  // HTMLVideoElement.currentTime triggers a full decode seek in the browser's
  // media pipeline — the 'seeked' event fires when the decoder is ready.
  await new Promise<void>((resolve) => {
    video.currentTime = time;
    video.onseeked = () => resolve();
  });
 
  // Draw the current video frame to our offscreen canvas.
  // canvas.drawImage() is GPU-accelerated: the video frame is decoded
  // in the GPU's video decode unit, then blitted to the canvas texture.
  ctx.drawImage(video, 0, 0, CANVAS_WIDTH, CANVAS_HEIGHT);
 
  // Run inference. Internally: canvas → WebGL texture upload →
  // GLSL shader inference → NMS → coordinate decode → JS object.
  const result = detector.detect(canvas);
 
  let x: number, y: number;
  if (result.detections.length > 0) {
    const bb = result.detections[0]!.boundingBox!;
    x = (bb.originX + bb.width / 2) / CANVAS_WIDTH;
    y = (bb.originY + bb.height / 2) / CANVAS_HEIGHT;
  } else {
    // Carry forward: prevents snapping to center on missed frames
    x = lastX;
    y = lastY;
  }
  lastX = x;
  lastY = y;
 
  rawSamples.push({ frame: frameNum, x, y });
 
  // Yield to the browser's event loop every 50 samples.
  // Without this, the long-running loop blocks all UI repaints and
  // event handling, making the browser appear frozen.
  if (i % 50 === 0) {
    await new Promise((r) => setTimeout(r, 0));
    updateProgressBar(i / numSamples);
  }
}

Design decisions, explained:

Sampling every 10th frame (3 Hz): Human head movement in conversation maxes out at roughly 60°/second for fast gestures, but in a talking-head video it's typically 10–20°/second. At 3 Hz sampling, even a fast 0.5-second head swing gets 1–2 detection samples. The linear interpolation in Stage 3 fills in the rest convincingly because the motion is approximately linear over such short intervals.

640×360 canvas: BlazeFace always processes 128×128 internally, so our canvas resolution is irrelevant to model accuracy. The difference is in texture upload time. At 1920×1080, uploading a frame to a WebGL texture takes ~8–12ms. At 640×360 (1/9 the pixels), it's ~1–2ms. Across 4,000 samples, this saves roughly 30–40 seconds total.

Carry-forward on detection miss: When the model fails to detect a face (occlusion, strong motion blur, person looked away), we hold the last known position rather than defaulting to (0.5, 0.5). A jump to center would create a jarring visual "snap" in the final video. Holding position is almost always the right call — if someone briefly turns away, we want the crop to stay where it was, not reset.

Detection time for a 22-minute video: At ~5ms per inference (GPU) + ~3ms seek overhead + ~2ms canvas draw, a 22-minute video with 4,070 samples takes roughly 22 × 60 / 10 × 10ms ≈ ~25 minutes on a MacBook M1. This is acceptable as a one-time preprocessing step — the result is a 100 KB JSON file that never needs regeneration unless the video changes.


Stage 2: EMA Smoothing

Why Raw Detections Are Noisy

MediaPipe's bounding box outputs have intrinsic frame-to-frame jitter even on a completely stationary face. This noise has several sources:

  1. Anchor snapping: BlazeFace predicts offsets from pre-defined anchors. At the boundary between two anchor cells, small input changes can flip which anchor "wins," causing the bounding box to jump by one anchor spacing (~16 pixels at 128×128 resolution, ~80 pixels at 640×360 scale).

  2. Video compression artifacts: H.264/H.265 inter-frame prediction introduces block artifacts and ringing that change the apparent pixel content of the face region frame-to-frame, even when nothing is actually moving.

  3. Lighting gradients: Subtle shifts in ambient light between frames change the activation magnitudes in early Conv2D layers, propagating small perturbations to the final bounding box prediction.

The result: on a stationary head, raw detections might jitter ±10–15 pixels at 640×360 resolution (±1.5–2.3% in normalized coordinates). That sounds small, but at 9:16 aspect ratio, a 2% horizontal shift moves the crop window ~45 pixels — clearly visible.

Exponential Moving Average

We smooth the raw samples with an Exponential Moving Average (EMA) — the simplest and most computationally efficient smoothing filter:

function applyEMA(samples: FaceSample[], alpha: number): FaceSample[] {
  // Initialize with the first sample (no lag at the start)
  let smoothX = samples[0]!.x;
  let smoothY = samples[0]!.y;
 
  return samples.map((s) => {
    // EMA recurrence: s_t = α × x_t + (1-α) × s_{t-1}
    // α controls the trade-off between responsiveness and smoothness:
    //   High α (→ 1): follows current sample closely, less smoothing
    //   Low α  (→ 0): heavy smoothing, slow response to real movement
    smoothX = alpha * s.x + (1 - alpha) * smoothX;
    smoothY = alpha * s.y + (1 - alpha) * smoothY;
    return { frame: s.frame, x: smoothX, y: smoothY };
  });
}
 
const smoothed = applyEMA(rawSamples, 0.35);

Understanding Alpha = 0.35

The EMA formula st=αxt+(1α)st1s_t = \alpha \cdot x_t + (1-\alpha) \cdot s_{t-1} expands to an infinite weighted sum of past observations:

st=αxt+α(1α)xt1+α(1α)2xt2+α(1α)3xt3+s_t = \alpha \cdot x_t + \alpha(1-\alpha) \cdot x_{t-1} + \alpha(1-\alpha)^2 \cdot x_{t-2} + \alpha(1-\alpha)^3 \cdot x_{t-3} + \dots

With α=0.35\alpha = 0.35, each past sample's effective weight:

Lag (samples back)FormulaWeightCumulative %
0 (current)0.350.350.350035.0%
10.35×0.650.35 \times 0.650.227557.8%
20.35×0.6520.35 \times 0.65^20.147972.5%
30.35×0.6530.35 \times 0.65^30.096182.1%
50.35×0.6550.35 \times 0.65^50.040692.7%
100.35×0.65100.35 \times 0.65^{10}0.004799.5%

The "effective window" is roughly 6 samples — about 2 seconds of video at our 3 Hz sampling rate. This was tuned empirically: α=0.2\alpha = 0.2 made pans feel sluggish and laggy during real head movements, while α=0.5\alpha = 0.5 left too much jitter for slow scenes.

Why EMA over a simple moving average (SMA)? A SMA over NN samples would eliminate all frequencies above 1/(NΔt)1/(N \cdot \Delta t) Hz. It has a "box filter" frequency response — flat passband, then sharp cutoff — which introduces ringing in the time domain (Gibbs phenomenon). EMA has an exponential frequency response that rolls off gradually, producing a smoother, more natural motion curve. It's also O(1)O(1) memory vs O(N)O(N) for SMA.

The Output: v1_faces.json

{
  "sampleInterval": 10,
  "fps": 30,
  "samples": [
    { "frame": 0,    "x": 0.4773, "y": 0.3208 },
    { "frame": 10,   "x": 0.4795, "y": 0.3227 },
    { "frame": 20,   "x": 0.4775, "y": 0.3298 },
    { "frame": 30,   "x": 0.4815, "y": 0.3327 },
    ...
  ]
}

For a 22-minute video: 22 × 60 × 30 / 10 = 3,960 samples, each a frame index and two floats. The JSON file is ~100 KB — negligible to fetch and small enough to hold entirely in a JS array.


Stage 3: Runtime Interpolation

At render time, Remotion calls our component once per frame (frame 0, 1, 2, ..., 39,600 for a 22-minute video at 30 fps). We have face data only every 10 frames — so at frame 15, we need to estimate where the face is between the detections at frame 10 and frame 20.

Linear Interpolation (Lerp)

export function getFacePositionAtFrame(
  data: FaceTrackingData,
  frame: number
): { x: number; y: number } {
  const { samples, sampleInterval } = data;
 
  // Convert the target frame to a floating-point index into the samples array.
  // frame=15, sampleInterval=10 → idx=1.5 (halfway between samples[1] and samples[2])
  const idx = frame / sampleInterval;
  const lo = Math.floor(idx); // Lower sample index
  const hi = Math.ceil(idx); // Upper sample index
 
  // Boundary clamping: extrapolate as constant at the edges
  if (lo < 0) return { x: samples[0]!.x, y: samples[0]!.y };
  if (lo >= samples.length - 1) {
    const last = samples[samples.length - 1]!;
    return { x: last.x, y: last.y };
  }
 
  const sLo = samples[lo]!; // Face position at the earlier keyframe
  const sHi = samples[hi]!; // Face position at the later keyframe
 
  // Exact hit (frame is a multiple of sampleInterval)
  if (lo === hi) return { x: sLo.x, y: sLo.y };
 
  // t is the fractional position between the two samples: 0.0 at lo, 1.0 at hi
  const t = idx - lo; // For frame=15: t = 1.5 - 1 = 0.5
 
  // Linear interpolation: blend between the two detected positions
  return {
    x: sLo.x + (sHi.x - sLo.x) * t,
    y: sLo.y + (sHi.y - sLo.y) * t,
  };
}

Visualizing What Lerp Does

Sample[1] at frame 10     Sample[2] at frame 20
     x = 0.48 ─────●                    ●───── x = 0.56
                     \                 /
                      \               /
                       ●─●─●─●─●─●─●        ← frames 11–19, linearly interpolated
                       11 12 13 14 15 16 17 18 19

getFacePositionAtFrame(data, 15):
  idx  = 15 / 10 = 1.5
  lo   = 1      → samples[1].x = 0.48
  hi   = 2      → samples[2].x = 0.56
  t    = 1.5 - 1 = 0.5
  x    = 0.48 + (0.56 - 0.48) × 0.5 = 0.48 + 0.04 = 0.52  ✓

Why Not Higher-Order Interpolation?

Cubic splines or Catmull-Rom interpolation would produce smoother curves through the sample points. We explicitly chose not to use them because:

Head motion is not polynomial. A cubic spline "overshoots" between samples when the underlying motion is a piecewise-constant velocity (e.g., the speaker is still, then moves suddenly). The overshoot creates visual artifacts where the crop window swings past the face before correcting. Linear interpolation can't overshoot by construction.

The cosine kernel in Stage 4 provides the perceptual smoothness. Even if the interpolated path has slight kinks at each 10-frame boundary, the 25-frame cosine blend completely masks them. Adding spline smoothing on top would be double-smoothing with no visible benefit.

O(1) per frame. Lerp is two array lookups and four multiplies. A spline would require evaluating a cubic polynomial and potentially doing a binary search to find the right segment. At 30 fps for 22 minutes that's 39,600 calls — the difference is measurable.

Clamping to [0, 1]

export function clampPosition(
  pos: { x: number; y: number },
  min = 0,
  max = 1
): { x: number; y: number } {
  return {
    x: Math.max(min, Math.min(max, pos.x)),
    y: Math.max(min, Math.min(max, pos.y)),
  };
}

The cosine kernel blend in Stage 4 can push coordinates slightly outside [0,1] when the face is near the edge of frame. Without clamping, objectPosition: "102%" would attempt to show video content beyond the edge — the browser renders this as black.


Stage 4: The CSS Trick — How Panning Actually Works

The actual panning requires zero canvas manipulation, no WebGL compositing, and no manual pixel math. We use two CSS properties: objectFit: "cover" and objectPosition.

objectFit: cover — The Zoom Foundation

objectFit: cover instructs the browser's video rendering engine to scale the video such that it completely fills the container element, preserving aspect ratio and cropping the overflow. For a 16:9 video in a 9:16 container (1080×1920):

Source video:          1920 × 1080  (aspect ratio = 1.778)
Container:             1080 × 1920  (aspect ratio = 0.5625)

Cover scaling rules:
  To fill width:   scale = 1080 / 1920 = 0.5625  → video becomes 1080 × 607  (too short)
  To fill height:  scale = 1920 / 1080 = 1.7778  → video becomes 3413 × 1920  ✓

Browser picks the larger scale (1.7778) to ensure no uncovered area.
Result: a 3413×1920 virtual canvas, with a 1080×1920 viewport window.

The browser's video compositor handles this scaling entirely in GPU hardware — it's a single matrix transform on the video texture. No pixel data is moved in JavaScript.

objectPosition — The Pan Control

With objectFit: cover active, objectPosition: "X% Y%" specifies which part of the virtual canvas is visible through the viewport window.

The X% value maps like this: 0% shows the leftmost 1080 pixels, 100% shows the rightmost 1080 pixels, and 50% shows the center 1080 pixels of the 3413-pixel-wide virtual canvas.

Virtual video width:   3413 px
Viewport width:        1080 px
Overflow (pan range):  3413 - 1080 = 2333 px

objectPosition X%:  left offset = X/100 × 2333 px
  0%   → offset = 0    → shows pixels [0,    1080)  (leftmost)
  50%  → offset = 1167 → shows pixels [1167, 2247)  (center)
  100% → offset = 2333 → shows pixels [2333, 3413)  (rightmost)

So our normalized face position x ∈ [0,1] maps directly to objectPosition: "${x * 100}%". When the face is at x=0.3, the viewport slides to show the left-center portion of the virtual canvas, keeping the face in frame.

┌──────────────────────────────────────────────┐  Virtual: 3413×1920
│                                              │
│   ┌──────────┐                               │
│   │ Face at  │      ┌──────────┐             │
│   │ x=0.3   │      │ Face at  │             │
│   │          │      │ x=0.7   │             │
│   │   ◎      │      │    ◎     │             │
│   └──────────┘      └──────────┘             │
│   Viewport:                                  │
│   objectPosition: "30%"  → "70%"             │
└──────────────────────────────────────────────┘

Pan Ranges by Aspect Ratio

Output FormatCanvasVirtual Video SizeHorizontal Pan Range
9:161080×19203413×19202333 px (±52%)
1:11080×10801920×1080840 px (±22%)
4:51080×13502400×13501320 px (±28%)

For 9:16 content there's enormous pan range — more than enough to track a speaker walking across a stage. For 1:1, the range is tighter, but talking-head content rarely needs more than ±15% horizontal movement.

The CSS in Practice

const objX = `${(x * 100).toFixed(2)}%`;
const objY = `${(y * 100).toFixed(2)}%`;
 
<Video
  src={staticFile("v1.mp4")}
  style={{
    width: "100%",
    height: "100%",
    objectFit: "cover",
    objectPosition: `${objX} ${objY}`,
  }}
/>;

The .toFixed(2) keeps the CSS value to 2 decimal places — sufficient sub-pixel precision given that 0.01% of 3413px is ~0.34px, well below any perceptible threshold.


Stage 5: Cosine-Weighted Kernel Smoothing (Runtime)

After EMA smoothing (Stage 2) and linear interpolation (Stage 3), the positional data is mathematically smooth but can still feel "sticky" — the crop window tracks the face's every micro-movement. What we want is the feel of a camera operator on a fluid-head tripod: anticipatory, not reactive; the crop leads slightly, doesn't chase.

We achieve this with a second smoothing pass at render time: a cosine-weighted temporal kernel.

Why a Second Smoothing Pass?

EMA smoothing (Stage 2) operates on the raw detection noise — it removes jitter between adjacent 10-frame samples. The cosine kernel (Stage 5) operates on a fundamentally different problem: the interpolated frame-by-frame position path still has "perfect tracking" fidelity, following every micromovement. The kernel adds intentional temporal lag — a deliberate cinematic lookahead/lookbehind that makes the crop feel camera-operated rather than computer-driven.

These two smoothing operations are complementary, not redundant.

Building the Cosine Kernel

const SMOOTH_WINDOW = 12; // ±12 frames = ~400ms lookahead/lookbehind at 30fps
const kernelSize = SMOOTH_WINDOW * 2 + 1; // 25 total taps
 
const kernel = useMemo(() => {
  const weights: number[] = [];
 
  for (let i = -SMOOTH_WINDOW; i <= SMOOTH_WINDOW; i++) {
    // cos(0) = 1.0 at center (i=0), cos(π/2) = 0.0 at edges (i=±12)
    // This creates a smooth bell shape that naturally reaches zero at the boundaries
    weights.push(Math.cos((i / SMOOTH_WINDOW) * (Math.PI / 2)));
  }
 
  // Normalize: divide by sum so weights sum to exactly 1.0
  // This ensures the blended position is a true weighted average,
  // not scaled up or down by the kernel's total magnitude.
  const sum = weights.reduce((a, b) => a + b, 0);
  return weights.map((w) => w / sum);
}, []); // Empty deps: kernel shape never changes for the entire video

The resulting kernel (normalized weights for 25 taps):

Tap offset:  -12  -11  -10   -9   -8   -7   -6   -5   -4   -3   -2   -1    0
Raw weight: 0.000 0.013 0.025 0.038 0.050 0.062 0.073 0.084 0.094 0.103 0.111 0.118 0.122
              ↑                                                                       ↑
           edge=0                                                               peak at center

Tap offset:   +1   +2   +3   +4   +5   +6   +7   +8   +9  +10  +11  +12
Raw weight: 0.118 0.111 0.103 0.094 0.084 0.073 0.062 0.050 0.038 0.025 0.013 0.000

This is a half-cosine window — identical to the "Hann window" used in signal processing for spectral analysis. Its frequency-domain properties suppress high-frequency content (fast position changes) while preserving low-frequency content (slow, intentional movement), with no sidelobe artifacts (the cosine naturally reaches zero at ±12 with zero slope).

Applying the Kernel

const { x, y } = useMemo(() => {
  if (reframeTransition === "instant") {
    // Skip kernel: use raw interpolated position directly
    return clampPosition(getFacePositionAtFrame(faceTrackingData, sourceFrame));
  }
 
  // Blend: for each kernel tap, sample the face position at that offset frame,
  // weight it by the kernel coefficient, and accumulate.
  let sumX = 0;
  let sumY = 0;
 
  for (let i = 0; i < kernelSize; i++) {
    const tapFrame = sourceFrame + (i - SMOOTH_WINDOW); // −12 to +12 offset
    const pos = getFacePositionAtFrame(faceTrackingData, tapFrame);
    sumX += pos.x * kernel[i]!;
    sumY += pos.y * kernel[i]!;
  }
 
  return clampPosition({ x: sumX, y: sumY });
}, [sourceFrame, faceTrackingData, kernel, reframeTransition]);

At frame 300, this blends positions from frames 288 through 312 — a ±400ms window around the current moment. The result: the crop window "knows" where the face is heading (future taps) and where it came from (past taps), producing smooth anticipatory tracking without explicit velocity prediction.

Performance: 25 calls to getFacePositionAtFrame() per frame, each O(1). At 30 fps rendering, that's 750 interpolation lookups per second — less than a microsecond of CPU time. useMemo ensures this only runs when sourceFrame changes, not on every React re-render.

Smooth vs. Instant

ModeDescriptionBest For
smooth25-tap cosine blend, ±400ms windowDialogue, interviews, presentations
instantRaw linear interpolation, no kernelFast cuts, action clips, music videos

The instant mode is useful when auto-reframe needs to "snap" between cuts — the cosine kernel would otherwise smear across a hard cut boundary, showing a blurred trajectory through the cut point.


Stage 6: The Full Component Architecture

Video Bounds Helper

Before assembling components, we need getVideoBounds() — a utility that computes where the video sits within the composition canvas. This matters for positioning captions and the cinematic letterbox overlay relative to the video area, not the raw canvas.

const SOURCE_ASPECT = 16 / 9;
 
export function getVideoBounds(
  compWidth: number,
  compHeight: number,
  coverMode: boolean = false
): {
  videoWidth: number;
  videoHeight: number;
  offsetX: number;
  offsetY: number;
} {
  if (coverMode) {
    // In cover mode, the video fills the entire canvas — no offsets, no bars
    return {
      videoWidth: compWidth,
      videoHeight: compHeight,
      offsetX: 0,
      offsetY: 0,
    };
  }
 
  // Contain mode: fit video within canvas with letterbox or pillarbox bars
  const compAspect = compWidth / compHeight;
  let videoWidth: number, videoHeight: number;
 
  if (compAspect > SOURCE_ASPECT) {
    // Canvas is wider than 16:9 → pillarbox (bars on left and right)
    videoHeight = compHeight;
    videoWidth = compHeight * SOURCE_ASPECT;
  } else {
    // Canvas is narrower/taller than 16:9 → letterbox (bars on top and bottom)
    videoWidth = compWidth;
    videoHeight = compWidth / SOURCE_ASPECT;
  }
 
  return {
    videoWidth,
    videoHeight,
    offsetX: (compWidth - videoWidth) / 2,
    offsetY: (compHeight - videoHeight) / 2,
  };
}

VideoComposition — The Gate

const VideoComposition: React.FC<VideoCompositionProps> = ({
  autoReframe,
  reframeTransition,
  faceTrackingData,
  trimStartFrame,
  trimEndFrame,
}) => {
  // Auto-reframe is active only when BOTH conditions are met:
  //   1. User toggled it on in settings
  //   2. Face tracking data was successfully loaded (null = /detect not run yet)
  const reframeActive = autoReframe && faceTrackingData != null;
 
  return (
    <AbsoluteFill style={{ backgroundColor: "black" }}>
      {reframeActive ? (
        // Cover mode: video fills canvas, panned by objectPosition
        <AutoReframeVideo
          faceTrackingData={faceTrackingData!}
          reframeTransition={reframeTransition}
          trimStartFrame={trimStartFrame}
        />
      ) : (
        // Contain mode: video letterboxed/pillarboxed, centered
        <Video
          src={staticFile("v1.mp4")}
          style={{ width: "100%", height: "100%", objectFit: "contain" }}
        />
      )}
      {/* Overlay components receive coverMode to adjust their positioning */}
      <CinematicFrameOverlay coverMode={reframeActive} />
      <CaptionRenderer coverMode={reframeActive} />
    </AbsoluteFill>
  );
};

The gate pattern autoReframe && faceTrackingData != null provides graceful degradation: if the user hasn't run the /detect tool, or if v1_faces.json fails to load, the video renders as a normal centered clip with no errors.

AutoReframeVideo — The Engine

const AutoReframeVideo: React.FC<AutoReframeVideoProps> = ({
  faceTrackingData,
  reframeTransition = "smooth",
  trimStartFrame = 0,
}) => {
  const frame = useCurrentFrame(); // Remotion: current frame in the composition
  // Map composition frame → source video frame (accounting for trim)
  const sourceFrame = frame + trimStartFrame;
 
  const kernel = useMemo(() => {
    /* cosine weights, computed once */
  }, []);
 
  const { x, y } = useMemo(() => {
    if (reframeTransition === "instant") {
      return clampPosition(
        getFacePositionAtFrame(faceTrackingData, sourceFrame)
      );
    }
    let sx = 0,
      sy = 0;
    for (let i = 0; i < kernelSize; i++) {
      const pos = getFacePositionAtFrame(
        faceTrackingData,
        sourceFrame + (i - SMOOTH_WINDOW)
      );
      sx += pos.x * kernel[i]!;
      sy += pos.y * kernel[i]!;
    }
    return clampPosition({ x: sx, y: sy });
  }, [sourceFrame, faceTrackingData, kernel, reframeTransition]);
 
  const objX = `${(x * 100).toFixed(2)}%`;
  const objY = `${(y * 100).toFixed(2)}%`;
 
  return (
    // Sequence with negative offset: tells Remotion that frame 0 of the composition
    // corresponds to trimStartFrame in the source video
    <Sequence from={-trimStartFrame} layout="none">
      <Video
        src={staticFile("v1.mp4")}
        style={{
          width: "100%",
          height: "100%",
          objectFit: "cover",
          objectPosition: `${objX} ${objY}`,
        }}
      />
    </Sequence>
  );
};

Aspect Ratio Compositions

In Root.tsx, we register four Remotion compositions — one per output format. All share the same VideoComposition component; the canvas dimensions are what change:

<Composition id="VideoWithSubtitles" width={1920} height={1080} component={VideoComposition} ... /> // 16:9
<Composition id="VideoPortrait"      width={1080} height={1920} component={VideoComposition} ... /> // 9:16
<Composition id="VideoSquare"        width={1080} height={1080} component={VideoComposition} ... /> // 1:1
<Composition id="Video4x5"           width={1080} height={1350} component={VideoComposition} ... /> // 4:5

The calculateMetadata function runs at composition mount time to fetch data:

const calculateMetadata: CalculateMetadataFunction<
  VideoCompositionProps
> = async ({ props }) => {
  let faceTrackingData = null;
  try {
    const faceRes = await fetch(staticFile("v1_faces.json"));
    if (faceRes.ok) faceTrackingData = await faceRes.json();
  } catch {
    /* gracefully omit if not present */
  }
 
  return { props: { ...props, faceTrackingData } };
};

State Management with Zustand

export const usePlayerStore = create<PlayerState>((set) => ({
  aspectRatio: "16:9",
  autoReframe: false,
  reframeTransition: "smooth",
}));
 
export const ASPECT_DIMENSIONS = {
  "16:9": { width: 1920, height: 1080 },
  "9:16": { width: 1080, height: 1920 },
  "1:1": { width: 1080, height: 1080 },
  "4:5": { width: 1080, height: 1350 },
} as const;

The SettingsPanel UI:

  • Disables the auto-reframe toggle when aspectRatio === "16:9" (no crop needed for native format)
  • Shows the "Smooth / Instant" mode selector only when autoReframe === true
  • Disables settings when faceTrackingData === null with a tooltip: "Run /detect first"

Edge Cases and Design Decisions

Why Not runningMode: "VIDEO"?

MediaPipe's video mode accepts a timestamp alongside each frame and internally maintains temporal state between frames. In theory, this could provide smoother detections by filtering across frames at the model level. In practice, it assumes frames arrive chronologically and at real-time speed. Our seek-and-detect loop jumps from frame 0 to frame 10 to frame 20 instantaneously — the video mode's temporal filter would accumulate stale state and produce corrupted bounding boxes. "IMAGE" mode's statelesness is a feature here, not a limitation.

What Happens When the Face Leaves Frame?

If the speaker walks off-screen or turns completely away, result.detections.length === 0. Our carry-forward logic holds the last known position. In practice, the carry-forward is almost never needed for more than 2–3 samples (0.3 seconds at 3 Hz) before the face returns or a cut occurs. The worst case — someone standing up or walking away permanently — still produces a reasonable result: the crop stays where the face last was, which is typically a reasonable framing even without a face.

What About Multiple Faces?

We track detections[0] — the highest-confidence detection. For multi-speaker content, the right approach would be:

  1. Run speaker diarization (e.g., pyannote-audio) to get per-speaker timestamps
  2. During each detected segment, choose the bounding box closest to the known speaker's last position
  3. This is noted in our arch.md as a future enhancement

GPU Fallback

delegate: "GPU" will throw if the browser has no WebGL support or if the WebGL context fails. For production, add:

let detector: FaceDetector;
try {
  detector = await FaceDetector.createFromOptions(vision, {
    baseOptions: { modelAssetPath, delegate: "GPU" },
    runningMode: "IMAGE",
  });
} catch {
  // CPU fallback: ~3-5x slower, works in headless environments
  detector = await FaceDetector.createFromOptions(vision, {
    baseOptions: { modelAssetPath, delegate: "CPU" },
    runningMode: "IMAGE",
  });
}

Server-Side Detection?

We considered running detection server-side with tfjs-node + node-canvas and the TFLite delegate. Rejected because:

  • The browser's WebGL pipeline is faster than node-canvas CPU inference for this model
  • Keeping detection client-side means the server is stateless — no GPU instance needed
  • Users get real-time progress feedback and can abort mid-video
  • The ~100 KB output JSON file is trivially small to persist anywhere

The Full Data Flow

┌─────────────────────────────────────────────────────────────────────┐
│                    /detect (browser, one-time)                       │
│                                                                     │
│  v1.mp4                                                             │
│    │                                                                │
│    ├─ seek to frame 0, 10, 20, ... N                                │
│    │                                                                │
│    └─▶ ctx.drawImage(video, 0, 0, 640, 360)                         │
│              │                                                      │
│              └─▶ MediaPipe FaceDetector.detect(canvas)              │
│                    │                                                │
│                    ├─ WebGL texture upload (640×360)                │
│                    ├─ GLSL shader inference (BlazeFace 128×128)     │
│                    ├─ NMS in WASM                                   │
│                    └─▶ boundingBox {originX, originY, width, height}│
│                              │                                      │
│                    normalize to [0,1] → FaceSample{frame, x, y}    │
│                              │                                      │
│                         applyEMA(α=0.35)                            │
│                              │                                      │
│                         POST /api/save-faces                        │
└─────────────────────────────────────────┬───────────────────────────┘
                                          │
                                          ▼
                               public/v1_faces.json
                          { sampleInterval:10, fps:30,
                            samples:[{frame, x, y}, ...] }
                                          │
                               fetchFaceTrackingData()
                                          │
                                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Remotion render loop (every frame)                 │
│                                                                     │
│  useCurrentFrame() → frame                                          │
│     │                                                               │
│     └─ sourceFrame = frame + trimStartFrame                         │
│              │                                                      │
│              └─ [for i in -12..+12]:                                │
│                   getFacePositionAtFrame(data, sourceFrame + i)     │
│                     → floor/ceil sample indices                     │
│                     → linear lerp with t = fractional offset        │
│                   × kernel[i]  (cosine weight)                      │
│                   accumulate sumX, sumY                             │
│              │                                                      │
│              └─ clampPosition({x: sumX, y: sumY}, 0, 1)            │
│                              │                                      │
│                 objectPosition: `${x*100}% ${y*100}%`              │
│                              │                                      │
│                 <Video objectFit="cover" objectPosition=... />      │
│                 → GPU video compositor pans viewport across         │
│                   the 3413×1920 virtual canvas                      │
└─────────────────────────────────────────────────────────────────────┘

Conclusion

The auto-reframe system achieves cinematic face tracking across arbitrary aspect ratios with a small, composable surface area. Every layer serves a distinct, non-overlapping purpose:

MediaPipe + WASM: On-device, GPU-accelerated face detection with no API costs or server round-trips. Understanding the WASM linear memory model, the BlazeFace anchor architecture, and WebGL delegation explains every performance characteristic and failure mode.

Sparse sampling (every 10th frame): MediaPipe is the bottleneck — interpolation is free. 3 Hz sampling captures all meaningful head movement while reducing detection time from hours to minutes.

EMA smoothing (α=0.35): Removes detection noise from the discrete sample set before storage. Targets the jitter source, not the signal.

Linear interpolation: Converts sparse samples to a dense per-frame position stream. O(1), no overshoot, sufficient for smooth head motion at 10-frame intervals.

Cosine kernel (±12 frames): Adds the cinematic lookahead-lookbehind feel at render time. Targets perceptual smoothness, not mathematical noise. Separating this from EMA means you can tune each independently.

CSS objectFit: cover + objectPosition: The GPU video compositor does all the panning math. No pixel manipulation, no additional canvas elements, no manual crop math. The browser's built-in hardware acceleration handles it all.

Graceful degradation: Every layer has a fallback. No face data → centered video. Detection miss → carry-forward position. GPU unavailable → CPU fallback. The feature is additive throughout.

The entire system adds less than 200 KB of runtime code and has no ongoing inference costs after the one-time /detect run. The render loop is pure arithmetic — array lookups, lerps, and one CSS property update per frame.