Hermes Dub

Hermes Dub is built to be the best dubbing system in the world. This is version one, released May 16, 2026. It already surpasses every open-source system and nearly all proprietary ones. The only proprietary system that edges it out in certain cases is HeyGen. A public GitHub is coming soon, with rapid development ahead. The pipeline runs mostly on open-source models connected by a very intricate architecture.

How It Works

Original

Hermes Dub Open Source Models

ElevenLabs Closed Source

Other Open-Source Systems

System A

System B

System C

System D

System E

How It Works

Voice Generation

The voice generation side of the pipeline solves a problem that has been consistently treated as unsolved in the literature: producing high-quality voice cloning across languages from a short reference clip. Most research and tooling in this space either requires long reference recordings or degrades heavily when the source and target language differ, because the model has to separate speaker identity from phoneme patterns that simply do not exist in the target language.

Reference clip — original speaker

The original speaker's voice extracted from the source video vocals. This is the identity target — a short English clip the voice conversion step tries to match as closely as possible.

The approach uses a two-stage process. The first stage uses Qwen TTS with only a base embedding to generate a generic, accent-free voice in the target language — no speaker identity, just clean phonemes. The second stage runs that output through SeedVC for voice conversion, using the original speaker clip as the target identity. The source for the conversion is the generated target language speech, not the original recording, which sidesteps the cross-language degradation problem almost entirely.

First TTS pass — no identity

Generic voice, phoneme-correct in the target language. No speaker identity yet — this is the clean foundation fed into SeedVC.

After SeedVC voice conversion

Same 8-second window after the speaker's identity is mapped across. The cross-language clone — from a short clip, no English accent.

To prepare the target embedding, SeedVC runs with a pre-cleaned version of the reference audio passed through noise reduction, highpass/lowpass filtering, and loudness normalization before inference, which tightens the identity match considerably. The result is a voice that carries the original speaker's characteristics without the original accent, achievable with only a few seconds of reference. For cases where longer reference audio is available, RVC produces a more stable and higher-fidelity clone since it has more material to anchor the speaker identity.

Final dubbed segment

The final dubbed output using the refined speaker embedding. This is exactly what goes into the finished video.

Lip Sync

The lip sync problem is fundamentally a multi-person scene problem, and most off-the-shelf tools simply cannot handle it. They process the full video frame as a single face target, which breaks immediately when two speakers are visible at once, or fails silently by latching onto the wrong face. The pipeline solves this by treating each speaker as an independent rendering unit.

It starts with full-frame body tracking using YOLO + BoTSORT re-identification to get stable person trajectories across the video, then runs face detection and embedding extraction using InsightFace combined with DeepFace in a blended scoring model. The identity matching weights spatial proximity heavily alongside deep embedding similarity, with a rolling gallery of confirmed embeddings per speaker to prevent cross-speaker steals as people move around the frame.

Body tracking + face identity

Live bounding boxes per speaker across every frame. Yellow = Speaker 1, Orange = Speaker 2. Each person gets a stable ID for the full duration.

Once the per-person tracks are established, active speaker recognition resolves which speaker is producing which audio segment at every moment: it measures temporal overlap between the diarized speech segments from WhisperX and the face observation windows, accumulates overlap-weighted scores per speaker-face pair, then assigns each speaker a unique face greedily from the ranked candidates. Each speaker then gets their own cropped video lip-synced independently against their own dubbed audio track, and composited back into the original full frame with alpha blending to smooth the edges.

Face crop — original

512×512 input to the lip sync model before dubbing is applied.

Face crop — dubbed

Same crop after lip sync. This gets composited back into the full scene.

On-Screen Text

The on-screen text translation is a separate experimental module that handles persistent text overlays throughout the video, like title cards and captions. The detection pass samples the video at two frames per second, runs Tesseract OCR on each frame with word-level confidence scoring, and clusters nearby words into blocks using gap-based spatial grouping. Only regions that persist for at least 1.5 seconds are kept as candidates — the frame boundaries are then refined by a linear search that pins the exact first and last frames where the text is present.

Text detection pass

Colored boxes show every persistent on-screen text region — with detected content and the exact frame range it was tracked across.

For each detected region, the system runs three sequential image model calls: first to inpaint the region with all text removed, reconstructing the natural background as if it never existed; second to generate the translated text on a pure magenta background using the exact same font, weight, and placement as the original; and third, a chroma key pass converts the magenta pixels to transparency, producing a clean RGBA text layer. At compositing time, each frame receives a two-layer blend — the clean background is masked to only the pixels where the original text was, and the translated RGBA is alpha-composited on top. The surrounding region is left completely untouched.