Specula app icon

Workflow · Podcast

Measure the talking, not the music bed.

Podcast platforms specify a dialog-gated loudness target, not a whole-file one. Measure the episode as a single number and the intro bed, the stinger, and the room tone all drag it off. Specula runs a neural VAD over the file, gates a parallel integrated LUFS to the speech blocks, and reads that against the Apple Podcasts and Spotify Podcasts targets, with the speech regions hand-correctable when the detector guesses wrong.

macOS Speech-gated LUFS Silero neural VAD Dialogue mode Level Dialogue Noise floor

7-day trial · one-time purchase · macOS 14+

Specula in Dialogue mode with detected speech regions over the waveform, one region being hand-edited.
Dialogue, detected and correctable. A neural VAD marks speech; you can hand-correct any region before it drives the loudness read.

Why whole-file LUFS is the wrong number

Dialog loudness measured over the whole file is wrong the moment there's music or SFX in it. The platforms know that, so they specify dialog-gated targets.

An integrated LUFS reading over the entire episode mixes the talking with everything that isn't talking. A loud cold-open music bed, a sponsor stinger, a long quiet pause, they all land in the same average, so the file can read on-target while the actual voice sits a couple of LU off where a listener hears it. Specula computes loudness only over the speech blocks, so the number you check against Apple Podcasts or Spotify Podcasts is the dialogue, not the dialogue blended with the bed.

The same separation matters at the quiet end. The room-tone noise floor between phrases is invisible on a whole-file loudness meter and inaudible on headphones, but it's measured over every non-speech sample here, so a noisy capture shows up as a number before it ships.

Apple Podcasts−16 LKFS, dialog-gated, true peak at or below −1 dBTP.
Spotify Podcasts−14 LKFS, dialog-gated, true peak at or below −1 dBTP.
ACX−20.5 dB RMS ±2.5 dB (judged on plain RMS, not LUFS), noise floor ≤ −60 dB RMS, true peak ≤ −3 dBTP, for audiobook work, covered in the audiobook workflow.

The workflow

Load to verdict in four steps. The detector does the first pass; you correct it; the targets read off the result.

1 · Mark speech on load
Drop the episode; the Silero neural VAD runs on load, downsampling the file to 16 kHz mono and classifying every 100 ms block as speech or non-speech. Detected speech shades teal on the waveform (the SG toggle), so you can eyeball the result before relying on the numbers.
2 · Hand-correct in Dialogue mode
Dialogue mode (⌘4) lets you fix what the VAD missed: drag a region's body to move it or its edges to resize, mark I / O the DAW way, S to split at the playhead, Delete to remove. 50-step undo / redo, including drags. Edits auto-save to a versioned .dlg.json sidecar next to the file (debounced 500 ms), so re-opening the episode re-applies them.
3 · Read speech-gated LUFS
The Loudness section shows Speech-Gated Integrated LUFS and Speech % alongside the full-file integrated value, plus a live Noise Floor readout (dB RMS over every non-speech sample) in Podcast mode. Same BS.1770 K-weighting and dual gating as the full measurement, accumulated only over the speech blocks.
4 · Check against the platform
In Podcast mode the Loudness Targets panel shows the penalty against Apple Podcasts (−16 LKFS) and Spotify Podcasts (−14 LKFS), both dialog-gated, with a triangle if the true peak breaches −1 dBTP. Set the TP threshold to that −1 dBTP ceiling and every spot the episode breaches it lights up on the waveform, so an intro sting or a clipped laugh shows you where to seek rather than just that it happened somewhere. Run offline analysis (⌘Return) over the whole file for the settled number, then export the receipt: ⌥⌘E for PDF, ⇧⌘E for JSON.

Two-tier speech detection, and where your corrections go

Tier 1 is a Silero neural VAD (MIT-licensed, run via FluidAudio). On load the file is downsampled to 16 kHz mono and classified per 100 ms block. Tier 2 is a spectral fallback that covers the rare case where Silero is unavailable, it classifies on four spectral features (an HF gate, a 300 to 3 400 Hz band-energy ratio, spectral flatness, and a harmonicity-plus-flux test) so the speech-gated path still produces a result.

The corrected regions feed straight into the speech-gated path. When you fix regions in Dialogue mode, the dialog-gated LUFS, the noise-floor reading, and the Apple Podcasts / Spotify Podcasts verdicts all recompute against the speech and silence you confirmed, not what the raw VAD guessed. Region tint shows provenance: teal for pristine Silero output, amber for VAD regions you've edited, blue for ones loaded from a sidecar.

Bias the detector to your material in Settings → Speech. Three sliders re-run detection in about half a second: Threshold (0.1 to 0.9, default 0.5, lower picks up quieter speech), Minimum region duration (0.05 to 2.0 s, default 0.10, raise to reject clicks and short interjections), and Merge gap (0.05 to 2.0 s, default 0.10, larger values fuse nearby regions through breath pauses). Reset to VAD discards your edits, deletes the sidecar, and re-runs Silero with the current tuning.

Level the dialogue, in the same app

Where the dialogue is off level or the room tone sits high, the fix is in Edit mode. Level Dialogue works off the same speech regions the measurement uses (the ones you confirmed in Dialogue mode): it ducks the room tone between phrases with a fast attack and slow release, so word onsets and tails stay clean. Tick Lift dialogue to target and it also brings the voice to a target in the same pass, capping inter-sample peaks at a true-peak ceiling. Gating the silence before the makeup gain is what lets the dialogue come up without the floor riding up with it, the difference between a clean episode and one where every pause hisses. It's a gate, not a denoiser, so it rescues a marginal floor (a quiet room tone just under the gate); a genuinely hissy recording still wants a better capture.

For a straight level move, Normalize on a dialog-gated basis shifts the whole episode by the gain that lands speech-gated LUFS on Apple Podcasts' −16 or Spotify's −14, and a true-peak triangle is one Limit TP pass that caps the peaks without touching loudness. Each is one undoable edit that saves a new file. See the editing workflow →

See it in Specula

Teal speech regions overlaid on a spoken-word waveform immediately after load.
Speech, found on load. The Silero neural model marks spoken regions automatically, fully offline.
The Level Dialogue popover with a duck-room-tone field and a lift-dialogue-to-target toggle for LKFS and dBTP.
Level the dialogue without raising the floor. It ducks room tone before the makeup gain, so the floor drops instead of riding up.
Speech-gated integrated LUFS and a live noise-floor readout in the loudness sidebar.
Speech-gated loudness. The dialog-gated LUFS and noise floor are measured over speech only, not the silences.
Podcast mode targets for Apple Podcasts and Spotify, with the ACX audiobook row and its noise-floor criterion.
Podcast and audiobook specs. Apple, Spotify, and the ACX RMS window with its noise-floor requirement, side by side.

Ship the episode on target

Specula is the pass between "the edit is locked" and "it's published." It tells you what the dialogue actually measures against each platform's spec, levels the dialogue and ducks the room tone with Level Dialogue, lands the episode on a target with Normalize, and caps the peaks with Limit TP, all without leaving the file you loaded.