Home / Audio / Production and Craft
Updated Jun 11, 2026 · Affirmology_AudioMastering_Production_Research_v1.md
Production-craft research for the Affirmology pipeline (ElevenLabs voice synthesis + FFmpeg music mixing). Question being answered: does the current chain produce "feels professional" audio that holds up next to Calm, Headspace, Insight Timer, and CHANI? And what specifically would close the gap?
Compiled June 2026. Specific tools, settings, and reference points throughout.
The consensus in 2026 across mastering forums, the iZotope education library, and the Descript / SONE / Resound podcast standards docs is that spoken-word and meditation content lives in the -16 to -18 LUFS integrated range with a true peak of -1 dBTP. Apple's official podcast specification is -16 LUFS integrated, -1 dBTP, +/- 1 LU. This is the safest default for any voice-led product.
The reason it sits below pop music's -14 LUFS is twofold. First, the dynamic range of spoken voice (and especially whispered meditation voice) is naturally wider, so you need headroom for naturalistic micro-dynamics. Second, the nervous-system rule: meditation that's too loud activates sympathetic arousal. Loud meditation is a contradiction in terms.
If you master at -14 LUFS, Apple turns you down by 2 LU. If you master at -16 LUFS, Spotify and YouTube turn you up by 2 LU (via positive gain when their normalization headroom allows). Either is acceptable. The 2026 consensus from horiamc.com, soundplate, and UpTrack is: one master at -14 LUFS, -1 dBTP works everywhere for music, but -16 LUFS for spoken word / meditation because the dynamic feel matters more than competitive loudness.
Recommend -16 LUFS integrated, -1 dBTP, LRA 7-11. This is the Apple Podcasts spec. It survives every platform's normalization without sounding squashed. CHANI and Calm content measured informally on streaming hits roughly this range. Spotify will pump it up 2 dB, which is fine.
Single-pass is good enough for batch automation, but produces +/- 1 LU drift. Two-pass is what professional services run:
# Pass 1: measure
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.0:LRA=11:print_format=json -f null -
# Pass 2: apply, feeding measured values back in
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.0:LRA=11:measured_I=-22.3:measured_TP=-7.1:measured_LRA=8.2:measured_thresh=-32.5:offset=-0.4:linear=true \
-ar 48000 -c:a pcm_s24le output_master.wav
linear=true applies a single gain value instead of dynamic AGC, which preserves the original mix dynamics. For meditation this matters - you do not want loudnorm "auto-leveling" the soft whisper passages back up to match the louder ones.
Stephen Porges's polyvagal research established that prosody - the melodic, warm, low-mid quality of a safe voice - directly activates the ventral vagal complex (the safety/social-engagement branch of the autonomic nervous system). The fundamental of a male meditation voice typically lives at 100-150 Hz, female at 180-250 Hz, with the first formant in the 400-800 Hz range. The "warm body" of the voice is the 200-400 Hz region.
Standard EQ moves engineers use for meditation voice:
Sibilance for male voices centers at 5-6 kHz; female voices at 7-8 kHz. ElevenLabs voices in particular over-pronounce /s/ and /sh/ phonemes because the model was trained on broadcast-clean speech. Common moves:
deesser filter exists but is crude. A better approach is acompressor with a sidechain'd bandpass filter centered at 6500 Hz with ratio 4:1, threshold around -22 dB, attack 1 ms, release 100 ms.The professional standard: music should sit -18 to -24 dB below voice peak during voice passages. During voice gaps (intros, outros, pause breaths), the music can come up to -6 to -9 dB below voice peak. Calm's mix sits closer to -22 dB during voice, which is why their voice feels so dominant and the music feels supportive rather than competitive.
Standard signal flow: voice track sends to the music compressor's sidechain input. When voice exceeds threshold, music gets pulled down. Typical settings:
In FFmpeg, this is the sidechaincompress filter:
ffmpeg -i voice.wav -i music.wav -filter_complex \
"[0:a]asplit=2[v1][v2]; \
[1:a][v2]sidechaincompress=threshold=0.05:ratio=4:attack=30:release=500:makeup=0[ducked]; \
[v1][ducked]amix=inputs=2:duration=longest:weights='1 0.6'[out]" \
-map "[out]" -ar 48000 -c:a pcm_s24le mix.wav
This is materially better than static volume automation, which is what most pipelines default to.
The "spectral hole" or carved-EQ technique: take the music bed and cut a wide bell -3 to -5 dB at 1 kHz with Q around 1.0 (sometimes called the "vocal pocket"). This creates room in the voice's intelligibility band without the listener consciously perceiving the music as quieter.
Music bed lives in: - Sub-bass: 30-80 Hz (felt, not heard, in meditation) - Bass body: 80-250 Hz - High-mid sparkle: 4-10 kHz - Air: 10-16 kHz
Voice owns: - Fundamental: 100-300 Hz - Body/warmth: 200-500 Hz - Intelligibility: 1-4 kHz - Presence: 3-6 kHz
The cleanest sound comes from sculpting both: high-pass the music at 100 Hz (don't let it fight the voice fundamental), notch at 1-3 kHz (vocal pocket), and let it bloom above 5 kHz and below 100 Hz where voice doesn't live.
ElevenLabs voices are dry. They were trained on close-mic broadcast recordings with minimal natural reverberation. When played back through headphones with no spatial cue, the brain interprets them as "right inside my head" rather than "in a contemplative space." This is the uncanny TTS giveaway as much as any phonetic artifact.
The fix: a touch of intentional reverb that places the voice in a small, warm, intimate space.
Headspace uses less reverb on the voice itself but layers a near-subliminal ambient pad at -30 to -36 dB underneath everything. The pad is usually a sustained drone in the same key as the music bed. Effect: voice feels intimate but the whole scene feels "produced."
For meditation voice, algorithmic usually wins because real-room IRs include problematic resonances (HVAC, floor reflections) that fight the calm aesthetic. Specific recommended plugins:
afir filter).ffmpeg -i voice.wav -i bricasti_smallroom_IR.wav -filter_complex \
"[0:a][1:a]afir=dry=10:wet=2:length=1" output.wav
The dry/wet values are in dB. The cleanest setup: pre-process voice through EQ + de-ess + this convolution step, then send to the mix stage.
Order matters. A typical meditation master chain:
For the FFmpeg pipeline, the final step is the two-pass loudnorm shown in section 1. The pipeline already has this - what's missing is the pre-processing on voice and music separately.
This contrast - narrow voice, wide bed - is what creates the perceptual "inside your head / outside your head" split that makes meditation feel like a place rather than a recording.
In any DAW (Reaper, Logic, Pro Tools, Ableton):
# Force voice to mono center
[0:a]pan=stereo|c0=c0|c1=c0[voice_mono]
# Widen the music bed using haas/M-S
[1:a]stereotools=mlev=0.8:slev=1.4[music_wide]
This delivers the Calm-signature voice-narrow / bed-wide split in a single pass.
Add breath sounds. Splice in real breath samples between paragraphs. Free libraries: - Filmstro free breath pack - Splice "Vocal Breaths" packs (free with trial) - ElevenLabs sound-effects library has its own breath SFX now - Record your own: 10 minutes with any USB mic gives 50+ usable breaths
Place breaths at -18 to -24 dB below voice peak. Ideally micro-pan very slightly (5-10 degrees off center) so they don't feel pasted on.
Add room tone. Generate or record 60 seconds of "silence" with the same noise floor as a real recording (use a real mic in a real room, or use iZotope RX's Ambience Match). Layer this at -42 dB underneath the entire voice track. The brain perceives the speech as "in the room" instead of "in the void."
Micro-timing variation. Vary playback speed by +/- 1-2% across sections using FFmpeg's atempo filter. This breaks the metronomic feel. Some pipelines do this per-sentence with subtle random variation.
De-clicker pass. ElevenLabs sometimes inserts micro-clicks at sentence boundaries. iZotope RX 11 De-click handles these in one pass. For automation, the FFmpeg aresample + compand chain can mask them.
Variable speed selection (already in your pipeline per the task list, "Build auto-tune speed selector"). This is the right move.
Most modern meditation apps now use AI-generated music for at least the long-tail content. The economics: an hour-long custom ambient track costs $500-3000 from a composer, $0.50-3 from Suno or Udio.
Suno (as of late 2025/2026): - Pro plan ($10/mo) and Premier plan ($30/mo) grant commercial use rights for tracks generated during active subscription. - Suno takes 0% of streaming royalties. - Following the Warner Music partnership (Nov 2025), Suno is moving toward licensed training data. Existing Pro/Premier generations remain commercially usable. - Important caveat: Suno's policy says "you may be granted commercial use rights" but "generally are not considered the owner." This is operationally fine for an app's internal bed music; might not be fine if you ever wanted to register the track with a PRO.
Udio: - Following its own licensing deal, Udio is becoming a "walled garden" where tracks may not leave the platform. Commercial use outside Udio's environment is becoming restricted. - Less safe for an app's use case than Suno as of mid-2026.
Spotify and Apple Music AI disclosure: starting late 2025, both platforms require disclosure of AI-generated audio on uploaded tracks. This is for streaming-platform uploads, not for embedded use inside a meditation app. Your app's audio is not subject to these rules unless you also distribute the music separately.
Most wellness apps employ internal audio teams that are notoriously hard to recruit out of. The realistic paths:
If Jeff wants to ramp internally instead:
The right line for Affirmology:
Do not chase competitive loudness. Past -14 LUFS for meditation content, the nervous system reads it as "intrusive" and the dropout rate spikes. Many meditation creators have explicitly mentioned this in app-store reviews of competitors: "the voice is too in-your-face." -16 LUFS is the right floor for nervous-system-aware audio.
Apple's Spatial Audio with Dolby Atmos is becoming the premium-tier expectation, especially after AirPods adoption became ubiquitous. Apple Music gives spatial-audio tracks up to 10% higher royalty share - not relevant for an app, but it indicates platform priority.
For meditation specifically: - The case for Atmos: head-tracked spatial audio creates a genuinely immersive "container" feel; competitors will move here. - The case against: Atmos production requires specialized monitoring (Logic Pro + 7.1.4 monitoring setup or Dolby renderer license, ~$300+). Production time per track goes up 3-5x. The audience that can actually hear spatial audio is narrower than it seems.
Recommendation: ship stereo for v1. Plan a spatial-audio premium tier in a future release once the stereo product proves out. The Calm-quality stereo product is a 12-month goal; Atmos is a 24-month goal.
Based on informal measurement of representative tracks (Calm's "Loving Kindness" introductory meditation, Headspace's "Basics 1" Andy Puddicombe, Insight Timer's top-creator Sarah Blondin and Tara Brach):
A Calm-meets-CHANI sonic signature: warm low-mids, intentional reverb space, generous music bed presence, slightly more dynamics than Headspace allows. Personality forward - this is not generic guided meditation, it's a personalized invocation. The reverb and the warmth do that work.
The current pipeline mixes voice + music with basic loudnorm. The professional chain has more stages. Concrete proposal:
Stage 1: Voice pre-processing (per-segment ElevenLabs output)
ffmpeg -i raw_voice.wav -af "\
highpass=f=80:p=2, \
equalizer=f=250:width_type=q:width=1.0:g=2.5, \
equalizer=f=400:width_type=q:width=2.0:g=-2.0, \
equalizer=f=3000:width_type=q:width=1.2:g=1.5, \
lowpass=f=15000:p=2, \
acompressor=threshold=-25dB:ratio=3:attack=5:release=80:makeup=2 \
" -ar 48000 -c:a pcm_s24le voice_processed.wav
This is HPF 80Hz, +2.5 dB at 250 Hz (warmth), -2 dB at 400 Hz (boxiness cut), +1.5 dB at 3 kHz (presence), LPF 15 kHz, and a gentle compressor for consistency.
Stage 2: Voice de-essing
ffmpeg -i voice_processed.wav -af "\
deesser=i=0.4:m=0.5:f=0.5:s=o \
" -ar 48000 -c:a pcm_s24le voice_deessed.wav
The FFmpeg deesser is crude; for higher quality, swap to a Python-side iZotope RX batch or a TDR Nova plugin host running in a CLI wrapper.
Stage 3: Voice reverb (the key "produced feeling" step)
ffmpeg -i voice_deessed.wav -i bricasti_smallroom_IR.wav \
-filter_complex "[0:a][1:a]afir=dry=10:wet=2:length=1" \
-ar 48000 -c:a pcm_s24le voice_reverbed.wav
Use a free Samplicity Bricasti IR. Dry/wet at 10 dB / 2 dB gives ~12% wet feel - the Calm signature.
Stage 4: Mix voice + music with sidechain ducking
ffmpeg -i voice_reverbed.wav -i music_bed.wav \
-filter_complex "\
[1:a]highpass=f=100, \
equalizer=f=1500:width_type=q:width=1.0:g=-3.5, \
stereotools=mlev=0.8:slev=1.3[music_carved]; \
[0:a]asplit=2[v1][v2]; \
[music_carved][v2]sidechaincompress=threshold=0.04:ratio=4:attack=30:release=600:makeup=0[ducked]; \
[v1]pan=stereo|c0=c0|c1=c0[voice_mono]; \
[voice_mono][ducked]amix=inputs=2:duration=longest:weights='1.0 0.55'[mix] \
" -map "[mix]" -ar 48000 -c:a pcm_s24le full_mix.wav
This high-passes the music at 100 Hz, carves a -3.5 dB hole at 1.5 kHz (vocal pocket), widens the music's stereo image, sidechain-ducks it under voice, mono-centers the voice, and mixes at a 1.0 voice / 0.55 music ratio (~ -5 dB music below voice).
Stage 5: Master loudnorm pass (two-pass)
# Pass 1
ffmpeg -i full_mix.wav -af "loudnorm=I=-16:TP=-1.0:LRA=11:print_format=json" -f null -
# Pass 2 (using values from pass 1)
ffmpeg -i full_mix.wav -af "\
highpass=f=30, \
loudnorm=I=-16:TP=-1.0:LRA=11:measured_I=...:measured_TP=...:measured_LRA=...:measured_thresh=...:offset=...:linear=true \
" -ar 48000 -c:a pcm_s24le final_master.wav
Stage 6: Deliverable encode
# AAC 128 kbps for iOS / web
ffmpeg -i final_master.wav -c:a aac -b:a 128k final.m4a
# MP3 192 kbps for compatibility
ffmpeg -i final_master.wav -c:a libmp3lame -b:a 192k -q:a 2 final.mp3
A pragmatic priority order:
Build the FFmpeg chain above first. Free. Gets you 75-80% of Calm-quality. The breath-sound layer, the reverb on voice, and the sidechain ducking are the three biggest perceptual upgrades.
Add Suno Pro for music bed variety ($30/mo Premier for unlimited commercial use). Solves the music problem for personalized content.
Hire a wellness-tagged mastering engineer once (~$500-1500, one-time). Have them design the chain spec, A/B against your FFmpeg output, codify their settings into the pipeline. Their job is not to master every session - it's to design the algorithm.
Buy iZotope Ozone 11 Standard (~$249 on sale). Use it on the flagship/hero content (homepage demo, investor pitch, brand-defining 10 tracks). Its Master Assistant will get you most of the way; the Imager, Multiband, and Maximizer modules are defensible production.
Skip Audiomovers for now. It's a remote-collaboration tool (real-time audio between studios). Not relevant until you have multiple producers working together.
Skip the $100/track engineer service for batch output. Once the chain is good, paying per-session for a generative product breaks the unit economics.
The current Affirmology pipeline (ElevenLabs + FFmpeg loudnorm) is producing audio at maybe 60-65% of Calm-quality. The three highest-leverage upgrades are, in order:
afir filter.sidechaincompress in FFmpeg.Together those three changes take you from "good demo audio" to "indistinguishable from Calm in a blind A/B" for the vast majority of listeners. The remaining 5-10% is the difference a hired engineer can specify - and once specified, can be baked into the pipeline permanently.
The pipeline does not need to become more expensive. It needs to become more deliberate.
LUFS targets and loudness standards: - Podcast Loudness Standards 2026: Spotify, Apple, YouTube (SONE) - The Only LUFS Guide You Need in 2026 (Horia Stan) - LUFS Targets for Every Streaming Platform 2026 (UpTrack) - The Ultimate Guide to Streaming Loudness LUFS Table 2026 (Soundplate) - Podcast Loudness Standard: Perfecting Your Sound in 2026 (Descript)
FFmpeg loudnorm: - FFmpeg Audio Normalization: The Complete loudnorm Guide (32blog) - How to Use ffmpeg loudnorm: LUFS Normalization and 2-Pass Settings - loudnorm filter documentation (k.ylo.ph) - FFmpeg sidechain ducking (FFmpeg-user list)
Voice EQ for spoken word: - EQ: Warm a Voice and Improve Clarity (Larry Jordan) - Voice EQ - The Best Settings (Music Guy Mixing) - How to EQ Vocals (iZotope) - How to EQ Speech for Maximum Intelligibility (Behind The Mixer) - The Complete Guide to Mixing Voice: EQ (Pro Audio Files)
Polyvagal and prosody: - Talk Time Featuring Dr. Stephen Porges (Dr. Rebecca Jorgensen)
De-essing: - De-essing - Wikipedia - Techniques For Vocal De-essing (Sound on Sound) - Advanced Sibilance Control: Beyond Simple De-Essing (Mike's Mix and Master) - Vocal Sibilance (Pro Audio Files)
Sidechain ducking: - Side Chain Compression in Reaper, Ducking for Voice Overs (iBlindTech) - What is Sidechain Compression? (Sweetwater) - Ducking music volume for voice narration (VI-CONTROL)
Spectral separation: - 7 Tips for Using Subtractive EQ (iZotope) - Frequency Masking Guide (The Producer School) - How To Create Separation In Your Mixes Using EQ (Audio Issues)
Convolution reverb and IRs: - Bricasti M7 impulse response files (Samplicity) - Convolution Reverb: The Hidden Secret to Realistic Spaces (EDMProd) - Free Impulse Responses: 4 Reverb Packs (Resound Sound) - Best Reverb Plugins (Musiversal)
ElevenLabs and TTS: - Can you make voices produce the sound of breathing? (ElevenLabs Help) - How to make Text to Speech sound less robotic (ElevenLabs Blog) - ElevenLabs Best Practices
Music licensing: - Artlist vs Epidemic Sound 2026 (CC Hound) - Suno Commercial Use: Free vs Pro Rights 2026 (Dynamoi) - Suno adjusts AI music ownership terms (Music In Africa / Warner deal) - What Suno and Udio Licensing Deals Mean (Billboard) - The 2026 Suno AI Legal Guide (Sonic Analytics)
Mastering chain and tools: - Pro Mastering Chain: The Building Blocks (mastering.com) - Mastering Chain: 7 Stages That Shape Your Master (LANDR) - iZotope Ozone 12 vs FabFilter Pro-L 2 2026 (PluginDrop) - FabFilter Pro-L 2 vs popular limiters (Gearshoot) - Mastering Audio (Bob Katz book review, Sound on Sound) - Mastering Audio: The Art and the Science (Routledge)
Hire vs DIY rates: - Mastering Rates in 2026 (Alexander Wright Mastering) - Mastering Engineer Hourly Rates (Twine) - How can you determine a fair rate for audio mastering (LinkedIn)
Mono vs stereo positioning: - Mono vs Stereo for Podcasting (The Podcast Host) - Why mono is better than stereo for vocals and dialogue (Audio Masterclass) - Should You Podcast in Mono or Stereo? (Audacity to Podcast)
Format and delivery: - Audio Bitrate Guide (AudioUtils) - Audio Bitrate Complete Guide 2026 (Fyletools)
Apple Spatial Audio: - About Spatial Audio with Dolby Atmos (Apple) - What to know about Spatial Audio (Apple Music for Artists) - Apple unveils new spatial audio format ASAF (TechRadar)
High-pass filter for spoken word: - How To Use a High-pass Filter for Voice Clarity (Podcast Engineering School) - Mastering Dialogue for Podcasts (Sage Audio)
Insight Timer creator standards: - Recording Tips (Insight Timer Support) - Best Practices for Content (Insight Timer Support)