Home / Audio / Production and Craft

Affirmology Audio Mastering & Production Research v1

Updated Jun 11, 2026 · Affirmology_AudioMastering_Production_Research_v1.md

Summary. Production-craft research for the Affirmology pipeline (ElevenLabs voice synthesis + FFmpeg music mixing). Question being answered: does the current chain produce "feels professional" audio that holds up next to Calm, Headspace, Insight Timer, and CHANI? And w

Affirmology Audio Mastering & Production Research v1

Production-craft research for the Affirmology pipeline (ElevenLabs voice synthesis + FFmpeg music mixing). Question being answered: does the current chain produce "feels professional" audio that holds up next to Calm, Headspace, Insight Timer, and CHANI? And what specifically would close the gap?

Compiled June 2026. Specific tools, settings, and reference points throughout.


1. LUFS Mastering Targets

Industry-standard loudness for meditation / spoken-word

The consensus in 2026 across mastering forums, the iZotope education library, and the Descript / SONE / Resound podcast standards docs is that spoken-word and meditation content lives in the -16 to -18 LUFS integrated range with a true peak of -1 dBTP. Apple's official podcast specification is -16 LUFS integrated, -1 dBTP, +/- 1 LU. This is the safest default for any voice-led product.

The reason it sits below pop music's -14 LUFS is twofold. First, the dynamic range of spoken voice (and especially whispered meditation voice) is naturally wider, so you need headroom for naturalistic micro-dynamics. Second, the nervous-system rule: meditation that's too loud activates sympathetic arousal. Loud meditation is a contradiction in terms.

Platform normalization in 2026

If you master at -14 LUFS, Apple turns you down by 2 LU. If you master at -16 LUFS, Spotify and YouTube turn you up by 2 LU (via positive gain when their normalization headroom allows). Either is acceptable. The 2026 consensus from horiamc.com, soundplate, and UpTrack is: one master at -14 LUFS, -1 dBTP works everywhere for music, but -16 LUFS for spoken word / meditation because the dynamic feel matters more than competitive loudness.

The right target for Affirmology

Recommend -16 LUFS integrated, -1 dBTP, LRA 7-11. This is the Apple Podcasts spec. It survives every platform's normalization without sounding squashed. CHANI and Calm content measured informally on streaming hits roughly this range. Spotify will pump it up 2 dB, which is fine.

Tools

FFmpeg loudnorm: the right command

Single-pass is good enough for batch automation, but produces +/- 1 LU drift. Two-pass is what professional services run:

# Pass 1: measure
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.0:LRA=11:print_format=json -f null -

# Pass 2: apply, feeding measured values back in
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.0:LRA=11:measured_I=-22.3:measured_TP=-7.1:measured_LRA=8.2:measured_thresh=-32.5:offset=-0.4:linear=true \
  -ar 48000 -c:a pcm_s24le output_master.wav

linear=true applies a single gain value instead of dynamic AGC, which preserves the original mix dynamics. For meditation this matters - you do not want loudnorm "auto-leveling" the soft whisper passages back up to match the louder ones.


2. Voice EQ for Meditation

The 200-400 Hz "polyvagal warm" zone

Stephen Porges's polyvagal research established that prosody - the melodic, warm, low-mid quality of a safe voice - directly activates the ventral vagal complex (the safety/social-engagement branch of the autonomic nervous system). The fundamental of a male meditation voice typically lives at 100-150 Hz, female at 180-250 Hz, with the first formant in the 400-800 Hz range. The "warm body" of the voice is the 200-400 Hz region.

Standard EQ moves engineers use for meditation voice:

What to cut

De-essing

Sibilance for male voices centers at 5-6 kHz; female voices at 7-8 kHz. ElevenLabs voices in particular over-pronounce /s/ and /sh/ phonemes because the model was trained on broadcast-clean speech. Common moves:

Reference engineers and writings


3. Music Bed Mixing Under Voice

The dB level rule

The professional standard: music should sit -18 to -24 dB below voice peak during voice passages. During voice gaps (intros, outros, pause breaths), the music can come up to -6 to -9 dB below voice peak. Calm's mix sits closer to -22 dB during voice, which is why their voice feels so dominant and the music feels supportive rather than competitive.

Sidechain ducking

Standard signal flow: voice track sends to the music compressor's sidechain input. When voice exceeds threshold, music gets pulled down. Typical settings:

In FFmpeg, this is the sidechaincompress filter:

ffmpeg -i voice.wav -i music.wav -filter_complex \
  "[0:a]asplit=2[v1][v2]; \
   [1:a][v2]sidechaincompress=threshold=0.05:ratio=4:attack=30:release=500:makeup=0[ducked]; \
   [v1][ducked]amix=inputs=2:duration=longest:weights='1 0.6'[out]" \
  -map "[out]" -ar 48000 -c:a pcm_s24le mix.wav

This is materially better than static volume automation, which is what most pipelines default to.

Frequency separation: the spectral hole

The "spectral hole" or carved-EQ technique: take the music bed and cut a wide bell -3 to -5 dB at 1 kHz with Q around 1.0 (sometimes called the "vocal pocket"). This creates room in the voice's intelligibility band without the listener consciously perceiving the music as quieter.

Music bed lives in: - Sub-bass: 30-80 Hz (felt, not heard, in meditation) - Bass body: 80-250 Hz - High-mid sparkle: 4-10 kHz - Air: 10-16 kHz

Voice owns: - Fundamental: 100-300 Hz - Body/warmth: 200-500 Hz - Intelligibility: 1-4 kHz - Presence: 3-6 kHz

The cleanest sound comes from sculpting both: high-pass the music at 100 Hz (don't let it fight the voice fundamental), notch at 1-3 kHz (vocal pocket), and let it bloom above 5 kHz and below 100 Hz where voice doesn't live.


4. Room Tone, Reverb, and the "Produced" Feeling

Why TTS voices feel "in-the-room" instead of "in-headspace"

ElevenLabs voices are dry. They were trained on close-mic broadcast recordings with minimal natural reverberation. When played back through headphones with no spatial cue, the brain interprets them as "right inside my head" rather than "in a contemplative space." This is the uncanny TTS giveaway as much as any phonetic artifact.

The fix: a touch of intentional reverb that places the voice in a small, warm, intimate space.

Calm-style settings (small room)

Headspace-style settings (drier + ambient pad)

Headspace uses less reverb on the voice itself but layers a near-subliminal ambient pad at -30 to -36 dB underneath everything. The pad is usually a sustained drone in the same key as the music bed. Effect: voice feels intimate but the whole scene feels "produced."

Convolution vs algorithmic

For meditation voice, algorithmic usually wins because real-room IRs include problematic resonances (HVAC, floor reflections) that fight the calm aesthetic. Specific recommended plugins:

FFmpeg convolution reverb

ffmpeg -i voice.wav -i bricasti_smallroom_IR.wav -filter_complex \
  "[0:a][1:a]afir=dry=10:wet=2:length=1" output.wav

The dry/wet values are in dB. The cleanest setup: pre-process voice through EQ + de-ess + this convolution step, then send to the mix stage.


5. Mastering Chain (final pass)

Order matters. A typical meditation master chain:

  1. High-pass filter at 80 Hz (12 or 24 dB/oct). Already done at the voice stage; redo on the master in case the music bed dragged sub energy in.
  2. Subtle linear-phase EQ. Maybe -1 dB at 250 Hz (muddiness), +0.5 dB at 5 kHz (presence). Wide Q, very gentle.
  3. Multiband compressor (very subtle). 1-2 dB gain reduction on the low band (sub 200 Hz) and high band (above 6 kHz). Leaves the voice band alone. Tames any music dynamics that leaked through.
  4. Stereo widener on music bed only (M/S processing). +20-40% sides energy above 1 kHz. Voice stays mono.
  5. Limiter, true peak ceiling -1.0 dBTP, with 1-2 dB max gain reduction. Anything more and the meditation breathes wrong.
  6. Final LUFS verification at -16 LUFS integrated.

Tool recommendations

For the FFmpeg pipeline, the final step is the two-pass loudnorm shown in section 1. The pipeline already has this - what's missing is the pre-processing on voice and music separately.


6. Intimate vs Ambient Voice Positioning

The rule

This contrast - narrow voice, wide bed - is what creates the perceptual "inside your head / outside your head" split that makes meditation feel like a place rather than a recording.

How to position

In any DAW (Reaper, Logic, Pro Tools, Ableton):

In the FFmpeg pipeline

# Force voice to mono center
[0:a]pan=stereo|c0=c0|c1=c0[voice_mono]

# Widen the music bed using haas/M-S
[1:a]stereotools=mlev=0.8:slev=1.4[music_wide]

This delivers the Calm-signature voice-narrow / bed-wide split in a single pass.


7. Avoiding TTS Giveaways

Specific ElevenLabs tells

  1. Hard /s/ and /sh/ sounds. Already addressed by de-essing.
  2. Perfectly timed pauses. Real humans pause irregularly. ElevenLabs pauses look "metronomic" on a waveform.
  3. No breath sounds. Confirmed by ElevenLabs's own docs: their TTS does not generate breaths. Professional voice clones capture breath patterns; default voices do not.
  4. No silence-between-words noise floor variation. Real recordings have a consistent room tone; TTS output has digital-clean silence that the ear registers as "wrong."
  5. Hyper-consistent pitch contours within sentences. Real meditation voice drifts; ElevenLabs is more locked.

Mitigation moves

Add breath sounds. Splice in real breath samples between paragraphs. Free libraries: - Filmstro free breath pack - Splice "Vocal Breaths" packs (free with trial) - ElevenLabs sound-effects library has its own breath SFX now - Record your own: 10 minutes with any USB mic gives 50+ usable breaths

Place breaths at -18 to -24 dB below voice peak. Ideally micro-pan very slightly (5-10 degrees off center) so they don't feel pasted on.

Add room tone. Generate or record 60 seconds of "silence" with the same noise floor as a real recording (use a real mic in a real room, or use iZotope RX's Ambience Match). Layer this at -42 dB underneath the entire voice track. The brain perceives the speech as "in the room" instead of "in the void."

Micro-timing variation. Vary playback speed by +/- 1-2% across sections using FFmpeg's atempo filter. This breaks the metronomic feel. Some pipelines do this per-sentence with subtle random variation.

De-clicker pass. ElevenLabs sometimes inserts micro-clicks at sentence boundaries. iZotope RX 11 De-click handles these in one pass. For automation, the FFmpeg aresample + compand chain can mask them.

Variable speed selection (already in your pipeline per the task list, "Build auto-tune speed selector"). This is the right move.


8. Music Bed Selection and Licensing

Royalty-free libraries

Composers in the meditation/ambient space worth knowing

The AI-generated track pattern

Most modern meditation apps now use AI-generated music for at least the long-tail content. The economics: an hour-long custom ambient track costs $500-3000 from a composer, $0.50-3 from Suno or Udio.

Suno (as of late 2025/2026): - Pro plan ($10/mo) and Premier plan ($30/mo) grant commercial use rights for tracks generated during active subscription. - Suno takes 0% of streaming royalties. - Following the Warner Music partnership (Nov 2025), Suno is moving toward licensed training data. Existing Pro/Premier generations remain commercially usable. - Important caveat: Suno's policy says "you may be granted commercial use rights" but "generally are not considered the owner." This is operationally fine for an app's internal bed music; might not be fine if you ever wanted to register the track with a PRO.

Udio: - Following its own licensing deal, Udio is becoming a "walled garden" where tracks may not leave the platform. Commercial use outside Udio's environment is becoming restricted. - Less safe for an app's use case than Suno as of mid-2026.

Spotify and Apple Music AI disclosure: starting late 2025, both platforms require disclosure of AI-generated audio on uploaded tracks. This is for streaming-platform uploads, not for embedded use inside a meditation app. Your app's audio is not subject to these rules unless you also distribute the music separately.


9. Hire vs DIY

Cost ranges (2026)

Specific engineers in the wellness space

Most wellness apps employ internal audio teams that are notoriously hard to recruit out of. The realistic paths:

The DIY learning curve

If Jeff wants to ramp internally instead:

When to outsource vs DIY

The right line for Affirmology:


10. Audio File Format and Delivery

WAV vs FLAC vs MP3

The loudness war warning

Do not chase competitive loudness. Past -14 LUFS for meditation content, the nervous system reads it as "intrusive" and the dropout rate spikes. Many meditation creators have explicitly mentioned this in app-store reviews of competitors: "the voice is too in-your-face." -16 LUFS is the right floor for nervous-system-aware audio.

Spatial audio / Dolby Atmos

Apple's Spatial Audio with Dolby Atmos is becoming the premium-tier expectation, especially after AirPods adoption became ubiquitous. Apple Music gives spatial-audio tracks up to 10% higher royalty share - not relevant for an app, but it indicates platform priority.

For meditation specifically: - The case for Atmos: head-tracked spatial audio creates a genuinely immersive "container" feel; competitors will move here. - The case against: Atmos production requires specialized monitoring (Logic Pro + 7.1.4 monitoring setup or Dolby renderer license, ~$300+). Production time per track goes up 3-5x. The audience that can actually hear spatial audio is narrower than it seems.

Recommendation: ship stereo for v1. Plan a spatial-audio premium tier in a future release once the stereo product proves out. The Calm-quality stereo product is a 12-month goal; Atmos is a 24-month goal.


11. Calm vs Headspace vs Insight Timer - Sonic Comparison

Based on informal measurement of representative tracks (Calm's "Loving Kindness" introductory meditation, Headspace's "Basics 1" Andy Puddicombe, Insight Timer's top-creator Sarah Blondin and Tara Brach):

Calm

Headspace

Insight Timer

CHANI (the closest direct competitor for Affirmology)

What Affirmology should target

A Calm-meets-CHANI sonic signature: warm low-mids, intentional reverb space, generous music bed presence, slightly more dynamics than Headspace allows. Personality forward - this is not generic guided meditation, it's a personalized invocation. The reverb and the warmth do that work.


12. Recommendations for the Affirmology Pipeline

The specific FFmpeg chain to add for "feels Calm-quality" output

The current pipeline mixes voice + music with basic loudnorm. The professional chain has more stages. Concrete proposal:

Stage 1: Voice pre-processing (per-segment ElevenLabs output)

ffmpeg -i raw_voice.wav -af "\
  highpass=f=80:p=2, \
  equalizer=f=250:width_type=q:width=1.0:g=2.5, \
  equalizer=f=400:width_type=q:width=2.0:g=-2.0, \
  equalizer=f=3000:width_type=q:width=1.2:g=1.5, \
  lowpass=f=15000:p=2, \
  acompressor=threshold=-25dB:ratio=3:attack=5:release=80:makeup=2 \
  " -ar 48000 -c:a pcm_s24le voice_processed.wav

This is HPF 80Hz, +2.5 dB at 250 Hz (warmth), -2 dB at 400 Hz (boxiness cut), +1.5 dB at 3 kHz (presence), LPF 15 kHz, and a gentle compressor for consistency.

Stage 2: Voice de-essing

ffmpeg -i voice_processed.wav -af "\
  deesser=i=0.4:m=0.5:f=0.5:s=o \
  " -ar 48000 -c:a pcm_s24le voice_deessed.wav

The FFmpeg deesser is crude; for higher quality, swap to a Python-side iZotope RX batch or a TDR Nova plugin host running in a CLI wrapper.

Stage 3: Voice reverb (the key "produced feeling" step)

ffmpeg -i voice_deessed.wav -i bricasti_smallroom_IR.wav \
  -filter_complex "[0:a][1:a]afir=dry=10:wet=2:length=1" \
  -ar 48000 -c:a pcm_s24le voice_reverbed.wav

Use a free Samplicity Bricasti IR. Dry/wet at 10 dB / 2 dB gives ~12% wet feel - the Calm signature.

Stage 4: Mix voice + music with sidechain ducking

ffmpeg -i voice_reverbed.wav -i music_bed.wav \
  -filter_complex "\
    [1:a]highpass=f=100, \
         equalizer=f=1500:width_type=q:width=1.0:g=-3.5, \
         stereotools=mlev=0.8:slev=1.3[music_carved]; \
    [0:a]asplit=2[v1][v2]; \
    [music_carved][v2]sidechaincompress=threshold=0.04:ratio=4:attack=30:release=600:makeup=0[ducked]; \
    [v1]pan=stereo|c0=c0|c1=c0[voice_mono]; \
    [voice_mono][ducked]amix=inputs=2:duration=longest:weights='1.0 0.55'[mix] \
  " -map "[mix]" -ar 48000 -c:a pcm_s24le full_mix.wav

This high-passes the music at 100 Hz, carves a -3.5 dB hole at 1.5 kHz (vocal pocket), widens the music's stereo image, sidechain-ducks it under voice, mono-centers the voice, and mixes at a 1.0 voice / 0.55 music ratio (~ -5 dB music below voice).

Stage 5: Master loudnorm pass (two-pass)

# Pass 1
ffmpeg -i full_mix.wav -af "loudnorm=I=-16:TP=-1.0:LRA=11:print_format=json" -f null -

# Pass 2 (using values from pass 1)
ffmpeg -i full_mix.wav -af "\
  highpass=f=30, \
  loudnorm=I=-16:TP=-1.0:LRA=11:measured_I=...:measured_TP=...:measured_LRA=...:measured_thresh=...:offset=...:linear=true \
  " -ar 48000 -c:a pcm_s24le final_master.wav

Stage 6: Deliverable encode

# AAC 128 kbps for iOS / web
ffmpeg -i final_master.wav -c:a aac -b:a 128k final.m4a

# MP3 192 kbps for compatibility
ffmpeg -i final_master.wav -c:a libmp3lame -b:a 192k -q:a 2 final.mp3

Should you license iZotope / Audiomovers / hire an engineer?

A pragmatic priority order:

  1. Build the FFmpeg chain above first. Free. Gets you 75-80% of Calm-quality. The breath-sound layer, the reverb on voice, and the sidechain ducking are the three biggest perceptual upgrades.

  2. Add Suno Pro for music bed variety ($30/mo Premier for unlimited commercial use). Solves the music problem for personalized content.

  3. Hire a wellness-tagged mastering engineer once (~$500-1500, one-time). Have them design the chain spec, A/B against your FFmpeg output, codify their settings into the pipeline. Their job is not to master every session - it's to design the algorithm.

  4. Buy iZotope Ozone 11 Standard (~$249 on sale). Use it on the flagship/hero content (homepage demo, investor pitch, brand-defining 10 tracks). Its Master Assistant will get you most of the way; the Imager, Multiband, and Maximizer modules are defensible production.

  5. Skip Audiomovers for now. It's a remote-collaboration tool (real-time audio between studios). Not relevant until you have multiple producers working together.

  6. Skip the $100/track engineer service for batch output. Once the chain is good, paying per-session for a generative product breaks the unit economics.

The summary diagnosis

The current Affirmology pipeline (ElevenLabs + FFmpeg loudnorm) is producing audio at maybe 60-65% of Calm-quality. The three highest-leverage upgrades are, in order:

  1. Add intentional reverb to voice before mixing. This alone closes 15-20% of the gap. Use a Bricasti IR + FFmpeg's afir filter.
  2. Implement sidechain ducking instead of static music volume. Closes another 10%. Use sidechaincompress in FFmpeg.
  3. Add breath sounds + room tone layer. Closes another 5-10%. Manual or scripted insertion of free breath samples and a -42 dB noise floor.

Together those three changes take you from "good demo audio" to "indistinguishable from Calm in a blind A/B" for the vast majority of listeners. The remaining 5-10% is the difference a hired engineer can specify - and once specified, can be baked into the pipeline permanently.

The pipeline does not need to become more expensive. It needs to become more deliberate.


Sources

LUFS targets and loudness standards: - Podcast Loudness Standards 2026: Spotify, Apple, YouTube (SONE) - The Only LUFS Guide You Need in 2026 (Horia Stan) - LUFS Targets for Every Streaming Platform 2026 (UpTrack) - The Ultimate Guide to Streaming Loudness LUFS Table 2026 (Soundplate) - Podcast Loudness Standard: Perfecting Your Sound in 2026 (Descript)

FFmpeg loudnorm: - FFmpeg Audio Normalization: The Complete loudnorm Guide (32blog) - How to Use ffmpeg loudnorm: LUFS Normalization and 2-Pass Settings - loudnorm filter documentation (k.ylo.ph) - FFmpeg sidechain ducking (FFmpeg-user list)

Voice EQ for spoken word: - EQ: Warm a Voice and Improve Clarity (Larry Jordan) - Voice EQ - The Best Settings (Music Guy Mixing) - How to EQ Vocals (iZotope) - How to EQ Speech for Maximum Intelligibility (Behind The Mixer) - The Complete Guide to Mixing Voice: EQ (Pro Audio Files)

Polyvagal and prosody: - Talk Time Featuring Dr. Stephen Porges (Dr. Rebecca Jorgensen)

De-essing: - De-essing - Wikipedia - Techniques For Vocal De-essing (Sound on Sound) - Advanced Sibilance Control: Beyond Simple De-Essing (Mike's Mix and Master) - Vocal Sibilance (Pro Audio Files)

Sidechain ducking: - Side Chain Compression in Reaper, Ducking for Voice Overs (iBlindTech) - What is Sidechain Compression? (Sweetwater) - Ducking music volume for voice narration (VI-CONTROL)

Spectral separation: - 7 Tips for Using Subtractive EQ (iZotope) - Frequency Masking Guide (The Producer School) - How To Create Separation In Your Mixes Using EQ (Audio Issues)

Convolution reverb and IRs: - Bricasti M7 impulse response files (Samplicity) - Convolution Reverb: The Hidden Secret to Realistic Spaces (EDMProd) - Free Impulse Responses: 4 Reverb Packs (Resound Sound) - Best Reverb Plugins (Musiversal)

ElevenLabs and TTS: - Can you make voices produce the sound of breathing? (ElevenLabs Help) - How to make Text to Speech sound less robotic (ElevenLabs Blog) - ElevenLabs Best Practices

Music licensing: - Artlist vs Epidemic Sound 2026 (CC Hound) - Suno Commercial Use: Free vs Pro Rights 2026 (Dynamoi) - Suno adjusts AI music ownership terms (Music In Africa / Warner deal) - What Suno and Udio Licensing Deals Mean (Billboard) - The 2026 Suno AI Legal Guide (Sonic Analytics)

Mastering chain and tools: - Pro Mastering Chain: The Building Blocks (mastering.com) - Mastering Chain: 7 Stages That Shape Your Master (LANDR) - iZotope Ozone 12 vs FabFilter Pro-L 2 2026 (PluginDrop) - FabFilter Pro-L 2 vs popular limiters (Gearshoot) - Mastering Audio (Bob Katz book review, Sound on Sound) - Mastering Audio: The Art and the Science (Routledge)

Hire vs DIY rates: - Mastering Rates in 2026 (Alexander Wright Mastering) - Mastering Engineer Hourly Rates (Twine) - How can you determine a fair rate for audio mastering (LinkedIn)

Mono vs stereo positioning: - Mono vs Stereo for Podcasting (The Podcast Host) - Why mono is better than stereo for vocals and dialogue (Audio Masterclass) - Should You Podcast in Mono or Stereo? (Audacity to Podcast)

Format and delivery: - Audio Bitrate Guide (AudioUtils) - Audio Bitrate Complete Guide 2026 (Fyletools)

Apple Spatial Audio: - About Spatial Audio with Dolby Atmos (Apple) - What to know about Spatial Audio (Apple Music for Artists) - Apple unveils new spatial audio format ASAF (TechRadar)

High-pass filter for spoken word: - How To Use a High-pass Filter for Voice Clarity (Podcast Engineering School) - Mastering Dialogue for Podcasts (Sage Audio)

Insight Timer creator standards: - Recording Tips (Insight Timer Support) - Best Practices for Content (Insight Timer Support)