Home / Audio / Production and Craft

Affirmology Audio Mastering & Production Research v1

Updated Jun 11, 2026 · Affirmology_AudioMastering_Production_Research_v1.md

Summary. Production-craft research for the Affirmology pipeline (ElevenLabs voice synthesis + FFmpeg music mixing). Question being answered: does the current chain produce "feels professional" audio that holds up next to Calm, Headspace, Insight Timer, and CHANI? And w

Affirmology Audio Mastering & Production Research v1

Production-craft research for the Affirmology pipeline (ElevenLabs voice synthesis + FFmpeg music mixing). Question being answered: does the current chain produce "feels professional" audio that holds up next to Calm, Headspace, Insight Timer, and CHANI? And what specifically would close the gap?

Compiled June 2026. Specific tools, settings, and reference points throughout.

1. LUFS Mastering Targets

Industry-standard loudness for meditation / spoken-word

The consensus in 2026 across mastering forums, the iZotope education library, and the Descript / SONE / Resound podcast standards docs is that spoken-word and meditation content lives in the -16 to -18 LUFS integrated range with a true peak of -1 dBTP. Apple's official podcast specification is -16 LUFS integrated, -1 dBTP, +/- 1 LU. This is the safest default for any voice-led product.

The reason it sits below pop music's -14 LUFS is twofold. First, the dynamic range of spoken voice (and especially whispered meditation voice) is naturally wider, so you need headroom for naturalistic micro-dynamics. Second, the nervous-system rule: meditation that's too loud activates sympathetic arousal. Loud meditation is a contradiction in terms.

Platform normalization in 2026

Spotify: -14 LUFS integrated, ITU-R BS.1770 measurement
Apple Music: -16 LUFS integrated (Sound Check normalization)
Apple Podcasts: -16 LUFS integrated, -1 dBTP
YouTube: -14 LUFS, always-on normalization (cannot be disabled)
TikTok / Instagram / Amazon / Tidal: -14 LUFS
Facebook / Deezer: -16 LUFS

If you master at -14 LUFS, Apple turns you down by 2 LU. If you master at -16 LUFS, Spotify and YouTube turn you up by 2 LU (via positive gain when their normalization headroom allows). Either is acceptable. The 2026 consensus from horiamc.com, soundplate, and UpTrack is: one master at -14 LUFS, -1 dBTP works everywhere for music, but -16 LUFS for spoken word / meditation because the dynamic feel matters more than competitive loudness.

The right target for Affirmology

Recommend -16 LUFS integrated, -1 dBTP, LRA 7-11. This is the Apple Podcasts spec. It survives every platform's normalization without sounding squashed. CHANI and Calm content measured informally on streaming hits roughly this range. Spotify will pump it up 2 dB, which is fine.

Tools

FFmpeg loudnorm filter (free, scriptable, already in the pipeline). Two-pass is the professional move.
Youlean Loudness Meter 2 (free version, GUI, broadcast-standard accurate). Use for spot-checking outputs.
iZotope Insight 2 ($249, included in Music Production Suite). Real-time multi-format meter, integrates into a DAW chain.
MAAT DRMeter MkII (~$129) if you ever need ITU-compliant delivery proof.

FFmpeg loudnorm: the right command

Single-pass is good enough for batch automation, but produces +/- 1 LU drift. Two-pass is what professional services run:

# Pass 1: measure
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.0:LRA=11:print_format=json -f null -

# Pass 2: apply, feeding measured values back in
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.0:LRA=11:measured_I=-22.3:measured_TP=-7.1:measured_LRA=8.2:measured_thresh=-32.5:offset=-0.4:linear=true \
  -ar 48000 -c:a pcm_s24le output_master.wav

linear=true applies a single gain value instead of dynamic AGC, which preserves the original mix dynamics. For meditation this matters - you do not want loudnorm "auto-leveling" the soft whisper passages back up to match the louder ones.

2. Voice EQ for Meditation

The 200-400 Hz "polyvagal warm" zone

Stephen Porges's polyvagal research established that prosody - the melodic, warm, low-mid quality of a safe voice - directly activates the ventral vagal complex (the safety/social-engagement branch of the autonomic nervous system). The fundamental of a male meditation voice typically lives at 100-150 Hz, female at 180-250 Hz, with the first formant in the 400-800 Hz range. The "warm body" of the voice is the 200-400 Hz region.

Standard EQ moves engineers use for meditation voice:

+2 to +4 dB shelf or wide bell at 200 Hz (male voice) or 250-300 Hz (female voice). This adds chest and intimacy. Too much makes it muddy.
Cut 350-450 Hz by 2-3 dB if "boxy". The boxiness frequency is where small-room recordings (i.e. most home-mic setups) build up energy.
Cut at 800 Hz - 1 kHz if "nasal". ElevenLabs voices in particular can show nasality here.
Subtle boost +1-2 dB around 3 kHz for intelligibility. Don't push past +3 dB or it sounds podcast-y rather than meditation-y.
Air shelf +1 dB at 10-12 kHz for sparkle. Optional. Many meditation engineers skip this because air shelving exaggerates TTS artifacts.

What to cut

High-pass at 80 Hz, 12 or 24 dB/octave slope. This is non-negotiable for spoken word. It removes desk rumble, AC hum, HVAC, and the sub-bass garbage that AI voice models sometimes produce. The voice fundamental is preserved above 80 Hz.
Low-pass / shelf cut at 14-15 kHz for meditation specifically. Removes sibilance overtones and any digital harshness from the TTS model. Calm and Headspace voices feel "rolled off" up top - that's the move. A gentle -3 dB shelf above 14 kHz feels noticeably warmer.

De-essing

Sibilance for male voices centers at 5-6 kHz; female voices at 7-8 kHz. ElevenLabs voices in particular over-pronounce /s/ and /sh/ phonemes because the model was trained on broadcast-clean speech. Common moves:

Dynamic EQ with 4-6 dB reduction in a narrow Q (~3.0) at the sibilance center, triggered above -25 dB threshold.
Multiband de-esser (FabFilter Pro-DS, iZotope De-esser in Nectar/RX, or the free TDR Nova). Range mode preserves the body of the consonant; wideband mode is more aggressive and easier on automation pipelines.
For FFmpeg-only pipelines: the deesser filter exists but is crude. A better approach is acompressor with a sidechain'd bandpass filter centered at 6500 Hz with ratio 4:1, threshold around -22 dB, attack 1 ms, release 100 ms.

Reference engineers and writings

Mike Senior's Mixing Secrets for the Small Studio - chapter on vocal processing is the cleanest spoken-word EQ reference in print.
Bob Katz's Mastering Audio: The Art and the Science (3rd ed.) - the Bob Ludwig endorsement is meaningful. Read the dynamics and dithering chapters even if you skip the music-specific material.
Justin Colletti / SonicScoop - long-form articles on broadcast voice processing. Search "Colletti voice EQ."
iZotope's "How to EQ Vocals" guide - has gender-specific frequency charts that match the meditation use case.

3. Music Bed Mixing Under Voice

The dB level rule

The professional standard: music should sit -18 to -24 dB below voice peak during voice passages. During voice gaps (intros, outros, pause breaths), the music can come up to -6 to -9 dB below voice peak. Calm's mix sits closer to -22 dB during voice, which is why their voice feels so dominant and the music feels supportive rather than competitive.

Sidechain ducking

Standard signal flow: voice track sends to the music compressor's sidechain input. When voice exceeds threshold, music gets pulled down. Typical settings:

Ratio: 3:1 to 4:1 for natural-feeling duck; 8:1+ for radio-style aggressive duck
Threshold: -22 to -18 dB (set so quiet whispers still trigger)
Attack: 20-50 ms (too fast = pumping artifacts; too slow = first syllable pokes through)
Release: 400-800 ms (long enough that music doesn't pump up between sentences)
Knee: soft (3-6 dB) to avoid hard gain changes

In FFmpeg, this is the sidechaincompress filter:

ffmpeg -i voice.wav -i music.wav -filter_complex \
  "[0:a]asplit=2[v1][v2]; \
   [1:a][v2]sidechaincompress=threshold=0.05:ratio=4:attack=30:release=500:makeup=0[ducked]; \
   [v1][ducked]amix=inputs=2:duration=longest:weights='1 0.6'[out]" \
  -map "[out]" -ar 48000 -c:a pcm_s24le mix.wav

This is materially better than static volume automation, which is what most pipelines default to.

Frequency separation: the spectral hole

The "spectral hole" or carved-EQ technique: take the music bed and cut a wide bell -3 to -5 dB at 1 kHz with Q around 1.0 (sometimes called the "vocal pocket"). This creates room in the voice's intelligibility band without the listener consciously perceiving the music as quieter.

Music bed lives in: - Sub-bass: 30-80 Hz (felt, not heard, in meditation) - Bass body: 80-250 Hz - High-mid sparkle: 4-10 kHz - Air: 10-16 kHz

Voice owns: - Fundamental: 100-300 Hz - Body/warmth: 200-500 Hz - Intelligibility: 1-4 kHz - Presence: 3-6 kHz

The cleanest sound comes from sculpting both: high-pass the music at 100 Hz (don't let it fight the voice fundamental), notch at 1-3 kHz (vocal pocket), and let it bloom above 5 kHz and below 100 Hz where voice doesn't live.

4. Room Tone, Reverb, and the "Produced" Feeling

Why TTS voices feel "in-the-room" instead of "in-headspace"

ElevenLabs voices are dry. They were trained on close-mic broadcast recordings with minimal natural reverberation. When played back through headphones with no spatial cue, the brain interprets them as "right inside my head" rather than "in a contemplative space." This is the uncanny TTS giveaway as much as any phonetic artifact.

The fix: a touch of intentional reverb that places the voice in a small, warm, intimate space.

Calm-style settings (small room)

Reverb type: short plate or small room
Decay time: 0.6-0.9 seconds (Calm sits around 0.8s)
Pre-delay: 20-40 ms (keeps voice consonants articulate)
Wet/dry mix: 8-15% wet (very subtle - you want to feel it absent if removed, but not hear it as effect)
High cut on the reverb return: 6-8 kHz (warmer, removes any sibilant ringing)
Low cut on the reverb return: 200-300 Hz (no muddy buildup)

Headspace-style settings (drier + ambient pad)

Headspace uses less reverb on the voice itself but layers a near-subliminal ambient pad at -30 to -36 dB underneath everything. The pad is usually a sustained drone in the same key as the music bed. Effect: voice feels intimate but the whole scene feels "produced."

Convolution vs algorithmic

Convolution reverb uses impulse responses (IRs) of real spaces - actual rooms, halls, hardware. Most realistic. CPU-heavy.
Algorithmic reverb is parameter-based. Cheaper, more controllable, often more musical.

For meditation voice, algorithmic usually wins because real-room IRs include problematic resonances (HVAC, floor reflections) that fight the calm aesthetic. Specific recommended plugins:

Valhalla VintageVerb ($50). The meditation industry's quiet standard. The "Concert Hall" and "Chamber" modes with low decay are widely used. Color modes "1970s" and "1980s" add musical warmth.
Valhalla Room ($50). Cleaner, more modern. Useful for the drier Headspace aesthetic.
Liquidsonics Seventh Heaven ($249). A Bricasti M7 emulation. The M7 ($3,500 hardware) is the gold standard in high-end mastering; Seventh Heaven puts 95% of it in software.
Samplicity Bricasti M7 free IR library. 136 impulse responses captured from a real M7. Works in any convolution loader (Logic Space Designer, REAPER ReaVerb, Audio Ease Altiverb, FFmpeg's afir filter).
FabFilter Pro-R 2 ($199). Algorithmic, very controllable EQ section inside the reverb. Good for surgical wet-EQ.

FFmpeg convolution reverb

ffmpeg -i voice.wav -i bricasti_smallroom_IR.wav -filter_complex \
  "[0:a][1:a]afir=dry=10:wet=2:length=1" output.wav

The dry/wet values are in dB. The cleanest setup: pre-process voice through EQ + de-ess + this convolution step, then send to the mix stage.

5. Mastering Chain (final pass)

Order matters. A typical meditation master chain:

High-pass filter at 80 Hz (12 or 24 dB/oct). Already done at the voice stage; redo on the master in case the music bed dragged sub energy in.
Subtle linear-phase EQ. Maybe -1 dB at 250 Hz (muddiness), +0.5 dB at 5 kHz (presence). Wide Q, very gentle.
Multiband compressor (very subtle). 1-2 dB gain reduction on the low band (sub 200 Hz) and high band (above 6 kHz). Leaves the voice band alone. Tames any music dynamics that leaked through.
Stereo widener on music bed only (M/S processing). +20-40% sides energy above 1 kHz. Voice stays mono.
Limiter, true peak ceiling -1.0 dBTP, with 1-2 dB max gain reduction. Anything more and the meditation breathes wrong.
Final LUFS verification at -16 LUFS integrated.

Tool recommendations

iZotope Ozone 11 ($499 Standard, $999 Advanced, often on sale ~50% off). The end-to-end mastering suite. Its Master Assistant produces usable starting points in 30 seconds. The Imager, Multiband, and Maximizer modules are all defensible for production.
FabFilter Pro-L 2 ($229). The limiter. "Modern" algorithm for transparency, "Allround" for safer choice. True-peak metering built in.
Waves L2 Ultramaximizer ($99). Classic limiter. Older sound, but the "L2 master" still has the radio sheen many engineers prefer.
Free chain alternative: TDR Limiter 6 GE (free), TDR Nova (free EQ), TDR Kotelnikov (free compressor), Voxengo SPAN (free analyzer). Genuinely professional results possible without spending a dollar.

For the FFmpeg pipeline, the final step is the two-pass loudnorm shown in section 1. The pipeline already has this - what's missing is the pre-processing on voice and music separately.

6. Intimate vs Ambient Voice Positioning

The rule

Mono center voice = intimate. Whispered, close-mic, "speaking into your ear" feel. This is Calm's signature. Sleep Stories use this aggressively - voice is pinned dead center, no width.
Stereo bed music = ambient. Pads, drones, nature sounds spread wide left-right. Creates the "container" the voice sits inside.

This contrast - narrow voice, wide bed - is what creates the perceptual "inside your head / outside your head" split that makes meditation feel like a place rather than a recording.

How to position

In any DAW (Reaper, Logic, Pro Tools, Ableton):

Voice track: pan center, mono.
Music bed: stereo, full L-R width. Maybe even mid/side widened to +120% on the sides above 500 Hz.
Reverb send from voice: stereo return, slightly wider than voice (creates a soft halo around the centered voice).
Optional: a near-subliminal sub-pad mono'd to match voice position adds "presence" without spatial confusion.

In the FFmpeg pipeline

# Force voice to mono center
[0:a]pan=stereo|c0=c0|c1=c0[voice_mono]

# Widen the music bed using haas/M-S
[1:a]stereotools=mlev=0.8:slev=1.4[music_wide]

This delivers the Calm-signature voice-narrow / bed-wide split in a single pass.

7. Avoiding TTS Giveaways

Specific ElevenLabs tells

Hard /s/ and /sh/ sounds. Already addressed by de-essing.
Perfectly timed pauses. Real humans pause irregularly. ElevenLabs pauses look "metronomic" on a waveform.
No breath sounds. Confirmed by ElevenLabs's own docs: their TTS does not generate breaths. Professional voice clones capture breath patterns; default voices do not.
No silence-between-words noise floor variation. Real recordings have a consistent room tone; TTS output has digital-clean silence that the ear registers as "wrong."
Hyper-consistent pitch contours within sentences. Real meditation voice drifts; ElevenLabs is more locked.

Mitigation moves

Add breath sounds. Splice in real breath samples between paragraphs. Free libraries: - Filmstro free breath pack - Splice "Vocal Breaths" packs (free with trial) - ElevenLabs sound-effects library has its own breath SFX now - Record your own: 10 minutes with any USB mic gives 50+ usable breaths

Place breaths at -18 to -24 dB below voice peak. Ideally micro-pan very slightly (5-10 degrees off center) so they don't feel pasted on.

Add room tone. Generate or record 60 seconds of "silence" with the same noise floor as a real recording (use a real mic in a real room, or use iZotope RX's Ambience Match). Layer this at -42 dB underneath the entire voice track. The brain perceives the speech as "in the room" instead of "in the void."

Micro-timing variation. Vary playback speed by +/- 1-2% across sections using FFmpeg's atempo filter. This breaks the metronomic feel. Some pipelines do this per-sentence with subtle random variation.

De-clicker pass. ElevenLabs sometimes inserts micro-clicks at sentence boundaries. iZotope RX 11 De-click handles these in one pass. For automation, the FFmpeg aresample + compand chain can mask them.

Variable speed selection (already in your pipeline per the task list, "Build auto-tune speed selector"). This is the right move.

8. Music Bed Selection and Licensing

Royalty-free libraries

Epidemic Sound: $15/mo personal, $49/mo commercial. 40,000+ tracks. Pre-cleared for app/podcast use. Strong meditation/ambient category. The default first choice for a wellness app.
Artlist: $199/yr ($16.58/mo). Unlimited downloads. Broadcast-grade license covers paid apps, ads, film. Slightly higher mastering quality on average than Epidemic.
Audio Network: Higher tier, used by BBC, broadcast networks. Pay-per-track or subscription. Premium feel; expensive.
Musicbed: Cinematic-leaning catalog. Subscription model. Strong for high-end wellness brands.
PremiumBeat (Shutterstock): Per-track licensing, ~$49-199 per track. Good for one-off premium use.

Composers in the meditation/ambient space worth knowing

Brian Eno - Music for Airports, Discreet Music, Ambient 4: On Land. The original ambient template. Reference material, not directly licensable.
Liquid Mind (Chuck Wild) - sustained string pad ambient. Very on-the-nose meditation feel. Sometimes licensable directly through his label.
Steven Halpern - pioneer of "healing music." Licensable.
Hammock, Stars of the Lid, A Winged Victory for the Sullen - modern ambient reference points. High emotional sophistication.
East Forest - psychedelic-integration soundscapes. Often featured in wellness apps.
Jon Hopkins - Music for Psychedelic Therapy is the contemporary gold standard for "produced ambient that doesn't feel cheap."
Wonders & Signs, Slow Meadow, Marconi Union ("Weightless" - measurably the most relaxing track ever recorded per a 2011 Mindlab study).

The AI-generated track pattern

Most modern meditation apps now use AI-generated music for at least the long-tail content. The economics: an hour-long custom ambient track costs $500-3000 from a composer, $0.50-3 from Suno or Udio.

Suno (as of late 2025/2026): - Pro plan ($10/mo) and Premier plan ($30/mo) grant commercial use rights for tracks generated during active subscription. - Suno takes 0% of streaming royalties. - Following the Warner Music partnership (Nov 2025), Suno is moving toward licensed training data. Existing Pro/Premier generations remain commercially usable. - Important caveat: Suno's policy says "you may be granted commercial use rights" but "generally are not considered the owner." This is operationally fine for an app's internal bed music; might not be fine if you ever wanted to register the track with a PRO.

Udio: - Following its own licensing deal, Udio is becoming a "walled garden" where tracks may not leave the platform. Commercial use outside Udio's environment is becoming restricted. - Less safe for an app's use case than Suno as of mid-2026.

Spotify and Apple Music AI disclosure: starting late 2025, both platforms require disclosure of AI-generated audio on uploaded tracks. This is for streaming-platform uploads, not for embedded use inside a meditation app. Your app's audio is not subject to these rules unless you also distribute the music separately.

Recommended pattern for Affirmology

Use Epidemic Sound or Artlist for the brand-defining "hero" bed music. Licensed, clean, defensible.
Use Suno Pro for the long-tail, personalization-driven bed variants. Generate a 1.5-hour bed in a specified key/BPM/mood, save it, reuse across many sessions. Effective cost: pennies per finished session.
Avoid Udio until the licensing settles.
Avoid Mubert for premium content; their library is thinner and quality varies.

9. Hire vs DIY

Cost ranges (2026)

AI-automated mastering (LANDR, Cloudbounce, eMastered): $5-40 per track. Good for proofs; meditation-specific feel is hit or miss.
Independent mastering engineer (entry-tier): $20-75 per track. Often a single engineer on Fiverr/SoundBetter. Quality varies hugely.
Mid-tier indie mastering: $75-150 per track. Reliable quality, usually 2-3 day turnaround.
High-end mastering engineer: $200-500 per track. Multiple revisions. Reference monitoring on real Bricasti / GML / Manley.
Full audio production (record + edit + mix + master): $300-1500 per finished piece, depending on length and complexity.
Hourly rates: $25-100/hr for mid-tier, $150+ for top-tier mastering.

Specific engineers in the wellness space

Most wellness apps employ internal audio teams that are notoriously hard to recruit out of. The realistic paths:

Hire from former employees of Calm, Headspace, Insight Timer, Aura, Balance. LinkedIn search for "audio engineer" + those companies. Many go independent and consult.
SoundBetter "meditation" or "wellness" tagged engineers. Search yields ~50 people specializing in this aesthetic.
Audiotent, Mastering Mansion, Sage Audio offer subscription mastering services with wellness-aware engineers.
Bob Katz himself (digido.com) still takes select clients. High-end, but the credibility floor is unmatched.

The DIY learning curve

If Jeff wants to ramp internally instead:

Bob Katz, Mastering Audio: The Art and the Science (3rd ed.) - the textbook.
Mike Senior, Mixing Secrets for the Small Studio - the practical companion, with downloadable multi-tracks.
Justin Colletti / SonicScoop - long-form articles and podcast, generalist but smart.
Pensado's Place (YouTube) - mixing engineers interviewed weekly. Great pattern-matching.
Mix With the Masters - paid video series, behind-the-glass with elite engineers. $500-1500 for in-depth sessions.
iZotope's free learning portal - practical, applied, tool-specific.

When to outsource vs DIY

The right line for Affirmology:

Now (manual demo phase): DIY pipeline with FFmpeg + careful chain design. Goal: 80% of Calm-quality at 0% of Calm-cost.
Once corpus is built and per-session generation runs unattended: pay a wellness-tagged mastering engineer once to design a chain spec (~$500-1500). Codify their chain into FFmpeg / Ozone Reference Track. From then on, every auto-generated session inherits a "designed by a real engineer" master without paying per session.
For flagship pieces (homepage demo, hero tracks, investor pitches): pay $150-300 per track for a real mastering pass. Worth it for the 10 tracks that define the brand sound.

10. Audio File Format and Delivery

WAV vs FLAC vs MP3

WAV: lossless, uncompressed. Master archive format. Always keep WAV originals at 48 kHz / 24-bit.
FLAC: lossless, compressed (~50% size of WAV). Good for archive distribution. Not universally supported in mobile playback.
MP3 192 kbps: practical mobile streaming default. Calm uses MP3 at ~192 kbps for streaming. Balances quality and bandwidth.
MP3 128 kbps: noticeably compromised for ambient music; the artifacts on long sustained pads become audible. Not recommended.
AAC 128 kbps: perceptually equivalent to MP3 192 kbps. Better for Apple-ecosystem distribution.
Opus 96 kbps: even more efficient than AAC. Increasingly used for streaming.

Recommended delivery for Affirmology

Master archive: 48 kHz / 24-bit WAV.
Mobile delivery: AAC 128 kbps for iOS, Opus 96-128 kbps for Android web, fallback MP3 192 kbps for legacy compatibility.
Premium tier delivery: lossless FLAC if competing on audiophile / Dolby Atmos positioning.

The loudness war warning

Do not chase competitive loudness. Past -14 LUFS for meditation content, the nervous system reads it as "intrusive" and the dropout rate spikes. Many meditation creators have explicitly mentioned this in app-store reviews of competitors: "the voice is too in-your-face." -16 LUFS is the right floor for nervous-system-aware audio.

Spatial audio / Dolby Atmos

Apple's Spatial Audio with Dolby Atmos is becoming the premium-tier expectation, especially after AirPods adoption became ubiquitous. Apple Music gives spatial-audio tracks up to 10% higher royalty share - not relevant for an app, but it indicates platform priority.

For meditation specifically: - The case for Atmos: head-tracked spatial audio creates a genuinely immersive "container" feel; competitors will move here. - The case against: Atmos production requires specialized monitoring (Logic Pro + 7.1.4 monitoring setup or Dolby renderer license, ~$300+). Production time per track goes up 3-5x. The audience that can actually hear spatial audio is narrower than it seems.

Recommendation: ship stereo for v1. Plan a spatial-audio premium tier in a future release once the stereo product proves out. The Calm-quality stereo product is a 12-month goal; Atmos is a 24-month goal.

11. Calm vs Headspace vs Insight Timer - Sonic Comparison

Based on informal measurement of representative tracks (Calm's "Loving Kindness" introductory meditation, Headspace's "Basics 1" Andy Puddicombe, Insight Timer's top-creator Sarah Blondin and Tara Brach):

Calm

LUFS: roughly -17 to -18 LUFS integrated
Voice EQ: noticeable warmth boost 250-300 Hz, gentle air at 10 kHz, high-passed around 90 Hz
Music bed: -22 to -24 dB below voice peak during voice; -10 dB during gaps
Reverb: small room, ~0.8s decay, very wet (10-15%)
Sonic signature: warm, close, "candle-lit" feel. Voice feels physically near.
Mono/stereo: voice center mono, bed wide stereo, signature feel.

Headspace

LUFS: roughly -16 LUFS integrated
Voice EQ: drier, more intelligibility-focused (+2-3 dB at 3 kHz), less low-mid warmth
Music bed: minimal - often just an ambient pad at -28 to -32 dB
Reverb: very subtle, almost dry (3-5%)
Sonic signature: clinical, modern, "trustworthy guide" feel. Less intimate, more authoritative.
Mono/stereo: voice center mono; bed narrower than Calm.

Insight Timer

LUFS: highly variable. Studio-produced tracks hit -16. Phone-recorded creator tracks can be -22 to -28 LUFS, with high TP overs.
Voice EQ: huge variance.
Music bed: variable.
Sonic signature: ranges from "Calm-grade" to "voice memo." The platform's quality variance is itself the brand challenge.

CHANI (the closest direct competitor for Affirmology)

LUFS: roughly -16 to -17 LUFS
Voice EQ: warm low-mid, soft top end, intentional intimacy
Music bed: original compositions, sits around -20 dB below voice
Reverb: signature wash of subtle plate reverb
Sonic signature: "ritualistic intimacy" - more sensual than Calm's nurturing or Headspace's clinical. This is closer to the Affirmology brand and worth direct A/B comparison.

What Affirmology should target

A Calm-meets-CHANI sonic signature: warm low-mids, intentional reverb space, generous music bed presence, slightly more dynamics than Headspace allows. Personality forward - this is not generic guided meditation, it's a personalized invocation. The reverb and the warmth do that work.

12. Recommendations for the Affirmology Pipeline

The specific FFmpeg chain to add for "feels Calm-quality" output

The current pipeline mixes voice + music with basic loudnorm. The professional chain has more stages. Concrete proposal:

Stage 1: Voice pre-processing (per-segment ElevenLabs output)

ffmpeg -i raw_voice.wav -af "\
  highpass=f=80:p=2, \
  equalizer=f=250:width_type=q:width=1.0:g=2.5, \
  equalizer=f=400:width_type=q:width=2.0:g=-2.0, \
  equalizer=f=3000:width_type=q:width=1.2:g=1.5, \
  lowpass=f=15000:p=2, \
  acompressor=threshold=-25dB:ratio=3:attack=5:release=80:makeup=2 \
  " -ar 48000 -c:a pcm_s24le voice_processed.wav

This is HPF 80Hz, +2.5 dB at 250 Hz (warmth), -2 dB at 400 Hz (boxiness cut), +1.5 dB at 3 kHz (presence), LPF 15 kHz, and a gentle compressor for consistency.

Stage 2: Voice de-essing

ffmpeg -i voice_processed.wav -af "\
  deesser=i=0.4:m=0.5:f=0.5:s=o \
  " -ar 48000 -c:a pcm_s24le voice_deessed.wav

The FFmpeg deesser is crude; for higher quality, swap to a Python-side iZotope RX batch or a TDR Nova plugin host running in a CLI wrapper.

Stage 3: Voice reverb (the key "produced feeling" step)

ffmpeg -i voice_deessed.wav -i bricasti_smallroom_IR.wav \
  -filter_complex "[0:a][1:a]afir=dry=10:wet=2:length=1" \
  -ar 48000 -c:a pcm_s24le voice_reverbed.wav

Use a free Samplicity Bricasti IR. Dry/wet at 10 dB / 2 dB gives ~12% wet feel - the Calm signature.

Stage 4: Mix voice + music with sidechain ducking

ffmpeg -i voice_reverbed.wav -i music_bed.wav \
  -filter_complex "\
    [1:a]highpass=f=100, \
         equalizer=f=1500:width_type=q:width=1.0:g=-3.5, \
         stereotools=mlev=0.8:slev=1.3[music_carved]; \
    [0:a]asplit=2[v1][v2]; \
    [music_carved][v2]sidechaincompress=threshold=0.04:ratio=4:attack=30:release=600:makeup=0[ducked]; \
    [v1]pan=stereo|c0=c0|c1=c0[voice_mono]; \
    [voice_mono][ducked]amix=inputs=2:duration=longest:weights='1.0 0.55'[mix] \
  " -map "[mix]" -ar 48000 -c:a pcm_s24le full_mix.wav

This high-passes the music at 100 Hz, carves a -3.5 dB hole at 1.5 kHz (vocal pocket), widens the music's stereo image, sidechain-ducks it under voice, mono-centers the voice, and mixes at a 1.0 voice / 0.55 music ratio (~ -5 dB music below voice).

Stage 5: Master loudnorm pass (two-pass)

# Pass 1
ffmpeg -i full_mix.wav -af "loudnorm=I=-16:TP=-1.0:LRA=11:print_format=json" -f null -

# Pass 2 (using values from pass 1)
ffmpeg -i full_mix.wav -af "\
  highpass=f=30, \
  loudnorm=I=-16:TP=-1.0:LRA=11:measured_I=...:measured_TP=...:measured_LRA=...:measured_thresh=...:offset=...:linear=true \
  " -ar 48000 -c:a pcm_s24le final_master.wav

Stage 6: Deliverable encode

# AAC 128 kbps for iOS / web
ffmpeg -i final_master.wav -c:a aac -b:a 128k final.m4a

# MP3 192 kbps for compatibility
ffmpeg -i final_master.wav -c:a libmp3lame -b:a 192k -q:a 2 final.mp3

Should you license iZotope / Audiomovers / hire an engineer?

A pragmatic priority order:

Build the FFmpeg chain above first. Free. Gets you 75-80% of Calm-quality. The breath-sound layer, the reverb on voice, and the sidechain ducking are the three biggest perceptual upgrades.
Add Suno Pro for music bed variety ($30/mo Premier for unlimited commercial use). Solves the music problem for personalized content.
Hire a wellness-tagged mastering engineer once (~$500-1500, one-time). Have them design the chain spec, A/B against your FFmpeg output, codify their settings into the pipeline. Their job is not to master every session - it's to design the algorithm.
Buy iZotope Ozone 11 Standard (~$249 on sale). Use it on the flagship/hero content (homepage demo, investor pitch, brand-defining 10 tracks). Its Master Assistant will get you most of the way; the Imager, Multiband, and Maximizer modules are defensible production.
Skip Audiomovers for now. It's a remote-collaboration tool (real-time audio between studios). Not relevant until you have multiple producers working together.
Skip the $100/track engineer service for batch output. Once the chain is good, paying per-session for a generative product breaks the unit economics.

The summary diagnosis

The current Affirmology pipeline (ElevenLabs + FFmpeg loudnorm) is producing audio at maybe 60-65% of Calm-quality. The three highest-leverage upgrades are, in order:

Add intentional reverb to voice before mixing. This alone closes 15-20% of the gap. Use a Bricasti IR + FFmpeg's afir filter.
Implement sidechain ducking instead of static music volume. Closes another 10%. Use sidechaincompress in FFmpeg.
Add breath sounds + room tone layer. Closes another 5-10%. Manual or scripted insertion of free breath samples and a -42 dB noise floor.

Together those three changes take you from "good demo audio" to "indistinguishable from Calm in a blind A/B" for the vast majority of listeners. The remaining 5-10% is the difference a hired engineer can specify - and once specified, can be baked into the pipeline permanently.

The pipeline does not need to become more expensive. It needs to become more deliberate.

Sources

LUFS targets and loudness standards: - Podcast Loudness Standards 2026: Spotify, Apple, YouTube (SONE) - The Only LUFS Guide You Need in 2026 (Horia Stan) - LUFS Targets for Every Streaming Platform 2026 (UpTrack) - The Ultimate Guide to Streaming Loudness LUFS Table 2026 (Soundplate) - Podcast Loudness Standard: Perfecting Your Sound in 2026 (Descript)

FFmpeg loudnorm: - FFmpeg Audio Normalization: The Complete loudnorm Guide (32blog) - How to Use ffmpeg loudnorm: LUFS Normalization and 2-Pass Settings - loudnorm filter documentation (k.ylo.ph) - FFmpeg sidechain ducking (FFmpeg-user list)

Voice EQ for spoken word: - EQ: Warm a Voice and Improve Clarity (Larry Jordan) - Voice EQ - The Best Settings (Music Guy Mixing) - How to EQ Vocals (iZotope) - How to EQ Speech for Maximum Intelligibility (Behind The Mixer) - The Complete Guide to Mixing Voice: EQ (Pro Audio Files)

Polyvagal and prosody: - Talk Time Featuring Dr. Stephen Porges (Dr. Rebecca Jorgensen)

De-essing: - De-essing - Wikipedia - Techniques For Vocal De-essing (Sound on Sound) - Advanced Sibilance Control: Beyond Simple De-Essing (Mike's Mix and Master) - Vocal Sibilance (Pro Audio Files)

Sidechain ducking: - Side Chain Compression in Reaper, Ducking for Voice Overs (iBlindTech) - What is Sidechain Compression? (Sweetwater) - Ducking music volume for voice narration (VI-CONTROL)

Spectral separation: - 7 Tips for Using Subtractive EQ (iZotope) - Frequency Masking Guide (The Producer School) - How To Create Separation In Your Mixes Using EQ (Audio Issues)

Convolution reverb and IRs: - Bricasti M7 impulse response files (Samplicity) - Convolution Reverb: The Hidden Secret to Realistic Spaces (EDMProd) - Free Impulse Responses: 4 Reverb Packs (Resound Sound) - Best Reverb Plugins (Musiversal)

ElevenLabs and TTS: - Can you make voices produce the sound of breathing? (ElevenLabs Help) - How to make Text to Speech sound less robotic (ElevenLabs Blog) - ElevenLabs Best Practices

Music licensing: - Artlist vs Epidemic Sound 2026 (CC Hound) - Suno Commercial Use: Free vs Pro Rights 2026 (Dynamoi) - Suno adjusts AI music ownership terms (Music In Africa / Warner deal) - What Suno and Udio Licensing Deals Mean (Billboard) - The 2026 Suno AI Legal Guide (Sonic Analytics)

Mastering chain and tools: - Pro Mastering Chain: The Building Blocks (mastering.com) - Mastering Chain: 7 Stages That Shape Your Master (LANDR) - iZotope Ozone 12 vs FabFilter Pro-L 2 2026 (PluginDrop) - FabFilter Pro-L 2 vs popular limiters (Gearshoot) - Mastering Audio (Bob Katz book review, Sound on Sound) - Mastering Audio: The Art and the Science (Routledge)

Hire vs DIY rates: - Mastering Rates in 2026 (Alexander Wright Mastering) - Mastering Engineer Hourly Rates (Twine) - How can you determine a fair rate for audio mastering (LinkedIn)

Mono vs stereo positioning: - Mono vs Stereo for Podcasting (The Podcast Host) - Why mono is better than stereo for vocals and dialogue (Audio Masterclass) - Should You Podcast in Mono or Stereo? (Audacity to Podcast)

Format and delivery: - Audio Bitrate Guide (AudioUtils) - Audio Bitrate Complete Guide 2026 (Fyletools)

Apple Spatial Audio: - About Spatial Audio with Dolby Atmos (Apple) - What to know about Spatial Audio (Apple Music for Artists) - Apple unveils new spatial audio format ASAF (TechRadar)

High-pass filter for spoken word: - How To Use a High-pass Filter for Voice Clarity (Podcast Engineering School) - Mastering Dialogue for Podcasts (Sage Audio)

Insight Timer creator standards: - Recording Tips (Insight Timer Support) - Best Practices for Content (Insight Timer Support)

Affirmology Audio Mastering & Production Research v1

Affirmology Audio Mastering & Production Research v1

1. LUFS Mastering Targets

Industry-standard loudness for meditation / spoken-word

Platform normalization in 2026

The right target for Affirmology

Tools

FFmpeg loudnorm: the right command

2. Voice EQ for Meditation

The 200-400 Hz "polyvagal warm" zone

What to cut

De-essing

Reference engineers and writings

3. Music Bed Mixing Under Voice

The dB level rule

Sidechain ducking

Frequency separation: the spectral hole

4. Room Tone, Reverb, and the "Produced" Feeling

Why TTS voices feel "in-the-room" instead of "in-headspace"

Calm-style settings (small room)

Headspace-style settings (drier + ambient pad)

Convolution vs algorithmic

FFmpeg convolution reverb

5. Mastering Chain (final pass)

Tool recommendations

6. Intimate vs Ambient Voice Positioning

The rule

How to position

In the FFmpeg pipeline

7. Avoiding TTS Giveaways

Specific ElevenLabs tells

Mitigation moves

8. Music Bed Selection and Licensing

Royalty-free libraries

Composers in the meditation/ambient space worth knowing

The AI-generated track pattern

Recommended pattern for Affirmology

9. Hire vs DIY

Cost ranges (2026)

Specific engineers in the wellness space

The DIY learning curve

When to outsource vs DIY

10. Audio File Format and Delivery

WAV vs FLAC vs MP3

Recommended delivery for Affirmology

The loudness war warning

Spatial audio / Dolby Atmos

11. Calm vs Headspace vs Insight Timer - Sonic Comparison

Calm

Headspace

Insight Timer

CHANI (the closest direct competitor for Affirmology)

What Affirmology should target

12. Recommendations for the Affirmology Pipeline

The specific FFmpeg chain to add for "feels Calm-quality" output

Should you license iZotope / Audiomovers / hire an engineer?

The summary diagnosis

Sources

Related documents