Home / Audio / Music and Voice Beds
Updated Jun 25, 2026 · Affirmology_FishVoice_Markers_Reference_v1.md
Authoritative, sourced. Fish does NOT shape voice with sliders the way ElevenLabs does. It shapes voice with markers and controls embedded in the input text, plus a small set of API params. This is the full public catalog (sources at the bottom).
[marker]. Crucially, S2 is open-domain: it is NOT limited to the fixed list. You can write free-form descriptions and modifiers and it interprets them:[whispers sweetly], [laughing nervously], [slightly sad], [very excited], [extremely gentle](marker), fixed set only, no custom tags.[soft tone][calm][grateful].[ ] / S1 ( ))Basic (24): happy, sad, angry, excited, calm, nervous, confident, surprised, satisfied, delighted, scared, worried, upset, frustrated, depressed, empathetic, embarrassed, disgusted, moved, proud, relaxed, grateful, curious, sarcastic
Advanced (25): disdainful, unhappy, anxious, hysterical, indifferent, uncertain, doubtful, confused, disappointed, regretful, guilty, ashamed, jealous, envious, hopeful, optimistic, pessimistic, nostalgic, lonely, bored, contemptuous, sympathetic, compassionate, determined, resigned
(S2 also accepts modifiers on any of these: [slightly nostalgic], [very calm].)
[in a hurry tone] [shouting] [screaming] [whispering] [soft tone]
[laughing] [chuckling] [sobbing] [crying loudly] [sighing] [groaning] [panting] [gasping] [yawning] [snoring]
Plus crowd effects: [audience laughing] [background laughter] [crowd laughing]
[break] - short pause. Empirically confirmed on s2: one [break] adds a real silence; two [break][break] produced a ~0.94s gap in our test (not spoken aloud).[long-break] - extended pause.[breath] - audible inhale. ([inhale] / [exhale] also referenced.)Inserting filler words like "um", "uh" (or natural laughter written as "Ha,ha,ha") controls rhythm/realism without any tag. Use sparingly for a meditation voice.
Force exact pronunciation with: <|phoneme_start|>PHONEMES<|phoneme_end|>
- English: CMU Arpabet (per word)
- Chinese: tone-number pinyin
- Japanese: OpenJTalk romaji with pitch-accent digits
This is how we make sure a person's name (e.g. an unusual spelling) is voiced correctly in their Soul Song.
prosody: { "speed": <float>, "volume": <int> }. We render at speed 0.80 - 0.88.
Caveat (measured): the speed knob is coarse - 0.88 vs 0.93 produced near-identical length. For real pacing change, move in bigger steps or add [break]/[long-break] between sentences (which we now do).
Model is chosen by the request model header: s2 (we use this) or s1.
[calm], [relaxed], [grateful], [hopeful], [compassionate], [nostalgic].[soft tone] / [whispering] (sparingly).[break] between sentences and [long-break] at section breaks.<|phoneme_start|>...<|phoneme_end|>.Verified in-engine 2026-06-25: Fish s2 honors [break] as inserted silence; all 1-min A/B renders passed audio_qc.py.