Home / Audio / Music and Voice Beds

Fish Audio - complete in-text control reference (the "[ ]" system)

Updated Jun 25, 2026 · Affirmology_FishVoice_Markers_Reference_v1.md

Summary. Authoritative, sourced. Fish does NOT shape voice with sliders the way ElevenLabs does. It shapes voice with markers and controls embedded in the input text, plus a small set of API params. This is the full public catalog (sources at the bottom).

Fish Audio - complete in-text control reference (the "[ ]" system)

Authoritative, sourced. Fish does NOT shape voice with sliders the way ElevenLabs does. It shapes voice with markers and controls embedded in the input text, plus a small set of API params. This is the full public catalog (sources at the bottom).

1. Two marker dialects

2. Placement & combining rules

3. Emotion markers (S2 [ ] / S1 ( ))

Basic (24): happy, sad, angry, excited, calm, nervous, confident, surprised, satisfied, delighted, scared, worried, upset, frustrated, depressed, empathetic, embarrassed, disgusted, moved, proud, relaxed, grateful, curious, sarcastic Advanced (25): disdainful, unhappy, anxious, hysterical, indifferent, uncertain, doubtful, confused, disappointed, regretful, guilty, ashamed, jealous, envious, hopeful, optimistic, pessimistic, nostalgic, lonely, bored, contemptuous, sympathetic, compassionate, determined, resigned (S2 also accepts modifiers on any of these: [slightly nostalgic], [very calm].)

4. Tone markers (5)

[in a hurry tone] [shouting] [screaming] [whispering] [soft tone]

5. Sound / audio effects (10)

[laughing] [chuckling] [sobbing] [crying loudly] [sighing] [groaning] [panting] [gasping] [yawning] [snoring] Plus crowd effects: [audience laughing] [background laughter] [crowd laughing]

6. Pauses & breathing (the "delay tags")

7. Pause words (natural rhythm)

Inserting filler words like "um", "uh" (or natural laughter written as "Ha,ha,ha") controls rhythm/realism without any tag. Use sparingly for a meditation voice.

8. Phoneme / pronunciation control ← important for names

Force exact pronunciation with: <|phoneme_start|>PHONEMES<|phoneme_end|> - English: CMU Arpabet (per word) - Chinese: tone-number pinyin - Japanese: OpenJTalk romaji with pitch-accent digits This is how we make sure a person's name (e.g. an unusual spelling) is voiced correctly in their Soul Song.

9. API params (not in-text)

prosody: { "speed": <float>, "volume": <int> }. We render at speed 0.80 - 0.88. Caveat (measured): the speed knob is coarse - 0.88 vs 0.93 produced near-identical length. For real pacing change, move in bigger steps or add [break]/[long-break] between sentences (which we now do). Model is chosen by the request model header: s2 (we use this) or s1.

Meditation / Soul Song cheat-sheet

Sources

Verified in-engine 2026-06-25: Fish s2 honors [break] as inserted silence; all 1-min A/B renders passed audio_qc.py.