Home / Audio / Music and Voice Beds

Fish Audio for Affirmology - evaluation (v1)

Updated Jun 18, 2026 · Affirmology_VoiceTools_FishAudio_Eval_v1.md

Summary. Deep-dive on whether to experiment with Fish Audio (the platform) / OpenAudio / Fish Speech (the open-weight models), measured against the current ElevenLabs setup (demo voice Charlotte, trailer voice Lily), for the two use cases Jeff raised: (1) real-time con

Fish Audio for Affirmology - evaluation (v1)

Deep-dive on whether to experiment with Fish Audio (the platform) / OpenAudio / Fish Speech (the open-weight models), measured against the current ElevenLabs setup (demo voice Charlotte, trailer voice Lily), for the two use cases Jeff raised: (1) real-time conversational voice for in-person "talk to Sophia" avatar installations, and (2) cloning a person's voice to render the Sacred Audios in their own voice. Researched June 2026, multi-source and verified. Bottom line up top.

Verdict in one paragraph

Yes, worth experimenting, but in a specific, bounded way. Fish Audio's newest model (S2 Pro) genuinely rivals or beats ElevenLabs on raw naturalness in blind preference tests, gives finer breathy/whisper emotion control, and is far cheaper (~$1.25/hour of audio vs ElevenLabs' tiered credits). For the AVATAR INSTALLATION it is a strong, legitimate candidate as the voice layer (HeyGen's live-avatar product even lists Fish Audio as a supported TTS) and is the only top option you can self-host on a kiosk for offline reliability. For CLONING A REAL PERSON INTO A SOLD PRODUCT it is the weaker fit right now: ElevenLabs edges it on long-slow-script stability (exactly where you already fight dropouts), ElevenLabs gives a built-in consent check and perpetual commercial rights, and Fish's open weights are non-commercial so the only license-clean path is Fish's paid hosted service. Net: keep ElevenLabs as the production voice, spin up a cheap Fish paid plan, and run a real head-to-head on one actual Sacred Audio script plus a small avatar prototype before changing anything. The locked demo does not change regardless.


1. What Fish Audio actually is (naming clarified)

"Fish Audio" is the hosted platform (company Hanabi AI / 39 AI). "OpenAudio" is the research brand. "Fish Speech / OpenAudio S1 / S2 Pro" are the underlying open-weight models. The current flagship is S2 Pro (4B), released ~March 2026, trained on 10M+ hours across 80+ languages, with a streaming WebSocket API, ~100ms time-to-first-audio on server hardware, and 64+ inline emotion tags (including [calm], [soft tone], [whispering], [break]) that are tailor-made for meditation delivery.

2. Quality: how good is it really

Fish's own blind A/B test (5,098 preference pairs, March-April 2026) ranked S2 Pro #1, beating ElevenLabs v3 60/40. Independent reviews echo "the ElevenLabs killer" for short-to-medium content. Two honest caveats: that test was vendor-run on Fish's own platform, and an independent leaderboard (Artificial Analysis Speech Arena) does NOT place Fish in its top five as of late May 2026, where Gemini, Cartesia Sonic, and Inworld lead. So treat "#1 / SOTA" as vendor-claimed but real enough to take seriously. The consistent independent finding: Fish wins on raw warmth and naturalness; ElevenLabs still edges it on long-form, multi-hour consistency, which is the meditation case.

3. Pricing comparison

Fish Audio ElevenLabs
Entry paid (commercial OK) Plus ~$11/mo (~200 min) Starter $6/mo (~30 min), Creator $22/mo (~121 min)
Pro tier $75/mo (~1,620 min) Pro $99/mo (~600 min)
API rate ~$15 per 1M bytes ≈ ~$1.25/hour of audio ~$0.17 to $0.18 per extra minute (Flash/Turbo 50% less)
Per ~7 min Sacred Audio a few cents ~$1 to $2 of quota
Free tier non-commercial only non-commercial only, requires attribution

Fish is materially cheaper at volume. Both grant commercial use and perpetual rights to output generated while on a paid plan. Neither charges per-output royalties on the hosted service.

4. The licensing trap (important, and a correction)

Fish Speech / OpenAudio weights are open-weight but NOT open-commercial. The current S2 weights are under the "Fish Audio Research License" (non-commercial; commercial use requires a separate written agreement); older S1/S1-mini are CC-BY-NC-SA-4.0 (also non-commercial). A common claim online that the code is "MIT" or "Apache" is STALE for the current repo, verified by reading the raw LICENSE file. Practical meaning: - Self-hosting the free weights on your own GPU to produce SOLD Sacred Audios is a license violation. Tempting on cost (24GB GPU, ~$0 per minute) but not allowed without a negotiated commercial license (business@fish.audio). - The license-clean commercial path is Fish's PAID hosted service / API, whose terms expressly grant paid users commercial use "to the fullest extent possible." So if you use Fish in production, use their paid API, not a self-hosted box, unless you sign a commercial deal. - Like Suno and ElevenLabs, Fish gives no IP indemnification and caps its liability at ~$100; you carry the legal risk. (ElevenLabs is the same: paid users are indemnitors, not indemnified.)

5. Use case 1 - real-time "talk to Sophia" avatar installation

This is a genuinely new build and a fun one. The standard stack is speech-to-text -> LLM -> TTS -> avatar lip-sync over WebRTC, orchestrated by something like Pipecat or LiveKit, or handled end-to-end by HeyGen's LiveAvatar.

Where Fish fits and how it stacks up: - Fish supports low-latency streaming TTS (a latency: low mode and a flush event built for interactive use) and, crucially, is self-hostable on the kiosk for offline reliability, the one thing the premium options can't do. For an in-person installation where internet can hiccup, a local voice layer is a real advantage. - For pure speed, Cartesia Sonic (~40ms) and ElevenLabs Flash v2.5 (~75ms) lead; Fish is mid-tier on independent latency/quality. Google Gemini Live and Hume EVI are all-in-one speech-to-speech engines (no separate TTS) worth knowing about. - The animated avatar itself is effectively cloud-only today (HeyGen LiveAvatar, Tavus, D-ID, Simli all stream rendered video from their servers). HeyGen LiveAvatar explicitly accepts Fish Audio as the TTS provider and can run the whole ASR+LLM+TTS+avatar pipeline, so "talk to Sophia, animated Sophia answers" is buildable today with Fish as the voice. Tavus has the best sub-1-second conversational avatar latency if realism is the priority.

Recommendation for use case 1: prototype it. Fastest path to a working demo is HeyGen LiveAvatar (pick the voice) or ElevenLabs Agents + HeyGen. Try Fish Audio as the voice in that prototype, and separately test a self-hosted Fish setup if true offline kiosk reliability matters. This is R&D, not production, so the licensing concern is lower, just use a paid Fish plan if any of it is shown publicly or commercially.

6. Use case 2 - clone a person's voice into the sold Sacred Audios

This is the higher-stakes one because it touches a sold product and a real person's identity.

Quality and reliability: For a calm, slow, breathy, long first-person narration, ElevenLabs Professional Voice Cloning (30+ min of pristine source, fine-tuned, not Instant cloning, not Turbo) is the more robust choice. Reviews consistently give ElevenLabs the edge on long-slow-script consistency, which is exactly the axis where you already fight Turbo dropouts. The fix for your dropout problem is also here: PVC on a stable model (Multilingual v2 or v3) at moderate stability (~35 to 40%, never below 30%) and capped similarity (≤80%); aggressive settings are what cause the dropouts. Fish S2 clones from as little as 10 to 30 seconds and sounds excellent, but breathy/gravelly source voices "can lose some character," and long-form consistency is its known weak spot.

Consent and law (this matters more than the audio): cloning a real person into something you SELL is the highest-risk category under 2026 law. The Tennessee ELVIS Act and similar state laws protect voice likeness; the federal NO FAKES Act is not yet law (state patchwork); and the EU AI Act Article 50 requires AI-generated audio to be disclosed to listeners from August 2, 2026 if you have any EU users. Regardless of vendor you need (a) a signed written voice-clone-and-commercial-use release per person covering purpose, channels, duration, territory, reuse, compensation, and deletion/revocation, and (b) an AI-generated disclosure to the listener. Vendor difference: ElevenLabs enforces a consent voice-captcha (the person reads a prompt to prove presence) on every clone; Fish Audio does not pre-verify at all and pushes 100% of the consent burden onto you contractually. For a product built on customers' or founders' voices, ElevenLabs' built-in check is a meaningful safety rail.

Recommendation for use case 2: stay on ElevenLabs PVC as the production path for any real-person clone, and treat Fish as an A/B challenger you test on actual scripts, not a switch you flip. Get an attorney to bless a consent-release template before any sold audio uses a cloned real voice.

7. So should Affirmology experiment with Fish now

Yes, cheaply and in parallel, without touching production: 1. Buy the Fish Plus plan (~$11/mo, commercial-OK) and clone a test voice, then render ONE real Sacred Audio script with S2 Pro using the [calm] / [soft tone] / [break] tags and run it through your existing audio_qc.py. Compare head-to-head against the Charlotte render for warmth, calm, and artifact-free long-form delivery. 2. Stand up a small avatar prototype (HeyGen LiveAvatar with Fish as the TTS, or a Pipecat pipeline) for the "talk to Sophia" installation, as an R&D track separate from the Studio and demo. 3. Keep ElevenLabs as the production voice meanwhile. The locked demo does not change. 4. Before any sold product uses a cloned real voice (Fish or ElevenLabs), put the written consent release and the AI-disclosure in place.

Where Fish is the stronger fit: cheap high-naturalness beds of speech, the avatar installation voice, and offline/kiosk self-hosting (with a commercial license). Where ElevenLabs stays stronger: long-slow-script stability for the sold Sacred Audios, built-in consent verification, and the mature cloning workflow.


Sources: Fish Audio blog + docs + GitHub LICENSE (fish.audio, docs.fish.audio, github.com/fishaudio/fish-speech), HuggingFace model cards, Fish Audio Terms (fish.audio/terms); ElevenLabs pricing/models/cloning docs + ToS (elevenlabs.io); Cartesia, Hume, Google Vertex Gemini Live, HeyGen LiveAvatar, Tavus, Pipecat/LiveKit docs; Artificial Analysis Speech Arena; EU AI Act Article 50; Tennessee ELVIS Act; NO FAKES Act tracking. Full URL list retained in the research run.