Home / Engine / Knowledge Corpus
Updated Jun 08, 2026 · Affirmology_CorpusBuild_Report_v1.md
Prepared for Jeff Parker Period covered Thursday June 4 evening through Monday morning (your Camp Brotherhood window) Status Scrape pass completed cleanly. Structured records: zero (Gemini path blocked by org policy chaos). 4.3 million words of raw text now sitting on the SSD, ready to be structured this week.
The system worked. The scraping run completed cleanly in 10 minutes 44 seconds on Thursday evening - not 84 hours. The runner exhausted its source list far faster than expected because several Internet Archive collections were unexpectedly deep. By the time you closed your laptop and left, the corpus build had already finished its main pass and was sitting in "completed" state. The 84-hour budget was never the binding constraint; the source list was.
What that means in practical terms: the weekend wasn't wasted; the actual work happened in the first 11 minutes. The remaining 83 hours, your Mac was idle (scraper exited cleanly, caffeinate still preventing sleep). No harm done.
What landed on the SSD:
This breaks the corpus down by where the words live now.
| Tradition | Documents | Words | Notes |
|---|---|---|---|
| Western astrology | 50 | 3,524,466 | Massively overrepresented because Internet Archive's Alan Leo and John Gadbury collections were deeper than expected |
| Vedic astrology | 10 | 395,861 | BPHS Sanskrit alone is 180K words. Solid Tier A foundation. |
| Numerology | 7 | 158,860 | Cheiro + Westcott + Sepharial give clean PD coverage of Pythagorean and Chaldean systems |
| Gene Keys | 4 | 131,738 | Wilhelm I Ching GitHub dataset = full hexagram-by-hexagram base layer for both GK and HD |
| Human Design | 5 | 87,583 | Light, because most independent-practitioner blogs blocked our bot |
| TOTAL | 76 | 4,302,924 |
The top 15 documents by word count are dominated by classical Western astrology:
The Alan Leo Internet Archive collection alone delivered eight full books - a far richer pull than I'd estimated. The single search query "alan leo astrology" returned and downloaded eight pre-1928 books in one go.
Three categories.
1. The Gemini authentication hellscape. Your Google account is in a Workspace organization with a security policy that blocks AI Studio API keys at the standard surface. The Agent Platform settings page (where you ended up) issues AQ-prefix keys that authenticate but are tied to projects without billing. The $300 free-trial credit exists but is not linked to the project that owns the AQ key. The ADC path (Application Default Credentials) is the correct workaround but requires gcloud CLI installed, which the bash setup script you ran did not actually install. Net result: we couldn't enable structured-record extraction during the run. The pipeline ran scrape-only, which still produces raw extracted text - but no typed JSON records yet.
Why this matters: the structured records are what generation agents (script generator, Sacred Audio Report PDF, etc.) query at runtime to compose interpretations. Raw extracted text is the input to structuring; structured records are the output. We have the input, not yet the output.
2. Robots.txt and bot-detection blocks on 22 sources. Several of the highest-priority Tier B sources rejected our scraper:
These weren't the kind of failures where we throttle and retry. They were polite hard-refusals from the sites' robots.txt files. Our scraper respects them, by design. The sites would let a browser through but won't let our identified-as-bot user agent. We have two options for next week: switch to a browser-style User-Agent header on the next pass (some of these may unblock), or pull the same authors' content from podcast transcripts and YouTube channels which use different infrastructure.
3. Source list exhausted faster than predicted. The 67-source registry I built proved to be light for an 84-hour budget. With Internet Archive's per-search enumeration returning multiple books per source, the actual work-units were maybe 110-140, finished in about 11 minutes. We were under-prepared for a long-running scrape. The fix is straightforward: add more sources to the registry, particularly YouTube channels (which take longer per source because of per-video transcript pulls) and per-author blog crawls with depth-2 link following.
In order of how much they matter for getting to a working script generator.
1. Zero structured records. This is the gating issue. Generation agents need typed records to compose from. Without structured records, the next script generator improvements have nothing new to draw from. Solution: fix Gemini auth this week, run a structure-only pass over the existing 76 documents (no scraping needed, the text is already on disk).
2. Tier B coverage is uneven across traditions. Western astrology has 5 Tier B docs; Human Design has 3 Tier B docs; Gene Keys has 0 Tier B docs (the Ashley Mosaic blog we counted on was blocked by robots.txt). Whatever generation agent draws from this corpus right now will be heavily biased toward classical voice (Lilly, Gadbury, Leo, Cheiro) rather than modern practitioner voice. Solution: retry blocked sources with a browser UA. If that fails, add YouTube transcript ingestion for those same practitioners (Jenna Zoe, Karen Curry Parker, etc. all have substantial YouTube content with auto-captions we can pull).
3. The Gemini wall is real and not solved. Until you have a working API key in a billing-linked project, OR ADC properly set up via installed gcloud CLI, the structured records pass cannot run. We tried four different paths Thursday evening, all blocked by some combination of org policy, missing billing link, or uninstalled tooling. Solution: Monday afternoon, with fresh eyes, install gcloud properly via Homebrew, run gcloud auth application-default login, link your trial billing to the project you point at. About 30 minutes total. Or, alternatively, switch the structurer to use Anthropic Claude Haiku (we already have that API key working) - about 2x more expensive than Gemini Flash but immediately available.
4. No connection between corpus and generation agents yet. The script generator (affirmology/agents/script_generator.py) still reads from Claude's training-time knowledge plus the chart JSON. It does not query the corpus database. Solution: add a corpus_lookup(chart_element, n=10) helper that retrieves the top-N records for any chart element from the corpus DB, then update the script generator's prompt template to include those records as context. Maybe 90 minutes of work once structured records exist.
5. The robots.txt blocks need triage. Some sites we can unblock with a browser-style header; some we can't and shouldn't try. We should split the 22 blocked sources into "retry with new UA" and "drop, find alternative." Solution: an afternoon of source-list curation.
Worth pausing to note what didn't break.
archive_org_search URL gave us 17 pre-1928 books from the Alan Leo collection in one shot. That pattern is replicable for other classical authors (Bonatti, Lilly, Cheiro have similar collections).Concrete, sequenced.
Day 1 (today, Monday). Get Gemini working OR switch to Claude Haiku as the structurer.
The path of least resistance is Claude Haiku. You already have the Anthropic API key working (it ran every Sacred Audio Report PDF and every script generation last week). Switching the structurer to Haiku is a 30-minute code change on my side, no Google Cloud politics. Cost difference: Gemini Flash at $0.075/M input vs Haiku at $1.00/M input - roughly 13x more expensive. For our 4.3M words of corpus text, structuring with Haiku is about $40-60 instead of Gemini's $3-5. Worth it to unblock the pipeline today.
If you want to make the Gemini path work properly anyway (for future cost savings), the right sequence is: brew install --cask google-cloud-sdk, then gcloud auth application-default login, then link your trial billing to the project. About 30 minutes. We can do both: Haiku for the immediate unblock, Gemini for future passes when set up.
Day 2. Run a structure-only pass over the existing 76 documents. The runner already supports --mode structure-only (well, it supports scrape+structure; I'll wire structure-only properly). 4.3M words at ~3,500 tokens per chunk = roughly 1,200 LLM calls. With Haiku, about $40. Output: ~15,000-25,000 typed JSON records. After this, the corpus is queryable.
Day 3. Wire the script generator to draw from the corpus instead of relying on Claude's training-time knowledge. This is the moat-building step. After this, every script ships referencing your proprietary corpus, not generic LLM knowledge.
Day 4-5. Triage the 22 blocked sources. Re-run scraping with a browser UA. Add YouTube transcript ingestion for the practitioners whose blogs are blocked. Refresh the corpus with the new Tier B content. Re-run the structure pass for the new documents only (cheap, incremental).
Mac mini setup. Plug SSD into Mac mini, install Python deps, point Tailscale at it. The corpus continues running on the Mac mini as the always-on background process. Your laptop is freed up.
Continuous improvement pass. Set up a weekly cron on the Mac mini that re-scrapes the same sources to pick up new blog posts and podcast episodes. The corpus stays fresh.
Script generator specialists. With a working corpus, start building the audio use-case specialists from your script-types list (walking meditation, Joe Dispenza journey, Gene Keys deep dive, HD walkthrough, astrology walkthrough). Each one is a prompt-template variation on the existing script generator, plus tradition-specific corpus queries.
If you do Haiku-based structuring today: - 4.3M words → roughly $40 in Haiku spend - One afternoon of running (~3-4 hours of Claude API calls at our rate limits) - Outcome: ~15,000-25,000 typed records ready to feed generation agents
If you wait for Gemini setup first: - Same 4.3M words → roughly $3-5 in Gemini spend - Same outcome - Tradeoff: setup time and Gemini configuration friction
My recommendation: do both. Use Haiku today to unblock the script-generator wiring. Configure Gemini properly this week so the NEXT structuring pass (after we add more sources) uses the cheaper provider.
When you have a moment, three answers will scope the next steps.
End of report. The corpus is real, the foundation is solid, the next pass unblocks generation.