Home / Engine / Knowledge Corpus
Updated Jun 18, 2026 · Affirmology_CorpusAcquisition_Plan_v1.md
Hand this to Claude Code and let it run unattended within budget. Jeff is tired of babysitting the corpus. This reframes the work so we stop grinding junk and start acquiring real material.
We have been spending structuring credits RE-MINING an existing scrape that is full of archive.org search?query=... pages, index pages, and junk. That is the wrong layer. The "6 Western big books" all turned out to be search-result pages, not books, so they yielded nothing. The leverage is ACQUISITION: go fetch real, high-quality, legally-clean full text and parse THAT. The books are fine; we were fetching the wrong URL. Expect far higher yield per dollar from real sources than from re-grinding the old pile.
Real public-domain (Tier A) Western astrology full text is freely and legally available. Fetch the actual text endpoint, never the search URL:
- Project Gutenberg: plain-text / HTML. E.g. Ptolemy, Tetrabiblos (#70850); Sepharial, "Astrology: How to Make and Read Your Own Horoscope" (#46963) and his other titles.
- archive.org: use the *_djvu.txt full-text URL of a book's details page (NOT /search). E.g. Raphael, "A Manual of Astrology"; Alan Leo titles (Astrology for All, Esoteric Astrology, How to Judge a Nativity); William Lilly, "Christian Astrology."
- Forgotten Books and sacred-texts.com astrology sections (verify public-domain status per item).
These are big, dense, on-topic books. Chunk them deeply (--chunk-chars, high --max-chunks-per-doc) and structure. This is the highest-yield Western work available and it is legal.
VEDIC (Tier A public domain): sacred-texts.com Hindu/Jyotish section; clearly public-domain Parashara/Jyotish translations on archive.org _djvu.txt. AVOID modern in-copyright authors (e.g. recent B.V. Raman editions) - stick to clearly public-domain texts.
GENE KEYS - KEEP THE TIER WALL. Richard Rudd's official Gene Keys and Ra Uru Hu / Jovian Archive are TIER C: benchmark only, NEVER scraped. Gene Keys A/B material = the public-domain I Ching roots of the 64 gates (Legge's translation is public domain; the 1950 Wilhelm translation is NOT) and genuinely A/B-licensed secondary/community commentary. Do not let Tier C text into the retrievable corpus.
The Gemini API key works now, so re-open the structurer model choice (the old "HOLD on Haiku" was when Gemini billing was blocked). Jeff wants it robust: use the cheaper model whenever it works, escalate only when it doesn't. Make this AUTOMATIC, not a decision Jeff has to make: - DEFAULT to the cheapest model (Gemini Flash) for every batch. - On each new source/tradition, the pilot doubles as a check: if Flash's pilot yields well and the records look clean on a quick spot-check, run the whole batch on Flash. - ESCALATE to Haiku (or Nemotron) ONLY for that batch if Flash's pilot underperforms (low yield or junky records). Then keep going. - One-time at the start: a 10-doc bake-off (Flash vs Haiku vs Nemotron) on the same recovered docs to set the baseline; after that it's automatic per the rule above. Jeff has Gemini, Nemotron, and Haiku credit. Optimize on cost-per-good-record, not on raw model preference.
--max-cost-usd cap on every batch. Never start a batch you cannot finish within remaining credit.corpus.db to R2 after EACH significant piece (each tradition batch, each big-book set). Each push overwrites the same R2 object.RUN THE WHOLE LOOP ON YOUR OWN. Do not stop to ask Jeff between steps or batches. Acquire -> pilot -> batch (adaptive model) -> push -> next source/tradition, repeating until you hit the ~$3 floor or run out of priority sources. Budget now: ~$20 fresh credits added 2026-06-18 on top of whatever remained. Only surface to Jeff at the floor, on a hard error you cannot route around, or at the end with the summary.
_djvu.txt), pilot-then-batch with caps, push after.affirmology-agent corpus code only. Do NOT touch the affirmology-studio repo or the engine files numerology.py / council.py / script_generator.py (being edited in parallel).Real record growth from real interpretation text (book chapters, article bodies), not re-tagged scraps. If a batch is not yielding real records cheaply, stop and re-aim at a better source rather than grinding.