Home / Engine / Knowledge Corpus

Corpus acquisition plan - the smart standing order (2026-06-18)

Updated Jun 18, 2026 · Affirmology_CorpusAcquisition_Plan_v1.md

Summary. Hand this to Claude Code and let it run unattended within budget. Jeff is tired of babysitting the corpus. This reframes the work so we stop grinding junk and start acquiring real material.

Corpus acquisition plan - the smart standing order (2026-06-18)

Hand this to Claude Code and let it run unattended within budget. Jeff is tired of babysitting the corpus. This reframes the work so we stop grinding junk and start acquiring real material.

The core insight (why the corpus felt weak)

We have been spending structuring credits RE-MINING an existing scrape that is full of archive.org search?query=... pages, index pages, and junk. That is the wrong layer. The "6 Western big books" all turned out to be search-result pages, not books, so they yielded nothing. The leverage is ACQUISITION: go fetch real, high-quality, legally-clean full text and parse THAT. The books are fine; we were fetching the wrong URL. Expect far higher yield per dollar from real sources than from re-grinding the old pile.

Priorities (Jeff, 2026-06-18)

WESTERN astrology - highest priority, wants more real records.
GENE KEYS and VEDIC - both important.
NUMEROLOGY - shelved / back burner. Lightly used in our product. Do NOT spend credits on it now. Spend the budget where it yields. Stop the moment a source proves to be noise.

The big-books fix (the unlock) - fetch FULL TEXT, not search pages

Real public-domain (Tier A) Western astrology full text is freely and legally available. Fetch the actual text endpoint, never the search URL: - Project Gutenberg: plain-text / HTML. E.g. Ptolemy, Tetrabiblos (#70850); Sepharial, "Astrology: How to Make and Read Your Own Horoscope" (#46963) and his other titles. - archive.org: use the *_djvu.txt full-text URL of a book's details page (NOT /search). E.g. Raphael, "A Manual of Astrology"; Alan Leo titles (Astrology for All, Esoteric Astrology, How to Judge a Nativity); William Lilly, "Christian Astrology." - Forgotten Books and sacred-texts.com astrology sections (verify public-domain status per item). These are big, dense, on-topic books. Chunk them deeply (--chunk-chars, high --max-chunks-per-doc) and structure. This is the highest-yield Western work available and it is legal.

VEDIC (Tier A public domain): sacred-texts.com Hindu/Jyotish section; clearly public-domain Parashara/Jyotish translations on archive.org _djvu.txt. AVOID modern in-copyright authors (e.g. recent B.V. Raman editions) - stick to clearly public-domain texts.

GENE KEYS - KEEP THE TIER WALL. Richard Rudd's official Gene Keys and Ra Uru Hu / Jovian Archive are TIER C: benchmark only, NEVER scraped. Gene Keys A/B material = the public-domain I Ching roots of the 64 gates (Legge's translation is public domain; the 1950 Wilhelm translation is NOT) and genuinely A/B-licensed secondary/community commentary. Do not let Tier C text into the retrievable corpus.

Model / cost optimization - ADAPTIVE, automatic (the other half of "10x for 10%")

The Gemini API key works now, so re-open the structurer model choice (the old "HOLD on Haiku" was when Gemini billing was blocked). Jeff wants it robust: use the cheaper model whenever it works, escalate only when it doesn't. Make this AUTOMATIC, not a decision Jeff has to make: - DEFAULT to the cheapest model (Gemini Flash) for every batch. - On each new source/tradition, the pilot doubles as a check: if Flash's pilot yields well and the records look clean on a quick spot-check, run the whole batch on Flash. - ESCALATE to Haiku (or Nemotron) ONLY for that batch if Flash's pilot underperforms (low yield or junky records). Then keep going. - One-time at the start: a 10-doc bake-off (Flash vs Haiku vs Nemotron) on the same recovered docs to set the baseline; after that it's automatic per the rule above. Jeff has Gemini, Nemotron, and Haiku credit. Optimize on cost-per-good-record, not on raw model preference.

Budget governance (so Jeff can set and forget)

Jeff is adding ~$15-20 now and will re-up when the work is proven useful and valid. He does NOT want credits drained on broken/noise records.
PILOT before every full batch: structure a small sample (e.g. 10 docs) and check yield. If the pilot yields ~0, QUARANTINE that source/doc set and skip it. Never run a full batch on unproven material.
Hard --max-cost-usd cap on every batch. Never start a batch you cannot finish within remaining credit.
QUARANTINE noise permanently (search pages, index pages, zero-yield, sub-floor) so it never gets retried and stops polluting counts. Keep the tag-validation that correctly moved forecast pages to transits.
Stop at ~$3 remaining and report; Jeff re-ups if it's earning.

Push + restart discipline (Jeff's ask: push after every significant piece)

Push the updated corpus.db to R2 after EACH significant piece (each tradition batch, each big-book set). Each push overwrites the same R2 object.
ONE Render restart at the end (or when Jeff wants to see it live); the service re-pulls on boot when the R2 size differs. Don't ask Jeff to restart after every push.

Standing order to run unattended

RUN THE WHOLE LOOP ON YOUR OWN. Do not stop to ask Jeff between steps or batches. Acquire -> pilot -> batch (adaptive model) -> push -> next source/tradition, repeating until you hit the ~$3 floor or run out of priority sources. Budget now: ~$20 fresh credits added 2026-06-18 on top of whatever remained. Only surface to Jeff at the floor, on a hard error you cannot route around, or at the end with the summary.

WESTERN first: acquire and parse the public-domain full-text books (Gutenberg + archive.org _djvu.txt), pilot-then-batch with caps, push after.
Then GENE KEYS (A/B only, tier wall enforced), then VEDIC (public-domain Jyotish). Pilot-then-batch-then-push each.
SKIP numerology.
Do the model bake-off early and switch the bulk to the cheapest good model.
Quarantine noise; never re-grind junk.
Stay in the affirmology-agent corpus code only. Do NOT touch the affirmology-studio repo or the engine files numerology.py / council.py / script_generator.py (being edited in parallel).
Keep a budget ledger and log counts + what was acquired to PROJECT_STATE as you go. Stop at the floor and report. At the very end: one summary + the single Render restart to run.

What "good" looks like

Real record growth from real interpretation text (book chapters, article bodies), not re-tagged scraps. If a batch is not yielding real records cheaply, stop and re-aim at a better source rather than grinding.

Corpus acquisition plan - the smart standing order (2026-06-18)

Corpus acquisition plan - the smart standing order (2026-06-18)

The core insight (why the corpus felt weak)

Priorities (Jeff, 2026-06-18)

The big-books fix (the unlock) - fetch FULL TEXT, not search pages

Model / cost optimization - ADAPTIVE, automatic (the other half of "10x for 10%")

Budget governance (so Jeff can set and forget)

Push + restart discipline (Jeff's ask: push after every significant piece)

Standing order to run unattended

What "good" looks like

Related documents