Home / Engine / Knowledge Corpus

Affirmology Corpus Status - 2026-06-13

Updated Jun 13, 2026 · Affirmology_CorpusStatus_2026-06-13_v1.md

Summary. Snapshot after the overnight chained run (overnightcorpus.sh, run 03:36).

Affirmology Corpus Status - 2026-06-13

Snapshot after the overnight chained run (overnight_corpus.sh, run 03:36).

Headline

Total structured records: 10,658 (up from ~3,760 the night before). The scrape-plus-structure chain ran unattended end to end and tripled the record count.

Scorecard

tradition	docs	words	structured docs	structured %
western_astrology	550	3,610,433	34	6%
human_design	1,487	1,571,994	706	47%
transits	382	1,082,484	369	97%
vedic_astrology	527	878,743	42	8%
gene_keys	413	397,906	186	45%
numerology	507	194,975	7	1%
total	3,866	7,736,545	1,344	35%

What went well

Human design grew from 825 to 1,487 docs. That delta is the uncapped YouTube transcript haul landing, and 706 docs are now structured.
Gene keys went from 4 structured docs to 186, on the back of the four new independent sources plus YouTube.
Transits is effectively complete at 97% structured.
The chunking fix is confirmed working: it reads whole documents instead of just front pages.

The real bottleneck (corrected from run-table data)

The per-run costs tell the true story, which is different from a first glance at the scorecard:

Western structuring (run #1013) completed with 0 records for $0.14. It did not hit a cap. The Western queue was already empty, earlier runs structured its real content. Western's low percentage is because 505 of its 550 docs are sub-100-word pages that are correctly skipped.
The only step that hit a cost ceiling was gene keys + human design (run #1015): $4.01 against a $4 cap, 3,453 records. That is the genuine backlog: human design has 1,487 docs (the YouTube haul) and only 706 are structured. About 780 are still waiting, cut off by the small cap.
vedic + numerology (run #1016) finished under budget at $1.95.

Total spend across all runs is about $19, so roughly $9-10 of credit remains (confirm in the Console).

So the highest-value, in-budget move is finishing the human design backlog, not pouring money into Western.

The subtler Western point

Western's big archive.org books are structured only to a depth of 8 chunks each (~88K chars), so roughly the first 30% of each book is mined and the rest is untouched. This is a chunk-depth setting (--max-chunks-per-doc), not a budget wall. Mining the books deeper is a quality refinement to do later, with more credit.

Estimated quality

The lagging dimension is structuring_progress (35% of docs structured), dragged mostly by the human design backlog. Tradition balance, voice diversity, and coverage all improved. Overall corpus quality is roughly mid-7 out of 10, up from about 6.5.

Next steps

Finish the human design + gene keys backlog (in budget, ~$5-9): bash caffeinate -i env PYTHONPATH=src python3 -m affirmology.corpus.run \ --data-dir /Volumes/Affirmology/corpus \ --traditions gene_keys,human_design --mode structure-only --backend anthropic \ --max-cost-usd 9
Later, with more credit: mine the Western/vedic/numerology books deeper by raising --max-chunks-per-doc (e.g. to 20).
Build the nightly launchd watchdog so this runs and reports automatically.

Affirmology Corpus Status - 2026-06-13

Affirmology Corpus Status - 2026-06-13

Headline

Scorecard

What went well

The real bottleneck (corrected from run-table data)

The subtler Western point

Estimated quality

Next steps

Related documents