Home / Engine / Knowledge Corpus
Updated Jun 04, 2026 · Affirmology_CorpusKickoff_Tonight.md
Plan: You kick this off Thursday evening (June 4) before leaving for Camp Brotherhood. The system runs unattended for ~84 hours through Monday morning (June 8). Zero check-ins from you. The runner is hardened to survive single-source failures, hangs, cost spikes, and brief network outages. By the time you're back, you have a corpus database that is days deep across all five traditions, ready to feed the script-generator agents on Monday.
v3 changes: Single run plan (no overnight test). Pre-flight checklist for true unattended operation. Hard cost ceiling on Gemini Flash so the LLM bill cannot run away. macOS Software Update deferral guidance (auto-restarts kill long runs).
For ~84 hours, the runner will:
pkill it remotely.Estimated end-state Monday morning: 15,000-30,000 structured records, $20-50 of Gemini spend, several GB of raw cache on the SSD.
Hardware:
Affirmology, formatted as APFS.macOS settings:
Software:
pip install trafilatura tenacity pypdf google-generativeai youtube-transcript-api yt-dlp httpx (already done if you ran the earlier kickoff doc - re-run is idempotent)..env. See Step 1 below. This is the critical missing piece.Storage:
df -h /Volumes/Affirmology shows at least 50GB free. (You have 895GB available, so this is a sanity check, not a worry.)Network:
You need this to enable structuring. Without it, the runner falls back to scrape-only (which still produces value, but no structured records).
AIzaSy...)echo "GEMINI_API_KEY=PASTE_YOUR_KEY_HERE" >> "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent/.env"
Replace PASTE_YOUR_KEY_HERE with the actual key. Verify it landed:
grep GEMINI "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent/.env"
Should print one line showing your key. If empty, the echo command didn't work - re-do step 6.
Gemini free tier: Gives you a generous rate limit for testing. Paid tier is pay-as-you-go at ~$0.075/M input tokens, ~$0.30/M output tokens. The estimated $20-50 weekend spend is paid-tier territory - you'll need a billing card on file in your Google Cloud account.
Run this once before the kickoff:
cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
pip install trafilatura tenacity pypdf google-generativeai youtube-transcript-api yt-dlp httpx
mkdir -p /Volumes/Affirmology/corpus/{raw,extracted,structured,logs,cache}
ls /Volumes/Affirmology/corpus/
Should show: cache extracted logs raw structured.
If pip install errors on lxml (Trafilatura dependency): run xcode-select --install, accept the dialog, wait for it to finish (5-10 min), retry the pip install.
Copy and paste this whole block into Terminal:
cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
caffeinate -i -m -s nohup env PYTHONPATH=src python -m affirmology.corpus.run \
--data-dir /Volumes/Affirmology/corpus \
--traditions all \
--mode scrape+structure \
--max-sources-per-tradition 100 \
--max-priority 4 \
--per-source-timeout-seconds 2700 \
--max-cost-usd 75 \
--stop-after-seconds 302400 \
> /Volumes/Affirmology/corpus/logs/run_weekend.log 2>&1 &
echo "Started corpus build PID $!"
disown
What every flag does:
caffeinate -i -m -s - prevents idle, disk, and system sleep for the duration of the wrapped processnohup - survives the parent terminal closingenv PYTHONPATH=src - points Python at the source tree--mode scrape+structure - runs scraping AND Gemini Flash structuring in one pass--max-sources-per-tradition 100 - caps each tradition at 100 sources (we have ~67 specs today; this leaves headroom for the additions Monday)--max-priority 4 - ingests priority 1-4 sources (skips exploratory 5)--per-source-timeout-seconds 2700 - 45-min soft cap per source. Slightly more than the overnight setting because Archive.org bulk downloads can take a while.--max-cost-usd 75 - HARD KILL SWITCH on Gemini spend. If cumulative cost crosses $75, structuring stops and the run continues in scrape-only mode for the remaining time. You will not wake up Monday to a surprise $500 bill.--stop-after-seconds 302400 - 84-hour time budget (302,400 seconds). Run will exit cleanly at that mark even if not done.& and disown - detach from your terminal so closing Terminal.app doesn't kill it.Note the PID it prints. Write it down somewhere just in case.
Three checks. Together they take about 30 seconds. Do not leave the house until all three pass.
ps -ef | grep "affirmology.corpus.run" | grep -v grep
Should print one line with the python process. If empty, the process died - check the log:
tail -50 /Volumes/Affirmology/corpus/logs/run_weekend.log
Wait one minute after kickoff, then:
cat /Volumes/Affirmology/corpus/logs/heartbeat.json
The heartbeat_at timestamp should be recent (within the last minute or two). The current_source field tells you what it's working on right now. If you cat this again after 30 seconds and the source changed (or the elapsed_s grew), the runner is healthy.
ls /Volumes/Affirmology/corpus/raw/
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus
After ~3-5 minutes you should see at least one tradition with documents AND non-zero structured records. If documents are landing but records=0, the Gemini integration is failing - likely a bad API key. Stop, fix, restart.
You can leave. The run is committed. Close your laptop's lid only if you have an external monitor connected; otherwise leave the lid open and the screen will turn off via Display sleep (allowed) while the system itself stays awake (the thing we care about).
The runner is designed to NOT need you. Here's what each failure mode does:
| Failure | What the runner does | Damage |
|---|---|---|
| Single source 403s | Logs error, skips, continues | None |
| Single source hangs >45 min | Soft timeout warning, accepts whatever results landed, continues | None |
| Network drops for an hour | Tenacity retries (3 attempts, exponential backoff), eventually marks source as failed, continues | One source lost; can re-fetch on Monday |
| Network drops for 12+ hours | Many sources fail, runner logs many errors, continues until 84-hour limit | Lots of errors to triage Monday; data we did fetch is intact |
| Gemini API rate-limited | Structuring records that fetch failed; document still saved; structurer tries next document | One document loses its records pass; can re-structure Monday |
| Gemini cost crosses $75 | Hard kill switch: structuring stops, scraping continues in scrape-only mode for remaining time | No more LLM spend; partial structured corpus + full raw corpus |
| Mac kernel panic / reboot | Python process dies. Everything fetched is on disk. No more progress. | Hours of remaining work not done; corpus intact |
| Disk fills up | SQLite write errors, runner exits | Unlikely with 895GB free |
| Process OOMs | Python crashes, log captures traceback | Unlikely; the runner is light on memory |
The pattern in all cases: what's already on disk stays on disk. The corpus database is durable. Worst case, you come home Monday to a partial corpus and an error log. You will not come home to a destroyed system, a surprise bill, or lost work.
Sometime Monday morning (84 hours after kickoff), the runner will exit cleanly with a final summary line in the log:
DONE run #1: pages=X extracted=Y records=Z cost=$N errors=M sources_done=K skipped=J elapsed=302400.0s
To get the full picture:
cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus --errors 50
You'll see: - Per-tradition document counts and word counts - Per-element-type structured record counts (this is the new thing - Gemini's structured output) - The 50 most recent errors (mostly robots.txt and 403s - expected, not a problem) - Total cumulative cost
Expected numbers (rough): - 1,000-3,000 documents - 5-15 million words of extracted text - 15,000-30,000 structured records - $25-60 of Gemini spend - 50-200 errors logged (most non-fatal)
We triage which sources erred and which were skipped, decide which to retry vs drop, then start standing up the script-generator specialists (walking meditation, Joe Dispenza journey, Gene Keys deep dive, etc.) that pull from the freshly-built corpus instead of inferring from Claude's training-time knowledge. That's the unlock that makes the rest of the audio types possible.
# Was it running when I last checked
ps -ef | grep affirmology.corpus.run | grep -v grep
# What was it doing when it finished
cat /Volumes/Affirmology/corpus/logs/heartbeat.json
# Final status report
cd ~/CLAUDE/AFFIRMOLOGY/affirmology-agent && source .venv/bin/activate
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus --errors 50
# Storage used
du -sh /Volumes/Affirmology/corpus/raw/ /Volumes/Affirmology/corpus/extracted/ /Volumes/Affirmology/corpus/corpus.db
# Free space remaining
df -h /Volumes/Affirmology
# Force stop (if it's still running somehow)
pkill -f "affirmology.corpus.run"
# Re-start if you want another pass (resume mode skips done sources)
# Same kickoff command from Step 3, but you can change --stop-after-seconds
The one thing that absolutely has to happen before you leave: Step 1 (Gemini key) plus Step 4 (three health checks pass after kickoff). Everything else is recoverable. These two are not.