Home / Engine / Knowledge Corpus

Corpus Build - Weekend Run

Updated Jun 04, 2026 · Affirmology_CorpusKickoff_Tonight.md

Summary. Plan: You kick this off Thursday evening (June 4) before leaving for Camp Brotherhood. The system runs unattended for ~84 hours through Monday morning (June 8). Zero check-ins from you. The runner is hardened to survive single-source failures, hangs, cost spik

Corpus Build - Weekend Run

Plan: You kick this off Thursday evening (June 4) before leaving for Camp Brotherhood. The system runs unattended for ~84 hours through Monday morning (June 8). Zero check-ins from you. The runner is hardened to survive single-source failures, hangs, cost spikes, and brief network outages. By the time you're back, you have a corpus database that is days deep across all five traditions, ready to feed the script-generator agents on Monday.

v3 changes: Single run plan (no overnight test). Pre-flight checklist for true unattended operation. Hard cost ceiling on Gemini Flash so the LLM bill cannot run away. macOS Software Update deferral guidance (auto-restarts kill long runs).


What runs while you're gone

For ~84 hours, the runner will:

  1. Scrape priority 1-4 sources across Western astrology, Vedic astrology, Gene Keys, Human Design, and numerology. Roughly 130-200 sources total.
  2. Extract clean text from every fetched HTML page, PDF, GitHub dataset, and YouTube transcript.
  3. Structure the extracted text via Gemini Flash into typed JSON records (planet-in-sign interpretations, gene key gifts, life path meanings, etc.) tagged with provenance and license tier.
  4. Heartbeat to disk after every source so a status check is always one command away.
  5. Stop cleanly if any of three guardrails trip: 84-hour time budget, $100 cost ceiling, or you pkill it remotely.

Estimated end-state Monday morning: 15,000-30,000 structured records, $20-50 of Gemini spend, several GB of raw cache on the SSD.


Pre-flight checklist (DO THIS BEFORE LEAVING)

Hardware:

macOS settings:

Software:

Storage:

Network:


Step 1: Get the Gemini API key (do this NOW, before anything else)

You need this to enable structuring. Without it, the runner falls back to scrape-only (which still produces value, but no structured records).

  1. Open https://aistudio.google.com in your browser
  2. Sign in with any Google account
  3. Click "Get API key" in the left sidebar
  4. Click "Create API key" → "Create API key in new project"
  5. Copy the key (looks like AIzaSy...)
  6. Add to your .env:
echo "GEMINI_API_KEY=PASTE_YOUR_KEY_HERE" >> "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent/.env"

Replace PASTE_YOUR_KEY_HERE with the actual key. Verify it landed:

grep GEMINI "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent/.env"

Should print one line showing your key. If empty, the echo command didn't work - re-do step 6.

Gemini free tier: Gives you a generous rate limit for testing. Paid tier is pay-as-you-go at ~$0.075/M input tokens, ~$0.30/M output tokens. The estimated $20-50 weekend spend is paid-tier territory - you'll need a billing card on file in your Google Cloud account.


Step 2: Verify dependencies and folder structure

Run this once before the kickoff:

cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
pip install trafilatura tenacity pypdf google-generativeai youtube-transcript-api yt-dlp httpx
mkdir -p /Volumes/Affirmology/corpus/{raw,extracted,structured,logs,cache}
ls /Volumes/Affirmology/corpus/

Should show: cache extracted logs raw structured.

If pip install errors on lxml (Trafilatura dependency): run xcode-select --install, accept the dialog, wait for it to finish (5-10 min), retry the pip install.


Step 3: THE KICKOFF COMMAND (Thursday evening, before you leave)

Copy and paste this whole block into Terminal:

cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
caffeinate -i -m -s nohup env PYTHONPATH=src python -m affirmology.corpus.run \
  --data-dir /Volumes/Affirmology/corpus \
  --traditions all \
  --mode scrape+structure \
  --max-sources-per-tradition 100 \
  --max-priority 4 \
  --per-source-timeout-seconds 2700 \
  --max-cost-usd 75 \
  --stop-after-seconds 302400 \
  > /Volumes/Affirmology/corpus/logs/run_weekend.log 2>&1 &
echo "Started corpus build PID $!"
disown

What every flag does:

Note the PID it prints. Write it down somewhere just in case.


Step 4: Verify the run is healthy BEFORE you leave

Three checks. Together they take about 30 seconds. Do not leave the house until all three pass.

Check 1: process is alive

ps -ef | grep "affirmology.corpus.run" | grep -v grep

Should print one line with the python process. If empty, the process died - check the log:

tail -50 /Volumes/Affirmology/corpus/logs/run_weekend.log

Check 2: heartbeat is updating

Wait one minute after kickoff, then:

cat /Volumes/Affirmology/corpus/logs/heartbeat.json

The heartbeat_at timestamp should be recent (within the last minute or two). The current_source field tells you what it's working on right now. If you cat this again after 30 seconds and the source changed (or the elapsed_s grew), the runner is healthy.

Check 3: data is landing AND Gemini is working

ls /Volumes/Affirmology/corpus/raw/
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus

After ~3-5 minutes you should see at least one tradition with documents AND non-zero structured records. If documents are landing but records=0, the Gemini integration is failing - likely a bad API key. Stop, fix, restart.

If all three pass

You can leave. The run is committed. Close your laptop's lid only if you have an external monitor connected; otherwise leave the lid open and the screen will turn off via Display sleep (allowed) while the system itself stays awake (the thing we care about).


What happens if something goes wrong while you're gone

The runner is designed to NOT need you. Here's what each failure mode does:

Failure What the runner does Damage
Single source 403s Logs error, skips, continues None
Single source hangs >45 min Soft timeout warning, accepts whatever results landed, continues None
Network drops for an hour Tenacity retries (3 attempts, exponential backoff), eventually marks source as failed, continues One source lost; can re-fetch on Monday
Network drops for 12+ hours Many sources fail, runner logs many errors, continues until 84-hour limit Lots of errors to triage Monday; data we did fetch is intact
Gemini API rate-limited Structuring records that fetch failed; document still saved; structurer tries next document One document loses its records pass; can re-structure Monday
Gemini cost crosses $75 Hard kill switch: structuring stops, scraping continues in scrape-only mode for remaining time No more LLM spend; partial structured corpus + full raw corpus
Mac kernel panic / reboot Python process dies. Everything fetched is on disk. No more progress. Hours of remaining work not done; corpus intact
Disk fills up SQLite write errors, runner exits Unlikely with 895GB free
Process OOMs Python crashes, log captures traceback Unlikely; the runner is light on memory

The pattern in all cases: what's already on disk stays on disk. The corpus database is durable. Worst case, you come home Monday to a partial corpus and an error log. You will not come home to a destroyed system, a surprise bill, or lost work.


What to expect Monday morning

Sometime Monday morning (84 hours after kickoff), the runner will exit cleanly with a final summary line in the log:

DONE run #1: pages=X extracted=Y records=Z cost=$N errors=M sources_done=K skipped=J elapsed=302400.0s

To get the full picture:

cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus --errors 50

You'll see: - Per-tradition document counts and word counts - Per-element-type structured record counts (this is the new thing - Gemini's structured output) - The 50 most recent errors (mostly robots.txt and 403s - expected, not a problem) - Total cumulative cost

Expected numbers (rough): - 1,000-3,000 documents - 5-15 million words of extracted text - 15,000-30,000 structured records - $25-60 of Gemini spend - 50-200 errors logged (most non-fatal)


What happens on Monday after you check in

We triage which sources erred and which were skipped, decide which to retry vs drop, then start standing up the script-generator specialists (walking meditation, Joe Dispenza journey, Gene Keys deep dive, etc.) that pull from the freshly-built corpus instead of inferring from Claude's training-time knowledge. That's the unlock that makes the rest of the audio types possible.


Reference: commands you might run when you're back

# Was it running when I last checked
ps -ef | grep affirmology.corpus.run | grep -v grep

# What was it doing when it finished
cat /Volumes/Affirmology/corpus/logs/heartbeat.json

# Final status report
cd ~/CLAUDE/AFFIRMOLOGY/affirmology-agent && source .venv/bin/activate
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus --errors 50

# Storage used
du -sh /Volumes/Affirmology/corpus/raw/ /Volumes/Affirmology/corpus/extracted/ /Volumes/Affirmology/corpus/corpus.db

# Free space remaining
df -h /Volumes/Affirmology

# Force stop (if it's still running somehow)
pkill -f "affirmology.corpus.run"

# Re-start if you want another pass (resume mode skips done sources)
# Same kickoff command from Step 3, but you can change --stop-after-seconds

The one thing that absolutely has to happen before you leave: Step 1 (Gemini key) plus Step 4 (three health checks pass after kickoff). Everything else is recoverable. These two are not.