Home / Engine / Knowledge Corpus

Corpus Build - Weekend Run

Updated Jun 04, 2026 · Affirmology_CorpusKickoff_Tonight.md

Summary. Plan: You kick this off Thursday evening (June 4) before leaving for Camp Brotherhood. The system runs unattended for ~84 hours through Monday morning (June 8). Zero check-ins from you. The runner is hardened to survive single-source failures, hangs, cost spik

Corpus Build - Weekend Run

Plan: You kick this off Thursday evening (June 4) before leaving for Camp Brotherhood. The system runs unattended for ~84 hours through Monday morning (June 8). Zero check-ins from you. The runner is hardened to survive single-source failures, hangs, cost spikes, and brief network outages. By the time you're back, you have a corpus database that is days deep across all five traditions, ready to feed the script-generator agents on Monday.

v3 changes: Single run plan (no overnight test). Pre-flight checklist for true unattended operation. Hard cost ceiling on Gemini Flash so the LLM bill cannot run away. macOS Software Update deferral guidance (auto-restarts kill long runs).

What runs while you're gone

For ~84 hours, the runner will:

Scrape priority 1-4 sources across Western astrology, Vedic astrology, Gene Keys, Human Design, and numerology. Roughly 130-200 sources total.
Extract clean text from every fetched HTML page, PDF, GitHub dataset, and YouTube transcript.
Structure the extracted text via Gemini Flash into typed JSON records (planet-in-sign interpretations, gene key gifts, life path meanings, etc.) tagged with provenance and license tier.
Heartbeat to disk after every source so a status check is always one command away.
Stop cleanly if any of three guardrails trip: 84-hour time budget, $100 cost ceiling, or you pkill it remotely.

Estimated end-state Monday morning: 15,000-30,000 structured records, $20-50 of Gemini spend, several GB of raw cache on the SSD.

Pre-flight checklist (DO THIS BEFORE LEAVING)

Hardware:

[ ] SSD plugged in, volume named exactly Affirmology, formatted as APFS.
[ ] MacBook plugged into AC power. Stay plugged in for the full duration. Battery throttles regardless of settings.
[ ] External monitor optional but useful - you can close the laptop lid if you have one connected and "Prevent automatic sleeping" is on. Without an external monitor, leave the lid OPEN.

macOS settings:

[ ] System Settings → Lock Screen → "Turn display off on power adapter when inactive" = Never (or 30+ min).
[ ] System Settings → Battery / Energy Saver → "Prevent automatic sleeping on power adapter when the display is off" = ON.
[ ] System Settings → Battery → "Wake for network access" = ON.
[ ] System Settings → Battery → "Start up automatically after a power failure" = ON.
[ ] System Settings → General → Software Update → Automatic Updates → click "i" → turn OFF "Install macOS updates" and "Install application updates from the App Store" for the duration. (You can re-enable Monday.) An auto-restart for an OS update is the single most common cause of long-running jobs dying.
[ ] Notifications → Focus / Do Not Disturb → optional, but reduces background activity.

Software:

[ ] Dependencies installed: pip install trafilatura tenacity pypdf google-generativeai youtube-transcript-api yt-dlp httpx (already done if you ran the earlier kickoff doc - re-run is idempotent).
[ ] Gemini API key in .env. See Step 1 below. This is the critical missing piece.
[ ] Quit any heavy apps you don't need running (Slack, Chrome with 50 tabs, etc.). The corpus build is light on CPU/RAM but free resources never hurt.

Storage:

[ ] df -h /Volumes/Affirmology shows at least 50GB free. (You have 895GB available, so this is a sanity check, not a worry.)

Network:

[ ] Mac is on Wi-Fi or Ethernet with stable internet. If your home connection drops for an hour overnight, tenacity retries handle it. If it drops for 12 hours, the runner errors out and waits - re-running on Monday is fine.

Step 1: Get the Gemini API key (do this NOW, before anything else)

You need this to enable structuring. Without it, the runner falls back to scrape-only (which still produces value, but no structured records).

Open https://aistudio.google.com in your browser
Sign in with any Google account
Click "Get API key" in the left sidebar
Click "Create API key" → "Create API key in new project"
Copy the key (looks like AIzaSy...)
Add to your .env:

echo "GEMINI_API_KEY=PASTE_YOUR_KEY_HERE" >> "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent/.env"

Replace PASTE_YOUR_KEY_HERE with the actual key. Verify it landed:

grep GEMINI "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent/.env"

Should print one line showing your key. If empty, the echo command didn't work - re-do step 6.

Gemini free tier: Gives you a generous rate limit for testing. Paid tier is pay-as-you-go at ~$0.075/M input tokens, ~$0.30/M output tokens. The estimated $20-50 weekend spend is paid-tier territory - you'll need a billing card on file in your Google Cloud account.

Step 2: Verify dependencies and folder structure

Run this once before the kickoff:

cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
pip install trafilatura tenacity pypdf google-generativeai youtube-transcript-api yt-dlp httpx
mkdir -p /Volumes/Affirmology/corpus/{raw,extracted,structured,logs,cache}
ls /Volumes/Affirmology/corpus/

Should show: cache extracted logs raw structured.

If pip install errors on lxml (Trafilatura dependency): run xcode-select --install, accept the dialog, wait for it to finish (5-10 min), retry the pip install.

Step 3: THE KICKOFF COMMAND (Thursday evening, before you leave)

Copy and paste this whole block into Terminal:

cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
caffeinate -i -m -s nohup env PYTHONPATH=src python -m affirmology.corpus.run \
  --data-dir /Volumes/Affirmology/corpus \
  --traditions all \
  --mode scrape+structure \
  --max-sources-per-tradition 100 \
  --max-priority 4 \
  --per-source-timeout-seconds 2700 \
  --max-cost-usd 75 \
  --stop-after-seconds 302400 \
  > /Volumes/Affirmology/corpus/logs/run_weekend.log 2>&1 &
echo "Started corpus build PID $!"
disown

What every flag does:

caffeinate -i -m -s - prevents idle, disk, and system sleep for the duration of the wrapped process
nohup - survives the parent terminal closing
env PYTHONPATH=src - points Python at the source tree
--mode scrape+structure - runs scraping AND Gemini Flash structuring in one pass
--max-sources-per-tradition 100 - caps each tradition at 100 sources (we have ~67 specs today; this leaves headroom for the additions Monday)
--max-priority 4 - ingests priority 1-4 sources (skips exploratory 5)
--per-source-timeout-seconds 2700 - 45-min soft cap per source. Slightly more than the overnight setting because Archive.org bulk downloads can take a while.
--max-cost-usd 75 - HARD KILL SWITCH on Gemini spend. If cumulative cost crosses $75, structuring stops and the run continues in scrape-only mode for the remaining time. You will not wake up Monday to a surprise $500 bill.
--stop-after-seconds 302400 - 84-hour time budget (302,400 seconds). Run will exit cleanly at that mark even if not done.
& and disown - detach from your terminal so closing Terminal.app doesn't kill it.

Note the PID it prints. Write it down somewhere just in case.

Step 4: Verify the run is healthy BEFORE you leave

Three checks. Together they take about 30 seconds. Do not leave the house until all three pass.

Check 1: process is alive

ps -ef | grep "affirmology.corpus.run" | grep -v grep

Should print one line with the python process. If empty, the process died - check the log:

tail -50 /Volumes/Affirmology/corpus/logs/run_weekend.log

Check 2: heartbeat is updating

Wait one minute after kickoff, then:

cat /Volumes/Affirmology/corpus/logs/heartbeat.json

The heartbeat_at timestamp should be recent (within the last minute or two). The current_source field tells you what it's working on right now. If you cat this again after 30 seconds and the source changed (or the elapsed_s grew), the runner is healthy.

Check 3: data is landing AND Gemini is working

ls /Volumes/Affirmology/corpus/raw/
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus

After ~3-5 minutes you should see at least one tradition with documents AND non-zero structured records. If documents are landing but records=0, the Gemini integration is failing - likely a bad API key. Stop, fix, restart.

If all three pass

You can leave. The run is committed. Close your laptop's lid only if you have an external monitor connected; otherwise leave the lid open and the screen will turn off via Display sleep (allowed) while the system itself stays awake (the thing we care about).

What happens if something goes wrong while you're gone

The runner is designed to NOT need you. Here's what each failure mode does:

Failure	What the runner does	Damage
Single source 403s	Logs error, skips, continues	None
Single source hangs >45 min	Soft timeout warning, accepts whatever results landed, continues	None
Network drops for an hour	Tenacity retries (3 attempts, exponential backoff), eventually marks source as failed, continues	One source lost; can re-fetch on Monday
Network drops for 12+ hours	Many sources fail, runner logs many errors, continues until 84-hour limit	Lots of errors to triage Monday; data we did fetch is intact
Gemini API rate-limited	Structuring records that fetch failed; document still saved; structurer tries next document	One document loses its records pass; can re-structure Monday
Gemini cost crosses $75	Hard kill switch: structuring stops, scraping continues in scrape-only mode for remaining time	No more LLM spend; partial structured corpus + full raw corpus
Mac kernel panic / reboot	Python process dies. Everything fetched is on disk. No more progress.	Hours of remaining work not done; corpus intact
Disk fills up	SQLite write errors, runner exits	Unlikely with 895GB free
Process OOMs	Python crashes, log captures traceback	Unlikely; the runner is light on memory

The pattern in all cases: what's already on disk stays on disk. The corpus database is durable. Worst case, you come home Monday to a partial corpus and an error log. You will not come home to a destroyed system, a surprise bill, or lost work.

What to expect Monday morning

Sometime Monday morning (84 hours after kickoff), the runner will exit cleanly with a final summary line in the log:

DONE run #1: pages=X extracted=Y records=Z cost=$N errors=M sources_done=K skipped=J elapsed=302400.0s

To get the full picture:

cd "/Users/jeffreyparker/CLAUDE/AFFIRMOLOGY/affirmology-agent"
source .venv/bin/activate
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus --errors 50

You'll see: - Per-tradition document counts and word counts - Per-element-type structured record counts (this is the new thing - Gemini's structured output) - The 50 most recent errors (mostly robots.txt and 403s - expected, not a problem) - Total cumulative cost

Expected numbers (rough): - 1,000-3,000 documents - 5-15 million words of extracted text - 15,000-30,000 structured records - $25-60 of Gemini spend - 50-200 errors logged (most non-fatal)

What happens on Monday after you check in

We triage which sources erred and which were skipped, decide which to retry vs drop, then start standing up the script-generator specialists (walking meditation, Joe Dispenza journey, Gene Keys deep dive, etc.) that pull from the freshly-built corpus instead of inferring from Claude's training-time knowledge. That's the unlock that makes the rest of the audio types possible.

Reference: commands you might run when you're back

# Was it running when I last checked
ps -ef | grep affirmology.corpus.run | grep -v grep

# What was it doing when it finished
cat /Volumes/Affirmology/corpus/logs/heartbeat.json

# Final status report
cd ~/CLAUDE/AFFIRMOLOGY/affirmology-agent && source .venv/bin/activate
PYTHONPATH=src python -m affirmology.corpus.status --data-dir /Volumes/Affirmology/corpus --errors 50

# Storage used
du -sh /Volumes/Affirmology/corpus/raw/ /Volumes/Affirmology/corpus/extracted/ /Volumes/Affirmology/corpus/corpus.db

# Free space remaining
df -h /Volumes/Affirmology

# Force stop (if it's still running somehow)
pkill -f "affirmology.corpus.run"

# Re-start if you want another pass (resume mode skips done sources)
# Same kickoff command from Step 3, but you can change --stop-after-seconds

The one thing that absolutely has to happen before you leave: Step 1 (Gemini key) plus Step 4 (three health checks pass after kickoff). Everything else is recoverable. These two are not.

Corpus Build - Weekend Run

Corpus Build - Weekend Run

What runs while you're gone

Pre-flight checklist (DO THIS BEFORE LEAVING)

Step 1: Get the Gemini API key (do this NOW, before anything else)

Step 2: Verify dependencies and folder structure

Step 3: THE KICKOFF COMMAND (Thursday evening, before you leave)

Step 4: Verify the run is healthy BEFORE you leave

Check 1: process is alive

Check 2: heartbeat is updating

Check 3: data is landing AND Gemini is working

If all three pass

What happens if something goes wrong while you're gone

What to expect Monday morning

What happens on Monday after you check in

Reference: commands you might run when you're back

Related documents