Home / Engine / Knowledge Corpus

Corpus Bootstrap Plan

Updated Jun 03, 2026 · Affirmology_CorpusBootstrap_v1.md

Summary. Prepared for Jeff Parker Locked dates Sept 11 debut (Jeff's birthday party in Miami). Sept 25 Ultimate Wellness Conference (group-headset stress test). Nov 11 official curated Miami launch. Goal of this document Get the proprietary knowledge corpus building ru

Corpus Bootstrap Plan

Prepared for Jeff Parker Locked dates Sept 11 debut (Jeff's birthday party in Miami). Sept 25 Ultimate Wellness Conference (group-headset stress test). Nov 11 official curated Miami launch. Goal of this document Get the proprietary knowledge corpus building running on hardware Jeff owns, starting today or tomorrow, scraping unattended while Jeff is at Camp Brotherhood (June 6 to June 8 approx). Migrate the work product seamlessly to a Mac mini production box when it arrives. By the time Jeff returns from Camp, the corpus is days deep into its first pass across all five traditions.

Part Zero: What I've Internalized From Your Voice Answers

The Miami arc has three checkpoints: Sept 11 debut, Sept 25 Ultimate Wellness stress-test with headsets, Nov 11 curated official launch. Inner-circle testing in late June and July. Ads start August. Practitioner tier is an August build, possibly shipped before Sept 11, marketed publicly only after the consumer flow is locked. Spanish in August, English-only until then. The cost ceiling is "a few hundred a month" and you want me to call out cost surprises before they happen. Privacy policy and TOS we draft together. The six script-type specialists are all on the roadmap for the next ninety days, ordered roughly: walking meditation, full-Gene-Keys journey, sankalpa-style identity statement, Human Design walkthrough, astrology walkthrough (as affirmations, not info dump), Joe-Dispenza-style 20-to-30-minute heart-coherence journey, EFT last because of the state-of-belief branching. We build the full corpus across all five traditions in parallel from the start, not sequentially.

That is the frame. The rest of this document is how we execute it.

Part One: On The Copyright Question (Your Direct Ask)

You asked my take on YouTube videos and blogs where people speak about Gene Keys in their own words, even if their original sourcing was Rudd's copyrighted material.

My read, with the disclaimer that you're the actual lawyer and your judgment overrides mine: YouTube videos and blog posts where someone speaks or writes in their own words about Gene Keys concepts are almost certainly fair game. Three reasons.

The line copyright protects is the specific expression, not the underlying idea. A blogger writing "Gene Key 50 is about harmony" in their own words is not infringing Rudd's specific expression. They are restating the concept. Concepts are not copyrightable.

Cumulative aggregation does create a marginal risk. If you scrape five hundred bloggers who all derive from Rudd, and your synthesized output reads suspiciously like Rudd's specific phrasing emerged from the synthesis, a thoughtful plaintiff's lawyer could argue you laundered the copyright through intermediaries. The defense is: your generation step explicitly rewrites in your own voice, your stored records cite the blogger not Rudd, and your test/scoring layer (the one you proposed where we benchmark against a separate copyrighted database) gives evidence that your output is independent.

The cleanest IP posture is a three-tier corpus.

Tier A: Truly clean sources. Public domain texts (Wilhelm's 1923 I Ching translation is PD in the US, Lilly's 1647 Christian Astrology is PD, Brihat Parashara Hora Shastra translations from the 1800s are PD, Pythagorean numerology is PD). Academic papers under fair-use education. Creative Commons-licensed content with attribution. This is what generation agents draw from.

Tier B: Fair-use commentary. YouTube videos and blogs where individuals speak in their own words, with provenance (URL, creator name, date) stored per record. Used for generation, attributed in the methodology PDF if asked.

Tier C: Isolated copyrighted reference set. Stored entirely separately, never queried by generation agents, used only by a "benchmark scoring agent" that periodically asks "how close is our output to the canonical copyrighted reference, and where are we weakest." This gives you the defensibility narrative ("we built this without using copyrighted material") plus a quality signal for where to deepen Tier A/B sourcing. Your idea, and it's a good one.

Engineering note: the database schema needs a license_tier enum on every record. Generation agents query only WHERE license_tier IN ('A', 'B'). Benchmark agent queries only WHERE license_tier = 'C'. The wall between the two is enforced at the query layer.

Worth saying explicitly: the Wilhelm I Ching translation alone gives you 64 hexagram interpretations in deeply respected language that you can build entire Gene-Key-adjacent material from without ever touching Rudd. Combined with academic I Ching scholarship and your own interpretive layer (which is what makes Affirmology Affirmology, not the source citations), you have a foundation that does not depend on Rudd at all.

Part Two: Hardware Recommendations

Mac mini. Buy the M4 base model with 16GB unified memory and 512GB internal SSD. $799. This is enough for everything the corpus build, the agent runtime, and the production audio pipeline will need for the next twelve months. The 24GB and 32GB upgrades are not worth the money for this workload because most heavy compute (the LLM calls) happens on Anthropic's servers, not yours. You only need local RAM for the scraping and database work, which is light.

If you want to future-proof for running local LLMs (Gemma, Llama 3, or similar via Ollama) for the bulk-scraping summarization step, bump to 24GB. That's $200 more. Worth it if you want to cut Anthropic costs by 50% on the corpus build by routing first-pass cleanup through a local model.

External SSD. Samsung T9 Portable SSD, 2TB, around $200. Thunderbolt 3/4 connectivity, faster than the older T7. Plugged into your laptop tonight, into the Mac mini when it arrives. Path stays the same (/Volumes/T9 or whatever you name it), so nothing in the code changes during migration.

If you want headroom for years, the 4TB model is around $350. The corpus across all five traditions, including raw scraped text and processed records, will fit comfortably in 100GB. The remaining space is for audio masters, video assets, render archives, and the eventual Postgres database backups. 2TB is more than enough for the next eighteen months.

Order both today. Mac mini and SSD both ship same-day or next-day from Apple Store, Best Buy, or Amazon. If you can pick up the Mac mini in person tomorrow, do it. If not, Wednesday delivery is fine.

Part Three: The SSD-First Bootstrap

You don't have to wait for the Mac mini to start. Tonight, plug the SSD into your laptop. The corpus build kicks off on your laptop using SSD storage. Wednesday or Thursday, when the Mac mini arrives, unplug the SSD from the laptop, plug it into the Mac mini, restart the corpus agents. The path is identical, the data is intact, the work continues.

This buys you 4 to 5 days of head start, including the days you are at Camp Brotherhood when nothing else is happening.

The architecture for that:

The SSD becomes the canonical home for everything: the SQLite database file (let's call it affirmology_corpus.db), the raw scraped HTML/PDF cache directory, the processed-record files, the render archive, and eventually the Postgres data directory if/when we migrate from SQLite to Postgres. The affirmology-agent code stays on the laptop (and later on the Mac mini), and points at /Volumes/T9/affirmology as the data root via an environment variable.

When the agents run, they: 1. Fetch source pages (HTTP requests, polite rate-limiting, robots.txt respect). 2. Cache the raw page to /Volumes/T9/affirmology/cache/{source}/{hash}.html. 3. Extract clean text using Trafilatura. 4. Call Claude (Haiku for cheap structuring, Sonnet only when the text needs deeper interpretation) to produce a structured record. 5. Insert the record into affirmology_corpus.db. 6. Log the run to a runs table for observability.

On the Mac mini, the agents run on a launchd schedule (macOS's native cron equivalent) every 30 minutes during off-peak hours. On your laptop tonight, you run them manually with a single python -m affirmology.corpus.run_all command, and they keep running until you stop them.

Part Four: The Software Install List

This is the order to install on a fresh macOS machine. Both your laptop (if not already done) and the Mac mini when it arrives. The whole sequence takes about 20 minutes if you copy and paste, longer if you read each step.

Xcode Command Line Tools. xcode-select --install in Terminal. Required for almost everything else.
Homebrew. Paste the install command from brew.sh. This is the foundation package manager for macOS development.
Core packages via Homebrew. brew install python@3.11 git ffmpeg sqlite postgresql@16 node tailscale gh plus brew install --cask iterm2 visual-studio-code.
uv (faster Python package manager). brew install uv. Replaces pip and venv with a much faster tool. Especially helpful for the corpus build because we install lots of scraping dependencies.
Claude Code. Install via the official path. Once installed, you can run Claude from any terminal window on the Mac mini, including via SSH from your laptop later.
Tailscale account + node. Sign up at tailscale.com (free for personal), install the macOS app on both your laptop and the Mac mini, log in. Both machines now have stable private IP addresses you can SSH between.
Repository. Clone the affirmology-agent repo (or copy via the SSD if it's not in Git yet - but it really should go in a private GitHub repo this week, see the Sprint Plan).
Dependencies. cd affirmology-agent && uv pip install -e ".[dev]" plus the new corpus dependencies (Trafilatura, BeautifulSoup4, httpx, tenacity for retries, sqlite-utils).
Environment. cp .env.example .env and fill in API keys: Anthropic, ElevenLabs, and (new) the data-root path AFFIRMOLOGY_DATA_DIR=/Volumes/T9/affirmology.

That's the full machine setup. About 20 minutes of mostly waiting for downloads.

Part Five: Networking And Remote Access

You said "open claw not happening" so let's skip Open Code (the OSS Claude alternative) for now and use Anthropic's Claude Code on the Mac mini directly.

The pattern: Mac mini sits in your office, always on, plugged into power and Ethernet (Ethernet better than wifi for scraping reliability). Tailscale gives it a stable private hostname like affirmology-mini.tailfee.ts.net (or whatever you name it). From your laptop, anywhere on the planet with internet, you do ssh jeff@affirmology-mini. You land in the Mac mini's shell. You can run Claude Code there (claude command), edit files via vim or VS Code's Remote-SSH extension, kick off corpus agents, monitor logs.

VS Code Remote-SSH is the friendliest experience: open VS Code on your laptop, "Connect to Host," pick affirmology-mini, and VS Code opens a window that is editing files on the Mac mini as if they were local. Terminal in VS Code is also remote. Claude Code can be invoked from that terminal.

Cowork (this current chat mode you're in) lives separately. Cowork is a desktop-app experience tied to your laptop. The Mac mini does not run Cowork. The Mac mini runs Claude Code (the CLI). Both Cowork and Claude Code are first-class Anthropic surfaces; they just live in different layers. You'd use Cowork on your laptop for high-level conversation, planning, and document work; you'd use Claude Code on the Mac mini for the heavy hands-on engineering and the agent runtimes.

Sol and Colin connect the same way once invited. You add them as Tailscale users on your tailnet, they install Tailscale and the SSH client (built into macOS), they have access. Authentication is per-user SSH keys, so revoking access is one config change.

Part Six: The Corpus Build Architecture

A new module in the codebase: affirmology-agent/src/affirmology/corpus/.

Inside it:

db.py: SQLite handler with three core tables (source, record, runs) and the license_tier enum.
extractors/: one extractor per source type (HTML pages, PDFs, YouTube transcripts, Wikipedia API).
traditions/: one builder per tradition. western_astrology.py, vedic_astrology.py, gene_keys.py, human_design.py, numerology.py, somatic.py.
structurer.py: takes raw cleaned text plus a tradition context, calls Claude (Haiku by default, Sonnet on request) to produce a typed record JSON.
bench.py: the benchmark agent that compares Tier A/B output against Tier C reference and writes a coverage-and-quality scorecard.
run_all.py: the orchestrator that runs all tradition builders on a polite schedule.

Each tradition builder gets: - A starter list of canonical source URLs (the public-domain texts, the academic papers, the well-known free interpretation sites, the YouTube channels with transcripts). - A crawl policy (depth, page-per-day limit, respect robots.txt, identify as Affirmology-corpus-bot/1.0; contact@affirmology.com). - A schema for what a "record" looks like in that tradition: for astrology, (planet, sign, house, interpretation_text, source_url, license_tier); for Gene Keys, (gate, line, sphere, shadow, gift, siddhi, interpretation_text, source_url, license_tier); etc.

The corpus is queryable via a single corpus_lookup(chart_element, n=10) helper that returns the top n records for any chart element. Generation agents use this helper instead of relying on Claude's training-time knowledge.

Part Seven: Source Lists By Tradition (Day-One Targets)

These are the seed URLs each builder starts with. The builders crawl from these, follow internal links (capped depth 2), and build records.

Western Astrology. Project Gutenberg has Lilly's Christian Astrology in three volumes (PD). Sacred Texts has Ptolemy's Tetrabiblos in the Robbins translation (PD). Internet Archive has dozens of pre-1928 astrology books (PD). Wikipedia astrology categories cover signs, planets, houses, aspects at survey depth. Cafe Astrology has interpretation articles (robots.txt-checked, fair use). Astro-Seek interpretation pages (fair use, attributed). Skyscript.co.uk for classical astrology essays (mostly CC). Aeon (the magazine) for academic astrology criticism. About 8,000 to 15,000 pages of raw material.

Vedic Astrology. Sacred Texts has Brihat Parashara Hora Shastra (Santhanam translation, PD), Jaimini Sutras translations, Sarvartha Chintamani fragments. Vedic Astrology Lessons by B.V. Raman (some PD, some fair use). Internet Archive has Vedic texts from 19th and early 20th century English translations. Academic Vedic astrology papers from JSTOR (educational fair use for excerpts). Vedicology.com and similar interpretation sites. About 3,000 to 6,000 pages.

Gene Keys (non-Rudd). Wilhelm's I Ching translation (1923, PD) is the foundational text and covers all 64 hexagrams with line-by-line commentary. Karcher's I Ching scholarship (some PD, some fair use). Academic papers on the King Wen sequence and the I Ching as a divinatory system. YouTube creators speaking about Gene Keys in their own words (your fair-use position). Personal blog interpretations (Tier B). The "Rudd reference" goes to Tier C, used only by the benchmark. About 2,000 to 5,000 pages total Tier A/B.

Human Design (non-Ra Uru Hu). The 64 hexagrams again, this time mapped through the Rave Mandala. Public domain I Ching forms the base. Academic papers on the bodygraph as a synthesis of I Ching, Kabbalah, and Hindu chakras. Community blog interpretations of profiles, centers, channels. YouTube creators in their own words. Jovian Archive has some freely-readable content (we honor their terms). Ra Uru Hu copyrighted material goes to Tier C. About 2,000 to 4,000 pages Tier A/B.

Numerology. The smallest tradition by volume but the densest in clean PD material. Pythagorean numerology has been public domain forever. Florence Campbell's 1931 work (likely PD in US). Cheiro's Book of Numbers (1907, PD). Modern academic papers on numerology as cultural practice. Personal blogs. About 1,000 to 2,000 pages.

Somatic/EFT/Breathwork. Gary Craig's original EFT manual is freely distributed by Craig himself (Tier B with permission). HeartMath Institute research papers are partially public (Tier A for the PD-licensed ones). Stanley Rosenberg's vagal protocols summarized in academic literature (Tier A). NIH and CDC public health materials on breathwork and somatic regulation (PD). Wim Hof method overviews (fair use). About 1,500 to 3,000 pages.

Audio research and formational prompts. Joe Dispenza's published research papers (educational fair use). Bihar School of Yoga Yoga Nidra texts (some PD). Sankalpa tradition academic sources. Research on affirmation efficacy in clinical psychology. Hypnosis induction patterns in NLP literature. About 1,000 to 2,000 pages.

Total target corpus, all traditions: roughly 18,000 to 37,000 pages of raw text, processed into roughly 50,000 to 100,000 structured records.

Part Eight: Costs And Storage

One-time hardware costs. - Mac mini M4 16GB/512GB: $799. - Samsung T9 SSD 2TB: $200. - Total: about $1,000 upfront. If you upgrade to 24GB Mac mini for local LLM headroom: $1,200.

One-time corpus build costs. This is what I want to call out so you're not surprised. At the high end (37,000 pages scraped, each averaging 3,000 tokens of raw text, structured by Haiku): - Haiku cost: 37,000 pages × 3,500 tokens input × $0.80/1M = about $103 input + $50 output = around $150. - If we use Sonnet instead for richer structuring: 5x that, so around $750. - My recommendation: Haiku for first-pass structuring of 95% of records, Sonnet only for the highest-importance Tier A canonical interpretations (maybe 5,000 records). Blended cost: about $250 to $400 to build the entire corpus from scratch.

This is one-time. After the corpus exists, the per-render economics drop because generation pulls from corpus retrieval instead of fresh inference.

Monthly recurring costs. - ElevenLabs subscription (your existing): ~$22-$99/mo depending on tier. - Anthropic API for audio renders + ongoing corpus refresh: $30-$100/mo at expected volume. - Tailscale: free for personal use. - Supabase: free tier (500MB DB, plenty for v1) or $25/mo Pro when we outgrow. - Vercel landing/form hosting: free hobby tier or $20/mo Pro. - Domain: $12/year for affirmology.com if not owned. - Total monthly recurring: $80 to $250. Comfortably inside your "few hundred a month" ceiling, with room for spike testing.

Storage estimates. - Raw scraped HTML/PDF cache: 50-150GB across all traditions. - Processed records in SQLite: 5-10GB. - Audio render archive: 1GB per 100 renders. - Backend audit files: trivial (kilobytes per render). - Total: comfortably under 200GB after the first year. The 2TB SSD has decade-of-growth headroom.

Internet bandwidth. Web scraping is bursty but small. A page is typically 100KB to 2MB of HTML; a PDF maybe 5MB. Even at 10,000 pages a day (very aggressive), you're using 10-20GB/day of inbound bandwidth, which is nothing for any modern home internet. Production traffic from the web form will also be small (small JSON requests, audio downloads happen from CDN). You are fine on any reasonable home connection. The reason to host locally on Mac mini is cost (free vs paid cloud), control (your data, your hardware), and durability (Mac minis run for years without intervention), not bandwidth.

Part Nine: Cost Reduction Strategy

You raised this. Three real levers.

Use Haiku, not Sonnet, for structuring. Haiku is 12x cheaper than Sonnet and is great at "take this unstructured paragraph and extract these fields." Default the corpus structurer to Haiku. Bump to Sonnet only for the Tier A canonical-source records where rich interpretation matters.

Use a local model for first-pass cleanup. If you upgrade Mac mini to 24GB, Ollama can run Llama 3 8B or Mistral 7B locally. A pre-processing step that strips boilerplate, deduplicates, and rewrites for clarity locally before any Anthropic call can cut corpus-build cost by 30-50%. Tradeoff is one extra step in the pipeline and slightly worse quality on the cleanup. Worth it for bulk scraping.

Cache aggressively. Once a page is scraped and structured, it does not need to be re-fetched or re-structured for that record's lifetime. The cache layer already protects you from this, but worth calling out: don't accidentally re-run the same builder against the same sources from scratch.

Use Gemini Flash for the very high-volume preprocessing. Gemini Flash is even cheaper than Haiku for some tasks. If we have a step where we just need to summarize 50,000 raw pages down to 5,000 candidate records, Gemini Flash can do the screening pass for pennies. Worth considering if the corpus build budget needs to be tighter than $250.

For Camp Brotherhood specifically: I would set Haiku as the default and let it rip. $250 burned over five days of corpus building is fine, and you come back to a meaningfully complete corpus.

Part Ten: Tonight's Action Plan

Concretely, in priority order.

Order the Mac mini and the SSD. Apple Store online for the Mac mini (M4 base, 16GB, 512GB, $799), Amazon for the Samsung T9 2TB ($200). Both arrive Wednesday at latest. Total: $1,000.
Plug the SSD into your laptop tonight. Format as APFS, name it T9 or AffirmologyData.
Tell me to build the corpus module. I write the affirmology-agent/src/affirmology/corpus/ code with the SQLite schema, the six tradition builders, the orchestrator, the cost-tracker, and the wrapper script. Maybe 2 hours of my work on my next turn.
Kick off the first run on your laptop tonight or tomorrow morning. Single command: python -m affirmology.corpus.run_all --data-dir /Volumes/T9/affirmology --providers haiku --traditions all. Leave it running. Check on it before bed.
By tomorrow morning, the corpus is hundreds of records deep across all five traditions. You can leave for Camp Brotherhood knowing the build is running.

When the Mac mini arrives mid-week, you do the 20-minute software install, plug the SSD into the Mac mini, restart the corpus runner there, and your laptop is free. Same command, same data, picks up where it left off.

Part Eleven: While You Are At Camp Brotherhood

Assuming you depart Friday June 6, return Monday June 9, here's what runs unattended.

On the laptop (or Mac mini if it arrived): The corpus orchestrator continues working through tradition builders, page by page. Hourly checkpoint logs let you see progress when you check in. SQLite is single-writer-safe; nothing corrupts.

Cost meter: I'll build a tally counter into the orchestrator. You can SSH in from your phone (if you have Termius or Blink) and see "corpus build day 3: 18,000 records ingested, $87 spent, 3 errors logged, projected completion date X." You don't have to babysit, but you can peek.

By the time you're back Monday: Corpus is days-deep, well into the meaningful-content phase. Generation agents can start drawing from it in Sprint B (June 9 onward). You're ahead of schedule.

Part Twelve: What I Need From You To Start

Quick decisions: 1. Mac mini config: 16GB ($799) or 24GB ($999, gives you local LLM headroom)? 2. SSD: Samsung T9 2TB ($200) or 4TB ($350)? 3. Greenlight on Haiku-default corpus build (~$250 estimated cost) and the source list above? Anything to remove or add? 4. Should I write the corpus code in my next turn so you can start it tonight? Or do you want to read this first and decide?

Things I can start without waiting: - I'll add the cost-tracker module so we have spend visibility from minute one. - I'll update the .env.example with the new AFFIRMOLOGY_DATA_DIR variable. - I'll add the data-tier license enum to the architecture doc.

Tell me "go build it" and I write the corpus module immediately. Tell me "let me read this and think" and I wait.

End of bootstrap plan. The corpus is the moat. Let's start digging.

Corpus Bootstrap Plan

Corpus Bootstrap Plan

Part Zero: What I've Internalized From Your Voice Answers

Part One: On The Copyright Question (Your Direct Ask)

Part Two: Hardware Recommendations

Part Three: The SSD-First Bootstrap

Part Four: The Software Install List

Part Five: Networking And Remote Access

Part Six: The Corpus Build Architecture

Part Seven: Source Lists By Tradition (Day-One Targets)

Part Eight: Costs And Storage

Part Nine: Cost Reduction Strategy

Part Ten: Tonight's Action Plan

Part Eleven: While You Are At Camp Brotherhood

Part Twelve: What I Need From You To Start

Related documents