Home / Engine / Knowledge Corpus
Updated Jun 18, 2026 · Affirmology_CorpusGrowth_Plan_v1.md
Goal: turn the corpus pipeline from a one-time scrape into a nightly, self-directed agent that grows the corpus deeper over time, especially the starving traditions, learns from its own failures, overcomes obstacles, and also gathers competitive and cultural intelligence, all within hard legal and tier-wall boundaries. Haiku stays the structurer (the bake-off settled that). This extends the existing corpus agents (discovery, quality, reporter) rather than starting from scratch.
12,905 records, lopsided: human_design 7,863 · transits 2,525 · gene_keys 1,524 · western 544 · vedic 362 · numerology 87. The thin traditions (numerology, vedic, western) are the priority. The "707-doc backlog" was non-structurable residue (failed PDF extractions, books, raw HTML, papers), which is why it produced 0 records. So growth = NEW quality sources, not reprocessing residue.
Each night the crawler runs a critique-and-adapt cycle, not a fixed script: 1. Review. Read last night's yield by tradition + the running coverage map. Identify the biggest gaps (thin traditions, thin elements, e.g. missing numerology master numbers, sparse Vedic nakshatras). 2. Target. Pick the night's objectives from the gaps (e.g. "deepen numerology life-path + personal-year interpretations; fill missing HD gate-lines"). 3. Discover. The discovery agent finds candidate sources for those targets (search, follow links from productive seeds, expand outward from sources that already yielded good records). Goes deeper down productive veins rather than re-skimming the same pages. 4. Acquire (multi-method, obstacle-handling). Try the cheapest extractor first (trafilatura/readability), escalate to a headless browser (Playwright) for JavaScript-rendered pages, handle PDFs and varied formats, back off politely on rate limits. If one method fails, try the next before giving up. 5. Structure. Haiku turns clean extracted text into per-element interpretation records (the proven path). 6. Quality gate + quarantine. The quality agent scores records; anything below bar or non-structurable residue is quarantined, never inserted, so the corpus stays clean and the status counts only real content. 7. Self-critique + adapt. The crawler grades its own night: which sources yielded records vs junk, which extractor worked, where it got blocked. It updates its strategy, demotes junk-yielding source types, promotes productive ones, queues retries with a different method, and logs what it learned so next night is smarter. This is the "challenges and fixes itself" loop. 8. Sync + report. Re-upload the enriched corpus.db to R2 and ping the Render deploy hook so the cloud reloads (per the nightly-automation design). Write a morning report: records added by tradition, new sources found, obstacles hit and how it adapted, remaining gaps.
As the crawler moves through the web, it catalogs the players it finds: competitors (CoStar, The Pattern, CHANI, Sanctuary, Nebula), paywalled and premium sources, the big paid Instagram/TikTok accounts, individual practitioners creating transit reports, readings, and guides, and the wider cosmic-blueprint and subconscious/affirmation-audio spaces, especially YouTube channels (the formats, titles, lengths, music styles, thumbnails, and channels that actually win views and subscribers). For each it records who they are, what they offer, positioning, pricing, tiers, features, audience and reach, and the language that resonates. This is a standing intelligence asset that grows over time.
Cataloging = recording that a source exists and how it operates. That is completely distinct from overriding it: we are not bypassing paywalls or logins, just noting the source on the map. (Separate and later: if we ever wanted a premium source's actual content inside the corpus, we'd acquire it legitimately and keep it tier-walled, Tier C IP stays benchmark-only. That's a different question from cataloging and not what this lane is for.)
Why the catalog is worth building, more than one reason: - Partner and creator pipeline: the people already making transit reports and readings are exactly the Affirmologists/affiliates we'd want; cataloging them is lead-gen for the lab and affiliate program. - Competitive and pricing intelligence: features, tiers, what they charge, where they're thin. - Distribution scouting: which channels and accounts hold the audiences we want to reach. - Content-funnel R&D (big one): study what wins in the subconscious-audio and cosmic-blueprint space (YouTube especially) to power an automated curated-track funnel, we cheaply produce on-brand curated audio tracks, post them where that audience already searches, and funnel viewers into the app. The same engine that personalizes Sacred Audios can mass-produce curated tracks. This is a distinct growth initiative the catalog feeds, and likely deserves its own plan. - Cultural learning (Echo): the framing and language people actually respond to. - Collaboration / acquisition radar: who's worth partnering with or absorbing later. - Market proof for investors: willingness-to-pay, pricing, and the scale of the space, one use among several, not the headline.
Kept in its own intelligence store, separate from the A/B interpretation corpus.
Not a content-theft machine and not a paywall cracker. It's a disciplined, self-improving librarian that grows clean, well-sourced, tradition-balanced interpretive depth, and separately keeps an honest read on the market, without ever putting the company at legal risk.