Home / Engine / Knowledge Corpus

Affirmology Corpus Growth Plan v1 - the self-improving nightly crawler

Updated Jun 18, 2026 · Affirmology_CorpusGrowth_Plan_v1.md

Summary. Goal: turn the corpus pipeline from a one-time scrape into a nightly, self-directed agent that grows the corpus deeper over time, especially the starving traditions, learns from its own failures, overcomes obstacles, and also gathers competitive and cultural i

Affirmology Corpus Growth Plan v1 - the self-improving nightly crawler

Goal: turn the corpus pipeline from a one-time scrape into a nightly, self-directed agent that grows the corpus deeper over time, especially the starving traditions, learns from its own failures, overcomes obstacles, and also gathers competitive and cultural intelligence, all within hard legal and tier-wall boundaries. Haiku stays the structurer (the bake-off settled that). This extends the existing corpus agents (discovery, quality, reporter) rather than starting from scratch.

Where the corpus stands (the honest baseline)

12,905 records, lopsided: human_design 7,863 · transits 2,525 · gene_keys 1,524 · western 544 · vedic 362 · numerology 87. The thin traditions (numerology, vedic, western) are the priority. The "707-doc backlog" was non-structurable residue (failed PDF extractions, books, raw HTML, papers), which is why it produced 0 records. So growth = NEW quality sources, not reprocessing residue.

The nightly loop (agentic, self-improving)

Each night the crawler runs a critique-and-adapt cycle, not a fixed script: 1. Review. Read last night's yield by tradition + the running coverage map. Identify the biggest gaps (thin traditions, thin elements, e.g. missing numerology master numbers, sparse Vedic nakshatras). 2. Target. Pick the night's objectives from the gaps (e.g. "deepen numerology life-path + personal-year interpretations; fill missing HD gate-lines"). 3. Discover. The discovery agent finds candidate sources for those targets (search, follow links from productive seeds, expand outward from sources that already yielded good records). Goes deeper down productive veins rather than re-skimming the same pages. 4. Acquire (multi-method, obstacle-handling). Try the cheapest extractor first (trafilatura/readability), escalate to a headless browser (Playwright) for JavaScript-rendered pages, handle PDFs and varied formats, back off politely on rate limits. If one method fails, try the next before giving up. 5. Structure. Haiku turns clean extracted text into per-element interpretation records (the proven path). 6. Quality gate + quarantine. The quality agent scores records; anything below bar or non-structurable residue is quarantined, never inserted, so the corpus stays clean and the status counts only real content. 7. Self-critique + adapt. The crawler grades its own night: which sources yielded records vs junk, which extractor worked, where it got blocked. It updates its strategy, demotes junk-yielding source types, promotes productive ones, queues retries with a different method, and logs what it learned so next night is smarter. This is the "challenges and fixes itself" loop. 8. Sync + report. Re-upload the enriched corpus.db to R2 and ping the Render deploy hook so the cloud reloads (per the nightly-automation design). Write a morning report: records added by tradition, new sources found, obstacles hit and how it adapted, remaining gaps.

Going deeper over time

Seed expansion: start from the sources that already produced good records and follow their links, citations, and related pages, deeper each night.
Gap-driven targeting: always aim the night at the current thinnest elements, so coverage evens out instead of piling more onto human_design.
Obstacle playbook: JS pages → Playwright; PDFs → robust extraction with a fallback; rate limits → backoff + scheduling; format variety → method-per-format. Blocked sources get logged and routed around, not retried blindly.
Memory: a persistent crawl-state (sources tried, yield, blocks, what worked) so it compounds knowledge rather than rediscovering the same dead ends.

Market & creator intelligence catalog (a deliberate asset, multiple uses)

As the crawler moves through the web, it catalogs the players it finds: competitors (CoStar, The Pattern, CHANI, Sanctuary, Nebula), paywalled and premium sources, the big paid Instagram/TikTok accounts, individual practitioners creating transit reports, readings, and guides, and the wider cosmic-blueprint and subconscious/affirmation-audio spaces, especially YouTube channels (the formats, titles, lengths, music styles, thumbnails, and channels that actually win views and subscribers). For each it records who they are, what they offer, positioning, pricing, tiers, features, audience and reach, and the language that resonates. This is a standing intelligence asset that grows over time.

Cataloging = recording that a source exists and how it operates. That is completely distinct from overriding it: we are not bypassing paywalls or logins, just noting the source on the map. (Separate and later: if we ever wanted a premium source's actual content inside the corpus, we'd acquire it legitimately and keep it tier-walled, Tier C IP stays benchmark-only. That's a different question from cataloging and not what this lane is for.)

Why the catalog is worth building, more than one reason: - Partner and creator pipeline: the people already making transit reports and readings are exactly the Affirmologists/affiliates we'd want; cataloging them is lead-gen for the lab and affiliate program. - Competitive and pricing intelligence: features, tiers, what they charge, where they're thin. - Distribution scouting: which channels and accounts hold the audiences we want to reach. - Content-funnel R&D (big one): study what wins in the subconscious-audio and cosmic-blueprint space (YouTube especially) to power an automated curated-track funnel, we cheaply produce on-brand curated audio tracks, post them where that audience already searches, and funnel viewers into the app. The same engine that personalizes Sacred Audios can mass-produce curated tracks. This is a distinct growth initiative the catalog feeds, and likely deserves its own plan. - Cultural learning (Echo): the framing and language people actually respond to. - Collaboration / acquisition radar: who's worth partnering with or absorbing later. - Market proof for investors: willingness-to-pay, pricing, and the scale of the space, one use among several, not the headline.

Kept in its own intelligence store, separate from the A/B interpretation corpus.

Guardrails

Cost caps on every nightly run (Haiku structuring is cheap, but cap it).
No PII collection.
Honest status: quarantine residue so "pending/structured" reflects real content (fixes the recurring fake-backlog problem).
Everything chart/tradition-tagged and tier-walled; generation never crosses a person's content with another's.

Staged build (so it's incremental, not a moonshot)

Fix the books first: honest status + residue quarantine (stops the recurring fake backlogs). Small, do now.
Self-review nightly loop over existing source types: review → target gaps → structure → quality → quarantine → report. The compounding core.
Discovery for new sources aimed at the thin traditions (numerology, vedic, western).
Obstacle handling: add Playwright for JS pages + robust PDF/format extraction + backoff.
Competitive/cultural intelligence lane (Echo): the watch list + positioning/language capture, separate store.
Self-critique + crawl memory: the part that makes it genuinely learn and go deeper each night.

What this is NOT

Not a content-theft machine and not a paywall cracker. It's a disciplined, self-improving librarian that grows clean, well-sourced, tradition-balanced interpretive depth, and separately keeps an honest read on the market, without ever putting the company at legal risk.

Affirmology Corpus Growth Plan v1 - the self-improving nightly crawler

Affirmology Corpus Growth Plan v1 - the self-improving nightly crawler

Where the corpus stands (the honest baseline)

The nightly loop (agentic, self-improving)

Going deeper over time

Market & creator intelligence catalog (a deliberate asset, multiple uses)

Guardrails

Staged build (so it's incremental, not a moonshot)

What this is NOT

Related documents