Internal · The whole thing, honestly

How we got to a product call — and what's still unsettled.

A faithful record of the CORA listening project: every assumption we made, every decision we forced, where the data agreed with us, and the two places it told us we were wrong.

For Julien · Dan · Rafaela  ·  Status: living document · all 3 analysis layers read  ·  One open decision: cohort × format (founders' call)
Corpus: 24,450 raw / 13,025 triaged  ·  Total spend to date ≈ $35  ·  ES-Spain share: 36.3% (gate cleared)

00What this is

CORA is a Spain-first women's daily-nutrition brand. The deck had already converged on a single-SKU launch — a complete-protein coffee creamer (Option 4) — off a thin survey (n≈93, badly Brazilian-skewed; the de-biased Spanish core was n≈12). This project pointed a proven social-listening engine at the category to test that bet against thousands of real conversations rather than re-confirm it.

The whole exercise was run under one rule: answer CORA's open questions, don't re-derive the deck. That meant being willing to have the data contradict three sessions of our own thinking — which, twice, it did.

01The engine we re-pointed

We did not build from scratch. A listening + analysis pipeline was already built and proven on the GLP-1 category (13,886 items → a real product thesis). It scrapes four platforms (Reddit, YouTube, TikTok, Instagram via Apify), triages every item with a cheap model (the "census" — tagging concern, journey phase, emotion, geography), escalates the highest-intent items, and stores everything in a re-queryable Postgres corpus where the raw text is immutable and the tags can be re-run.

Re-pointing it at CORA was a configuration, not a rebuild: a new database (nm_cora_datalake), a new concern taxonomy, a new life_stage cohort axis, a new geo logic that separates Spain from LatAm-Spanish from Brazilian-Portuguese, and a connector re-weighting toward Instagram/TikTok (where Spanish women's conversation actually lives) over Reddit (which is English-dominated).

02The pivot: velocity → durability

The original brief specified a velocity-first analysis — rank concerns by growth, because CORA wanted to ride the fastest-growing opportunity it could still catch. We changed that, deliberately, when the founder chose to optimise for a durable need rather than the steepest riser (decision #01 below).

The reasoning: in a category that's filling fast (creatine-for-women, the gummy/stick explosion, Spanish brands already entering), the steepest slope can simply mean you're late. A large, stable, emotionally-hot need you can own for a decade beats a fad you catch at its peak. So the headline ranking became a durability composite — size × emotion intensity × concentration at a buying moment — and velocity was demoted to a health-check column: "is the durable thing we want to anchor on rising, flat, or fading?"

03The 21 assumptions we put on the table

Before spending a euro, we inventoried everything we were quietly treating as settled, so we could challenge it. LB marks load-bearing — wrong here and the run points the wrong way, not just noisier.

Strategic framing
  1. Velocity = opportunity. LB Resolved by choosing durability (decision #01).
  2. Public conversation predicts purchase. LB Listening measures what's posted, not felt or bought; private concerns (stress, hormonal) are under-posted. Mitigated by requiring a search cross-check (decision #02).
  3. More research is the right move (vs. a dressed-up delay). LB Resolved: treated as a genuine input (decision #03 = A).
Hypothesis & geography
  1. Perimenopause/creatine/protein is the relevant frontier — circular if we only seed around it.
  2. Spain-first is fixed. LB The BR-Madrid skew might be the signal, not the noise. Held as Spain-first (decision #04 = A), BR quarantined not deleted.
  3. English concern-structure transfers to Spanish women — they may weight stress/aesthetics/hormonal differently.
Velocity method
  1. posted_at is reliable across platforms (TikTok/IG reshares can corrupt true dates).
  2. 12–18 months separates trend from seasonality — barely one annual cycle.
  3. Share-of-month removes the sampling artifact — reduces, doesn't kill it (deleted old posts deflate the past).
  4. Tag-based slopes are stable — a slope is a derivative; ~10% tag error hurts it more than a rank.
  5. Rising conversation leads, not lags, purchase.
Cohort & sub-40
  1. Life-stage proxies stand in for age. LB They only fire when someone declares a life stage — a small, self-selecting slice. (Confirmed: ~85% unknown.)
  2. Rafa's sub-40 access is a channel, not a one-time list.
  3. Sub-40 access = monetisable demand — the sub-40 core was the price-sensitive, low-spend group.
Seed vocabulary & taxonomy
  1. new_pain_proposed backstops our blind spots. LB It only sees what we seeded — a seed blind spot is invisible to the tool meant to catch blind spots.
  2. The taxonomy tags cleanly in Spanish/French — it was designed in English.
  3. Brands are good listening anchors — in a nascent ES market people may not name brands.
Platform, volume & guardrails
  1. IG/TikTok return substantive, taggable text. LB Spanish meaning may live on Reddit while IG/TikTok give volume but thin captions. (Partly bit us — IG scraped thin.)
  2. One scrape yields enough per cohort × month to compute a slope — sub-40 ES by month may be n<10.
  3. The 25% ES+FR gate is the right threshold — arbitrary, but pre-registered.
  4. Supply-side and demand-side are separable — influencer-as-consumer blurs the line. (This became the key debias.)

04The four decisions we forced before scraping

Four strategic calls were settled in writing first, so the data couldn't be cherry-picked to fit whichever answer we secretly wanted.

#QuestionCallConsequence
01Optimise for momentum or a durable need?B · DurableHeadline became a durability composite; velocity demoted to health-check.
02Require a search-volume cross-check before trusting any ranking?YESDan supplies ES Google Trends / Keyword Planner; conversation alone never decides.
03Real decision input, or de-risking a call already made?A · OpenFull run; Option 4 is genuinely contestable by the data.
04Spain-broad first, or Brazilian-Madrid / LatAm first?A · SpainES is the primary corpus; BR & LatAm-ES quarantined as reference, not deleted.

1B reshaped the whole extraction. It also, combined with 4A, committed CORA to the slower category-builder road — launching into the cold, price-sensitive Spanish core and consciously demoting Rafa's warm BR network from launch market to seeding asset. A defensible bet, made knowingly.

05The scrape, step by step

The engine was cloned, the new schema applied, the read-only role proven SELECT-only, and — the non-negotiable lesson from GLP-1 — nm_cora_datalake was added to the backup script and a real test dump run before any scraping. Five Spanish creator anchors were verified (Boticaria García, Marta Marcé, Cristina Mitre, Ismael Galancho, brand Woments) and three French ones; no handles were fabricated.

What went right and what bit us

  • Reddit comment-tree blowout. We set Reddit to maxItems=15 to keep it a modest English reference. The actor caps posts, not comments, so fetchPostComments pulled full trees → Reddit ballooned to 13,954 items (69% of the corpus). We trimmed it back to ~2,738 (all posts + top threads by engagement) so it stayed reference, not driver.
  • Instagram scraped thin. The hashtag-search input format was wrong for the actor; it returned ~385 items. IG — our intended primary Spanish channel — underperformed. TikTok (2,200) carried the social load instead.
  • Cost discipline held. First run ≈ $23 combined, under target, after the Reddit trim.

06The geo gate — honoured, not waved

The pre-registered rule: don't draw a Spain conclusion unless ES-Spain clears 25% of geo-identified items. First run came in at 23.4% — just under. We did not wave it through despite a favourable early ranking. Instead we ran a cheap, surgical Spain top-up: comment-scraping the five Spanish anchors' posts (where Spain-Spanish conversation concentrates) plus a Spain-only YouTube re-query.

Gate cleared, honestly

The top-up added 1,149 genuine ES items. ES-Spain rose 1,559 → 3,732 (36.3% of geo-identified) — cleared with room. It cleared because we found real Spanish conversation, not because we reclassified our way past the threshold.

Final geo: ES 3,732 · LatAm-ES 3,485 · unknown 2,735 · US 2,440 · FR 277 · UK 140 · other 134 · BR 82 (quarantined). LatAm-ES is held as a Spanish-language reference tier — a far better proxy for Spain than English Reddit, because concern structure transfers across dialects better than across languages.

07What the corpus actually says

The Spanish read (ES only), in durability order, with the velocity health-check beside it:

Concern (ES)SignalVelocityRead
stress_cortisol#1 (~402)Large & risingBedrock. Robust across both reads + survey.
creatine_curiosity_safetytop-2 (~275)Flat / dipping (0.74×)Organic, not seller-echo. See §08.
energy_fatigue~308RisingReads as the same woman as stress.
perimenopause_symptoms~303 (18.7%)RisingSpain-strong — but likely anchor-inflated; 40+. See §09.
sleep · muscle_maintenance · strength_identity169–201mixedSupporting cluster.
calm_energy new tag~106 ES / 343 allFlatReal but a framing, not a spine.

The model also proposed concerns we hadn't seeded — calm_energy (promoted to a real tag), plus caffeine_non_responder, task_paralysis, hormonal_rage, hair_loss_adverse_event — the whitespace worth watching.

08Two reversals — where the data told us we were wrong

Reversal 1 · The creatine premise was false

We pulled creatine out of Option 4 three sessions ago on one premise: that Spanish women read creatina as gym-bro and won't buy it (the deck's n=12 said 17% uptake, "cosa de hombres"). The debias falsifies it: only 2.9% of ES creatine mentions are anchor comments — 267 of 275 are organic, and less seller-driven than the stress/energy control (8.5%). Creatine is top-2 in Spain, organically. The decision to remove it rested on a belief this corpus disproves, so it is re-opened.

Reversal 2 · The single-SKU "calm cluster" is weaker than we bet

We had drifted toward "stress + energy + calm + sleep are one woman, one coffee." The cluster test doesn't support the bundle: of 3,732 ES items, only 216 carry ≥2 of those four and just 12 carry all four. They read as mostly distinct concerns. The "one coffee fixes the whole knot" is a ~6% niche, not the mass frame — and calm_energy is a positioning framing, not a product spine.

Net: the corpus splits into two coherent territories, not one fused need. The original Option 4 straddled both; pulling creatine narrowed the product onto the harder-to-defend emotional side.

Manage how I feel

stress · energy · calm_energy

Biggest, rising, emotionally hot. But venting-risk, and "calm" is the most commoditised lane there is (and our no-ashwagandha rule narrows the toolkit).

Build & keep my body

creatine · muscle · strength · protein

Product-shaped, organic, real in Spain. But gym-adjacent and velocity-flat. The clearer reason-to-believe.

09The reads closed — and reading the language corrected us done

The three diagnostics and a full read of all three analysis layers (skeleton, verbatims, threads) closed the open questions and corrected three things the numbers alone overstated:

  1. Buyable vs. ventable settled it. Creatine 0.95 buyable (171 of 275 seeking) — the substance. Stress 0.47, mostly venting (157 of 402) — the wrapper. Stress is not a hero.
  2. Creatine demand is anxious, not confident. The verbatims show the "seeking" is safety-seeking — "¿no engorda, no hincha?", "¿se puede tomar en café caliente?", "me empezó a dejar calva", drug-interaction and breastfeeding fears. The hero job is removing fear, not selling strength.
  3. Perimenopause debiased out. 30.4% of ES peri mentions are anchor comments (~10× creatine's 2.9%). Its Spain-strength was a seeding artifact. Discount it.

The language also surfaced the fork the numbers hid: creatine demand skews 40+; the 40+ cohort is quitting coffee ("café no tomo," matcha "sin tembladera ni bajón"); Rafa reaches sub-40. Substance, format and channel point three ways. And one external check remains: Dan's ES search cross-check — the only clean growth read, since the velocity slopes are untrustworthy (sources only switched on mid-2025).

10What's settled — and the one decision left cohort × format

Most of the product is settled by evidence; one strategic fork is deliberately left to the founders. Settled: creatine + protein is the buyable substance (reassurance-led demand); stress is the emotional wrapper, not a SKU; brain fog / "claridad mental" is the through-line benefit; single SKU, not a range; perimenopause discounted; the deck's creatine-out decision reversed (false premise).

The open decision — not ours to make

Creatine demand skews 40+; coffee is something the 40+ cohort is actively quitting; Rafa reaches sub-40. Choosing the cohort resolves format, positioning and channel at once. Three coherent paths — A sub-40 / coffee / energy-clarity; B 40+ / non-coffee / transition; C format-flex. The Strategic Report §07 lays out all three with evidence and the five questions that decide it. We educate the choice; we don't make it.

Full reasoning in the Strategic Report v2; present-live version in the Dashboard v2 — all three documents consistent on what's settled and on leaving §07 open.

11Next steps

  1. Close the three open reads (§09) + Dan's search cross-check.
  2. Lock the product call: single SKU vs. straddle, hero substance, hero emotion, format.
  3. Build the strategy report and the dashboard (GLP-1 shape) on the settled call — deliberately held until the call is locked, so the two artifacts never contradict each other.
  4. Re-open the deck's creatine-out decision with the founders, given Reversal 1.
  5. Run-2 hygiene: a clean perimenopause read without peri-creator over-weighting; more Spain-targeted TikTok; bump the Apify plan off the $29 Starter cap; add the DNS A record for the dashboard.

12Lessons (carry into the next category)

  • Anchor-seeding inflates whatever the anchors talk about. Seeding creator comments to get Spanish volume imports those creators' agendas. Always debias by anchor flag. We caught creatine; we nearly trusted an inflated peri number. This is the new general lesson.
  • Reddit's maxItems caps posts, not comments. "Modest Reddit" needs a comment cap or fetchPostComments turns a reference into the whole corpus.
  • Instagram hashtag-search is unreliable; comment-scraping known anchors is the high-yield Spanish lever.
  • LatAm-ES is a real reference tier. A better Spain proxy than English — same language, closer culture.
  • Honour the gate when the answer is the one you wanted. We were just under 25% with a favourable ranking — exactly when a rule becomes a decoration if you wave it through. We didn't.
  • A scrape ranking is a starting point, not a verdict. The two findings that mattered most (creatine debias, weak cluster) came from interrogating the ranking, not reading it.

13Infrastructure & cost

ItemDetail
Databasenm_cora_datalake · in pg_backup.sh (both loops) · test dump verified
Servicecora-listening · port 8085 · cora-listening.nextmomentum.io (DNS A record pending)
Read-only rolecora_readonly · proven SELECT-only
Corpus24,450 raw / 13,025 triaged · ~12 months dense coverage
Analysis exports/opt/backups/cora-analysis/ · 25 files
Spend to date≈ $35 total (Apify ~$14 + Anthropic ~$21) across two runs
WatchApify Starter $29/mo cap — hit mid-run; bump before next scrape