Evidence-Graded Audience Research: Units Instead of Avatars
Evidence-Graded Audience Research: Units Instead of Avatars
TL;DR: Most audience research is persona theater — demographic avatars assembled from imagination that look rigorous and test poorly. The alternative: classify every input by who can actually know it (ask-only / file-preferred / researchable), grade every claim
provided | researched | hypothesiswith a source, build audiences from verbatim customer language, and output a unit table — [audience × job-to-be-done × angle] rows with stable IDs and transparent priority sub-scores — instead of avatar documents. The unit, not the persona, is what creative production and feedback loops can actually consume.
The problem: persona theater
The traditional deliverable — “Marketing Mary, 34, drives a Volvo, loves yoga” — has a published critique trail going back two decades. Chapman & Milham (2006, the canonical academic paper) showed personas can’t be verified or falsified and that it’s impossible to know how many real customers one represents. Adele Revella (Buyer Persona Institute) calls demographic-first personas “fictional avatars of dubious merit”; an Audiense survey she amplifies found 77% of marketers never refer to their buyer personas after creating them (vendor survey — label it as such, but the direction is damning).
The AI era made this worse, not better. LLMs generate plausible-sounding audience research instantly — and the synthetic-users literature (NN/g 2023 and Aug 2025; IDEO Jan 2025) converges on the failure mode: AI-generated users are designed to please, validate concepts real users would kill, average away marginal segments, and answer confidently with no traceability. NN/g’s 2025 data carries the constructive version: simulations grounded in real interview data beat demographic-prompted ones (interview-grounded twins >80% accuracy vs persona-based models). The lesson isn’t “don’t use AI for research” — it’s that ungraded research, human or AI, is indistinguishable from fabrication.
Discipline 1: classify inputs by who can know them
Before researching anything, classify every brief field into three source classes:
| Class | Definition | If missing |
|---|---|---|
| ASK-ONLY | Only the client knows it (margins, CAC targets, compliance history, asset rights, capacity). Researching it = hallucination. | Ask once; a skip becomes a labeled hypothesis with a recorded consequence |
| FILE-PREFERRED | Should exist as a document (brand kit, past test results) | One ask, then fallback |
| RESEARCHABLE | Public (competitor angles, customer language, pricing norms). Asking the client is lazy. | Fetch it live, with access dates |
| The questions go to the client in one batch, each carrying its skip-consequence (“unit economics not provided → economics scoring will be hypothesis-graded”). Skips are legal and recorded — the point isn’t to force answers, it’s to make the cost of every gap visible instead of silently inventing the missing value. For health and regulated products, the highest-stakes ask-only field is compliance constraints: unverified compliance is the #1 rework cause downstream. |
Discipline 2: grade every claim
Every pain, desire, audience, job, and angle carries an evidence grade plus a source reference:
provided— from client inputs/answersresearched— fetched live this run (web, ad libraries, review platforms), with access datehypothesis— no source; explicitly labeled (“likely competitors include X — verify”)
Supporting rules: triangulate load-bearing claims (single-source claims that shape an audience get flagged); model training knowledge may generate search queries but nothing survives into the output unless fetched this run; verbatim quotes are stored with source links because paraphrase destroys the value — the customer’s own words are the asset.
This imports a discipline that’s a century old elsewhere: intelligence work grades source reliability separately from item credibility (the NATO/Admiralty system; ICD 203 requires analysts to state confidence based on source quality), and medicine grades evidence certainty (GRADE). Marketing research, as far as we can find, has no published equivalent — which is why ungraded decks all look equally confident.
Discipline 3: build from verbatim language, not imagination
Audiences and hooks get assembled from harvested customer language — review mining and community mining (tools/reddit-thread-analyzer territory), with every quote kept verbatim plus source link. This is established practitioner doctrine (Copyhackers’ review-mining method, 2014): the canonical case mined 500+ Amazon reviews to produce the headline “If you think you need rehab, you do.” — +400% CTA clicks over the agency’s polished control (a single self-reported A/B test, but the method has a decade of practitioner consensus behind it).
Verbatim language does two jobs at once: it grounds the research (a pain backed by a 57-point Reddit comment with a permalink is researched, not hypothesis) and it pre-writes the creative (hooks adapted from real phrasing carry the customer’s emotional shape into the ad).
The output: a unit table, not an avatar deck
The deliverable is a table of units — [TA × JTBD × Angle + content type] — where:
- TA (target audience) carries an awareness level per glossary/awareness-levels, not a demographic sketch
- JTBD carries a core promise plus proof requirements — what the creative must demonstrate for the promise to be believed (the targeting unit is the job, per Christensen: “the fact that you’re 18 to 35 with a college degree does not cause you to buy”)
- Angle carries its mechanism, sophistication move, persuasion lever (glossary/persuasion-principles), and a saturation classification from the competitive sweep
- Stable IDs (
TA2.J1.A1) — frozen after sign-off, append-only afterward, so downstream production and feedback can reference units unambiguously
Priority is scored with transparent sub-scores — evidence (0–3) × whitespace (0–3) × feasibility (0–2) × economics (0–2) — summed, not blended opaquely. A hypothesis-graded angle can’t quietly outrank a researched one; the scoring shows why each unit ranks where it does. The ranking is a testing queue, not a production order (marketing/discovery-before-scale logic).
Tournaments with visible rejects
Jobs and angles are generated in slates (3–6 candidates per slot), then culled to 1–3 — with rejects listed per slot alongside the specific test that killed them (duplicate proof requirements, no evidence, mechanism mismatch, awareness mismatch, feasibility). Visible rejects are half the value: the client sees what was considered, and a candidate that loses in one slot can legitimately win in another.
One counter-bias rule is load-bearing: always include one saturated-but-winnable candidate per slot. Generating only from the whitespace map optimizes differentiation but silently skips crowded angles where the client holds structurally better proof. The five checks decide, not the map alone.
Human gates at the points of irreversibility
Three hard stops: research scope approval (before heavy fetching), audience confirmation (a wrong TA poisons every downstream artifact), and final sign-off (after which IDs freeze). Everything between gates can be machine-fast; the gates are where judgment is non-delegable.
Feedback merges back at the right level
Because ad names encode unit IDs, performance results re-enter the research with three-level failure attribution: hook-level (creative execution failed; the unit may still be alive), angle-level (the angle is falsified), TA-level (multiple units of one audience failed the same way → revisit the audience definition). Without the levels, one bad video kills a good angle — or a dead angle survives behind a lucky hook.
Honest limits
- The methodology has one full engagement behind it (a DTC genomics-skincare brand, US market — 5 TAs, 11 signed-off units) plus the published prior art above. The components are well-anchored; the assembled pipeline is N=1.
- No independent published evidence exists that JTBD targeting outperforms demographic targeting in paid media specifically (Strategyn’s 86%-vs-17% innovation success figure is vendor-claimed). The case rests on the persona critique + the verbatim-language evidence + mechanism.
- Evidence grading in marketing research appears to be genuinely uncodified territory — we’re importing intelligence-community discipline by analogy, not citing a marketing study.
Key Takeaways
- Classify inputs by who can know them: ask-only (client), file-preferred, researchable — researching an ask-only field is hallucination with extra steps.
- Grade every claim
provided | researched | hypothesiswith a source; skips become labeled hypotheses, never silent assumptions. - Build audiences and hooks from verbatim customer language — it grounds the research and pre-writes the creative in one move.
- Output units ([TA × JTBD × angle], stable IDs, transparent sub-scores), not avatar decks — units are what production and feedback loops can consume.
- Run tournaments with visible rejects, keep one saturated-but-winnable candidate per slot, and gate the irreversible steps with humans.
Related
- competitor-analysis/dna-beauty-paid-social-whitespace — a public market note produced by this methodology’s competitive sweep
- marketing/prescriptive-production-briefs — the downstream consumer: how a signed-off unit becomes a production package
- automation/staged-compiler-pattern — the architecture this methodology runs on (staged compiler with JSON contracts and gates)
- glossary/awareness-levels — Schwartz’s framework; every TA carries an awareness level
- glossary/persuasion-principles — every angle names its Cialdini lever
- marketing/discovery-before-scale — why the unit ranking is a testing queue, not a production order
- glossary/honest-assessment — the same calibration discipline, applied to research claims instead of product claims
- glossary/hallucination — the failure mode evidence grading exists to prevent
- tools/reddit-thread-analyzer — the community-mining capability behind verbatim harvesting
- tools/target-audience-research — the Primores skill implementing this methodology end-to-end (tool review)
Sources
- Chapman & Milham 2006 — The Personas’ New Clothes — the canonical academic persona critique
- Audiense survey via Adele Revella — 77% of marketers never refer to their personas — vendor survey, labeled accordingly
- Copyhackers — review mining (2014) — the verbatim-language method + the Beachway case
- HBS Working Knowledge — Clay Christensen’s Milkshake Marketing (2011) — JTBD against demographic segmentation
- ICD 203 — Analytic Standards (ODNI) + the NATO/Admiralty grading system — the evidence-grading prior art
- NN/g — Evaluating AI-Simulated Behavior (Aug 2025) — interview-grounded beats demographic-prompted simulation
- IDEO — The Case Against AI-Generated Users (Jan 2025) — “the solution isn’t to make up fake people”
- Engagement artifacts, June 2026 (internal; client anonymized) — the full pipeline run this page codifies