Evidence-Graded Audience Research: Units Instead of Avatars

TL;DR: Most audience research is persona theater — demographic avatars assembled from imagination that look rigorous and test poorly. The alternative: classify every input by who can actually know it (ask-only / file-preferred / researchable), grade every claim provided | researched | hypothesis with a source, build audiences from verbatim customer language, and output a unit table — [audience × job-to-be-done × angle] rows with stable IDs and transparent priority sub-scores — instead of avatar documents. The unit, not the persona, is what creative production and feedback loops can actually consume.

The problem: persona theater

The traditional deliverable — “Marketing Mary, 34, drives a Volvo, loves yoga” — has a published critique trail going back two decades. Chapman & Milham (2006, the canonical academic paper) showed personas can’t be verified or falsified and that it’s impossible to know how many real customers one represents. Adele Revella (Buyer Persona Institute) calls demographic-first personas “fictional avatars of dubious merit”; an Audiense survey she amplifies found 77% of marketers never refer to their buyer personas after creating them (vendor survey — label it as such, but the direction is damning).

The AI era made this worse, not better. LLMs generate plausible-sounding audience research instantly — and the synthetic-users literature (NN/g 2023 and Aug 2025; IDEO Jan 2025) converges on the failure mode: AI-generated users are designed to please, validate concepts real users would kill, average away marginal segments, and answer confidently with no traceability. NN/g’s 2025 data carries the constructive version: simulations grounded in real interview data beat demographic-prompted ones (interview-grounded twins >80% accuracy vs persona-based models). The lesson isn’t “don’t use AI for research” — it’s that ungraded research, human or AI, is indistinguishable from fabrication.

Discipline 1: classify inputs by who can know them

Before researching anything, classify every brief field into three source classes:

Class	Definition	If missing
ASK-ONLY	Only the client knows it (margins, CAC targets, compliance history, asset rights, capacity). Researching it = hallucination.	Ask once; a skip becomes a labeled hypothesis with a recorded consequence
FILE-PREFERRED	Should exist as a document (brand kit, past test results)	One ask, then fallback
RESEARCHABLE	Public (competitor angles, customer language, pricing norms). Asking the client is lazy.	Fetch it live, with access dates
The questions go to the client in one batch, each carrying its skip-consequence (“unit economics not provided → economics scoring will be hypothesis-graded”). Skips are legal and recorded — the point isn’t to force answers, it’s to make the cost of every gap visible instead of silently inventing the missing value. For health and regulated products, the highest-stakes ask-only field is compliance constraints: unverified compliance is the #1 rework cause downstream.

Discipline 2: grade every claim

Every pain, desire, audience, job, and angle carries an evidence grade plus a source reference:

provided — from client inputs/answers
researched — fetched live this run (web, ad libraries, review platforms), with access date
hypothesis — no source; explicitly labeled (“likely competitors include X — verify”)

Supporting rules: triangulate load-bearing claims (single-source claims that shape an audience get flagged); model training knowledge may generate search queries but nothing survives into the output unless fetched this run; verbatim quotes are stored with source links because paraphrase destroys the value — the customer’s own words are the asset.

This imports a discipline that’s a century old elsewhere: intelligence work grades source reliability separately from item credibility (the NATO/Admiralty system; ICD 203 requires analysts to state confidence based on source quality), and medicine grades evidence certainty (GRADE). Marketing research, as far as we can find, has no published equivalent — which is why ungraded decks all look equally confident.

Discipline 3: build from verbatim language, not imagination

Audiences and hooks get assembled from harvested customer language — review mining and community mining (tools/reddit-thread-analyzer territory), with every quote kept verbatim plus source link. This is established practitioner doctrine (Copyhackers’ review-mining method, 2014): the canonical case mined 500+ Amazon reviews to produce the headline “If you think you need rehab, you do.” — +400% CTA clicks over the agency’s polished control (a single self-reported A/B test, but the method has a decade of practitioner consensus behind it).

Verbatim language does two jobs at once: it grounds the research (a pain backed by a 57-point Reddit comment with a permalink is researched, not hypothesis) and it pre-writes the creative (hooks adapted from real phrasing carry the customer’s emotional shape into the ad).

The output: a unit table, not an avatar deck

The deliverable is a table of units — [TA × JTBD × Angle + content type] — where:

TA (target audience) carries an awareness level per glossary/awareness-levels, not a demographic sketch
JTBD carries a core promise plus proof requirements — what the creative must demonstrate for the promise to be believed (the targeting unit is the job, per Christensen: “the fact that you’re 18 to 35 with a college degree does not cause you to buy”)
Angle carries its mechanism, sophistication move, persuasion lever (glossary/persuasion-principles), and a saturation classification from the competitive sweep
Stable IDs (TA2.J1.A1) — frozen after sign-off, append-only afterward, so downstream production and feedback can reference units unambiguously

Priority is scored with transparent sub-scores — evidence (0–3) × whitespace (0–3) × feasibility (0–2) × economics (0–2) — summed, not blended opaquely. A hypothesis-graded angle can’t quietly outrank a researched one; the scoring shows why each unit ranks where it does. The ranking is a testing queue, not a production order (marketing/discovery-before-scale logic).

Tournaments with visible rejects

Jobs and angles are generated in slates (3–6 candidates per slot), then culled to 1–3 — with rejects listed per slot alongside the specific test that killed them (duplicate proof requirements, no evidence, mechanism mismatch, awareness mismatch, feasibility). Visible rejects are half the value: the client sees what was considered, and a candidate that loses in one slot can legitimately win in another.

One counter-bias rule is load-bearing: always include one saturated-but-winnable candidate per slot. Generating only from the whitespace map optimizes differentiation but silently skips crowded angles where the client holds structurally better proof. The five checks decide, not the map alone.

Human gates at the points of irreversibility

Three hard stops: research scope approval (before heavy fetching), audience confirmation (a wrong TA poisons every downstream artifact), and final sign-off (after which IDs freeze). Everything between gates can be machine-fast; the gates are where judgment is non-delegable.

Feedback merges back at the right level

Because ad names encode unit IDs, performance results re-enter the research with three-level failure attribution: hook-level (creative execution failed; the unit may still be alive), angle-level (the angle is falsified), TA-level (multiple units of one audience failed the same way → revisit the audience definition). Without the levels, one bad video kills a good angle — or a dead angle survives behind a lucky hook.

Honest limits

The methodology has one full engagement behind it (a DTC genomics-skincare brand, US market — 5 TAs, 11 signed-off units) plus the published prior art above. The components are well-anchored; the assembled pipeline is N=1.
No independent published evidence exists that JTBD targeting outperforms demographic targeting in paid media specifically (Strategyn’s 86%-vs-17% innovation success figure is vendor-claimed). The case rests on the persona critique + the verbatim-language evidence + mechanism.
Evidence grading in marketing research appears to be genuinely uncodified territory — we’re importing intelligence-community discipline by analogy, not citing a marketing study.

Key Takeaways

Classify inputs by who can know them: ask-only (client), file-preferred, researchable — researching an ask-only field is hallucination with extra steps.
Grade every claim provided | researched | hypothesis with a source; skips become labeled hypotheses, never silent assumptions.
Build audiences and hooks from verbatim customer language — it grounds the research and pre-writes the creative in one move.
Output units ([TA × JTBD × angle], stable IDs, transparent sub-scores), not avatar decks — units are what production and feedback loops can consume.
Run tournaments with visible rejects, keep one saturated-but-winnable candidate per slot, and gate the irreversible steps with humans.

competitor-analysis/dna-beauty-paid-social-whitespace — a public market note produced by this methodology’s competitive sweep
marketing/prescriptive-production-briefs — the downstream consumer: how a signed-off unit becomes a production package
automation/staged-compiler-pattern — the architecture this methodology runs on (staged compiler with JSON contracts and gates)
glossary/awareness-levels — Schwartz’s framework; every TA carries an awareness level
glossary/persuasion-principles — every angle names its Cialdini lever
marketing/discovery-before-scale — why the unit ranking is a testing queue, not a production order
glossary/honest-assessment — the same calibration discipline, applied to research claims instead of product claims
glossary/hallucination — the failure mode evidence grading exists to prevent
tools/reddit-thread-analyzer — the community-mining capability behind verbatim harvesting
tools/target-audience-research — the Primores skill implementing this methodology end-to-end (tool review)

Sources

Chapman & Milham 2006 — The Personas’ New Clothes — the canonical academic persona critique
Audiense survey via Adele Revella — 77% of marketers never refer to their personas — vendor survey, labeled accordingly
Copyhackers — review mining (2014) — the verbatim-language method + the Beachway case
HBS Working Knowledge — Clay Christensen’s Milkshake Marketing (2011) — JTBD against demographic segmentation
ICD 203 — Analytic Standards (ODNI) + the NATO/Admiralty grading system — the evidence-grading prior art
NN/g — Evaluating AI-Simulated Behavior (Aug 2025) — interview-grounded beats demographic-prompted simulation
IDEO — The Case Against AI-Generated Users (Jan 2025) — “the solution isn’t to make up fake people”
Engagement artifacts, June 2026 (internal; client anonymized) — the full pipeline run this page codifies

Evidence-Graded Audience Research: Units Instead of Avatars

Evidence-Graded Audience Research: Units Instead of Avatars

The problem: persona theater

Discipline 1: classify inputs by who can know them

Discipline 2: grade every claim

Discipline 3: build from verbatim language, not imagination

The output: a unit table, not an avatar deck

Tournaments with visible rejects

Human gates at the points of irreversibility

Feedback merges back at the right level

Honest limits

Key Takeaways

Related

Sources