Skip to content

Evidence-Graded Audience Research: Units Instead of Avatars

Evidence-Graded Audience Research: Units Instead of Avatars

TL;DR: Most audience research is persona theater — demographic avatars assembled from imagination that look rigorous and test poorly. The alternative: classify every input by who can actually know it (ask-only / file-preferred / researchable), grade every claim provided | researched | hypothesis with a source, build audiences from verbatim customer language, and output a unit table — [audience × job-to-be-done × angle] rows with stable IDs and transparent priority sub-scores — instead of avatar documents. The unit, not the persona, is what creative production and feedback loops can actually consume.

The problem: persona theater

The traditional deliverable — “Marketing Mary, 34, drives a Volvo, loves yoga” — has a published critique trail going back two decades. Chapman & Milham (2006, the canonical academic paper) showed personas can’t be verified or falsified and that it’s impossible to know how many real customers one represents. Adele Revella (Buyer Persona Institute) calls demographic-first personas “fictional avatars of dubious merit”; an Audiense survey she amplifies found 77% of marketers never refer to their buyer personas after creating them (vendor survey — label it as such, but the direction is damning).

The AI era made this worse, not better. LLMs generate plausible-sounding audience research instantly — and the synthetic-users literature (NN/g 2023 and Aug 2025; IDEO Jan 2025) converges on the failure mode: AI-generated users are designed to please, validate concepts real users would kill, average away marginal segments, and answer confidently with no traceability. NN/g’s 2025 data carries the constructive version: simulations grounded in real interview data beat demographic-prompted ones (interview-grounded twins >80% accuracy vs persona-based models). The lesson isn’t “don’t use AI for research” — it’s that ungraded research, human or AI, is indistinguishable from fabrication.

Discipline 1: classify inputs by who can know them

Before researching anything, classify every brief field into three source classes:

ClassDefinitionIf missing
ASK-ONLYOnly the client knows it (margins, CAC targets, compliance history, asset rights, capacity). Researching it = hallucination.Ask once; a skip becomes a labeled hypothesis with a recorded consequence
FILE-PREFERREDShould exist as a document (brand kit, past test results)One ask, then fallback
RESEARCHABLEPublic (competitor angles, customer language, pricing norms). Asking the client is lazy.Fetch it live, with access dates
The questions go to the client in one batch, each carrying its skip-consequence (“unit economics not provided → economics scoring will be hypothesis-graded”). Skips are legal and recorded — the point isn’t to force answers, it’s to make the cost of every gap visible instead of silently inventing the missing value. For health and regulated products, the highest-stakes ask-only field is compliance constraints: unverified compliance is the #1 rework cause downstream.

Discipline 2: grade every claim

Every pain, desire, audience, job, and angle carries an evidence grade plus a source reference:

  • provided — from client inputs/answers
  • researched — fetched live this run (web, ad libraries, review platforms), with access date
  • hypothesis — no source; explicitly labeled (“likely competitors include X — verify”)

Supporting rules: triangulate load-bearing claims (single-source claims that shape an audience get flagged); model training knowledge may generate search queries but nothing survives into the output unless fetched this run; verbatim quotes are stored with source links because paraphrase destroys the value — the customer’s own words are the asset.

This imports a discipline that’s a century old elsewhere: intelligence work grades source reliability separately from item credibility (the NATO/Admiralty system; ICD 203 requires analysts to state confidence based on source quality), and medicine grades evidence certainty (GRADE). Marketing research, as far as we can find, has no published equivalent — which is why ungraded decks all look equally confident.

Discipline 3: build from verbatim language, not imagination

Audiences and hooks get assembled from harvested customer language — review mining and community mining (tools/reddit-thread-analyzer territory), with every quote kept verbatim plus source link. This is established practitioner doctrine (Copyhackers’ review-mining method, 2014): the canonical case mined 500+ Amazon reviews to produce the headline “If you think you need rehab, you do.”+400% CTA clicks over the agency’s polished control (a single self-reported A/B test, but the method has a decade of practitioner consensus behind it).

Verbatim language does two jobs at once: it grounds the research (a pain backed by a 57-point Reddit comment with a permalink is researched, not hypothesis) and it pre-writes the creative (hooks adapted from real phrasing carry the customer’s emotional shape into the ad).

The output: a unit table, not an avatar deck

The deliverable is a table of units — [TA × JTBD × Angle + content type] — where:

  • TA (target audience) carries an awareness level per glossary/awareness-levels, not a demographic sketch
  • JTBD carries a core promise plus proof requirements — what the creative must demonstrate for the promise to be believed (the targeting unit is the job, per Christensen: “the fact that you’re 18 to 35 with a college degree does not cause you to buy”)
  • Angle carries its mechanism, sophistication move, persuasion lever (glossary/persuasion-principles), and a saturation classification from the competitive sweep
  • Stable IDs (TA2.J1.A1) — frozen after sign-off, append-only afterward, so downstream production and feedback can reference units unambiguously

Priority is scored with transparent sub-scores — evidence (0–3) × whitespace (0–3) × feasibility (0–2) × economics (0–2) — summed, not blended opaquely. A hypothesis-graded angle can’t quietly outrank a researched one; the scoring shows why each unit ranks where it does. The ranking is a testing queue, not a production order (marketing/discovery-before-scale logic).

Tournaments with visible rejects

Jobs and angles are generated in slates (3–6 candidates per slot), then culled to 1–3 — with rejects listed per slot alongside the specific test that killed them (duplicate proof requirements, no evidence, mechanism mismatch, awareness mismatch, feasibility). Visible rejects are half the value: the client sees what was considered, and a candidate that loses in one slot can legitimately win in another.

One counter-bias rule is load-bearing: always include one saturated-but-winnable candidate per slot. Generating only from the whitespace map optimizes differentiation but silently skips crowded angles where the client holds structurally better proof. The five checks decide, not the map alone.

Human gates at the points of irreversibility

Three hard stops: research scope approval (before heavy fetching), audience confirmation (a wrong TA poisons every downstream artifact), and final sign-off (after which IDs freeze). Everything between gates can be machine-fast; the gates are where judgment is non-delegable.

Feedback merges back at the right level

Because ad names encode unit IDs, performance results re-enter the research with three-level failure attribution: hook-level (creative execution failed; the unit may still be alive), angle-level (the angle is falsified), TA-level (multiple units of one audience failed the same way → revisit the audience definition). Without the levels, one bad video kills a good angle — or a dead angle survives behind a lucky hook.

Honest limits

  • The methodology has one full engagement behind it (a DTC genomics-skincare brand, US market — 5 TAs, 11 signed-off units) plus the published prior art above. The components are well-anchored; the assembled pipeline is N=1.
  • No independent published evidence exists that JTBD targeting outperforms demographic targeting in paid media specifically (Strategyn’s 86%-vs-17% innovation success figure is vendor-claimed). The case rests on the persona critique + the verbatim-language evidence + mechanism.
  • Evidence grading in marketing research appears to be genuinely uncodified territory — we’re importing intelligence-community discipline by analogy, not citing a marketing study.

Key Takeaways

  • Classify inputs by who can know them: ask-only (client), file-preferred, researchable — researching an ask-only field is hallucination with extra steps.
  • Grade every claim provided | researched | hypothesis with a source; skips become labeled hypotheses, never silent assumptions.
  • Build audiences and hooks from verbatim customer language — it grounds the research and pre-writes the creative in one move.
  • Output units ([TA × JTBD × angle], stable IDs, transparent sub-scores), not avatar decks — units are what production and feedback loops can consume.
  • Run tournaments with visible rejects, keep one saturated-but-winnable candidate per slot, and gate the irreversible steps with humans.

Sources