Question: What AI Tools Actually Deliver ROI for Small Businesses?
Question: What AI Tools Actually Deliver ROI for Small Businesses?
Status: 🔍 Exploring
The Question
With hundreds of AI tools launching every week, which ones actually save time/money or generate revenue for small businesses? What’s real vs. hype?
Why This Matters
- Small businesses can’t afford to waste time on tools that don’t work
- AI marketing is full of hype — need to cut through it
- Finding genuinely useful tools = competitive advantage
- This is core to Primores.org’s value proposition
What We Know So Far
From peer-reviewed studies (added May 2026)
Three randomized/quasi-experimental studies now provide direct ROI numbers — not vendor claims, not testimonials:
| Domain | Study | N | Headline ROI |
|---|---|---|---|
| Customer support | Brynjolfsson, Li & Raymond (2023) NBER WP 31161 | 5,179 | +14% issues/hour avg; +34% for novices; ~0% for experts. Real production setting, staggered rollout. |
| Writing tasks | Noy & Zhang (2023) Science 381(6654) | 444 | −40% time, +18% quality. Low-ability workers gained most. Preregistered experiment. |
| Knowledge work | Dell’Acqua et al. (2023) HBS WP 24-013 | 758 | Inside frontier: +12.2% tasks, +25.1% faster, +40% quality. Outside frontier: −19pp accuracy. |
| Knowledge work (field, 2025) | Dillon, Jaffe, Immorlica & Stanton — NBER w33795 | 7,137 | Microsoft Copilot RCT across 66 firms: users saved ~2 hrs/week on email but showed no change in output (same email threads, meetings, documents completed). Time saved did not become more produced. |
The unified finding across all three: AI delivers meaningful productivity gains inside its capability frontier, with low-skill workers gaining most. ROI is not uniformly distributed across tasks or workers — it depends on whether the specific task is inside the frontier (see glossary/jagged-frontier) and whether the user was skill-bottlenecked on a phase AI can absorb (see glossary/ai-task-restructuring).
The 2025 reality check (Stanford HAI AI Index 2025): generative-AI use in at least one business function jumped from 33% to 71% of organizations in a single year, yet financial returns stay modest — most organizations report cost savings under 10% and revenue gains under 5%. Adoption raced ahead of measured ROI. Paired with the Copilot RCT above (time saved ≠ output), the lesson is that ROI depends on redeploying the freed time, not just deploying the tool. See glossary/appropriate-reliance — uncritical adoption captures the cost but not the value.
This sharpens the original framing: the question isn’t “does AI deliver ROI?” — it’s “for which task × worker combinations?”
From Experience
- ChatGPT/Claude for drafting content: High ROI — saves hours per week. Consistent with Noy-Zhang’s 40% time savings on writing tasks.
- AI image generation for social: Medium ROI — saves money vs. stock photos
- AI transcription (Otter, etc.): High ROI — saves hours on meeting notes. Inside-the-frontier task with abundant training data.
- AI “agents” for complex tasks: Low ROI so far — promising but unreliable. Consistent with Dell’Acqua’s outside-frontier finding (−19pp accuracy when capability frontier is exceeded).
Initial Hypotheses
- “Boring” productivity AI (writing, transcription) delivers more ROI than “exciting” AI — partially supported: drafting and transcription are inside-frontier; agent orchestration is at or beyond the current frontier.
- Tools that augment humans beat tools that try to replace humans — supported: Brynjolfsson’s “centaur” finding (best practices distributed by AI to less-experienced workers) is exactly this pattern.
- Free tiers often provide 80% of the value — partially reasoned (see “Free vs. paid” below): not directly addressed by the academic literature, but the frontier/skill-leveling findings let us reason about it.
ROI by business function — the two-axis model (synthesis, May 2026)
The “for which task × worker combinations?” reframing above can be sharpened into a model that predicts ROI by function. Synthesizing the three academic anchors with the wiki’s glossary/jagged-frontier and the severity gate from glossary/guardrails / glossary/customer-perception-moments, ROI is a function of two axes:
- Frontier position — is the task inside the AI’s capability frontier (abundant training data, pattern-matchable, verifiable output) or outside it (novel, judgment-heavy, no clear right answer)? (Dell’Acqua 2023: +40% quality inside; −19pp accuracy outside.)
- Error cost — what does a confident-but-wrong output cost? Low for a discardable first draft; catastrophic for an unsupervised legal, medical, or financial decision.
| Low error cost | High error cost | |
|---|---|---|
| Inside frontier | 🟢 Highest ROI — deploy now. Content drafting, transcription, tier-1 support, image generation, summarization-for-humans. (Noy-Zhang −40% time / +18% quality; Brynjolfsson +14%, novices +34%.) | 🟡 Conditional ROI — ROI realized only with a verification layer. Code generation (review before merge), data analysis (check the numbers), customer-facing copy (human edit). The verification cost is real; net ROI is positive but smaller than the headline. |
| Outside frontier | 🟠 Low ROI, low risk — cheap to try, often disappointing. Open-ended ideation, “be creative” tasks. Discard the misses; the cost is just the tokens. | 🔴 Negative ROI — the −19pp zone. Unsupervised agents on novel multi-step tasks, autonomous decisions in regulated/high-severity domains. The cost of wrong-with-high-confidence output exceeds any labor saved. |
The organizing insight: the highest-ROI AI work is not the most impressive-sounding. It clusters in the top-left cell — boring, inside-frontier, low-error-cost tasks where AI augments a human who still owns the outcome. This is the empirical backbone of the wiki’s glossary/automation-eats-execution pattern: AI compresses the execution layer (top-left) first, while strategy and judgment (bottom-right) stay human.
Function-by-function map
| Function | Typical cell | Wiki evidence |
|---|---|---|
| Content / copywriting | 🟢 / 🟡 (edit before publish) | marketing/ai-marketing-case-studies, marketing/ai-human-voice-prompting |
| Customer support (tier-1) | 🟢 / 🟡 | automation/ai-customer-service-cases, cases/intercom-fin-support — and the perception caveat: glossary/customer-perception-moments |
| Transcription / meeting notes | 🟢 | High-ROI, deep inside frontier |
| Software development | 🟡 (review-gated) | automation/ai-developer-tools-cases, glossary/vibe-coding — inside for boilerplate, jagged at architecture |
| Data analysis / BI | 🟡 | Inside for summarization; outside for causal reasoning |
| Legal / healthcare / finance | 🔴 without supervision | automation/ai-legal-cases, automation/ai-healthcare-cases, automation/ai-finance-banking-cases — high error cost dominates |
| Autonomous multi-step agents | 🔴 / 🟠 today | questions/managed-agents-break-even, glossary/jagged-frontier |
Free vs. paid tiers
The academic literature doesn’t test this directly, but the frontier model predicts the answer. Free tiers capture ~80% of the value for tasks that are clearly inside the frontier — drafting, transcription, summarization — because the task is well within reach of even a mid-tier model. The marginal value of a paid tier concentrates at the frontier edge: a stronger model (and higher rate limits, integrations, longer context) is exactly what pulls a borderline task from “outside” to “inside” the frontier. So the paid-tier decision rule: pay when your highest-value tasks sit at the frontier edge or you run high volume; stay free when your tasks are comfortably inside the frontier and intermittent. This refines hypothesis 3 from “untested” to “true for inside-frontier tasks; the exception is frontier-edge work where the better model changes the answer.”
Current Thinking
The highest-ROI AI tools seem to be those that:
- Automate genuinely tedious tasks (not interesting work)
- Require minimal setup/learning
- Integrate with existing workflows
- Have clear, measurable outputs
Open Threads
Things still to explore:
- Survey actual small business owners on what they use
- Test 10 popular AI tools and document real time savings
- Compare free vs. paid tiers — is paid worth it?
- Look at AI tools by business function — done via the two-axis model + function map above (May 2026)
- Find case studies with actual numbers
Sources to Review
- Academic/research studies on AI productivity — three core anchors ingested (Brynjolfsson 2023, Noy-Zhang 2023, Dell’Acqua 2023). See glossary/ai-skill-leveling, glossary/ai-task-restructuring, glossary/jagged-frontier.
- AI tool comparison sites (need to evaluate their credibility)
- Small business forums/Reddit for real user experiences
- Vendor case studies (with skepticism)
Related
- automation/finding-ai-use-cases — TRIPS framework for prioritizing AI opportunities
- automation/ai-enablement-levels — Understanding the 5 levels of AI adoption
- glossary/llm-evals — How to evaluate if AI is actually working
- questions/ai-as-personal-advisor — Related exploration on AI for productivity
- glossary/jagged-frontier — AI is asymmetric: helps inside frontier, hurts outside (Dell’Acqua 2023, n=758)
- glossary/ai-skill-leveling — Three studies confirming AI raises low-performer productivity disproportionately
- glossary/ai-task-restructuring — Where the time savings come from (drafting compresses; framing and editing remain)
- comparisons/strategy-vs-execution-ai — Synthesis: what kinds of work get the ROI
- glossary/automation-eats-execution — The cross-domain pattern these findings anchor
- glossary/guardrails — The error-cost axis operationalized: pair high-error-cost AI use with a verification gate
- glossary/customer-perception-moments — Where the customer-support ROI cell intersects perception (forgiveness, recovery copy) — ROI isn’t just resolution rate
Last updated: 2026-05-29 (added the two-axis ROI-by-function model + free-vs-paid reasoning; upgraded 🌱→🌿). Prior: 2026-05-05 academic ingest wave (Brynjolfsson, Noy-Zhang, Dell’Acqua).