How to Evaluate AI Chatbot Accuracy: The Strawberry Letter Test and Beyond

Introduction

AI chatbots like ChatGPT have made impressive strides, but they still suffer from a persistent flaw: confidently delivering incorrect information. A classic example is the infamous question: "How many 'R's are in the word 'strawberry'?" For a long time, ChatGPT would answer incorrectly—often saying two or three—with absolute certainty. Recently, OpenAI touted an improvement where ChatGPT finally gets it right (three 'R's). But as the victory lap was taken, users quickly pointed out other confident mistakes that remain. This guide will show you how to systematically test an AI chatbot's accuracy, using the strawberry letter-count as a starting point, so you can identify and handle these confident errors.

How to Evaluate AI Chatbot Accuracy: The Strawberry Letter Test and Beyond — Source: 9to5google.com

What You Need

Access to an AI chatbot (e.g., ChatGPT, Claude, or Gemini)
A list of factual or counting-based questions (including the strawberry letter-count)
A notebook or digital document to record responses
A reliable external source (e.g., dictionary, encyclopedia, or trusted website)
Patience and a critical mindset

Step-by-Step Guide

Step 1: Prepare Your Benchmark Questions

Start with a small set of questions that test basic factual knowledge and simple counting. Include the classic "How many 'R's are in 'strawberry'?" as a control question. Add a few other common pitfalls, such as:

"What is the capital of Australia?" (Canberra, not Sydney)
"How many legs does a spider have?" (Eight, not six)
"Which planet is closest to the Sun?" (Mercury, not Venus)

These questions are straightforward but often trip up AI models because of training data biases or lack of reasoning. Record the exact wording you will use for each question to ensure consistency.

Step 2: Ask the Chatbot and Record Responses

One at a time, present each question to the chatbot. Copy the chatbot's response exactly as it appears, including any confident language like "definitely", "absolutely", or "without a doubt". Note the date and time of the interaction. For the strawberry question, observe not only the count but also any explanation the bot provides. In OpenAI's recent update, when asked about 'strawberry,' ChatGPT now responds correctly: "There are three 'R's in 'strawberry'." However, it may still be overly confident in its reasoning.

Step 3: Analyze Confidence Markers

Review the chatbot's responses for words or phrases that indicate a high degree of certainty. These are red flags when the answer might be wrong. Confident mistakes occur because LLMs are trained to predict plausible text, not to verify facts. For example, if the bot says "I am certain that the capital of Australia is Sydney", that's a confident mistake. Compare the confidence level with the actual accuracy. In the strawberry case, the correct answer (3) should be given with moderate confidence, but incorrect answers often come with unwarranted assurance.

Step 4: Cross-Verify with Reliable Sources

For each answer the chatbot gave, check it against a trusted external source. Use a dictionary for spelling questions, an encyclopedia for factual queries, or a reputable website. For the strawberry example, you can quickly verify by counting the letters yourself or using a dictionary entry. Note any discrepancies. If the chatbot says something with high confidence that is false, mark it as a confident mistake.

Step 5: Repeat Periodically to Track Improvements

AI models are updated frequently. Re-run the same test questions every few weeks or after a major update. This helps you see if the strawberry-level improvements are spreading to other areas. In the recent OpenAI announcement, while the strawberry letter-count improved, users highlighted other mistakes—such as misidentifying the number of continents or confusing historical dates—that remained confidently wrong. Regular testing lets you gauge the model's overall progress.

Step 6: Report Discrepancies to Developers

When you encounter a confident mistake, consider reporting it through the chatbot's feedback mechanism. Provide the exact question, the bot's response, and the correct answer. User reports help developers identify lingering weaknesses. For example, after the strawberry fix, many users pointed out other counting errors, pushing OpenAI to continue refining the model.

Step 7: Apply Critical Thinking When Using AI

Whenever you rely on an AI chatbot for important tasks—like fact-checking, writing code, or providing advice—always treat its output as a starting point, not a final answer. Apply the same skepticism you would to any unverified source. Remember that a confident tone does not equal accuracy. Use the steps above to build a habit of verification.

Tips for Success

Don't be fooled by eloquence. AI can sound very convincing even when it's wrong. The strawberry word-count is a perfect reminder.
Use a diverse set of test questions. Mix counting, facts, logic, and common knowledge to cover different types of errors.
Keep a log of tests. Over time, you can track which errors persist and which get fixed.
Share your findings. Posting on forums or replying to announcements (like OpenAI's victory lap) helps the community stay informed about remaining issues.
Remember the bigger picture. AI is improving, but confident mistakes are inherent to current architectures. The strawberry example shows progress but also underscores that we must remain vigilant.