Research Project
AI Medicare Advice Evaluator
Based on Accuracy of Medicare Information Provided by State Health Insurance Assistance Programs — Dugan et al., JAMA Network Open, 2025

Overview: Evaluating AI responses about Medicare

AI tools have the the potential to help people navigate complex systems while also having the potential to cause harm by providing incorrect or biased information. This project evaluates how well AI models answer a standardized set of Medicare questions, using a published research rubric.

Initial results indicate AI responses are more accurate than trained counselors providing an ideal answer 65% of the time vs 37% from trained counselors. This is a *preliminary* result from a non-expert as part of a personal project.

Acknowledgement: This project essentially re-implements the original SHIP study using AI agents. The vast majority of the work was done by the original authors of the study.

What This System Does

This project implements that evaluation pipeline as an automated harness. It can run any AI model through the SHIP question sequence and score the result against the published rubric. Models are tested under identical conditions — the same questions, the same grading criteria — so results are directly comparable to each other and to the published human counselor data.

The system supports major AI providers (OpenAI, Anthropic, Google, and others) and can evaluate multiple models in a single run. Results are stored with full audit trails: every response, every extracted claim, every judge verdict.

Findings and Caveats

This section summarizes what the evaluation has found. Detailed results are available in the data reports linked below.

Accuracy **Preliminary**

AI models are consistently more accurate

From the preliminary results, it appears that AI models provide more accurate responses than trained counselors - giving responses that were Accurate and Complete, the best response under this rubric, to 65% of questions compared to 37% from experts.

Access

AI dramatically expands who can get Medicare guidance

The SHIP study's own data shows that 47% of outreach calls went unanswered — meaning nearly half of people seeking help couldn't reach a counselor at all. AI systems are available immediately, at any time, without hold times or geographic constraints, and in dozens of languages. For the many Medicare beneficiaries who can't access in-person or phone assistance, AI represents a meaningful expansion of who can get help.

Caveats

Important limitations of this evaluation

  • The accuracy data is preliminary.
  • This is a personal project by a non-expert. The approach has not been validated with any experts.
  • The test questions are not exhaustive and do not cover all the questions that people may have about Medicare.

The SHIP Study as a Benchmark

In the SHIP mystery-shopper study (Dugan et al., JAMA Network Open, 2025) researchers posed as Medicare beneficiaries and called SHIP counselors with a standardized script of questions. They then evaluated responses against a detailed rubric specifying which facts a correct, complete answer must contain.

This methodology turns Medicare advice into something measurable. The rubric defines, for each question, which information points are required, which are optional, and what constitutes an error. That structure makes it possible to score any response — human or AI — on the same scale.

It also maps naturally onto an agent-based architecture, where each role in the original study becomes a software agent:

👤
Questioner
Poses scripted beneficiary questions
🤖
AI Model
Responds as it would to a real user
🔍
Extractor
Breaks response into atomic, verifiable claims
⚖️
Judge
Scores each claim against the SHIP rubric

Multiple independent judge instances evaluate each claim and resolve disagreements, mirroring the multi-rater reliability methods used in the original study. Every step — the raw AI response, extracted claims, judge verdicts, and final scores — is stored for full auditability.

What's Next: Improving AI Responses

Establishing a baseline is phase one. Phase two is using that measurement infrastructure to ask a more actionable question: how can we make AI responses better?

Because every run is scored against the same rubric, the system can detect whether a change actually improves performance. That makes it possible to rigorously test:

The goal is not just to measure AI performance, but to understand how to make AI genuinely useful for people navigating one of the most consequential decisions of their lives.

Explore the Data

The following report shows detailed evaluation results from AI models run against the SHIP rubric.

Research context: This system is for research purposes only. It does not provide medical, legal, or insurance advice, and results should not be used to make healthcare decisions. The SHIP program provides free, expert Medicare counseling — find your local counselor at shiphelp.org.