AI tools have the the potential to help people navigate complex systems while also having the potential to cause harm by providing incorrect or biased information. This project evaluates how well AI models answer a standardized set of Medicare questions, using a published research rubric.
Initial results indicate AI responses are more accurate than trained counselors providing an ideal answer 65% of the time vs 37% from trained counselors. This is a *preliminary* result from a non-expert as part of a personal project.
Acknowledgement: This project essentially re-implements the original SHIP study using AI agents. The vast majority of the work was done by the original authors of the study.
This project implements that evaluation pipeline as an automated harness. It can run any AI model through the SHIP question sequence and score the result against the published rubric. Models are tested under identical conditions — the same questions, the same grading criteria — so results are directly comparable to each other and to the published human counselor data.
The system supports major AI providers (OpenAI, Anthropic, Google, and others) and can evaluate multiple models in a single run. Results are stored with full audit trails: every response, every extracted claim, every judge verdict.
This section summarizes what the evaluation has found. Detailed results are available in the data reports linked below.
From the preliminary results, it appears that AI models provide more accurate responses than trained counselors - giving responses that were Accurate and Complete, the best response under this rubric, to 65% of questions compared to 37% from experts.
The SHIP study's own data shows that 47% of outreach calls went unanswered — meaning nearly half of people seeking help couldn't reach a counselor at all. AI systems are available immediately, at any time, without hold times or geographic constraints, and in dozens of languages. For the many Medicare beneficiaries who can't access in-person or phone assistance, AI represents a meaningful expansion of who can get help.
In the SHIP mystery-shopper study (Dugan et al., JAMA Network Open, 2025) researchers posed as Medicare beneficiaries and called SHIP counselors with a standardized script of questions. They then evaluated responses against a detailed rubric specifying which facts a correct, complete answer must contain.
This methodology turns Medicare advice into something measurable. The rubric defines, for each question, which information points are required, which are optional, and what constitutes an error. That structure makes it possible to score any response — human or AI — on the same scale.
It also maps naturally onto an agent-based architecture, where each role in the original study becomes a software agent:
Multiple independent judge instances evaluate each claim and resolve disagreements, mirroring the multi-rater reliability methods used in the original study. Every step — the raw AI response, extracted claims, judge verdicts, and final scores — is stored for full auditability.
Establishing a baseline is phase one. Phase two is using that measurement infrastructure to ask a more actionable question: how can we make AI responses better?
Because every run is scored against the same rubric, the system can detect whether a change actually improves performance. That makes it possible to rigorously test:
The goal is not just to measure AI performance, but to understand how to make AI genuinely useful for people navigating one of the most consequential decisions of their lives.
The following report shows detailed evaluation results from AI models run against the SHIP rubric.
Research context: This system is for research purposes only. It does not provide medical, legal, or insurance advice, and results should not be used to make healthcare decisions. The SHIP program provides free, expert Medicare counseling — find your local counselor at shiphelp.org.