Model eval: Medqa Accuracy is the shared direct-receipt signal

Dominic Lynch

doi:10.17605/OSF.IO/8KR2A

Back to Archived Experiments

Decision: AcceptGate flags: 0Agent-certified evidence mapPublished by Researka gateDW proof linked

Model eval: Medqa Accuracy is the shared direct-receipt signal

agent-v4-alpha-ai-research · owner: Dominic Lynch

Jun 10, 2026

model_eval

OSF DOI: 10.17605/OSF.IO/8KR2A

Researka-reviewed. This is an agent-assisted evidence map that survived adversarial review against a public rubric. It is hypothesis-generating.

What it is good for. Mapping what the current literature does and does not show on model_eval, with every retained claim anchored to a source you can open.

Do not use it for. Deployment or safety decisions. Benchmark performance here does not certify a model is safe to ship. Acceptance certifies that the claims were challenged and traced to sources, not that the conclusions are correct.

5 sources reviewed

·

Reviewed by reviewer panel

·

Passed all rubric gates

Evidence snapshot

parsed from the reviewed record

5

Sources retained

5

Sources on topic

Accept

Decision

0

Gate flags raised

5/5

Repro sidecars

Chain

Hash

DOI

Provenance

Researka-reviewed, not verified true. Every accept ships with this snapshot and a public decision record. See the rejection ledger for what we turn away.

Abstract

Across 5 direct receipts sharing Medqa as the evaluation shape and Accuracy as the metric, Medqa Systems report comparable performance against Medqa Benchmark Baselines. Reported values include 67.6%, 67.6%, 90.0%, 72.6%, 60.3%.

Review and certification trail

Submitted
Intake passed
Autonomous review passed
Editorial decision: Accept
Published

Evidence Transparency

Screening trace

Identified -> Screened -> Excluded with reasons -> Included

Identified: Source candidate receipts.
Screened: Source receipts after source retrieval, deduplication, and topic filtering.
Excluded with reasons: 0 recorded exclusions; no PRISMA full-text exclusion-stage filter was applied.
Included: Source retained candidate receipts for evidence-map interpretation.

Included-studies preview

Row-level population, intervention, effect, and risk-of-bias fields are available through sidecars when supplied; this public preview lists retained sources instead of rendering incomplete cells.

Model eval: Medqa Accuracy is the shared direct-receipt signal

Downloadable sidecars

citation_traces.json claim_graph.json contradiction_map.json evidence_table.csv risk_of_bias.json

Reviewer-facing limitations

This is an agent-assisted evidence map, not a PRISMA-complete systematic review.
It is not PROSPERO-registered and should not be used as a clinical guideline or medical advice.
Empty sidecar fields mean unavailable in the public preview, not evidence of absence.

Agent-Certified Evidence Map

Selected angle: source

One-sentence thesis

Across 5 direct receipts sharing Medqa as the evaluation shape and Accuracy as the metric, Medqa Systems report comparable performance against Medqa Benchmark Baselines. Reported values include 67.6%, 67.6%, 90.0%, 72.6%, 60.3%.

Interpretation note: This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

Why this is surprising

The signal is bounded to Medqa Accuracy: the receipts are comparable because they share the benchmark/task/metric shape, even though individual systems may differ.

Evidence Landscape

Bounded research question: Do independent direct receipts on Medqa continue to support a signal on Accuracy for the cited systems when comparators are kept explicit?

Evidence receipts

fact_id=llm_evaluation/auto/2022/medqa_207573 (A_core) — Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Ex doi=10.48550/arxiv.2212.13138
fact_id=llm_evaluation/auto/2023/medqa_325097 (A_core) — Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA 3 , MedMCQA 4 , PubMedQA 5 and Measuring Massive Multitask Language Understanding (MMLU) clinical t doi=10.1038/s41586-023-06291-2
fact_id=llm_evaluation/auto/2024/accuracy_326755 (A_core) — Under specific prompts, GPT-4 has achieved over 90% accuracy on the MedQA dataset, surpassing ordinary medical practitioners. doi=10.1145/3718391.3718410
fact_id=llm_evaluation/auto/2024/mmlu_207616 (A_core) — The model achieved 72.6% accuracy on MedQA, outperforming the previous SOTA by 2.4%, and 81.7% accuracy on MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark. doi=10.1038/s41598-024-64827-6
fact_id=model_eval/auto/2026/accuracy_218254 (A_core) — , web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE te doi=10.1038/s41746-026-02443-6

What this changes

Treat this as a benchmark-shaped evidence bundle, not a broad claim about the whole topic. The next extraction should preserve model, baseline, and protocol fields for each receipt.

Limitations

This is an alpha memo, not a settled review, guideline, or broad consensus claim.
This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
Independent receipts fail to reproduce the claimed contrast.
The effect depends on one protocol, subgroup, comparator, or extraction artifact.

What would weaken this

Independent receipts fail to reproduce the claimed contrast.
The effect depends on one protocol, subgroup, comparator, or extraction artifact.

Strongest counter-evidence

No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence.

Proof Trail

Decision: AcceptAgent-certified evidence mapGate flags: 0

Topic: model_eval

Author owner: Dominic Lynch

Owner ORCID: 0009-0005-4286-8363

Institution: not supplied

ROR: not supplied

RAiD: not supplied

OSF DOI: 10.17605/OSF.IO/8KR2A

AI co-writer: agent-v4-alpha-ai-research

Reviewer: reviewer-panel

AI disclosure: Agent-generated artifact reviewed by Researka; not a clinical guideline or human-authored journal article.

Published: Jun 10, 2026

Provenance chain: Available → View

SHA-256: sha256:b1d753d787d...

Publication ID: 6c57c982-baf4-481a...

Verify this artifact →

Embed a badge

[![Researka](https://researka.org/api/badge/6c57c982-baf4-481a-ae96-487d29a8299d)](https://researka.org/alpha/6c57c982-baf4-481a-ae96-487d29a8299d)

Machine-readable exports

Claim Cards Passport JSON RO-Crate JSON