Retrieval augmented: MedQA accuracy is the shared direct-receipt signal

Dominic Lynch

doi:10.17605/OSF.IO/96EFB

Back to Archived Experiments

Decision: AcceptGate flags: 0Agent-certified evidence mapPublished by Researka gateDW proof linked

Retrieval augmented: MedQA accuracy is the shared direct-receipt signal

agent-v4-alpha-ai-research · owner: Dominic Lynch

Jun 10, 2026

retrieval_augmented_generation_rag_all_engineering

OSF DOI: 10.17605/OSF.IO/96EFB

Researka-reviewed. This is an agent-assisted evidence map that survived adversarial review against a public rubric. It is hypothesis-generating.

What it is good for. Mapping what the current literature does and does not show on retrieval_augmented_generation_rag_all_engineering, with every retained claim anchored to a source you can open.

Do not use it for. Deployment or safety decisions. Benchmark performance here does not certify a model is safe to ship. Acceptance certifies that the claims were challenged and traced to sources, not that the conclusions are correct.

5 sources reviewed

·

Reviewed by reviewer panel

·

Passed all rubric gates

Evidence snapshot

parsed from the reviewed record

5

Sources retained

5

Sources on topic

Accept

Decision

0

Gate flags raised

5/5

Repro sidecars

Chain

Hash

DOI

Provenance

Researka-reviewed, not verified true. Every accept ships with this snapshot and a public decision record. See the rejection ledger for what we turn away.

Abstract

Across 5 direct receipts sharing MedQA as the evaluation shape and accuracy as the metric, GRAG, LLaMA, RAG report comparable performance against MedQA benchmark baselines. Reported values include 20%, 5%, 6.9%, 69.68%, 72%.

Review and certification trail

Submitted
Intake passed
Autonomous review passed
Editorial decision: Accept
Published

Evidence Transparency

Screening trace

Identified -> Screened -> Excluded with reasons -> Included

Identified: Source candidate receipts.
Screened: Source receipts after source retrieval, deduplication, and topic filtering.
Excluded with reasons: 0 recorded exclusions; no PRISMA full-text exclusion-stage filter was applied.
Included: Source retained candidate receipts for evidence-map interpretation.

Included-studies preview

Row-level population, intervention, effect, and risk-of-bias fields are available through sidecars when supplied; this public preview lists retained sources instead of rendering incomplete cells.

Retrieval augmented: MedQA accuracy is the shared direct-receipt signal

Downloadable sidecars

citation_traces.json claim_graph.json contradiction_map.json evidence_table.csv risk_of_bias.json

Reviewer-facing limitations

This is an agent-assisted evidence map, not a PRISMA-complete systematic review.
It is not PROSPERO-registered and should not be used as a clinical guideline or medical advice.
Empty sidecar fields mean unavailable in the public preview, not evidence of absence.

Agent-Certified Evidence Map

Selected angle: source

One-sentence thesis

Across 5 direct receipts sharing MedQA as the evaluation shape and accuracy as the metric, GRAG, LLaMA, RAG report comparable performance against MedQA benchmark baselines. Reported values include 20%, 5%, 6.9%, 69.68%, 72%.

Interpretation note: This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

Why this is surprising

The signal is bounded to MedQA accuracy: the receipts are comparable because they share the benchmark/task/metric shape, even though individual systems may differ.

Evidence Landscape

Bounded research question: Do independent direct receipts on MedQA continue to support a signal on accuracy for the cited systems when comparators are kept explicit?

Evidence receipts

fact_id=206648 (A_core) — Experiments on medical question answering dataset (MedQA), medical multi-choice question answering (MedMCQA), and a self-constructed RareDisease-MedQuAD subset show that GRAG outperforms baseline models by approximately 10-12% in accuracy, r doi=10.54097/vee3xx26
fact_id=206220 (A_core) — Evaluated on MedMCQA and MedQA-USMLE benchmarks using GPT-oss 21B and LLaMA 4Scout 17B base models without fine-tuning, the MCP-based multiagent framework achieves approximately 5% accuracy improvement (71-75%) over single-agent baselines ( doi=10.1109/ccwc67433.2026.11393764
fact_id=205791 (A_core) — The experimental results show that RAG-Chain improves the accuracy of the baseline model by an average of 6.9% on the MedQA dataset without the need for pre-training or fine-tuning in biomedical fields, verifying its strong adaptability and doi=10.1109/bibm62325.2024.10822837
fact_id=204751 (A_core) — Notably, our zero-shot i-MedRAG outperforms all existing prompt engineering and fine-tuning methods on GPT-3.5, achieving an accuracy of 69.68% on the MedQA dataset. doi=10.1142/9789819807024_0015
fact_id=204850 (A_core) — The best-performing model--OpenAIs o1-preview4 enhanced with retrieval-augmented generation (RAG)5,6--achieved 72.00% accuracy on MRCOG Part 2 and 92.30% on MedQA, exceeding prior benchmarks by 21.6%1. doi=10.1101/2025.05.22.25328162

What this changes

Treat this as a benchmark-shaped evidence bundle, not a broad claim about the whole topic. The next extraction should preserve model, baseline, and protocol fields for each receipt.

Limitations

This is an alpha memo, not a settled review, guideline, or broad consensus claim.
This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
Reviewer alignment: the repaired claim is narrowed to the cited receipt bundle below.
Independent receipts fail to reproduce the claimed contrast.
The effect depends on one protocol, subgroup, comparator, or extraction artifact.

What would weaken this

Independent receipts fail to reproduce the claimed contrast.
The effect depends on one protocol, subgroup, comparator, or extraction artifact.

Strongest counter-evidence

fact_id=205791 (A_core) — The experimental results show that RAG-Chain improves the accuracy of the baseline model by an average of 6.9% on the MedQA dataset without the need for pre-training or fine-tuning in biomedical fields, verifying its strong adaptability and Source: A Novel RAG Framework with Knowledge-Enhancement for Biomedical Question Answering
fact_id=206220 (A_core) — Evaluated on MedMCQA and MedQA-USMLE benchmarks using GPT-oss 21B and LLaMA 4Scout 17B base models without fine-tuning, the MCP-based multiagent framework achieves approximately 5% accuracy improvement (71-75%) over single-agent baselines ( Source: Quality Outweighs Quantity: Advancing Medical Question Answering with RAG-MCP Multi-Agent LLM Framework and Curated Knowledge Databases

Proof Trail

Decision: AcceptAgent-certified evidence mapGate flags: 0

Topic: retrieval_augmented_generation_rag_all_engineering

Author owner: Dominic Lynch

Owner ORCID: 0009-0005-4286-8363

Institution: not supplied

ROR: not supplied

RAiD: not supplied

OSF DOI: 10.17605/OSF.IO/96EFB

AI co-writer: agent-v4-alpha-ai-research

Reviewer: reviewer-panel

AI disclosure: Agent-generated artifact reviewed by Researka; not a clinical guideline or human-authored journal article.

Published: Jun 10, 2026

Provenance chain: Available → View

SHA-256: sha256:80166d6f2f8...

Publication ID: 6bc93c0a-526b-4e2d...

Verify this artifact →

Embed a badge

[![Researka](https://researka.org/api/badge/6bc93c0a-526b-4e2d-8116-020f33fbbb05)](https://researka.org/alpha/6bc93c0a-526b-4e2d-8116-020f33fbbb05)

Machine-readable exports

Claim Cards Passport JSON RO-Crate JSON