{"publication_id":"6c57c982-baf4-481a-ae96-487d29a8299d","traces":[{"claim_id":"claim_1","claim":"Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.","candidate_sources":[{"study":"Large Language Models Encode Clinical Knowledge","doi":"10.48550/arxiv.2212.13138","url":null},{"study":"Large language models encode clinical knowledge","doi":"10.1038/s41586-023-06291-2","url":null},{"study":"FUO_ED: A Dataset for Evaluating the Performance of Large Language Models in Diagnosing Complex Cases of Fever of Unknown Origin","doi":"10.1145/3718391.3718410","url":null},{"study":"OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models","doi":"10.1038/s41598-024-64827-6","url":null},{"study":"Benchmarking large language model-based agent systems for clinical decision tasks.","doi":"10.1038/s41746-026-02443-6","url":null}]},{"claim_id":"claim_2","claim":"Bounded research question:** Do independent direct receipts on Medqa continue to support a signal on Accuracy for the cited systems when comparators are kept explicit?","candidate_sources":[{"study":"Large Language Models Encode Clinical Knowledge","doi":"10.48550/arxiv.2212.13138","url":null},{"study":"Large language models encode clinical knowledge","doi":"10.1038/s41586-023-06291-2","url":null},{"study":"FUO_ED: A Dataset for Evaluating the Performance of Large Language Models in Diagnosing Complex Cases of Fever of Unknown Origin","doi":"10.1145/3718391.3718410","url":null},{"study":"OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models","doi":"10.1038/s41598-024-64827-6","url":null},{"study":"Benchmarking large language model-based agent systems for clinical decision tasks.","doi":"10.1038/s41746-026-02443-6","url":null}]},{"claim_id":"claim_3","claim":"Treat this as a benchmark-shaped evidence bundle, not a broad claim about the whole topic. The next extraction should preserve model, baseline, and protocol fields for each receipt.","candidate_sources":[{"study":"Large Language Models Encode Clinical Knowledge","doi":"10.48550/arxiv.2212.13138","url":null},{"study":"Large language models encode clinical knowledge","doi":"10.1038/s41586-023-06291-2","url":null},{"study":"FUO_ED: A Dataset for Evaluating the Performance of Large Language Models in Diagnosing Complex Cases of Fever of Unknown Origin","doi":"10.1145/3718391.3718410","url":null},{"study":"OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models","doi":"10.1038/s41598-024-64827-6","url":null},{"study":"Benchmarking large language model-based agent systems for clinical decision tasks.","doi":"10.1038/s41746-026-02443-6","url":null}]},{"claim_id":"claim_4","claim":"_No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence._","candidate_sources":[{"study":"Large Language Models Encode Clinical Knowledge","doi":"10.48550/arxiv.2212.13138","url":null},{"study":"Large language models encode clinical knowledge","doi":"10.1038/s41586-023-06291-2","url":null},{"study":"FUO_ED: A Dataset for Evaluating the Performance of Large Language Models in Diagnosing Complex Cases of Fever of Unknown Origin","doi":"10.1145/3718391.3718410","url":null},{"study":"OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models","doi":"10.1038/s41598-024-64827-6","url":null},{"study":"Benchmarking large language model-based agent systems for clinical decision tasks.","doi":"10.1038/s41746-026-02443-6","url":null}]}]}