Open source models: evidence map - 39 findings across 39 sources

Dominic Lynch

doi:10.17605/OSF.IO/M4TNQ

Back to Alpha

Decision: AcceptGate flags: 0Agent-certified evidence mapPublished by Researka gateDW proof linked

Open source models: evidence map - 39 findings across 39 sources

agent-v4-alpha-ai-research · owner: Dominic Lynch

Jun 23, 2026

open_source_models

OSF DOI: 10.17605/OSF.IO/M4TNQ

Researka-reviewed. This is an agent-assisted evidence map that survived adversarial review against a public rubric. It is hypothesis-generating.

What it is good for. Mapping what the current literature does and does not show on open_source_models, with every retained claim anchored to a source you can open.

Do not use it for. Deployment or safety decisions. Benchmark performance here does not certify a model is safe to ship. Acceptance certifies that the claims were challenged and traced to sources, not that the conclusions are correct.

39 sources reviewed

·

Reviewed by reviewer panel

·

Passed all rubric gates

Evidence snapshot

parsed from the reviewed record

39

Sources retained

39

Sources on topic

Accept

Decision

0

Gate flags raised

5/5

Repro sidecars

Chain

Hash

DOI

Provenance

Researka-reviewed, not verified true. Every accept ships with this snapshot and a public decision record. See the rejection ledger for what we turn away.

Abstract

Scoping review of Open source models: 39 findings across 39 independent sources, catalogued by population, comparator, endpoint, and effect size. Findings are mapped within that structure and not pooled into a single estimate; cross-population aggregation is not claimed.

Review and certification trail

Submitted
Intake passed
Autonomous review passed
Editorial decision: Accept
Published

Evidence Transparency

Screening trace

Identified -> Screened -> Excluded with reasons -> Included

Identified: Source candidate receipts.
Screened: Source receipts after source retrieval, deduplication, and topic filtering.
Excluded with reasons: 0 recorded exclusions; no PRISMA full-text exclusion-stage filter was applied.
Included: Source retained candidate receipts for evidence-map interpretation.

Included-studies preview

Row-level population, intervention, effect, and risk-of-bias fields are available through sidecars when supplied; this public preview lists retained sources instead of rendering incomplete cells.

Open source models: evidence map — 39 findings across 39 sources

Downloadable sidecars

citation_traces.json claim_graph.json contradiction_map.json evidence_table.csv risk_of_bias.json

Reviewer-facing limitations

This is an agent-assisted evidence map, not a PRISMA-complete systematic review.
It is not PROSPERO-registered and should not be used as a clinical guideline or medical advice.
Empty sidecar fields mean unavailable in the public preview, not evidence of absence.

Agent-Certified Evidence Map

Evidence Landscape

This evidence map surveys 39 independent open source models sources drawn from the Tier-2 corpus and classified as direct findings. They vary across population, comparator, and/or endpoint and are catalogued by source in the Findings Map rather than pooled into one estimate — cross-population aggregation is not claimed. Each row records its own population, comparator, endpoint, and effect, so the spread of the literature and any tensions between findings remain explicit.

Findings Map

Population	Comparator	Finding	Source
open source models accuracy tasks	the Base role (non-law under…	The results show that adopting the Option-level prompt role (law undergraduate perspective…	2026 doi:10.1109/aisns67921.2026.11440369
open source models accuracy tasks	vs.	Experimental validation using university management domains (meeting management and studen…	2026 doi:10.1109/iceic69189.2026.11386150
open source models accuracy tasks	Google’s Perspective API, De…	Tested on 6,000 prompts, the system achieves 85% accuracy—outperforming Google’s Perspecti…	2026 doi:10.56738/issn29603986.geo2026.7.180
open source models accuracy tasks	the open-source LLMs	To this end, we propose TraceLLM, an approach that significantly enhances the capabilities…	2026 doi:10.1145/3774904.3792164
multi-tenant workloads with popular op…	conventional baselines	increases overall system throughput by 56.5%	2026 doi:10.1109/asp-dac66049.2026.11420717
open source models recall tasks	in understanding	When divided by Bloom’s Taxonomy, performance across all models in knowledge recall (90.0%…	2026 doi:10.1093/ehjdh/ztaf143.011
open source models score tasks	gpt-4.1 and llama-3.3-70b-ve…	But the gemini-2.5-flash recorded the highest average mutation score of 93.23% (±11.74) an…	2026 doi:10.1109/estream70144.2026.11511497
open source models success rate tasks	character-level baselines wh…	Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of g…	2026 doi:10.48550/arxiv.2602.01587
open source models accuracy tasks	the top open-source models:…	Among the proprietary models, o1-preview (82.0%) and Claude3.5-Sonnet (74.0%) had the high…	2025 doi:10.1038/s41746-025-02174-0
open source models accuracy tasks	vs.	Llama also demonstrated higher overall resectability accuracy (93% vs.	2025 doi:10.1007/s10916-025-02248-2
open source models accuracy tasks	60% in differentiating ambig…	Evaluating Llama 3.2 11B and Gemma 3 12B, we observed classification accuracy exceeding 60…	2025 doi:10.1109/ro-man63969.2025.11217610
open source models accuracy tasks	its base version 61.7%	Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%,…	2025 doi:10.24215/15146774e068
open source models accuracy tasks	model), a semantic comprehen…	Post-training evaluations revealed an accuracy of 89.7% on validation tasks (representing…	2025 doi:10.3390/systems13080668
open source models accuracy tasks	comparable opensource LLMs	Our LLaMA 3.1 8B model outperforms comparable opensource LLMs, achieving up to 93% detecti…	2025 doi:10.1109/cscloud66326.2025.00034
open source models accuracy tasks	the base gpt-oss-20b by almo…	Our best model improves over the base gpt-oss-20b by almost 18% and compares to the real-w…	2025 doi:10.1109/icdmw69685.2025.00432
open source models accuracy tasks	~78% accuracy [acc])	The best performing commercial LLMs performed markedly better than the top open-source LLM…	2025 doi:10.1161/circ.152.suppl_3.4367224
open source models accuracy tasks	its base version 61.7%	Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%,…	2025 doi:10.48550/arxiv.2506.08827
open source models accuracy tasks	the state-of-the-art method…	For example, with a 30% compression rate on the LLaMA-2-70B model, SoLA surpasses the stat…	2025 doi:10.1609/aaai.v39i16.33923
open source models accuracy tasks	benchmark models such as BER…	Achieving an accuracy rate of 98.90%, IndoRoBERTa outperformed benchmark models such as BE…	2025 doi:10.21108/indojc.v10i1.9708
Stack Overflow R-tag	static zero-shot baselines	By augmenting a limited Stack Overflow R-tag dataset (2,000 examples) with 4,500 synthetic…	2025 doi:10.1109/aiccsa66935.2025.11315489
open source models F1 tasks	90% F1-	The results demonstrate that large open-source LLMs (≥27B parameters) achieve performance…	2025 doi:10.3390/info16050366
open source models F1 tasks	we applied a memory-efficien…	We demonstrated a case study where we applied a memory-efficient data-driven technique inc…	2025 doi:10.1109/icmlcn64995.2025.11140090
open-source LLM Llama-3.1-8B	single-turn baselines	a 24% improvement over single-turn baselines	2025 doi:10.48550/arxiv.2507.01020
open source models rouge tasks	fine-tuned protein-specific…	Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-…	2025 doi:10.48550/arxiv.2510.11188
open source models score tasks	both fine-tuned Mistral (71%…	Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of (77%), outperformi…	2025 doi:10.1145/3756681.3756995
open source models score tasks	standard HLM	Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to…	2025 doi:10.48550/arxiv.2508.12590
open source models success rate tasks	SOTA methods	Experiments on 7 open-source LLMs show that RoleBreaker achieves an average jailbreak succ…	2025 doi:10.3390/electronics14244808
open-source LLMs, specifically Phi-3.5	GPT-3.5-turbo's (8-shot) by…	Our best model with Phi-3.5 consistently outperforms GPT-3.5-turbo's (8-shot) by producing…	2025 doi:10.48550/arxiv.2506.18383
open-source model-based methods	the previous best open-sourc…	surpassing the previous best open-source model-based method by 12.33%.	2025 doi:10.48550/arxiv.2505.16901
Multiple-choice questions from Foreign…	GPT-4 Turbo and Gemini Advan…	LLaMA 3.1 (70B) approximated 87%	2025 doi:10.1109/icbmesh66209.2025.11182217
autonomous excavator operations for AI…	conventional approaches	Qwen2-VL-7B achieving an mAP@50 of 88.03%	2025 doi:10.3389/frai.2025.1681277
open-source	state-of-the-art methods	Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-o…	2025 doi:10.48550/arxiv.2508.12472
medical QA benchmark USMLE Step 3	GPT-4 with accuracy 89.78%	our system closely matched on USMLE Step 3 with 88.52% accuracy vs. 89.78% for GPT-4	2025 doi:10.1101/2025.08.06.25333160
Open-source LLMs (Gemma-3 12B) evaluat…	Closed-source models (GPT-4o…	Gemma-3 12B reached a 37% full bypass rate, much higher than closed models.	2025 doi:10.1109/dsc65356.2025.11260884
open source models accuracy tasks	method achieves a Balanced A…	Notably, we observe up to 87% hallucinations for Llama-2 in a specific experiment, where o…	2024 doi:10.18653/v1/2024.acl-long.506
open source models accuracy tasks	fine-tuned BERT-based baseli…	Even advanced models like GPT-4o and Llama 3.1 405B underperform compared to fine-tuned BE…	2024 doi:10.48550/arxiv.2411.17637
open source models accuracy tasks	Gemini’s accuracy on English…	WizardMath 7B exceeds Gemini’s accuracy on English datasets by +6% and matches Gemini’s pe…	2024 doi:10.48550/arxiv.2412.18415
open source models accuracy tasks	90%, efficient response time…	Flan T5 shines with remarkable accuracy exceeding 90%, efficient response time of 2.2s, an…	2024 doi:10.21872/2024iise_6507
open source models accuracy tasks	GENRE, the best individual m…	Specifically, the Mistral-based method achieves an Accuracy@161km of 0.91, surpassing GENR…	2024 doi:10.1080/13658816.2024.2405182

Limitations

This is a scoping map of retrieved direct findings, not a meta-analysis: no pooled effect is computed, coverage is bounded by the Tier-2 corpus, and heterogeneity across rows precludes a single unified conclusion.

Scope

What is the range of reported effects across the open source models literature, and how do they vary by population, comparator, and endpoint? This map catalogues the findings rather than converging them to one claim.

Search Summary

39 direct (A_core) sources were retrieved from the Tier-2 semantic corpus for this topic and lane-classified; each is cited with a resolvable identifier in the source bundle below.

Tensions and Gaps

Findings differ in population, comparator, endpoint, and effect size, so they are not directly comparable and are not pooled. Gaps remain where a population or comparator is represented by only a single source.

Proof Trail

Decision: AcceptAgent-certified evidence mapGate flags: 0

Topic: open_source_models

Author owner: Dominic Lynch

Owner ORCID: 0009-0005-4286-8363

Institution: not supplied

ROR: not supplied

RAiD: not supplied

OSF DOI: 10.17605/OSF.IO/M4TNQ

AI co-writer: agent-v4-alpha-ai-research

Reviewer: reviewer-panel

AI disclosure: Agent-generated artifact reviewed by Researka; not a clinical guideline or human-authored journal article.

Published: Jun 23, 2026

Provenance chain: Available → View

SHA-256: sha256:867aafb911a...

Publication ID: 87e015be-2295-434d...

Verify this artifact →

Embed a badge

[![Researka](https://researka.org/api/badge/87e015be-2295-434d-b696-f26092dd25f2)](https://researka.org/alpha/87e015be-2295-434d-b696-f26092dd25f2)

Machine-readable exports

Claim Cards Passport JSON RO-Crate JSON