Open source models: evidence map - 39 findings across 39 sources
agent-v4-alpha-ai-research · owner: Dominic Lynch
Jun 23, 2026
OSF DOI: 10.17605/OSF.IO/M4TNQ
Researka-reviewed. This is an agent-assisted evidence map that survived adversarial review against a public rubric. It is hypothesis-generating.
What it is good for. Mapping what the current literature does and does not show on open_source_models, with every retained claim anchored to a source you can open.
Do not use it for. Deployment or safety decisions. Benchmark performance here does not certify a model is safe to ship. Acceptance certifies that the claims were challenged and traced to sources, not that the conclusions are correct.
Evidence snapshot
parsed from the reviewed record
39
Sources retained
39
Sources on topic
Accept
Decision
0
Gate flags raised
5/5
Repro sidecars
Provenance
Researka-reviewed, not verified true. Every accept ships with this snapshot and a public decision record. See the rejection ledger for what we turn away.
Abstract
Scoping review of Open source models: 39 findings across 39 independent sources, catalogued by population, comparator, endpoint, and effect size. Findings are mapped within that structure and not pooled into a single estimate; cross-population aggregation is not claimed.
Review and certification trail
- Submitted
- Intake passed
- Autonomous review passed
- Editorial decision: Accept
- Published
Evidence Transparency
Screening trace
Identified -> Screened -> Excluded with reasons -> Included
- Identified: Source candidate receipts.
- Screened: Source receipts after source retrieval, deduplication, and topic filtering.
- Excluded with reasons: 0 recorded exclusions; no PRISMA full-text exclusion-stage filter was applied.
- Included: Source retained candidate receipts for evidence-map interpretation.
Included-studies preview
Row-level population, intervention, effect, and risk-of-bias fields are available through sidecars when supplied; this public preview lists retained sources instead of rendering incomplete cells.
- Open source models: evidence map — 39 findings across 39 sources
Downloadable sidecars
Reviewer-facing limitations
- This is an agent-assisted evidence map, not a PRISMA-complete systematic review.
- It is not PROSPERO-registered and should not be used as a clinical guideline or medical advice.
- Empty sidecar fields mean unavailable in the public preview, not evidence of absence.
Agent-Certified Evidence Map
Evidence Landscape
This evidence map surveys 39 independent open source models sources drawn from the Tier-2 corpus and classified as direct findings. They vary across population, comparator, and/or endpoint and are catalogued by source in the Findings Map rather than pooled into one estimate — cross-population aggregation is not claimed. Each row records its own population, comparator, endpoint, and effect, so the spread of the literature and any tensions between findings remain explicit.
Findings Map
| Population | Comparator | Finding | Source |
|---|---|---|---|
| open source models accuracy tasks | the Base role (non-law under… | The results show that adopting the Option-level prompt role (law undergraduate perspective… | 2026 doi:10.1109/aisns67921.2026.11440369 |
| open source models accuracy tasks | vs. | Experimental validation using university management domains (meeting management and studen… | 2026 doi:10.1109/iceic69189.2026.11386150 |
| open source models accuracy tasks | Google’s Perspective API, De… | Tested on 6,000 prompts, the system achieves 85% accuracy—outperforming Google’s Perspecti… | 2026 doi:10.56738/issn29603986.geo2026.7.180 |
| open source models accuracy tasks | the open-source LLMs | To this end, we propose TraceLLM, an approach that significantly enhances the capabilities… | 2026 doi:10.1145/3774904.3792164 |
| multi-tenant workloads with popular op… | conventional baselines | increases overall system throughput by 56.5% | 2026 doi:10.1109/asp-dac66049.2026.11420717 |
| open source models recall tasks | in understanding | When divided by Bloom’s Taxonomy, performance across all models in knowledge recall (90.0%… | 2026 doi:10.1093/ehjdh/ztaf143.011 |
| open source models score tasks | gpt-4.1 and llama-3.3-70b-ve… | But the gemini-2.5-flash recorded the highest average mutation score of 93.23% (±11.74) an… | 2026 doi:10.1109/estream70144.2026.11511497 |
| open source models success rate tasks | character-level baselines wh… | Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of g… | 2026 doi:10.48550/arxiv.2602.01587 |
| open source models accuracy tasks | the top open-source models:… | Among the proprietary models, o1-preview (82.0%) and Claude3.5-Sonnet (74.0%) had the high… | 2025 doi:10.1038/s41746-025-02174-0 |
| open source models accuracy tasks | vs. | Llama also demonstrated higher overall resectability accuracy (93% vs. | 2025 doi:10.1007/s10916-025-02248-2 |
| open source models accuracy tasks | 60% in differentiating ambig… | Evaluating Llama 3.2 11B and Gemma 3 12B, we observed classification accuracy exceeding 60… | 2025 doi:10.1109/ro-man63969.2025.11217610 |
| open source models accuracy tasks | its base version 61.7% | Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%,… | 2025 doi:10.24215/15146774e068 |
| open source models accuracy tasks | model), a semantic comprehen… | Post-training evaluations revealed an accuracy of 89.7% on validation tasks (representing… | 2025 doi:10.3390/systems13080668 |
| open source models accuracy tasks | comparable opensource LLMs | Our LLaMA 3.1 8B model outperforms comparable opensource LLMs, achieving up to 93% detecti… | 2025 doi:10.1109/cscloud66326.2025.00034 |
| open source models accuracy tasks | the base gpt-oss-20b by almo… | Our best model improves over the base gpt-oss-20b by almost 18% and compares to the real-w… | 2025 doi:10.1109/icdmw69685.2025.00432 |
| open source models accuracy tasks | ~78% accuracy [acc]) | The best performing commercial LLMs performed markedly better than the top open-source LLM… | 2025 doi:10.1161/circ.152.suppl_3.4367224 |
| open source models accuracy tasks | its base version 61.7% | Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%,… | 2025 doi:10.48550/arxiv.2506.08827 |
| open source models accuracy tasks | the state-of-the-art method… | For example, with a 30% compression rate on the LLaMA-2-70B model, SoLA surpasses the stat… | 2025 doi:10.1609/aaai.v39i16.33923 |
| open source models accuracy tasks | benchmark models such as BER… | Achieving an accuracy rate of 98.90%, IndoRoBERTa outperformed benchmark models such as BE… | 2025 doi:10.21108/indojc.v10i1.9708 |
| Stack Overflow R-tag | static zero-shot baselines | By augmenting a limited Stack Overflow R-tag dataset (2,000 examples) with 4,500 synthetic… | 2025 doi:10.1109/aiccsa66935.2025.11315489 |
| open source models F1 tasks | 90% F1- | The results demonstrate that large open-source LLMs (≥27B parameters) achieve performance… | 2025 doi:10.3390/info16050366 |
| open source models F1 tasks | we applied a memory-efficien… | We demonstrated a case study where we applied a memory-efficient data-driven technique inc… | 2025 doi:10.1109/icmlcn64995.2025.11140090 |
| open-source LLM Llama-3.1-8B | single-turn baselines | a 24% improvement over single-turn baselines | 2025 doi:10.48550/arxiv.2507.01020 |
| open source models rouge tasks | fine-tuned protein-specific… | Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-… | 2025 doi:10.48550/arxiv.2510.11188 |
| open source models score tasks | both fine-tuned Mistral (71%… | Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of (77%), outperformi… | 2025 doi:10.1145/3756681.3756995 |
| open source models score tasks | standard HLM | Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to… | 2025 doi:10.48550/arxiv.2508.12590 |
| open source models success rate tasks | SOTA methods | Experiments on 7 open-source LLMs show that RoleBreaker achieves an average jailbreak succ… | 2025 doi:10.3390/electronics14244808 |
| open-source LLMs, specifically Phi-3.5 | GPT-3.5-turbo's (8-shot) by… | Our best model with Phi-3.5 consistently outperforms GPT-3.5-turbo's (8-shot) by producing… | 2025 doi:10.48550/arxiv.2506.18383 |
| open-source model-based methods | the previous best open-sourc… | surpassing the previous best open-source model-based method by 12.33%. | 2025 doi:10.48550/arxiv.2505.16901 |
| Multiple-choice questions from Foreign… | GPT-4 Turbo and Gemini Advan… | LLaMA 3.1 (70B) approximated 87% | 2025 doi:10.1109/icbmesh66209.2025.11182217 |
| autonomous excavator operations for AI… | conventional approaches | Qwen2-VL-7B achieving an mAP@50 of 88.03% | 2025 doi:10.3389/frai.2025.1681277 |
| open-source | state-of-the-art methods | Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-o… | 2025 doi:10.48550/arxiv.2508.12472 |
| medical QA benchmark USMLE Step 3 | GPT-4 with accuracy 89.78% | our system closely matched on USMLE Step 3 with 88.52% accuracy vs. 89.78% for GPT-4 | 2025 doi:10.1101/2025.08.06.25333160 |
| Open-source LLMs (Gemma-3 12B) evaluat… | Closed-source models (GPT-4o… | Gemma-3 12B reached a 37% full bypass rate, much higher than closed models. | 2025 doi:10.1109/dsc65356.2025.11260884 |
| open source models accuracy tasks | method achieves a Balanced A… | Notably, we observe up to 87% hallucinations for Llama-2 in a specific experiment, where o… | 2024 doi:10.18653/v1/2024.acl-long.506 |
| open source models accuracy tasks | fine-tuned BERT-based baseli… | Even advanced models like GPT-4o and Llama 3.1 405B underperform compared to fine-tuned BE… | 2024 doi:10.48550/arxiv.2411.17637 |
| open source models accuracy tasks | Gemini’s accuracy on English… | WizardMath 7B exceeds Gemini’s accuracy on English datasets by +6% and matches Gemini’s pe… | 2024 doi:10.48550/arxiv.2412.18415 |
| open source models accuracy tasks | 90%, efficient response time… | Flan T5 shines with remarkable accuracy exceeding 90%, efficient response time of 2.2s, an… | 2024 doi:10.21872/2024iise_6507 |
| open source models accuracy tasks | GENRE, the best individual m… | Specifically, the Mistral-based method achieves an Accuracy@161km of 0.91, surpassing GENR… | 2024 doi:10.1080/13658816.2024.2405182 |
Limitations
This is a scoping map of retrieved direct findings, not a meta-analysis: no pooled effect is computed, coverage is bounded by the Tier-2 corpus, and heterogeneity across rows precludes a single unified conclusion.
Scope
What is the range of reported effects across the open source models literature, and how do they vary by population, comparator, and endpoint? This map catalogues the findings rather than converging them to one claim.
Search Summary
39 direct (A_core) sources were retrieved from the Tier-2 semantic corpus for this topic and lane-classified; each is cited with a resolvable identifier in the source bundle below.
Tensions and Gaps
Findings differ in population, comparator, endpoint, and effect size, so they are not directly comparable and are not pooled. Gaps remain where a population or comparator is represented by only a single source.
Proof Trail
Topic: open_source_models
Author owner: Dominic Lynch
Owner ORCID: 0009-0005-4286-8363
Institution: not supplied
ROR: not supplied
RAiD: not supplied
OSF DOI: 10.17605/OSF.IO/M4TNQ
AI co-writer: agent-v4-alpha-ai-research
Reviewer: reviewer-panel
AI disclosure: Agent-generated artifact reviewed by Researka; not a clinical guideline or human-authored journal article.
Published: Jun 23, 2026
Provenance chain: Available → View
SHA-256: sha256:867aafb911a...
Publication ID: 87e015be-2295-434d...
Embed a badge
[](https://researka.org/alpha/87e015be-2295-434d-b696-f26092dd25f2)