Skip to main content
Local LLM Guide

The best local LLMs for software engineering

Ranked by SWE-bench Verified, the benchmark that tests real GitHub issues, not toy problems. Opinionated picks for every VRAM tier. Not sure if local models are ready? Read our honest assessment.

50 models evaluated9 research passesSWE-bench · LCB · HumanEval+Updated monthly

What should I run?

Pick your hardware. Get a model recommendation in seconds.

Platform
Hardware
VRAM
50 models
≤ 4 GBEdge, integrated GPU, older laptops
ModelBenchVRAMExpand
StarCoder2-3B
~2 GB
BigCode3BBigCode RAIL-M

The de facto standard for inline code completion. Continue.dev recommends it as the default FIM model. At 2 GB VRAM, it runs on almost any GPU — including integrated graphics with enough shared memory. Use the base model, not instruct — the fine-tuning actually hurts FIM quality. Not for chat; only for autocomplete.

FIM fill-in-middle champion — not an instruct model

16K contextInline autocomplete (FIM)Ollama supported
Granite 4.0 TinyMoE
82.41%HE~4 GB
DeepSeek-R1-Distill-Qwen-1.5B
16.9%LCB~1.2 GB
SmolLM3-3B
30%LCB~2 GB
Qwen3.5-4B
55.8%LCB~3 GB
Phi-4-mini
19.9%LCB~2.3 GB
Bonsai 8B
~1.2 GB
Bonsai 1.7B
~0.24 GB
Gemma 4 E2B
~3 GB
5–8 GBRTX 3050/4050, MacBook M1/M2 base
ModelBenchVRAMExpand
GLM-4.7-FlashMoE
59.2%SWE~6 GB
Zhipu AI (Z.ai)30B / 3B active (MoE)MIT

The single most surprising model in this guide. 84.9% LCB and 94.2% HumanEval from a MoE that runs on a 6GB laptop GPU. For competitive programming and pure code generation, it outperforms most 32B dense models. SWE-bench is 59.2% (the 73.8% score belongs to its full 355B parent model — not Flash). MIT license. Available on Ollama. The sleeper pick of 2026.

SWE-bench for Flash distill (NOT the full 355B parent's 73.8%)

200K contextCompetitive programming & code genOllama supported
DeepSeek-R1-0528-Qwen3-8B
60.5%LCB~5 GB
DeepSeek8BMIT

60.5% LCB at 8B and 5GB VRAM. That's what 32B models scored in 2024. The jump from older R1-Distill-Qwen-7B (37.6% LCB) happened because this was distilled from a fundamentally stronger teacher — R1-0528, which itself jumped +9.8 LCB over the original R1. Best reasoning model at the laptop GPU tier. MIT license.

Distilled from the updated R1-0528 teacher — dramatically better than older R1 distills

128K contextReasoning + coding at 8B
Qwen3-30B-A3BMoE
~6 GB
Granite 3.3 8B Instruct
89.73%HE~5 GB
IBM8BApache 2.0

A massive jump from IBM's old code-specific Granite models. 89.73% HumanEval at 8B and 5GB VRAM. Apache 2.0 with IBM's full data provenance tracking — every training token is documented. The cleanest enterprise license in this tier. If your legal team needs to audit the training data, this is the only sub-8GB model that can satisfy that requirement.

128K contextEnterprise code generation
Seed-Coder-8B Instruct
84.8%HE~5 GB
Yi-Coder-9B
85.4%HE~6 GB
MiniCPM4-8BHybrid
~5 GB
InternLM3-8B
17.8%LCB~5 GB
Llama 3.1 8B Instruct
72.6%HE~5 GB
Qwen2.5-Coder-7B
84.1%HE+~5 GB
Qwen3-8B
~5 GB
Qwen3.5-9B
65.6%LCB~6.6 GB
Gemma 4 E4B
52%LCB~5 GB
8–12 GBRTX 3060 / 4060, MacBook M2 Pro
ModelBenchVRAMExpand
Phi-4-Reasoning
53.8%LCB~9 GB
Microsoft14BMIT

Microsoft's hidden gem. 92.9% HumanEval+ — the same score as o1-mini — at 14B and 9GB VRAM. 53.8% LCB is strong for the size. The key differentiator is chain-of-thought: it reasons through problems before answering, catching logical errors that direct-generation models miss. MIT license. Trained on high-density synthetic reasoning data. Not on Ollama but GGUF available.

HumanEval+ ties o1-mini

32K contextReasoning-heavy coding
DeepSeek-R1-Distill-Qwen-14B
53.1%LCB~9 GB
Qwen3-14B
~9 GB
Gemma 3 12B
85.4%HE~8 GB
Qwen2.5-Coder-14B
89.1%HE~10 GB
Alibaba14BApache 2.0

The top code model in the 8–12 GB VRAM range. 89.1% HumanEval, 128K context, Apache 2.0, Ollama support, FIM for autocomplete. The sweet spot of the Qwen2.5-Coder family: the 14B hits 89%+ on HumanEval while the 32B adds roughly 4 more points at 2x the VRAM. If you have mid-tier hardware and want the best pure coding performance available, this is it.

Best HumanEval score in the 8–12 GB tier

128K contextCode generation — best in tierOllama supported
12–16 GBRTX 3080 / 4070, MacBook M2 Pro 16GB
ModelBenchVRAMExpand
Devstral Small 2
68%SWE~16 GB
Mistral AI24BApache 2.0

The best Apache 2.0 model that runs on a single consumer GPU. 68% SWE-bench — up from 46.8% at v1. That's a 21-point jump in one release, the largest single-model improvement of 2025. 256K context. Available on Ollama. Purpose-built for agentic workflows: fixing real GitHub issues, multi-file edits, running tests. If you need commercial-clean agentic coding on one card, this is the pick.

256K contextAgentic coding (Apache 2.0)Ollama supported
Codestral 22B
86.6%HE~14 GB
Mistral Small 3.2
92.9%HE~15 GB
GPT-OSS-20BMoE
~14 GB
Gemma 4 26BMoE
70%LCB~14 GB
16–24 GBRTX 3090 / 4090, Mac M2 Max 32GB
ModelBenchVRAMExpand
Gemma 4 31B
80%LCB~19 GB
Google DeepMind31BApache 2.0

The top-of-stack Gemma 4 and the most significant open-weight release of early 2026. 80% LCB and 89.2% AIME -- the AIME jump from Gemma 3's 20.8% is the largest single-generation reasoning improvement in open-source model history. Ranks #3 among all open models on the Arena AI text leaderboard. 256K context. Apache 2.0 -- commercially clean with no usage caps. Fits on a single RTX 4090 or Mac M2 Max at Q4_K_M (~19 GB). Multimodal: reads images, diagrams, and audio alongside code. For serious agentic workflows on a single consumer GPU, this is the new benchmark.

#3 open model on Arena AI (April 2026). AIME 2026: 89.2% (Gemma 3 was 20.8%).

256K contextAgentic coding + reasoningOllama supported
Qwen3.5-27B
72.4%SWE~17 GB
Alibaba27BApache 2.0

Best SWE-bench score that fits on a single consumer GPU. 72.4% SWE-bench, 80.7% LCB — outperforming its own 122B-A10B MoE sibling on coding tasks because dense beats MoE when full parameter engagement matters for complex multi-file reasoning. GatedDeltaNet hybrid architecture means 256K context at practical speed. Apache 2.0. Ollama support. The single-GPU pick of 2026.

Outperforms the 122B-A10B MoE sibling on coding — dense beats MoE here

256K contextAgentic coding — best single GPUOllama supported
Qwen3.6-35B-A3BMoE
73.4%SWE~20 GB
Alibaba35B / 3B active (MoE)Apache 2.0

73.4% SWE-bench Verified is the highest score of any locally-runnable model as of April 2026 — beating Qwen3.5-27B (72.4%) and every other single-GPU option. MoE architecture: 35B total parameters, 3B active per inference step. Runs at 3B inference speed while drawing on 35B parameter knowledge. Apache 2.0. Ollama support. 262K native context, extensible to 1M. "Thinking preservation" retains reasoning context across conversation turns — less re-derivation overhead in long agentic sessions. 92.7% AIME 2026 puts it at frontier-level math reasoning for a locally runnable model. If you have an RTX 4090 or a Mac with 24+ GB unified memory, this is the agentic coding model to run right now.

73.4% SWE-bench Verified — highest of any locally-runnable model as of April 2026. 92.7% AIME 2026. 86.0% GPQA Diamond.

262K contextAgentic coding — best single-GPU SWE-benchOllama supported
KAT-Dev-32B
62.4%SWE~20 GB
Qwen3-32B
72.05%HE~20 GB
OLMo 3.1 32B Think
83.3%LCB~20 GB
EXAONE Deep 32B
59.5%LCB~18 GB
Hermes 4.3 36B
~22 GB
Qwen2.5-Coder-32B
86.2%HE+~20 GB
Alibaba32BApache 2.0

The 2025 community gold standard for local coding agents. 92.7% HumanEval, 86.2% HumanEval+ — among the highest code generation scores of any model that fits on a single consumer GPU. Apache 2.0. Ollama support. 128K context. Purpose-built for software engineering, and the 32B hit the sweet spot: strong enough for real production code, small enough for an RTX 3090 or Mac M2 Max. Qwen3.5-27B has since taken the SWE-bench crown (72.4%), but for raw code generation, Qwen2.5-Coder-32B is still a top-tier pick.

2025 community gold standard for local coding agents

128K contextCode generation — community gold standardOllama supported
Qwen3.5-35B-A3BMoE
69.2%SWE~22 GB
Nemotron 3 Nano 30B-A3BHybrid
68.3%LCB~24 GB
40–48 GBDual RTX 3090, A6000, Mac Pro 192GB
ModelBenchVRAMExpand
Mistral Small 4MoE
~67 GB
Qwen3-Coder-NextMoE
71.3%SWE~45–49 GB
Alibaba80B / 3B active (MoE)Qwen License

71.3% SWE-bench on an 80B MoE that activates only 3B parameters per token. Purpose-built for software engineering. The catch: requires ~45-49GB VRAM at Q4_K_M — not a single RTX 4090. You need a 48GB workstation card (A6000, RTX 6000), a dual-GPU setup, or Apple Silicon 64GB+. If you have the hardware, this is the best locally-runnable coding model in the world.

Needs 48GB+ card or dual GPU — NOT runnable on single RTX 4090

256K contextFrontier agentic coding
KAT-Dev-72B-Exp
74.6%SWE~40 GB
Kwaipilot (Kuaishou AI)72BApache 2.0

74.6% SWE-bench at release made this the highest-scoring open-weight coding model in the world for several weeks. Dense 72B — same VRAM tier as Llama 3.3 70B and Kimi-Dev-72B. Apache 2.0. Community GGUF available. Built by Kwaipilot, Kuaishou's developer tooling team. Still almost entirely unknown outside Chinese research circles. If you have dual 3090s and want top-tier SWE-bench performance with a clean license, this is worth serious consideration.

Was #1 SWE-bench at release (Jan 2026)

128K contextAgentic coding — Apache 2.0 frontier
Kimi-Dev-72B
60.4%SWE~40 GB
DeepSeek-R1-Distill-Llama-70B
49.2%SWE~40 GB
Llama 4 ScoutMoE
47.3%SWE~67 GB
Llama 3.3 70B
88.4%HE~40 GB

All these models work in Bodega One

No config files. No YAML. Pick a model, connect a provider, start coding. One-time purchase.

Join the waitlist →

Best local LLMs by use case

SWE-bench tells you which models write code. These picks cover everything else: reasoning, research, writing, and math. All run locally, all work in Bodega One.

Reasoning

Chain-of-thought analysis, logical problem solving, and extended thinking for complex multi-step tasks.

  • ~5 GB

    DeepSeek-R1-0528-Qwen3-8B

    60.5% LCB at 5GB. Best reasoning per watt available.

  • ~10 GB

    Phi-4-Reasoning

    Distilled from o3-mini. Math and logic specialist from Microsoft.

  • ~5 GB

    OLMo 3.1 Think

    Fully open Apache 2.0 thinking model. No license restrictions.

Long-context research

Document analysis, knowledge synthesis, and multi-source research requiring large context windows.

  • ~24 GB

    Hermes 4.3 36B

    512K context window. Reads entire codebases or document sets.

  • ~18 GB

    Qwen3.5-27B

    Best dense model at this weight. Strong on long-context and reasoning.

  • Server

    Llama 3.3 70B

    128K context. Meta flagship, top open-weight instruction follower.

Writing & editing

Prose, documentation, structured output, and natural instruction following for content tasks.

  • ~5 GB

    Qwen3-8B

    Punches above its weight. Excellent at structured writing at 5GB.

  • Server

    Llama 3.3 70B

    Best open-weight instruction follower at any size class.

  • ~8 GB

    Mistral Nemo 12B

    Strong multilingual writing. Apache 2.0, runs on 8GB cards.

Math & science

Symbolic computation, step-by-step proofs, competition math, and STEM reasoning tasks.

  • ~10 GB

    Phi-4-Reasoning

    Purpose-built for mathematical reasoning. Top performer at 10GB.

  • ~5 GB

    DeepSeek-R1-0528-Qwen3-8B

    Extended thinking mode. Strong on competition-level math.

  • ~20 GB

    QwQ-32B

    72.9% MATH-500. Qwen reasoning model, math specialist.

Local AI that actually works

Every model on this page runs inside a full IDE with AI chat and an autonomous coding agent. Your data stays on your machine.

Join the waitlist →

What the benchmarks actually tell you

HumanEval is saturated

GLM-4.7-Flash scores 94.2% HumanEval on a 6GB laptop GPU. The benchmark is done. SWE-bench Verified and LiveCodeBench are the only meaningful signals for 2026.

Dense beats MoE on hard tasks

Qwen3.5-27B (dense, 27B params) outperforms Qwen3.5-122B-A10B (MoE, 10B active) on coding. When complex multi-file reasoning needs full parameter engagement, dense wins.

The 8B tier is now actually good

DeepSeek-R1-0528-Qwen3-8B scores 60.5% LCB at 5GB VRAM. That's what 32B models scored in 2024. Entry-level hardware is now competitive.

Devstral's 21-point jump

Devstral Small went from 46.8% to 68% SWE-bench between v1 and v2. The largest single-model improvement of the year. Best Apache 2.0 coding model on a single GPU.

Quantization matters sub-8B

Q4_K_M causes ~8-10% variance on coding tasks at 7B. Use Q6_K or Q8_0 for models under 8B. Q4 is fine at 14B and above.

Context floor: 32K minimum

8K context disqualifies a model for repo-level work. 32K is the minimum. 64K-128K is the sweet spot. Larger than 128K can hurt via 'lost in the middle' degradation.

Beyond consumer hardware

These models require server infrastructure or multi-GPU setups. They set the ceiling for what open-weight models can achieve.

ModelSWE-benchMin VRAM
MiniMax M2.5MiniMax229B / 10B activeCommercial OK
80.2%~128 GB (3-bit)
GLM-5Zhipu AI744B / 40B activeMIT
77.8%~180 GB (2-bit)
GLM-5.1Z.ai754B / 40B active (DSA MoE)MIT58.4% SWE-bench Pro (harder benchmark — #1 open-source at release). Novel GLM_MOE_DSA hybrid architecture. Trained on Huawei Ascend chips.
58.4%~640 GB (8x H100)
Kimi K2.5Moonshot AI1T / 32B activeModified MIT
76.8%~375 GB (2-bit)
Kimi K2.6Moonshot AI1T / 32B active (MoE)Modified MITApr 21 2026. 300 sub-agent orchestration. 13-hour autonomous coding. 80.2% SWE-bench Verified is Moonshot self-reported (in-house framework, not canonical harness). MIT with >100M MAU / $20M rev branding clause.
80.2%~250 GB
Qwen3.5-397B-A17BAlibaba397B / 17B activeApache 2.0
76.4%~220 GB
KAT-Dev-72B-ExpKwaipilot72BApache 2.0Borderline consumer
74.6%~40 GB (dual GPU)
DeepSeek V3.2DeepSeek685B / 37B activeMIT
74.1%Server
GLM-4.7 (full)Zhipu AI355B / 9B activeMIT
73.8%Server
MiMo-V2-FlashXiaomi309B / 15B activeApache 2.0150 tok/s via MTP
73.4%Multi-GPU
Devstral 2 LargeMistral AI123BApache 2.0
72.2%Multi-GPU
Qwen3.5-122B-A10BAlibaba122B / 10B activeApache 2.0LCB 78.9% — 27B dense actually beats it on coding at 1/4 the VRAM
72%~70–81 GB (multi-GPU)
Nemotron 3 Super 120B-A12BNVIDIA120.6B / 12.7B activeNVIDIA NemotronLCB 81.19% — Hybrid Mamba-2 MoE, 1M context, 7.5x faster than Qwen3.5-122B
60.47%~87 GB Q4 (64 GB+ unified)

Benchmark glossary

SWE-bench Verified
% of real GitHub issues resolved autonomously. A human validated each issue. The most practical benchmark: it tests actual software engineering, not toy problems. Frontier models top out around 80%.
LiveCodeBench (LCB)
Contamination-free competitive programming problems collected after the training cutoffs of the models being tested. Harder to game than HumanEval. Updated continuously.
HumanEval / HumanEval+
Code generation at function level. HumanEval is largely saturated. Multiple 6GB models score above 90%. Use LCB and SWE-bench for real discrimination. HumanEval+ has stricter tests than the original.
VRAM figures
All VRAM numbers are at Q4_K_M quantization unless noted. For models under 8B, use Q6_K or Q8_0. Q4 causes ~8-10% variance on coding tasks at that scale.

Running local models efficiently also depends on KV cache reuse and observation masking to cut token waste by 40-70%.

Run these models in a full IDE.

Bodega One supports every model on this page. One-time purchase. Your data never leaves.