The best local LLMs for software engineering
Ranked by SWE-bench Verified, the benchmark that tests real GitHub issues, not toy problems. Opinionated picks for every VRAM tier. Not sure if local models are ready? Read our honest assessment.
What should I run?
Pick your hardware. Get a model recommendation in seconds.
| Model | Org | Params | Bench | VRAM | License | Expand |
|---|---|---|---|---|---|---|
★StarCoder2-3B | BigCode | 3B | — | ~2 GB | BigCode RAIL-M | |
Granite 4.0 TinyMoE | IBM | 7B MoE / 1B active | 82.41%HE | ~4 GB | Apache 2.0 | |
DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek | 1.5B | 16.9%LCB | ~1.2 GB | MIT | |
SmolLM3-3B | HuggingFace | 3B | 30%LCB | ~2 GB | Apache 2.0 | |
Qwen3.5-4B | Alibaba | 4B | 55.8%LCB | ~3 GB | Apache 2.0 | |
Phi-4-mini | Microsoft | 3.8B | 19.9%LCB | ~2.3 GB | MIT | |
Bonsai 8B | PrismML | 8B (1-bit) | — | ~1.2 GB | Apache 2.0 | |
Bonsai 1.7B | PrismML | 1.7B (1-bit) | — | ~0.24 GB | Apache 2.0 | |
Gemma 4 E2B | Google DeepMind | ~5.1B total / ~2B effective (PLE) | — | ~3 GB | Apache 2.0 | |
| Model | Org | Params | Bench | VRAM | License | Expand |
|---|---|---|---|---|---|---|
★GLM-4.7-FlashMoE | Zhipu AI (Z.ai) | 30B / 3B active (MoE) | 59.2%SWE | ~6 GB | MIT | |
★DeepSeek-R1-0528-Qwen3-8B | DeepSeek | 8B | 60.5%LCB | ~5 GB | MIT | |
Qwen3-30B-A3BMoE | Alibaba | 30B / 3.3B active (MoE) | — | ~6 GB | Apache 2.0 | |
★Granite 3.3 8B Instruct | IBM | 8B | 89.73%HE | ~5 GB | Apache 2.0 | |
Seed-Coder-8B Instruct | ByteDance Seed | 8B | 84.8%HE | ~5 GB | MIT | |
Yi-Coder-9B | 01.ai | 9B | 85.4%HE | ~6 GB | Apache 2.0 | |
MiniCPM4-8BHybrid | Tsinghua / OpenBMB | 8B | — | ~5 GB | Apache 2.0 | |
InternLM3-8B | Shanghai AI Lab | 8B | 17.8%LCB | ~5 GB | Apache 2.0 | |
Llama 3.1 8B Instruct | Meta | 8B | 72.6%HE | ~5 GB | Llama 3.1 | |
Qwen2.5-Coder-7B | Alibaba | 7B | 84.1%HE+ | ~5 GB | Apache 2.0 | |
Qwen3-8B | Alibaba | 8B | — | ~5 GB | Apache 2.0 | |
Qwen3.5-9B | Alibaba | 9B | 65.6%LCB | ~6.6 GB | Apache 2.0 | |
Gemma 4 E4B | Google DeepMind | ~11B total / 4B effective | 52%LCB | ~5 GB | Apache 2.0 | |
| Model | Org | Params | Bench | VRAM | License | Expand |
|---|---|---|---|---|---|---|
★Phi-4-Reasoning | Microsoft | 14B | 53.8%LCB | ~9 GB | MIT | |
DeepSeek-R1-Distill-Qwen-14B | DeepSeek | 14B | 53.1%LCB | ~9 GB | MIT | |
Qwen3-14B | Alibaba | 14B | — | ~9 GB | Apache 2.0 | |
Gemma 3 12B | Google DeepMind | 12B | 85.4%HE | ~8 GB | Gemma Terms | |
★Qwen2.5-Coder-14B | Alibaba | 14B | 89.1%HE | ~10 GB | Apache 2.0 | |
| Model | Org | Params | Bench | VRAM | License | Expand |
|---|---|---|---|---|---|---|
★Devstral Small 2 | Mistral AI | 24B | 68%SWE | ~16 GB | Apache 2.0 | |
Codestral 22B | Mistral AI | 22B | 86.6%HE | ~14 GB | Mistral CL | |
Mistral Small 3.2 | Mistral AI | 24B | 92.9%HE | ~15 GB | Apache 2.0 | |
GPT-OSS-20BMoE | OpenAI | 20B / 3.6B active (MoE) | — | ~14 GB | Apache 2.0 | |
Gemma 4 26BMoE | Google DeepMind | 26B / 4B active (MoE) | 70%LCB | ~14 GB | Apache 2.0 | |
| Model | Org | Params | Bench | VRAM | License | Expand |
|---|---|---|---|---|---|---|
★Gemma 4 31B | Google DeepMind | 31B | 80%LCB | ~19 GB | Apache 2.0 | |
★Qwen3.5-27B | Alibaba | 27B | 72.4%SWE | ~17 GB | Apache 2.0 | |
★Qwen3.6-35B-A3BMoE | Alibaba | 35B / 3B active (MoE) | 73.4%SWE | ~20 GB | Apache 2.0 | |
KAT-Dev-32B | Kwaipilot (Kuaishou AI) | 32B | 62.4%SWE | ~20 GB | Apache 2.0 | |
Qwen3-32B | Alibaba | 32B | 72.05%HE | ~20 GB | Apache 2.0 | |
OLMo 3.1 32B Think | Allen AI | 32B | 83.3%LCB | ~20 GB | Apache 2.0 | |
EXAONE Deep 32B | LG AI Research | 32B | 59.5%LCB | ~18 GB | Research NC | |
Hermes 4.3 36B | NousResearch | 36B | — | ~22 GB | Llama 3.1 | |
★Qwen2.5-Coder-32B | Alibaba | 32B | 86.2%HE+ | ~20 GB | Apache 2.0 | |
Qwen3.5-35B-A3BMoE | Alibaba | 35B / 3B active (MoE) | 69.2%SWE | ~22 GB | Apache 2.0 | |
Nemotron 3 Nano 30B-A3BHybrid | NVIDIA | 31.6B / 3.2B active (MoE) | 68.3%LCB | ~24 GB | NVIDIA Nemotron License | |
| Model | Org | Params | Bench | VRAM | License | Expand |
|---|---|---|---|---|---|---|
Mistral Small 4MoE | Mistral AI | 119B (6B active, MoE) | — | ~67 GB | Apache 2.0 | |
★Qwen3-Coder-NextMoE | Alibaba | 80B / 3B active (MoE) | 71.3%SWE | ~45–49 GB | Qwen License | |
★KAT-Dev-72B-Exp | Kwaipilot (Kuaishou AI) | 72B | 74.6%SWE | ~40 GB | Apache 2.0 | |
Kimi-Dev-72B | Moonshot AI | 72B | 60.4%SWE | ~40 GB | Apache 2.0 | |
DeepSeek-R1-Distill-Llama-70B | DeepSeek | 70B | 49.2%SWE | ~40 GB | MIT | |
Llama 4 ScoutMoE | Meta | 109B / 17B active (MoE, 16 experts) | 47.3%SWE | ~67 GB | Llama 4 License | |
Llama 3.3 70B | Meta | 70B | 88.4%HE | ~40 GB | Llama 3.3 | |
All these models work in Bodega One
No config files. No YAML. Pick a model, connect a provider, start coding. One-time purchase.
Best local LLMs by use case
SWE-bench tells you which models write code. These picks cover everything else: reasoning, research, writing, and math. All run locally, all work in Bodega One.
Reasoning
Chain-of-thought analysis, logical problem solving, and extended thinking for complex multi-step tasks.
- ~5 GB
DeepSeek-R1-0528-Qwen3-8B
60.5% LCB at 5GB. Best reasoning per watt available.
- ~10 GB
Phi-4-Reasoning
Distilled from o3-mini. Math and logic specialist from Microsoft.
- ~5 GB
OLMo 3.1 Think
Fully open Apache 2.0 thinking model. No license restrictions.
Long-context research
Document analysis, knowledge synthesis, and multi-source research requiring large context windows.
- ~24 GB
Hermes 4.3 36B
512K context window. Reads entire codebases or document sets.
- ~18 GB
Qwen3.5-27B
Best dense model at this weight. Strong on long-context and reasoning.
- Server
Llama 3.3 70B
128K context. Meta flagship, top open-weight instruction follower.
Writing & editing
Prose, documentation, structured output, and natural instruction following for content tasks.
- ~5 GB
Qwen3-8B
Punches above its weight. Excellent at structured writing at 5GB.
- Server
Llama 3.3 70B
Best open-weight instruction follower at any size class.
- ~8 GB
Mistral Nemo 12B
Strong multilingual writing. Apache 2.0, runs on 8GB cards.
Math & science
Symbolic computation, step-by-step proofs, competition math, and STEM reasoning tasks.
- ~10 GB
Phi-4-Reasoning
Purpose-built for mathematical reasoning. Top performer at 10GB.
- ~5 GB
DeepSeek-R1-0528-Qwen3-8B
Extended thinking mode. Strong on competition-level math.
- ~20 GB
QwQ-32B
72.9% MATH-500. Qwen reasoning model, math specialist.
Local AI that actually works
Every model on this page runs inside a full IDE with AI chat and an autonomous coding agent. Your data stays on your machine.
What the benchmarks actually tell you
HumanEval is saturated
GLM-4.7-Flash scores 94.2% HumanEval on a 6GB laptop GPU. The benchmark is done. SWE-bench Verified and LiveCodeBench are the only meaningful signals for 2026.
Dense beats MoE on hard tasks
Qwen3.5-27B (dense, 27B params) outperforms Qwen3.5-122B-A10B (MoE, 10B active) on coding. When complex multi-file reasoning needs full parameter engagement, dense wins.
The 8B tier is now actually good
DeepSeek-R1-0528-Qwen3-8B scores 60.5% LCB at 5GB VRAM. That's what 32B models scored in 2024. Entry-level hardware is now competitive.
Devstral's 21-point jump
Devstral Small went from 46.8% to 68% SWE-bench between v1 and v2. The largest single-model improvement of the year. Best Apache 2.0 coding model on a single GPU.
Quantization matters sub-8B
Q4_K_M causes ~8-10% variance on coding tasks at 7B. Use Q6_K or Q8_0 for models under 8B. Q4 is fine at 14B and above.
Context floor: 32K minimum
8K context disqualifies a model for repo-level work. 32K is the minimum. 64K-128K is the sweet spot. Larger than 128K can hurt via 'lost in the middle' degradation.
Beyond consumer hardware
These models require server infrastructure or multi-GPU setups. They set the ceiling for what open-weight models can achieve.
| Model | Org | Params | SWE-bench | Min VRAM | License |
|---|---|---|---|---|---|
MiniMax M2.5MiniMax229B / 10B activeCommercial OK | MiniMax | 229B / 10B active | 80.2% | ~128 GB (3-bit) | Commercial OK |
GLM-5Zhipu AI744B / 40B activeMIT | Zhipu AI | 744B / 40B active | 77.8% | ~180 GB (2-bit) | MIT |
GLM-5.1Z.ai754B / 40B active (DSA MoE)MIT58.4% SWE-bench Pro (harder benchmark — #1 open-source at release). Novel GLM_MOE_DSA hybrid architecture. Trained on Huawei Ascend chips. | Z.ai | 754B / 40B active (DSA MoE) | 58.4% | ~640 GB (8x H100) | MIT |
Kimi K2.5Moonshot AI1T / 32B activeModified MIT | Moonshot AI | 1T / 32B active | 76.8% | ~375 GB (2-bit) | Modified MIT |
Kimi K2.6Moonshot AI1T / 32B active (MoE)Modified MITApr 21 2026. 300 sub-agent orchestration. 13-hour autonomous coding. 80.2% SWE-bench Verified is Moonshot self-reported (in-house framework, not canonical harness). MIT with >100M MAU / $20M rev branding clause. | Moonshot AI | 1T / 32B active (MoE) | 80.2% | ~250 GB | Modified MIT |
Qwen3.5-397B-A17BAlibaba397B / 17B activeApache 2.0 | Alibaba | 397B / 17B active | 76.4% | ~220 GB | Apache 2.0 |
KAT-Dev-72B-ExpKwaipilot72BApache 2.0Borderline consumer | Kwaipilot | 72B | 74.6% | ~40 GB (dual GPU) | Apache 2.0 |
DeepSeek V3.2DeepSeek685B / 37B activeMIT | DeepSeek | 685B / 37B active | 74.1% | Server | MIT |
GLM-4.7 (full)Zhipu AI355B / 9B activeMIT | Zhipu AI | 355B / 9B active | 73.8% | Server | MIT |
MiMo-V2-FlashXiaomi309B / 15B activeApache 2.0150 tok/s via MTP | Xiaomi | 309B / 15B active | 73.4% | Multi-GPU | Apache 2.0 |
Devstral 2 LargeMistral AI123BApache 2.0 | Mistral AI | 123B | 72.2% | Multi-GPU | Apache 2.0 |
Qwen3.5-122B-A10BAlibaba122B / 10B activeApache 2.0LCB 78.9% — 27B dense actually beats it on coding at 1/4 the VRAM | Alibaba | 122B / 10B active | 72% | ~70–81 GB (multi-GPU) | Apache 2.0 |
Nemotron 3 Super 120B-A12BNVIDIA120.6B / 12.7B activeNVIDIA NemotronLCB 81.19% — Hybrid Mamba-2 MoE, 1M context, 7.5x faster than Qwen3.5-122B | NVIDIA | 120.6B / 12.7B active | 60.47% | ~87 GB Q4 (64 GB+ unified) | NVIDIA Nemotron |
Benchmark glossary
- SWE-bench Verified
- % of real GitHub issues resolved autonomously. A human validated each issue. The most practical benchmark: it tests actual software engineering, not toy problems. Frontier models top out around 80%.
- LiveCodeBench (LCB)
- Contamination-free competitive programming problems collected after the training cutoffs of the models being tested. Harder to game than HumanEval. Updated continuously.
- HumanEval / HumanEval+
- Code generation at function level. HumanEval is largely saturated. Multiple 6GB models score above 90%. Use LCB and SWE-bench for real discrimination. HumanEval+ has stricter tests than the original.
- VRAM figures
- All VRAM numbers are at Q4_K_M quantization unless noted. For models under 8B, use Q6_K or Q8_0. Q4 causes ~8-10% variance on coding tasks at that scale.
Running local models efficiently also depends on KV cache reuse and observation masking to cut token waste by 40-70%.
Run these models in a full IDE.
Bodega One supports every model on this page. One-time purchase. Your data never leaves.