Factorio as an AI Benchmark
Table of Contents
Why automating a Factorio factory might be one of the most interesting unsolved challenges for LLM agents.
The best AI models today can barely automate early-game smelting in Factorio.
Meanwhile, experienced human players build megabases processing millions of items per minute — with perfectly ratioed production lines, optimized train networks, and nuclear power grids running at scale.
That gap is interesting.
Why Factorio Is Hard for AI
Factorio is not just a game. It is a real-time systems optimization problem.
It requires:
- Long-horizon planning
- Resource dependency tracking
- Spatial reasoning
- Throughput optimization
- Incremental refactoring of live systems
The game punishes short-term thinking.
Every early design decision propagates downstream.
This makes it fundamentally different from most existing AI benchmarks, which are:
- Static
- Text-only
- Short-horizon
- Deterministic
Factorio is dynamic, persistent, and adversarial to naive planning.
The Factorio Learning Environment
There is already an open-source research effort exploring this space:
Factorio Learning Environment
Published at NeurIPS 2025.
It exposes a live Factorio server to LLM agents.
Agents write Python code to interact with the environment and attempt to build automated factories.
The results so far highlight how difficult the problem really is.
Even strong language models struggle with:
- Maintaining state over long sessions
- Correctly sequencing multi-step production chains
- Recovering from partial failures
- Designing scalable layouts
This is not a prompt engineering problem.
It is a systems reasoning problem.
Why This Is an Interesting Benchmark
Factorio introduces properties that resemble real-world infrastructure engineering:
- Graph-based dependency trees
- Constrained resource allocation
- Throughput bottlenecks
- Distributed logistics (trains, belts, bots)
- Continuous optimization under growth
It is closer to distributed systems design than to puzzle solving.
That makes it a compelling unsaturated benchmark for autonomous agents.
A Rust-Based Approach
I’m currently experimenting with a Rust rewrite of the agent layer using Rig.
The direction is deliberate.
1. Typed Tools
Every game action becomes a strongly typed tool:
- Place entity
- Connect belts
- Query inventory
- Inspect recipes
- Read map state
The domain is highly structured.
Rust’s type system allows encoding that structure directly into the interface.
2. Multi-Turn Agent Loops over RCON
Instead of single-shot execution, the agent operates in iterative loops:
- Observe world state
- Plan next action
- Execute via RCON
- Re-evaluate
This creates a feedback-driven control system rather than a stateless command generator.
3. RAG over the Recipe Graph
Factorio’s crafting tree is a dependency graph.
Using retrieval over:
- The recipe tree
- Wiki documentation
- Item production chains
allows grounding decisions in structured domain knowledge instead of relying purely on model memory.
Why Rust Fits
Factorio is deterministic and rule-based.
The action space is structured. The state transitions are explicit. The constraints are mechanical.
Rust feels like a natural fit for:
- Modeling state transitions
- Enforcing invariants
- Building typed agent tooling
- Keeping orchestration predictable
When the domain itself is a graph of dependencies, types become leverage.
The Gap
Humans build megabases.
AI struggles to build a stable smelting line.
That gap is not just amusing — it’s informative.
It exposes the limits of current reasoning systems when faced with:
- Long-horizon planning
- Structural optimization
- Persistent world interaction
Factorio may quietly become one of the most revealing AI benchmarks available.
The factory must grow — for both humans and AI.