Self-Operating Computer Framework
Framework that lets a multimodal model view the screen and control mouse and keyboard to complete tasks on a computer.
Self-Operating Computer Framework
The Self-Operating Computer Framework lets a multimodal model operate a real computer the way a person does: it captures screenshots, decides what to do, and issues mouse and keyboard actions to accomplish a stated objective. It's a compact, readable reference implementation of the "computer use" pattern that works across multiple vision-capable models.
Key features
- Screen-perceive → reason → click/type loop driven by a vision LLM
- Pluggable backends (GPT-4o, Gemini, Claude, LLaVA, and others)
- Natural-language objectives: "open a browser and search for..."
- Cross-platform desktop control (macOS, Windows, Linux)
- Small, hackable codebase for building your own computer-use agent
A concrete starting point for GUI automation and computer-use experiments — useful for testing, repetitive desktop workflows, and research into agents that operate arbitrary software.
Curated mirror of the open-source Self-Operating Computer Framework (MIT). Get it from the source.