Self-Operating Computer Framework

The Self-Operating Computer Framework lets a multimodal model operate a real computer the way a person does: it captures screenshots, decides what to do, and issues mouse and keyboard actions to accomplish a stated objective. It's a compact, readable reference implementation of the "computer use" pattern that works across multiple vision-capable models.

Key features

Screen-perceive → reason → click/type loop driven by a vision LLM
Pluggable backends (GPT-4o, Gemini, Claude, LLaVA, and others)
Natural-language objectives: "open a browser and search for..."
Cross-platform desktop control (macOS, Windows, Linux)
Small, hackable codebase for building your own computer-use agent

A concrete starting point for GUI automation and computer-use experiments — useful for testing, repetitive desktop workflows, and research into agents that operate arbitrary software.

Curated mirror of the open-source Self-Operating Computer Framework (MIT). Get it from the source.

Self-Operating Computer Framework

Self-Operating Computer Framework

Key features

More from @ai-supply