Run LLMs Locally on Apple Silicon

OpenAI-compatible API server powered by MLX. Zero cloud. Full privacy. Native Metal performance.

Get Started Download Vesta 0.9.7 View on GitHub

Terminal

$ brew install scouzi1966/afm/afm

$ brew install --cask scouzi1966/afm/vesta-mac

$ afm mlx -m mlx-community/Qwen3-Coder-0.6B-4bit --port 9999

Server running on http://localhost:9999

Built for Local Inference

Everything you need to run LLMs on your Mac, with zero compromise.

⚡

Metal GPU Acceleration

Native Apple Metal for blazing fast inference. Fully utilizes your Mac's GPU for maximum throughput.

🔄

OpenAI Compatible

Drop-in replacement for the OpenAI API. Works with any client—Python, Node.js, curl, or your favorite IDE.

🛠️

Tool Calling

Full function calling support with streaming detection. Qwen, Llama, Mistral, and more—all formats covered.

💾

Prompt Caching

Server-level KV cache reuse for faster responses. Token-level prefix matching keeps context hot.

🧠

Thinking / Reasoning

Extract model reasoning with <think> tag support. Stream reasoning content alongside responses.

🔀

Multiple Backends

MLX models from Hugging Face and Apple Foundation Models on macOS 26+. Choose the backend that fits your workflow.

Quick Start

Up and running in under a minute.

Homebrew pip Build from Source

 $ brew install scouzi1966/afm/afm 
 $ brew install --cask scouzi1966/afm/vesta-mac 

Click to copy

$ pip install afm

Click to copy

 $ git clone https://github.com/scouzi1966/maclocal-api.git 
 $ cd maclocal-api 
 $ ./Scripts/build-from-scratch.sh 

Click to copy

Run your first model

$ afm mlx -m mlx-community/Qwen3-Coder-0.6B-4bit --port 9999

The server exposes /v1/chat/completions and /v1/models endpoints, compatible with any OpenAI client.

Vesta for macOS

Vesta 0.9.7 is available as a notarized Apple Silicon DMG with Qwen 3.6 and Gemma 4 MLX support.

$ brew install --cask scouzi1966/afm/vesta-mac

Direct download: Vesta-0.9.7.dmg

Supported Models

Run any MLX-format model from Hugging Face Hub.

Qwen3

Qwen3-Coder-0.6B-4bitQwen3-4B-4bit

Llama

Llama-3.2-3B-4bitCodeLlama-7B-4bit

Gemma

Gemma-2-2B-4bitGemma-3-4B-4bit

Mistral

Mistral-7B-v0.3-4bitCodestral-22B-4bit

Phi

Phi-4-mini-4bitPhi-3.5-mini-4bit

DeepSeek

DeepSeek-R1-0528-Qwen3-8B-4bitDeepSeek-Coder-V2-4bit

SmolLM

SmolLM2-1.7B-4bit

Starcoder2

Starcoder2-3B-4bit

Performance Benchmarks

Native Metal GPU inference on Apple Silicon.

Model	Device	Tokens/sec	Memory
Qwen3-Coder-0.6B-4bit	M1 MacBook Air (8 GB)	82 tok/s	1.2 GB
Llama-3.2-3B-4bit	M2 Pro Mac Mini (16 GB)	48 tok/s	3.1 GB
Qwen3-4B-4bit	M3 Max MacBook Pro (36 GB)	71 tok/s	3.8 GB
Mistral-7B-v0.3-4bit	M3 Max MacBook Pro (36 GB)	42 tok/s	5.6 GB
Phi-4-mini-4bit	M4 Pro Mac Mini (24 GB)	63 tok/s	4.2 GB

Benchmarks measured with default sampling parameters. Results may vary by system configuration and prompt length.

Works With Your Tools

Drop-in compatible with the tools you already use.

💻

OpenCode

AI coding assistant with local provider support. Point it at your AFM server for fully private code generation.

Learn more →

🪝

OpenClaw

CLI tool for LLM interactions. Use afm mlx --openclaw-config to generate the provider configuration automatically.

Learn more →

▶️

Continue

VS Code and JetBrains extension for AI-assisted development. Configure AFM as a local OpenAI-compatible provider.

Learn more →

🔌

Any OpenAI Client

Python openai library, Node.js SDK, curl, or any tool that speaks the OpenAI API—they all work out of the box.

Nightly Test Dashboard

Automated testing across models and configurations.

24 Models Tested

186 Test Cases

97.3% Pass Rate

Nightly test reports and model compatibility matrix

Join the Community

Open source. Built together.

⭐

GitHub

Star the repo, report issues, and contribute code. AFM is open source and community-driven.

View Repository →

💬

Discussions

Ask questions, share configurations, and show off what you've built with AFM.

Join Discussions →

🤝

Contributing

Read the contributor guide, pick up an issue, and submit a pull request.

Read the Guide →