Overview
MLX LM is a Python package and CLI for running, generating from, and fine-tuning large language models (LLMs) on Apple Silicon using the MLX stack. It focuses on practical workflows: making it easy to load models from the Hugging Face Hub, quantize models (e.g., produce 4-bit versions), perform LoRA and full-model fine-tuning (including for quantized checkpoints), run distributed inference/training, and stream generation outputs.
Key features
- Hugging Face Hub integration: load thousands of models with simple commands and link converted/quantized uploads back to the Hub.
- Quantization and conversion tooling: convert popular models into lower-bit formats (4-bit) and optionally upload the result to a specified Hugging Face repo.
- Fine-tuning: supports low-rank adapters (LoRA) and full-model fine-tuning, with explicit support for working on quantized models.
- Distributed workflows:
mx.distributedsupport for distributed inference and fine-tuning. - CLI + Python API: complete command-line tools (e.g.,
mlx_lm.generate,mlx_lm.chat,mlx_lm.convert) and a Python API for scripted use. - Streaming generation:
stream_generateyields token/response objects for real-time output. - Sampling & logits processing: accepts custom samplers and logits processors to customize generation behavior.
- Long-context optimizations: rotating fixed-size key-value cache and prompt caching utilities to scale to long prompts and repeated contexts efficiently.
Installation
Install via pip or conda:
pip install mlx-lm
# or
conda install -c conda-forge mlx-lmQuick usage examples
- CLI generation:
mlx_lm.generate --prompt "How tall is Mt Everest?"- Chat REPL:
mlx_lm.chat- Python API (load + generate):
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
text = generate(model, tokenizer, prompt=prompt, verbose=True)- Convert & quantize a model and upload to HF:
from mlx_lm import convert
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert(repo, quantize=True, upload_repo=upload_repo)Long prompts & caching
MLX LM provides a rotating fixed-size KV cache (configurable via --max-kv-size) and prompt caching utilities. Prompt caching lets you precompute and reuse a large prefix across multiple queries, which is useful for multi-turn or repeated-context workloads.
Streaming & custom sampling
Use stream_generate to stream generation outputs incrementally. Both generate and stream_generate accept sampler and logits_processors so you can plug in custom sampling algorithms and logits-level filters.
Notes on large models and macOS
Some memory optimizations rely on macOS features: using very large models relative to system RAM may be slow unless macOS 15+ features for wiring memory are available. The project provides guidance on adjusting iogpu.wired_limit_mb via sysctl to increase wired memory limits when necessary.
Who is it for
MLX LM targets developers and researchers who want a practical, command-line-first and scriptable toolkit to run and adapt LLMs locally (especially on Apple Silicon), convert/quantize models for efficient inference, and integrate with the Hugging Face ecosystem.
Links & ecosystem
The project works closely with the MLX/Hugging Face community (many compatible models live under mlx-community on Hugging Face). It is distributed as a Python package and maintained on GitHub.
