LogoAIAny
Icon for item

MLX LM

MLX LM is a Python package to run, generate with, and fine-tune large language models on Apple Silicon using MLX. It integrates with the Hugging Face Hub, supports quantization and uploading of models, low-rank and full-model fine-tuning (including for quantized models), distributed inference and training, streaming generation, sampling/custom logits processors, prompt caching, and a convenient CLI and Python API.

Introduction

Overview

MLX LM is a Python package and CLI for running, generating from, and fine-tuning large language models (LLMs) on Apple Silicon using the MLX stack. It focuses on practical workflows: making it easy to load models from the Hugging Face Hub, quantize models (e.g., produce 4-bit versions), perform LoRA and full-model fine-tuning (including for quantized checkpoints), run distributed inference/training, and stream generation outputs.

Key features
  • Hugging Face Hub integration: load thousands of models with simple commands and link converted/quantized uploads back to the Hub.
  • Quantization and conversion tooling: convert popular models into lower-bit formats (4-bit) and optionally upload the result to a specified Hugging Face repo.
  • Fine-tuning: supports low-rank adapters (LoRA) and full-model fine-tuning, with explicit support for working on quantized models.
  • Distributed workflows: mx.distributed support for distributed inference and fine-tuning.
  • CLI + Python API: complete command-line tools (e.g., mlx_lm.generate, mlx_lm.chat, mlx_lm.convert) and a Python API for scripted use.
  • Streaming generation: stream_generate yields token/response objects for real-time output.
  • Sampling & logits processing: accepts custom samplers and logits processors to customize generation behavior.
  • Long-context optimizations: rotating fixed-size key-value cache and prompt caching utilities to scale to long prompts and repeated contexts efficiently.
Installation

Install via pip or conda:

pip install mlx-lm
# or
conda install -c conda-forge mlx-lm
Quick usage examples
  • CLI generation:
mlx_lm.generate --prompt "How tall is Mt Everest?"
  • Chat REPL:
mlx_lm.chat
  • Python API (load + generate):
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
text = generate(model, tokenizer, prompt=prompt, verbose=True)
  • Convert & quantize a model and upload to HF:
from mlx_lm import convert
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert(repo, quantize=True, upload_repo=upload_repo)
Long prompts & caching

MLX LM provides a rotating fixed-size KV cache (configurable via --max-kv-size) and prompt caching utilities. Prompt caching lets you precompute and reuse a large prefix across multiple queries, which is useful for multi-turn or repeated-context workloads.

Streaming & custom sampling

Use stream_generate to stream generation outputs incrementally. Both generate and stream_generate accept sampler and logits_processors so you can plug in custom sampling algorithms and logits-level filters.

Notes on large models and macOS

Some memory optimizations rely on macOS features: using very large models relative to system RAM may be slow unless macOS 15+ features for wiring memory are available. The project provides guidance on adjusting iogpu.wired_limit_mb via sysctl to increase wired memory limits when necessary.

Who is it for

MLX LM targets developers and researchers who want a practical, command-line-first and scriptable toolkit to run and adapt LLMs locally (especially on Apple Silicon), convert/quantize models for efficient inference, and integrate with the Hugging Face ecosystem.

The project works closely with the MLX/Hugging Face community (many compatible models live under mlx-community on Hugging Face). It is distributed as a Python package and maintained on GitHub.

Information

  • Websitegithub.com
  • Authorsml-explore (GitHub organization)
  • Published date2025/03/11

More Items