AIAny - MLX LM

Overview

MLX LM is a Python package and CLI for running, generating from, and fine-tuning large language models (LLMs) on Apple Silicon using the MLX stack. It focuses on practical workflows: making it easy to load models from the Hugging Face Hub, quantize models (e.g., produce 4-bit versions), perform LoRA and full-model fine-tuning (including for quantized checkpoints), run distributed inference/training, and stream generation outputs.

Key features

Hugging Face Hub integration: load thousands of models with simple commands and link converted/quantized uploads back to the Hub.
Quantization and conversion tooling: convert popular models into lower-bit formats (4-bit) and optionally upload the result to a specified Hugging Face repo.
Fine-tuning: supports low-rank adapters (LoRA) and full-model fine-tuning, with explicit support for working on quantized models.
Distributed workflows: mx.distributed support for distributed inference and fine-tuning.
CLI + Python API: complete command-line tools (e.g., mlx_lm.generate, mlx_lm.chat, mlx_lm.convert) and a Python API for scripted use.
Streaming generation: stream_generate yields token/response objects for real-time output.
Sampling & logits processing: accepts custom samplers and logits processors to customize generation behavior.
Long-context optimizations: rotating fixed-size key-value cache and prompt caching utilities to scale to long prompts and repeated contexts efficiently.

Installation

Install via pip or conda:

pip install mlx-lm
# or
conda install -c conda-forge mlx-lm

Quick usage examples

CLI generation:

mlx_lm.generate --prompt "How tall is Mt Everest?"

Chat REPL:

mlx_lm.chat

Python API (load + generate):

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
text = generate(model, tokenizer, prompt=prompt, verbose=True)

Convert & quantize a model and upload to HF:

from mlx_lm import convert
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert(repo, quantize=True, upload_repo=upload_repo)

Long prompts & caching

MLX LM provides a rotating fixed-size KV cache (configurable via --max-kv-size) and prompt caching utilities. Prompt caching lets you precompute and reuse a large prefix across multiple queries, which is useful for multi-turn or repeated-context workloads.

Streaming & custom sampling

Use stream_generate to stream generation outputs incrementally. Both generate and stream_generate accept sampler and logits_processors so you can plug in custom sampling algorithms and logits-level filters.

Notes on large models and macOS

Some memory optimizations rely on macOS features: using very large models relative to system RAM may be slow unless macOS 15+ features for wiring memory are available. The project provides guidance on adjusting iogpu.wired_limit_mb via sysctl to increase wired memory limits when necessary.

Who is it for

MLX LM targets developers and researchers who want a practical, command-line-first and scriptable toolkit to run and adapt LLMs locally (especially on Apple Silicon), convert/quantize models for efficient inference, and integrate with the Hugging Face ecosystem.

Links & ecosystem

The project works closely with the MLX/Hugging Face community (many compatible models live under mlx-community on Hugging Face). It is distributed as a Python package and maintained on GitHub.

MLX LM

Introduction

Overview

Key features

Installation

Quick usage examples

Long prompts & caching

Streaming & custom sampling

Notes on large models and macOS

Who is it for

Links & ecosystem

Information

Categories

Tags

More Items

Genesis

MemU

ms-swift (SWIFT: Scalable lightWeight Infrastructure for Fine-Tuning)