Overview
SAM-Audio (Segment Anything in Audio) is a foundation audio model from Meta / Facebook Research designed to separate or isolate specific sounds from complex audio mixtures. It accepts three kinds of prompts — natural-language text descriptions, visual cues (video frames + masks), and temporal spans — and produces separated target audio and residual audio. The project pairs the separation model with ranking/judging modules (e.g., Judge, CLAP, ImageBind similarity) to select or score the best separation candidate.
Key features
- Multi-modal prompting: text, visual, and temporal span prompts so users can specify the target sound by description, by pointing to an object in a video frame, or by indicating time ranges.
- Multiple model sizes and variants: sam-audio-small / sam-audio-base / sam-audio-large (and *_tv variants optimized for visual prompts).
- Built on Perception-Encoder Audio-Visual (PE-AV) backbone for audio-visual perception and uses additional reranking/judge models for improved selection and quality assessment.
- Open-source inference code and example notebooks in the GitHub repo; pretrained checkpoints are hosted on Hugging Face (access requires requesting permission).
Typical workflows
- Authenticate to Hugging Face to download checkpoints (repo: facebook/sam-audio-large).
- Load the model and processor (PyTorch / torchaudio), prepare audio or video inputs.
- Provide a prompt (e.g., "man speaking", masked video region, or time anchors) and call model.separate(...).
- Optionally enable span prediction and candidate reranking (predict_spans=True, reranking_candidates=k) to improve separation at higher compute cost.
- Use Judge / CLAP / ImageBind to score or select the best output among candidates.
Example (Python):
from sam_audio import SAMAudio, SAMAudioProcessor
import torchaudio
import torch
model = SAMAudio.from_pretrained("facebook/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
model = model.eval().cuda()
batch = processor(audios=["path/to.wav"], descriptions=["man speaking"]).to("cuda")
with torch.inference_mode():
result = model.separate(batch, predict_spans=False, reranking_candidates=1)
# Save outputs
sample_rate = processor.audio_sampling_rate
torchaudio.save("target.wav", result.target.cpu(), sample_rate)
torchaudio.save("residual.wav", result.residual.cpu(), sample_rate)Models & evaluation
The repo documents subjective evaluation scores across categories (general SFX, speech, speaker, music, instruments) for each released size. It also provides evaluation scripts in the eval directory so researchers can reproduce paper results.
Dependencies & requirements
- Python >= 3.11
- CUDA-compatible GPU recommended for reasonable performance
- PyTorch / torchaudio ecosystem
- Hugging Face authentication required to download official checkpoints
Reranking / judging
To assess and pick the best separated audio, SAM-Audio integrates or recommends using:
- CLAP: similarity between separated audio and the text description.
- Judge model (facebook/sam-audio-judge): provides precision/recall/faithfulness scores.
- ImageBind: for visual prompting, compares masked video embedding similarity.
The repo supports generating multiple candidate separations and selecting the top candidate using these scorers.
License, citation and resources
- License: SAM License (see LICENSE in repo).
- Paper: "SAM Audio: Segment Anything in Audio" (arXiv / Meta research pages).
- Blog & demo: Meta AI blog post and an online demo are linked from the repo README.
- Citation: the repo includes a BibTeX entry for the paper and author list.
Use cases
- Source separation for speech, music, sound effects, and environmental SFX.
- Audio editing or post-production: isolate and remove or enhance sounds.
- Multimodal research combining audio and vision for audiovisual source separation.
- Downstream tasks needing targeted isolation (ASR preprocessing, sound event detection, audio forensics).
Limitations & considerations
- Large models require GPU memory and compute; span prediction and candidate reranking increase latency and memory usage.
- Checkpoint access currently requires Hugging Face authentication and repo access.
- As with many foundation models, performance varies by sound type and mixing conditions; the repo includes evaluation scores to guide model choice.
