AIAny - SAM-Audio

Overview

SAM-Audio (Segment Anything in Audio) is a foundation audio model from Meta / Facebook Research designed to separate or isolate specific sounds from complex audio mixtures. It accepts three kinds of prompts — natural-language text descriptions, visual cues (video frames + masks), and temporal spans — and produces separated target audio and residual audio. The project pairs the separation model with ranking/judging modules (e.g., Judge, CLAP, ImageBind similarity) to select or score the best separation candidate.

Key features

Multi-modal prompting: text, visual, and temporal span prompts so users can specify the target sound by description, by pointing to an object in a video frame, or by indicating time ranges.
Multiple model sizes and variants: sam-audio-small / sam-audio-base / sam-audio-large (and *_tv variants optimized for visual prompts).
Built on Perception-Encoder Audio-Visual (PE-AV) backbone for audio-visual perception and uses additional reranking/judge models for improved selection and quality assessment.
Open-source inference code and example notebooks in the GitHub repo; pretrained checkpoints are hosted on Hugging Face (access requires requesting permission).

Typical workflows

Authenticate to Hugging Face to download checkpoints (repo: facebook/sam-audio-large).
Load the model and processor (PyTorch / torchaudio), prepare audio or video inputs.
Provide a prompt (e.g., "man speaking", masked video region, or time anchors) and call model.separate(...).
Optionally enable span prediction and candidate reranking (predict_spans=True, reranking_candidates=k) to improve separation at higher compute cost.
Use Judge / CLAP / ImageBind to score or select the best output among candidates.

Example (Python):

from sam_audio import SAMAudio, SAMAudioProcessor
import torchaudio
import torch
 
model = SAMAudio.from_pretrained("facebook/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
model = model.eval().cuda()
 
batch = processor(audios=["path/to.wav"], descriptions=["man speaking"]).to("cuda")
with torch.inference_mode():
    result = model.separate(batch, predict_spans=False, reranking_candidates=1)
 
# Save outputs
sample_rate = processor.audio_sampling_rate
torchaudio.save("target.wav", result.target.cpu(), sample_rate)
torchaudio.save("residual.wav", result.residual.cpu(), sample_rate)

Models & evaluation

The repo documents subjective evaluation scores across categories (general SFX, speech, speaker, music, instruments) for each released size. It also provides evaluation scripts in the eval directory so researchers can reproduce paper results.

Dependencies & requirements

Python >= 3.11
CUDA-compatible GPU recommended for reasonable performance
PyTorch / torchaudio ecosystem
Hugging Face authentication required to download official checkpoints

Reranking / judging

To assess and pick the best separated audio, SAM-Audio integrates or recommends using:

CLAP: similarity between separated audio and the text description.
Judge model (facebook/sam-audio-judge): provides precision/recall/faithfulness scores.
ImageBind: for visual prompting, compares masked video embedding similarity.

The repo supports generating multiple candidate separations and selecting the top candidate using these scorers.

License, citation and resources

License: SAM License (see LICENSE in repo).
Paper: "SAM Audio: Segment Anything in Audio" (arXiv / Meta research pages).
Blog & demo: Meta AI blog post and an online demo are linked from the repo README.
Citation: the repo includes a BibTeX entry for the paper and author list.

Use cases

Source separation for speech, music, sound effects, and environmental SFX.
Audio editing or post-production: isolate and remove or enhance sounds.
Multimodal research combining audio and vision for audiovisual source separation.
Downstream tasks needing targeted isolation (ASR preprocessing, sound event detection, audio forensics).

Limitations & considerations

Large models require GPU memory and compute; span prediction and candidate reranking increase latency and memory usage.
Checkpoint access currently requires Hugging Face authentication and repo access.
As with many foundation models, performance varies by sound type and mixing conditions; the repo includes evaluation scores to guide model choice.

SAM-Audio

Introduction

Overview

Key features

Typical workflows

Models & evaluation

Dependencies & requirements

Reranking / judging

License, citation and resources

Use cases

Limitations & considerations

Information

Categories

Tags

More Items

Amphion

ebook2audiobook

Buzz