LogoAIAny
  • Search
  • Collection
  • Category
  • Tag
  • Daily AI
LogoAIAny
LogoAIAny

Curated AI Resources for Everyone

[email protected]

Powered by airss.app

Product
  • Search
  • Collection
  • Category
  • Tag
Resources
  • Blog
Company
  • Privacy Policy
  • Terms of Service
  • Sitemap
Copyright © 2026 All Rights Reserved.
  1. Home
  2. Category
  3. AI Model
  4. Nemotron-Labs-TwoTower-30B-A3B-Base-BF16
Icon for item

Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

Generates text by iteratively denoising blocks of tokens with a two-tower design: a frozen autoregressive context tower and a trainable diffusion denoiser tower, trading minimal quality loss for higher wall-clock throughput.

Visit Website

Introduction

Why this matters

Parallel block denoising lets large pretrained autoregressive models keep their learned context representations while switching to an iterative, multi-token-per-step decoding scheme. The core insight is that a frozen AR context tower can supply rich per-layer KV and Mamba states to a separate, trainable diffusion denoiser, enabling block-wise mask-diffusion generation that commits multiple high-confidence tokens per iteration and substantially increases wall-clock throughput with limited quality loss.

Key Capabilities
  • Near-AR quality with iterative decoding: preserves most of the backbone’s capabilities (reported ~98.7% of the autoregressive baseline on aggregate benchmarks) while shifting to block-wise generation.
  • Higher wall-clock throughput: commits multiple tokens per denoising step and achieves a reported ~2.42× generation speedup at the default operating point (confidence threshold γ=0.8, block_size=16).
  • Architectural separation of concerns: the frozen AR/context tower supplies layer-aligned KV and Mamba states; the denoiser tower uses bidirectional in-block attention and time-conditioned adaLN to refine noisy blocks without re-pretraining the full backbone.
  • Adaptation-light training: the denoiser was trained on ~2.1T tokens starting from a 25T-token-pretrained backbone, showing adaptation can recover most AR performance with a fraction of pretraining compute.
Who it’s for and trade-offs

Great fit if you need higher inference throughput from a pretrained autoregressive backbone and can provision multi-GPU NVIDIA hardware (two-tower diffusion inference typically uses 2× A100/H100 GPUs with BF16). The model is useful for text-generation workloads that tolerate occasional small quality drops in exchange for faster wall-clock latency.

Look elsewhere if you require strict one-token-at-a-time token-level determinism, need single-GPU low-memory deployment for full two-tower diffusion, or must avoid the NVIDIA Nemotron Open Model License constraints. Practical trade-offs include extra runtime complexity (placing towers on separate devices, mask-diffusion hyperparameters like confidence threshold and steps_per_block) and quality–throughput tuning: lowering the confidence threshold increases throughput at the cost of accuracy.

Practical notes
  • Default operating point: confidence unmasking γ=0.8, block_size=16, steps_per_block tuned to balance quality and speed.
  • Backbone: derived from a 30B hybrid Mamba-2 / attention / MoE Nemotron-3-Nano model; the released checkpoint contains both towers (≈60B total params, BF16 weights).
  • License & runtime: governed by the NVIDIA Nemotron Open Model License; optimized for NVIDIA GPU stacks and HuggingFace Transformers with trust_remote_code.

This design is a concrete example of adapting large AR LLMs to iterative parallel decoding without full re-pretraining, useful when you can accept modest accuracy trade-offs to materially improve generation throughput.

Back

Information

  • Websitehuggingface.co
  • OrganizationsNVIDIA Corporation
  • AuthorsFitsum Reda, John Kamalu, Roger Waleffe, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
  • Published date2026/04/11

Categories

  • AI Model

Tags

  • nvidia
  • huggingface
  • transformers
  • pytorch
  • llm
  • diffusers
  • ai-inference
  • ai-serving
  • ai-train

More Items

Hugging Face
Icon for item

Rampart

2026
National Design Studio

Detects and redacts personally identifiable information (PII) in user-typed text on-device, replacing sensitive values with stable placeholders before any data leaves the browser. Uses a small quantized ONNX token-classification model plus deterministic recognizers for structured identifiers, and applies a policy-driven keep-set for coarse geography.

privacytransformershuggingfacenlptypescript+2
Hugging Face
Icon for item

TabFM 1.0.0 (PyTorch)

2026
Google Research, Google
Weihao Kong, Abhimanyu Das

Performs zero-shot classification and regression on mixed numerical and categorical tabular data by treating training rows as in-context examples and predicting in a single forward pass. Uses alternating row/column attention and row compression; limited to 10 classes and model weights are non-commercial.

foundation-modelpytorchhuggingfacegoogleai+6
Hugging Face
Icon for item

BugTraceAI-CORE-Ultra-27B-Q6

2026
BugTraceAI

Generates production-ready offensive-security artifacts from prompts—Nuclei templates, CVE PoCs, exploit scripts and pentest tooling—fine-tuned on bug-bounty reports and CVE writeups and quantized for consumer/server GPU deployment.

qwensecurityhuggingfacellmai-tools+4