LogoAIAny
Icon for item

Nemotron-Pretraining-Code-v3

Metadata-only corpus of 146.3M new GitHub source-code files (commit_id, rel_path, language) intended as an incremental update to Nemotron v1/v2 for LLM code pretraining; CC-BY-4.0 licensed and designed to be used jointly with older versions.

Introduction

Large-scale code pretraining benefits from up-to-date indexes as much as from raw file content. This dataset supplies a compact, structured metadata layer — commit id, file path, and detected language — covering 146.3M newly-added GitHub files (≈173B tokens in the raw update) with a cutoff date of 2025-09-30. It is explicitly an incremental metadata update and is intended to be used alongside the v1/v2 corpora rather than as a standalone raw-code corpus.

What Sets It Apart
  • Incremental, metadata-first design: adds 146.3M records to Nemotron v1/v2, so you can identify and fetch only genuinely new files without reprocessing earlier corpora — saves bandwidth and deduplication effort. This dataset itself is ≈8.2 GB of parquet metadata, not raw source text.
  • Minimal, high-utility schema: each row contains commit_id (7 chars), rel_path, and detected language — enough to filter by repo, language, or commit provenance and to join with raw-content mirrors or internal archives.
  • Production-focused provenance: records come from automated GitHub harvesting with a clear cutoff and are trackable back to commits, which helps reproducibility, licensing checks, and selective re-training pipelines.
Who It's For and Tradeoffs

Great fit if you run large-scale LLM code pretraining or maintenance pipelines and need an efficient way to locate and filter newly added GitHub files before downloading full sources. It helps teams that want incremental updates, targeted re-sampling, or provenance-aware deduplication.
Not appropriate if you need raw file contents out of the box — this dataset contains metadata only. Also, licensing and downstream use require attention (dataset licensed CC-BY-4.0), and users who cannot fetch raw files from GitHub or mirrors will need an additional ingestion step.

Where It Fits

Use this as the indexing layer in a code-pretraining workflow: filter and sample with the metadata, retrieve raw files from repositories or vendor mirrors, then merge with Nemotron v1/v2 to build a complete training corpus. It complements raw-code corpora rather than replaces them.

Information

  • Websitehuggingface.co
  • AuthorsNVIDIA Corporation
  • Published date2026/05/28

Categories