Why this matters
Large-language-model-driven workflows can propose many low-level CUDA kernel variants, but hardware verification and optimization at scale require large, labeled corpora that link code to measurable performance. This dataset fills that gap by pairing thousands of generated CUDA kernels with PyTorch references, per-kernel runtimes and profiler outputs so researchers can train, evaluate, or reward-model kernel-generation and optimization strategies.
What Sets It Apart
- Size and structure: ~30,000 kernels distributed across three parquet splits (level_1, level_2, level_3) with per-row fields such as CUDA_Code, PyTorch reference code, CUDA_Runtime, PyTorch_Native_Runtime, CUDA_Speedup_Native, CUDA_Speedup_Compile, Correct, Max_Diff and recorded error messages — all machine-readable for large-scale training.
- Hardware-aware labels: includes NCU (NVIDIA) profiles, Torch profiles and Clang-Tidy outputs alongside measured runtimes and speedup scores, enabling models to learn correlations between code patterns and hardware/runtime behavior.
- Multiple proposals per task: contains multiple kernel variants for the same kernel task plus verification outcomes, supporting preference modeling, ranking, and offline RL approaches to optimize for both correctness and runtime.
- Open licensing and portability: released under CC-BY-4.0 and packaged as parquet files for immediate use with datasets/pandas/polars tooling.
Who It's For and Tradeoffs
Great fit if you want to fine-tune or evaluate models that generate or optimize GPU kernels, build reward models for performance-aware code synthesis, or study LLM-driven low-level code optimization at scale. The dataset is practical for offline RL, supervised fine-tuning, and preference-ranking experiments where measured runtime and profiler traces are useful signals.
Look elsewhere if you need hand-validated, production-ready kernels for deployment without further verification: generated kernels may still require hardware-specific validation and engineering, and performance characteristics are tied to the benchmarked hardware/environment recorded in the profiles. Also, the dataset focuses on CUDA/PyTorch translation and optimization tasks rather than general-purpose code corpora.
Where It Fits
This archive complements benchmarks like KernelBench by providing a large, ML-friendly corpus of model-generated kernel proposals annotated with runtime and profiler data. Use it to close the loop between generation, verification and performance-driven optimization when developing or fine-tuning models that target GPU kernel synthesis.
