Most 3D scene datasets label whole objects, but agents need the small interactive parts and how to manipulate them. SceneFun3D closes that gap by pairing point-accurate 3D masks of interactive elements with affordance labels, motion parameters, and natural-language task descriptions across high-fidelity scans.
What Sets It Apart
- Fine-grained, manipulation-focused annotations: 14,867 interactive-element annotations across 710 Faro laser-scan scenes, annotated as point-index masks and exported as 3D detections.
- Motion + affordance + language: each element carries a Gibsonian affordance (9 classes), motion type (translational/rotational), axis/origin vectors, and free-form task descriptions (10,913 elements with descriptions; 17,133 with rephrasings).
- Multimodal alignment: per-scene high-res iPad RGB-D recordings (hi-res RGB, depth, poses, intrinsics) with the 3D elements projected into video frames; provided as a FiftyOne FO3D grouped dataset for visualization and benchmarking.
- Benchmark focus: introduces three tasks—functionality segmentation, task-driven affordance grounding, and 3D motion estimation—targeting robotics and embodied-AI manipulation research.
Who it's for and tradeoffs
Great fit if you build or evaluate robotics, embodied-AI, or vision models that must localize tiny interactive parts and predict how to act on them (e.g., pick-and-place, manipulation target selection, action grounding from language). Look elsewhere if you need large-scale object-level semantics only, require a commercial license (SceneFun3D inherits CC BY‑NC‑SA), or cannot handle high-resolution Faro scans and cross-modal registration workflow. Practical limitations include withheld test annotations for benchmark use, excluded poorly-captured reflective elements, axis-aligned 3D boxes (no oriented box estimates), and nontrivial storage/registration needs for laser-scan + iPad assets.
