LogoAIAny
Icon for item

SceneFun3D

Provides point-accurate annotations of interactive parts in high-resolution indoor laser-scan point clouds, plus affordance labels, motion axes and natural-language task descriptions; includes aligned iPad RGB-D video slices with 2D projections for multimodal research.

Introduction

Most 3D scene datasets label whole objects, but agents need the small interactive parts and how to manipulate them. SceneFun3D closes that gap by pairing point-accurate 3D masks of interactive elements with affordance labels, motion parameters, and natural-language task descriptions across high-fidelity scans.

What Sets It Apart
  • Fine-grained, manipulation-focused annotations: 14,867 interactive-element annotations across 710 Faro laser-scan scenes, annotated as point-index masks and exported as 3D detections.
  • Motion + affordance + language: each element carries a Gibsonian affordance (9 classes), motion type (translational/rotational), axis/origin vectors, and free-form task descriptions (10,913 elements with descriptions; 17,133 with rephrasings).
  • Multimodal alignment: per-scene high-res iPad RGB-D recordings (hi-res RGB, depth, poses, intrinsics) with the 3D elements projected into video frames; provided as a FiftyOne FO3D grouped dataset for visualization and benchmarking.
  • Benchmark focus: introduces three tasks—functionality segmentation, task-driven affordance grounding, and 3D motion estimation—targeting robotics and embodied-AI manipulation research.
Who it's for and tradeoffs

Great fit if you build or evaluate robotics, embodied-AI, or vision models that must localize tiny interactive parts and predict how to act on them (e.g., pick-and-place, manipulation target selection, action grounding from language). Look elsewhere if you need large-scale object-level semantics only, require a commercial license (SceneFun3D inherits CC BY‑NC‑SA), or cannot handle high-resolution Faro scans and cross-modal registration workflow. Practical limitations include withheld test annotations for benchmark use, excluded poorly-captured reflective elements, axis-aligned 3D boxes (no oriented box estimates), and nontrivial storage/registration needs for laser-scan + iPad assets.

Information

  • Websitehuggingface.co
  • OrganizationsETH Zurich, Google, Technical University of Munich, Microsoft
  • AuthorsAlexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, Francis Engelmann
  • Published date2024/10/10

Categories