3D dense captioning

9 papers with code • 0 benchmarks • 1 datasets

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest.

Benchmarks

Add a Result

These leaderboards are used to track progress in 3D dense captioning

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Datasets

ReferIt3D

Most implemented papers

Most implemented Social Latest No code

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

curryyuan/x-trans2cap • • CVPR 2022

Thus, a more faithful caption can be generated only using point clouds during the inference.

Paper
Code

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

SxJyJay/MORE • • 10 Mar 2022

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.

Paper
Code

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

heng-hw/spacap3d • • 22 Apr 2022

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.

Paper
Code

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

leolyj/3d-vlp • • CVPR 2023

The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.

Paper
Code

End-to-End 3D Dense Captioning with Vote2Cap-DETR

ch3cook-fdu/vote2cap-detr • • CVPR 2023

Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.

Paper
Code

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

ch3cook-fdu/vote2cap-detr • • 6 Sep 2023

Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.

Paper
Code

An Embodied Generalist Agent in 3D World

embodied-generalist/embodied-generalist • • 18 Nov 2023

However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.

Paper
Code

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

open3da/ll3da • • 30 Nov 2023

However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.

Paper
Code

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

jxbbb/tod3cap • • 28 Mar 2024

However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the \textbf{domain gap} between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the \textbf{lack of data} with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.

Paper
Code

3D dense captioning

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result