Coarse Correspondences Elicit 3D Spacetime Understanding in Multimodal Language Model

1University of Washington  2Tsinghua University  3Tencent 
4Allen Institute for AI  5Cornell University  *Equal Contribution 

Abstract

We introduce Coarse Correspondences, a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs. Our method uses a lightweight tracking model to find object correspondences between frames in a video or between sets of image viewpoints. It selects the most frequent object instances and visualizes them with markers with unique IDs in the image. With this simple approach, we achieve state- of-the-art results on 3D understanding benchmarks including ScanQA (+20.5%) and a subset of OpenEQA (+9.7%), and on long-form video benchmarks such as EgoSchema (+6.0%). We also curate a small diagnostic dataset to evaluate whether MLLMs can reason about space from a described viewpoint other than the camera viewpoint. Again, Coarse Correspondences improves spatial perspective-taking abilities but we highlight that MLLMs struggle with this task. Together, we demonstrate that our simple prompting method can significantly aid downstream tasks that require 3D or temporal reasoning.

Coarse Correspondences


Coarse Correspondences is a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs. Our method uses a lightweight tracking model to find object correspondences between frames in a video or between sets of image viewpoints.


Fig 1: Overall pipeline of Coarse Correspondences. We combined light-weight video tracking models and multimodal LLMs to achieve a better understanding of 3D spacetime. (a) We use a tracking model at a high frame rate to obtain instance segmentation masks for each frame. (b) Then, we sequentially sparsify input frames, select prominent coarse correspondences, and visualize the constructed coarse correspondences on the images. (c) Finally, we enable MLLMs to better understand 3D spacetime from the prompted images.

Space Understanding with Coarse Correspondences

We evaluate the spatial understanding ability of Coarse Correspondences across ScanQA and OpenEQA which require understanding 3D space. We augment GPT-4V and GPT-4O with Coarse Correspondences and evaluate its zero-shot performance. We show that Coarse Correspondences not only significantly improves the base GPT models but that the improvements establish new state-of-the-art results across all our evaluations.

Tab 1: Comparison on ScanQA validation set. Following 3D-LLM, we conduct experiments on ScanQA validation set to demonstrate the effectiveness of Coarse Correspondences with different MLLMs. GPT-4V and GPT-4O surpass all finetuned models with our method in a zero-shot manner.

Temporal Understanding with Coarse Correspondences

For temporal understanding, we choose egoschema, a widely used dataset which aims at benchmarking a model's long video understanding ability. We limit this evaluation to 500 questions from the validation set. For the GPT models' input, we sample 8 frames uniformly from the video input.

Tab 2: Comparisons on EM-EQA setting of OpenEQA.

Tab 3: Comparisons on EgoSchema validation set.

SOT Benchmark

In cognitive science, the spatial orientation test (SOT) is a widely adopted examination for spatial intelligence in children. The SOT assesses spatial perspective-taking, the ability to imagine how an object or scene would appear from a perspective different from the current camera viewpoint.

Fig 2: Illustration of our SOT dataset. We mention two types of questions: Observer perspective understanding and spatial-perspective taking. Coarse Correspondences demonstrates superior effectiveness on the dataset.

We manually curated ten real-world scenes, both indoor and outdoor, using different mobile devices at various viewpoints. We carefully collect ten scenes and for each scene, we designed five carefully crafted questions, each asking the model to determine if one object is to the left or to the right of another from a specific viewpoint. In total, across the 10 scenes, SOT has a modest 50 questions.


Tab 4: Comparisons on SOT. Coarse Correspondences shows strong capability of enhancing 3D spatial understanding of MLLMs. It can also ease the striking finding of camera motion bias of current MLLMs.

Fig 3: Comparisons on SOT's spatial perspective-taking questions. Coarse Correspondences improves performance but GPT-4O still performs below random chance.

Examples of Coarse Correspondences


We show more qualitative results of Coarse Correspondences on ScanQA and SOT. We visualize the constructed coarse correspondences on the images and show how they can help multimodal LLMs to better understand 3D spacetime. We also provide a case study to compare different prompting methods on ScanQA.

BibTeX


        @article{liu2024coarse,
          title={Coarse Correspondences Elicit 3D Spacetime Understanding in Multimodal Language Model},
          author={Liu, Benlin and Dong, Yuhao and Wang, Yiqin and Rao, Yongming and Tang, Yansong and Ma, Wei-Chiu and Krishna, Ranjay},
          journal={arXiv preprint arXiv:2408.00754},
          year={2024}
        }