×

Searching the Physical World: Bridging 3D Models and LMMs

By Conrad Koziol, PhD and Thomas Charlton, PhD

Introduction

Capturing images and video is the first step in digitizing the physical world. Today, techniques such as photogrammetry[1], Neural Radiance Fields (NeRFs)[2], and Gaussian Splatting[3] allow us to transform this raw data into immersive, high-fidelity 3D environments.

High-fidelity 3D reconstruction solves the problem of visualization, but not of understanding. The missing layer is semantics. By integrating Large Multimodal Models (LMMs)—which reason jointly over text and images—we connect language to spatial geometry, enabling natural-language search and localization within 3D environments.


Background: The Semantic Gap in 3D Reconstruction

Digital models have become routine to generate. With a camera and modern reconstruction pipelines, complex environments can be captured and explored in 3D with remarkable realism. Yet despite their visual quality, these models contain minimal semantic information.

Figure 1: 3D reconstruction of the International Space Station.

3D models are unstructured visual records. While a human can recognize objects in the scene, there is no underlying index to locate equipment, audit inventories, or answer operational questions programmatically. As a result, the practical value of the model is limited.

Today, semantic meaning is typically added through manual tagging. This requires a user to:

This workflow does not scale. It is time-consuming, error-prone, and static: missed objects remain unsearchable, and any change on-site requires repeating the process. As environments evolve, maintaining an accurate digital record becomes an operational burden.

LMMs provide a structured approach to interpreting physical environments. They support analysis of visual scenes that extends beyond object recognition to include spatial relationships and context, allowing objects to be interpreted in relation to their surroundings.[4][5]

Automated object recognition and annotation of a single image
Figure 2: Automated object recognition and annotation of a single image.

This capability includes precise object localization, identifying the pixel coordinates for each detected element. By recognizing the same object across multiple viewpoints, these models maintain cross-view consistency, distinguishing unique instances from redundant observations. This consistency provides a direct basis for mapping image-level understanding into 3D space.

By leveraging LMMs, we can move beyond manual "point-and-click" labeling. Instead of forcing humans to annotate the model, we allow AI to interpret the site automatically—transforming a passive visual reconstruction into an actionable, searchable asset index.

Innovation: Connecting Language to 3D Space

Our approach decouples semantic reasoning from 3D representation. Rather than attempting to embed meaning directly into 3D geometry, we use 2D imagery as the semantic interface and project the results back into 3D space.

This workflow connects natural language queries to precise 3D locations.

1. Pose Estimation and 3D Reconstruction

We begin by establishing spatial context. Using Structure from Motion (SfM) or Simultaneous Localization and Mapping (SLAM), the system processes raw imagery to recover the position and orientation of every camera frame in a shared coordinate system.

Sparse point cloud and camera pose visualization
Figure 3: Sparse point cloud and camera pose visualization.

With camera poses defined, we generate the 3D model for visualization. Each image is anchored to a precise location and viewpoint in the model, ensuring that every image pixel can be accurately projected into the reconstructed scene.

Figure 4: Gaussian Splatting representation derived from the sparse point cloud and camera poses.

2. Semantic Filtering via Vector Embeddings

To make the image collection searchable, every frame is processed through an embedding model, producing a high-dimensional vector representation. This creates a semantic index over the entire image dataset.

When a user issues a query (such as "apple"), the system computes similarity between the text embedding and the image embeddings. This immediately narrows the search to a small subset of frames where the object is likely visible, avoiding the complexity of reasoning directly in 3D.

By extending the query from a single phrase to a list of search terms or a predefined taxonomy, the system evaluates multiple asset classes in parallel. This enables a site-wide semantic scan in a single pass, rather than a sequential, object-by-object inspection.

Multimodal analysis
Figure 5: Utilizing multimodal analysis to locate unique instances across the image set.

Similarity between image frames and search terms is computed as a cosine similarity matrix over their embeddings. Candidate frames are selected by applying a similarity threshold to reduce the search space. This filtering stage ensures that only frames with a high likelihood of containing relevant assets are forwarded to the multimodal model for analysis, improving computational efficiency and overall system robustness.

3. Multimodal Reasoning and Grounding

The candidate frames are passed to a LMM along with the user's query. By reasoning jointly over text and visual context, the model identifies unique object instances across views, filtering out redundant perspectives and occlusions.

For each instance the model selects a canonical frame, the view with the clearest visual evidence, and produces 2D grounding coordinates corresponding to the object's centroid. At this stage, a natural language request has been translated into a pixel-level reference.

Table 1: LMM determined unique object instances and associated information.
Search Term Objects Images Pixel Coordinates Canonical
"apple" apple_1 image_02 (324, 187) True
image_03 (412, 203) False
image_04 (367, 245) False
apple_2 image_03 (198, 178) False
image_04 (289, 156) True

4. Spatial Localization in 3D

Finally, the 2D coordinates are mapped into 3D space. Using known camera parameters, the system casts rays from the camera center through each identified pixel and computes their intersection with the reconstructed geometry.

This yields a 3D position for each object instance. The system then generates persistent annotations at those locations, creating a spatially grounded asset record within the model.

Raycasting from unique image set
Figure 6: Ray projection from canonical images establishes object positions in 3D space. Explore this model interactively.

Impact: From Static Models to Actionable Systems

By coupling 3D reconstruction with semantic reasoning, raw imagery becomes a structured, machine-readable representation of the physical world. This architecture enables several powerful capabilities.

Rapid Inventory and Auditing

What once required days of manual tagging can now be done in minutes. By combining semantic image search with automated localization, a 3D reconstruction becomes an instantly queryable inventory. As new imagery is captured, the process can be rerun to refresh the index, keeping the digital twin synchronized with the real site.

Relational and Contextual Intelligence

Unlike traditional computer vision systems that label objects in isolation, LMMs understand context and relationships. This enables higher-level queries such as identifying assets based on condition, proximity, or orientation. Questions like "find all access panels left open" or "locate equipment within two meters of the primary cooling line" become possible without training bespoke models for each new scenario.

A Spatial Interface for Documentation

Once assets are automatically localized, they become natural anchors for documentation. Maintenance records, inspection histories, and operating manuals can be attached directly to objects in their spatial context, turning the 3D model into a navigable interface for operational knowledge.

Conclusion

Our approach succeeds by utilizing 3D geometry as a spatial index rather than a semantic layer. By maintaining the 2D image dataset as the primary reference for reasoning, we leverage the full density of LMM intelligence where it is most effective. This creates a link between natural language and 3D space, ensuring that our digital twins are not just visually accurate, but programmatically queryable. We have moved from simply digitizing the appearance of a site to indexing its entire spatial reality.


Citations

[1] Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge University Press, 2003.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.

[3] Kerbl, Bernhard, et al. "3D Gaussian splatting for real-time radiance field rendering." ACM Trans. Graph. 42.4 (2023): 139-1.

[4] Doshi, Rohan. "Gemini 3 Pro: The Frontier of Vision AI." The Keyword, Google, 5 Dec. 2025, blog.google/technology/developers/gemini-3-pro-vision/.

[5] "Pointing and 3D Spatial Understanding with Gemini." Google Gemini Cookbook, Google, 2024, colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Spatial_understanding_3d.ipynb.