From Chaos to Code: Transforming Sparse Point Clouds into Structured 3D Buildings with ArcPro

Imagine flying a drone over a city to map it. The drone captures thousands of images, and through photogrammetry, you generate a 3D representation of the scene. What you get back, however, is rarely a pristine, CAD-ready model. Instead, you get a “point cloud”—a chaotic swarm of millions of floating dots.

If the scan is high-quality, the dots are dense, and you can see the surfaces clearly. But in the real world, data is often messy. Aerial scans can be sparse (containing very few points), noisy (points are in the wrong place), or incomplete (entire walls might be missing due to occlusion).

For urban planners, game developers, and engineers creating Digital Twins, turning these messy, sparse dots into clean, structured 3D meshes is a massive headache. Traditional algorithms struggle to “connect the dots” when the dots are too far apart. Neural networks often produce “blobby” shapes that lack the sharp edges characteristic of architecture.

Enter ArcPro.

Figure 1: ArcPro takes extremely sparse point clouds (left) and transforms them into clean, lightweight architectural meshes (right).

As shown in Figure 1, researchers have developed a new framework that takes a radically different approach. Instead of trying to mesh the points directly, ArcPro treats the 3D reconstruction problem as a language translation problem. It looks at a cloud of points and writes a computer program that describes the building. When that program is executed, it generates a clean, structured 3D model.

In this deep dive, we will explore how ArcPro bridges the gap between unstructured data and structured code, enabling the reconstruction of complex buildings from as few as 200 points.

The Challenge: Why is “Connecting the Dots” So Hard?

To understand why ArcPro is a breakthrough, we first need to look at why existing methods fail.

The Problem with Primitives

Traditional reconstruction methods (like RANSAC or PolyFit) look for mathematical patterns in the point cloud. They try to fit planes, cylinders, or boxes to the data. This works beautifully when you have dense data—if you have 10,000 points on a wall, it’s easy to mathematically fit a plane to them.

But what if you only have 50 points scattered across that same wall? Or what if the points are noisy? Traditional algorithms fall apart. They start detecting planes that don’t exist or fail to detect the ones that do.

The Problem with Deep Learning

Recent deep learning approaches (like BSP-Net) try to learn the shape of buildings. However, they often struggle with the topology—the connectivity of the building. They might produce a mesh that looks okay from a distance but is actually a collection of disconnected faces or “watertight” blobs that don’t respect the sharp, hierarchical logic of human architecture.

Architectural scenes are not random organic shapes; they follow rules. Floors are stacked. Walls are vertical. A footprint on the second floor usually relates to the footprint on the first floor. ArcPro succeeds because it enforces these rules by design.

The Core Concept: Inverse Procedural Modeling

The central insight of ArcPro is that buildings can be described by Architectural Programs.

Think of a building not as a list of triangles (a mesh), but as a recipe.

Start at ground level Z.
Create a base layer with a rectangular shape.
On top of the base layer, create a second layer that is slightly smaller.
On top of that, create a tower.

If you can predict this “recipe” from the point cloud, you don’t need to worry about surface noise. You just run the recipe (the program), and you get a geometrically perfect building.

This process is called Inverse Procedural Modeling. Standard procedural modeling goes Program \(\rightarrow\) Mesh. ArcPro learns to go Point Cloud \(\rightarrow\) Program.

Figure 2: The ArcPro Pipeline. The system uses a DSL to synthesize training data (left). The network (right) consists of a 3D Convolutional Encoder and a Transformer Decoder that predicts the program tokens.

Figure 2 illustrates the entire pipeline. It is a closed loop of synthesis and inference:

DSL Definition: The researchers defined a language to describe buildings.
Data Synthesis: They used this language to generate thousands of random building “recipes” and their corresponding 3D meshes/point clouds.
Training: A neural network learns to look at the point clouds and predict the original recipe.
Inference: On new, real-world data, the network predicts a program, which is then compiled into a 3D mesh.

The Language of Buildings: Domain-Specific Language (DSL)

To teach a computer to write building recipes, the researchers created a Domain-Specific Language (DSL) called ArcPro.

The DSL treats a building as an Architectural Tree. The root of the tree is the ground. Children nodes are layers (blocks of the building). A layer can be a parent to other layers (e.g., a tower sitting on a podium).

Key Commands

The language is surprisingly simple, relying primarily on two statements:

1. SetGround This establishes where the building sits in the world. Equation for SetGround Here, \(\Phi\) represents the ground plane at a specific height \(z\).

2. CreateLayer This is the workhorse of the language. It creates a 3D volume (a prism). Equation for CreateLayer

parent: Which layer does this sit on?
h: How tall is this layer?
c: What is the 2D contour (the footprint) of this layer?

From Trees to Code

To feed a tree structure into a neural network (which prefers linear sequences), the tree is flattened using a breadth-first search.

Equation for Linearization

This creates a linear string of commands that the neural network can output one by one.

Visualizing the compilation

How does code become a building? Figure 3 visualizes this “Compilation” process beautifully.

Figure 3: The compilation process. (a) The program code. (b) The logic of stacking layers (Construction). (c) The final 3D Architecture.

Panel (a): Shows the pseudocode. You see layers \(L_1\) through \(L_4\) being defined, with parent relationships.
Panel (b): Shows the 2D “footprints” of these layers.
Panel (c): Shows the final 3D result.

Notice how clean the geometry is. Because the output is generated from code, the walls are perfectly vertical, and the floors are perfectly parallel. There is no surface noise.

Handling Complexity: Child Contours

Real buildings aren’t just stacked boxes; they have complex setbacks and multiple towers. The DSL handles this by defining how a child layer’s shape (\(c_{L'}\)) relates to its parent’s shape (\(c_L\)).

Figure 4: Generating child contours. Top: A single child contour formed by shrinking the parent. Bottom: Multiple child contours formed by splitting the parent’s space.

As shown in Figure 4, a child layer is generated by modifying the parent’s footprint. This might involve shrinking the edges (top row) or splitting the parent’s footprint into a grid and selecting specific cells to extrude upwards (bottom row). This hierarchical dependency ensures that the resulting building makes structural sense—you rarely see a building where the 10th floor floats in the air completely disconnected from the 9th floor.

The Neural Architecture: Reading Dots, Writing Tokens

Now that we have a language, how do we train a machine to speak it? The researchers employed an Encoder-Decoder architecture, similar to the systems used for language translation (like Google Translate), but adapted for 3D data.

1. The Encoder: 3D Sparse Convolutions

The input is a sparse point cloud. Standard Convolutional Neural Networks (CNNs) used for images don’t work well on 3D data because 3D space is mostly empty air.

ArcPro uses a Sparse 3D Convolutional Network. This network only performs calculations where there are actually points, making it highly efficient. It processes the voxelized point cloud and extracts a dense feature vector—essentially a mathematical summary of the building’s shape.

2. The Decoder: Transformer

The feature vector is passed to a Transformer Decoder. Transformers are the architecture behind Large Language Models (LLMs) like GPT. In this case, instead of predicting the next word in a sentence, the Transformer predicts the next token in the architectural program.

3. Tokenization

Just as English text is broken into tokens, the ArcPro DSL is broken into discrete units.

Table 1: Tokenization rules. The DSL commands are converted into sequences of tokens representing keywords (like “CreateLayer”) and discrete numeric values.

As seen in Table 1, the system uses specific tokens for commands (<CreateLayer>, <SetGround>) and brackets for parameters. Crucially, continuous values like coordinates and heights are discretized (binned) into numeric tokens. This turns the regression problem (predicting a float value like 12.45m) into a classification problem (predicting token ID #45), which Transformers handle very well.

4. Syntax-Constrained Sampling

A common problem with generating code via AI is syntax errors. The model might predict a CreateLayer command but forget to specify the height.

To prevent this, ArcPro uses a Finite State Machine (FSM) during inference. The FSM acts as a “grammar police.” If the model just predicted a SetGround token, the FSM knows the next token must be a number (the Z-height). It masks out all non-numeric tokens, forcing the network to pick a valid option. This ensures that every program generated by ArcPro is syntactically valid and can be compiled into a mesh.

The Training Data Engine

Deep learning needs massive amounts of data. There simply isn’t a dataset of millions of buildings paired with their “source code.” So, the researchers built their own data factory.

They created a feedforward procedural generator. By randomizing the parameters of their DSL (heights, number of splits, setback distances), they could generate an infinite number of unique “Architectural Trees.”

Figure 10: Synthetic training data. By varying procedural parameters, the researchers generated diverse building shapes to train the network.

Figure 10 shows the variety of shapes produced. From simple towers to complex, multi-tiered structures, the generator creates the “Ground Truth” programs. These programs are compiled into meshes, and then points are sampled from the surface to create the input point clouds.

To make the network robust, they intentionally degrade these inputs:

Downsampling: Reducing the number of points.
Noise: Jittering the points.
Cropping: Removing chunks of the building to simulate occlusion.

This forces the network to learn to “hallucinate” the missing structure based on architectural logic.

Experimental Results

So, does it work? The comparisons against state-of-the-art methods are striking.

Visual Comparison

Figure 5: Comparison against SOTA. Traditional methods (PolyFit, KSR) and learning methods (BSP-Net) struggle with noise and topology. ArcPro produces clean, logical structures.

In Figure 5, look at the “SfM Point Cloud” column. The input is noisy and messy.

PolyFit (a traditional optimization method) often fails to close the mesh or creates jagged, chaotic geometry.
BSP-Net (a learning method) creates a mesh, but it often looks “melted” or overly simplified, missing the distinct steps of the building.
ArcPro (Ours) captures the distinct tiered structure (the “wedding cake” shape) perfectly. It balances geometric accuracy with structural simplicity.

Extreme Sparsity

The real test of ArcPro is low-quality data. Figure 11 showcases the method’s resilience.

Figure 11: ArcPro results on sparse sampling. Even with as few as 200 points (bottom right), the model recovers the essential structure.

Look at the bottom-right example in Figure 11. The input has only 200 points. To a human eye, it barely looks like a building. Yet, ArcPro recovers a plausible multi-story structure. Because the model has “learned” what buildings look like (via the DSL priors), it knows that those few floating points likely correspond to specific corners of a rectangular layer.

Why Traditional Methods Fail

The paper provides a specific look at why methods like RANSAC (Plane Fitting) fail on this data.

Figure 12: RANSAC failure cases. Sparse points make plane detection unreliable.

As shown in Figure 12, RANSAC (Row 2) attempts to find planes but gets confused by the sparsity, often missing entire walls or detecting diagonal planes that don’t exist. ArcPro (Row 3), by contrast, imposes the “Manhattan-like” constraints of the DSL, resulting in a clean model that matches the Reference (Row 4).

Beyond Reconstruction: Language and Speed

ArcPro isn’t just about making pretty meshes. The “Program” representation opens up fascinating new capabilities.

1. Semantic Retrieval

Because the output is code, you can query it. You can’t ask a standard 3D mesh, “Show me all buildings with a tower taller than the base.” But you can ask an ArcPro program that.

Figure 14: Using Natural Language to find buildings. The DSL allows converting text queries into code checks.

Figure 14 shows how a Large Language Model (like ChatGPT) can be prompted with the DSL definition to convert natural language queries (e.g., “buildings with two branching structures”) into Python code that checks the ArcPro programs. This turns a city-scale database of 3D models into a searchable text database.

2. Blazing Fast Processing

ArcPro is incredibly efficient.

Figure 13: Efficiency comparison. ArcPro creates a lightweight model in 0.034s, vs. 739s for a traditional MVS reconstruction.

In Figure 13, the authors compare ArcPro to a traditional Multi-View Stereo (MVS) pipeline.

MVS: Takes 739 seconds. Produces a heavy mesh with 450,000 faces.
ArcPro: Takes 0.034 seconds. Produces a lightweight mesh with 46 faces.

While the MVS mesh has textures and fine details, the ArcPro mesh provides the structural abstraction instantly. For applications like real-time navigation or urban planning simulations, this speed and low data footprint are game-changing.

Conclusion

ArcPro represents a shift in how we think about 3D Deep Learning. Rather than treating 3D reconstruction as a signal processing task (filtering noise to find a surface), it treats it as a cognitive task—understanding the logical structure that generated the data.

By defining a Domain-Specific Language for architecture and training a model to “speak” it, ArcPro achieves three things:

Robustness: It works on data that breaks other algorithms.
Structure: It guarantees clean, watertight, hierarchical models.
Interpretability: The output isn’t a black-box mesh; it’s readable, editable code.

As we move toward larger digital twins and more autonomous spatial computing, methods like ArcPro that bridge the gap between messy reality and structured logic will be essential tools in the computer vision toolkit.

From Chaos to Code: Transforming Sparse Point Clouds into Structured 3D Buildings with ArcPro#

The Challenge: Why is “Connecting the Dots” So Hard?#

The Problem with Primitives#

The Problem with Deep Learning#

The Core Concept: Inverse Procedural Modeling#

The Language of Buildings: Domain-Specific Language (DSL)#

Key Commands#

From Trees to Code#

Visualizing the compilation#

Handling Complexity: Child Contours#

The Neural Architecture: Reading Dots, Writing Tokens#

1. The Encoder: 3D Sparse Convolutions#

2. The Decoder: Transformer#

3. Tokenization#

4. Syntax-Constrained Sampling#

The Training Data Engine#

Experimental Results#

Visual Comparison#

Extreme Sparsity#

Why Traditional Methods Fail#

Beyond Reconstruction: Language and Speed#

1. Semantic Retrieval#

2. Blazing Fast Processing#

Conclusion#