Imagine you are in a foreign city. You open a map app, looking for a coffee shop that is open right now, within a 10-minute walk, and has a rating above 4.0. You also need to spot it on a map filled with icons and street names. For a human, this is a standard navigational task involving visual scanning, spatial reasoning, and reading comprehension.

Now, imagine asking an AI to do the same. While Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated incredible prowess in coding, creative writing, and general reasoning, their ability to navigate the physical world—represented through maps—remains a significant blind spot.

In this post, we are diving deep into MapEval, a comprehensive research paper that benchmarks how well foundation models perform geo-spatial reasoning. The researchers put 30 major models (including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro) to the test across 700 complex questions covering 180 cities. The results expose a fascinating gap between “intelligence” and “spatial awareness.”

The Problem: AI Doesn’t “Get” Geography

Recent advancements in AI have focused heavily on autonomous tool usage. We have agents that can browse the web or use calculators. However, map-based reasoning is distinct. It requires heterogeneous context processing—combining unstructured text (reviews), structured data (coordinates), and visual information (pins on a map).

Prior benchmarks in this space have been somewhat limited. They often relied on:

Template-based queries: Asking simple questions like “What is the capital of France?” which relies on memorized facts rather than reasoning.
Simple API lookups: Asking for the distance between two points without requiring complex route planning.
Remote Sensing: analyzing satellite imagery for land cover, which is different from reading a navigational map.

MapEval changes this by testing geo-spatial reasoning. This involves understanding spatial relationships (North of X), navigation (pathfinding), and travel planning (multi-stop itineraries).

Introducing MapEval

MapEval is a multi-modal benchmark designed to assess foundation models across three distinct evaluation modes: Textual, API-based, and Visual.

Overview of MapEval showing the annotation process and the three evaluation tasks: Textual, API, and Visual.

As shown in Figure 1, the framework is built on real-world data derived from Google Maps. The researchers didn’t just scrape data; they created a sophisticated pipeline to ensure the questions reflect how humans actually interact with maps.

The Three Evaluation Tasks

To truly understand where models fail, MapEval breaks down the challenge into three specific tasks.

1. MapEval-Textual

In this task, the model is provided with a rich textual context containing information about places, opening hours, coordinates, and reviews. The model must answer multiple-choice questions (MCQs) based only on this text. This tests the model’s ability to filter relevant information from long, complex descriptions and perform reasoning (e.g., “Is this restaurant open on Mondays based on the schedule provided?”).

2. MapEval-API

This is a more dynamic task. Here, the model acts as an agent. It is not given the context upfront. Instead, it must use tools (like PlaceSearch or Directions) to query a database and find the answer. This mimics a real-world scenario where an AI assistant must actively fetch information to help a user plan a trip.

3. MapEval-Visual

This is perhaps the most challenging task. The model is given a screenshot of a digital map (like you would see on your phone) and must answer questions based on visual cues. This involves reading labels, interpreting icons, understanding road networks, and estimating spatial relationships visually.

Dataset Diversity and Realism

A benchmark is only as good as its data. MapEval covers 700 unique questions spanning 54 countries and 180 cities. This geographic diversity ensures that models aren’t just memorizing popular locations in New York or London but must actually reason about the data presented.

Charts showing the category statistics of MapEval, split into Visual and Textual/API tasks.

The questions are categorized into five types (Figure 2):

Place Info: Details about a specific POI (Point of Interest).
Nearby: Finding what is around a specific location.
Routing: Navigating from A to B.
Trip: Planning complex schedules (e.g., “Visit the museum for 2 hours, then grab coffee”).
Unanswerable: Questions where the provided context is insufficient, testing the model’s ability to admit ignorance (a critical safety feature).
Counting: Specific to visual tasks (e.g., “How many hospitals are visible?”).

The researchers used a custom tool called MapQaTor to annotate this data efficiently, ensuring high-quality ground truth.

Screenshot of the MapQaTor interface used for creating the dataset.

To ensure global representation, the dataset draws contexts from all over the world. As seen in the heatmaps below, both the textual and visual contexts are widely distributed, preventing Western-centric bias.

Heatmaps displaying the geographical distribution of textual and visual contexts globally.

The Core Challenge: Visual Map Understanding

One of the most unique aspects of MapEval is the visual component. Digital maps are complex information artifacts. They contain text (street names), symbols (icons for hospitals, parks), and geometric structures (roads, rivers).

The researchers included map snapshots at various zoom levels, ranging from broad city views to detailed street levels.

Side-by-side comparison of map snapshots at Zoom level 1 (global view) and Zoom level 16 (street view), along with a distribution chart.

Why does zoom matter? At high zoom levels (detailed views), the model must read OCR text and distinguish between individual buildings. At low zoom levels, it must understand broader spatial relationships.

Let’s look at what these visual questions actually look like. Below are examples where models were asked to reason about map screenshots.

Example 1: Landmark Identification In this example, the model must identify a museum near a specific flagpole.

Example of a visual evaluation query asking for the nearest museum to the Qatar Flagpole.

Example 2: Route Analysis Here, the model is shown two routes and must determine the distance. This requires reading the small data labels on the map interface (“23 km”).

Example of a visual evaluation query asking for the driving distance between two locations on a map.

Example 3: Spatial Relationships This example asks which Golf Club is at a specific intersection. It requires the model to trace the roads labeled “Springfield” and “Houdaille Quarry” and find the point of intersection.

Example of a visual evaluation query asking to identify a Golf Club at a specific road intersection.

These visual tasks are incredibly hard for current Vision-Language Models because they require precise localization and symbol grounding, not just general image captioning.

Experiments & Results: The Reality Check

The researchers evaluated 30 foundation models, including proprietary giants (Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro) and open-source models (Llama-3, Qwen, Mistral).

1. Textual Reasoning Results

In the textual task, models were given all the necessary information in text form. You might expect them to score nearly 100% since the answer is “in the text.”

Bar chart showing categorical accuracy for MapEval-Textual. Claude-3.5-Sonnet leads, but Trip planning is low across the board.

Key Takeaways:

The Ceiling is Low: The best model (Claude-3.5-Sonnet) only reached 66.33% accuracy. Compare this to human performance, which stands at 86.67%.
Trip Planning is Hard: Look at the “Trip” category in Figure 16 above. Performance drops significantly (below 50% for most). Planning a schedule requires temporal arithmetic (adding durations) and logic, which LLMs struggle with.
Unanswerable Questions: Many models struggle to say “I don’t know,” often hallucinating answers instead of selecting the “Unanswerable” option.

2. API-Based Reasoning Results

When models had to act as agents and query the data themselves, performance dropped further.

Bar chart comparing accuracy between MapEval-Textual and MapEval-API.

As Figure 3 shows, there is a significant performance gap between having the text provided (pink bars) and having to search for it (yellow bars).

Why the drop? Agents often get stuck. They might query for “restaurants” but fail to parse the results, or they get caught in infinite loops trying to find parameters that don’t exist.
The “Loop” Problem: Open-source models, in particular, struggled with tool usage. Figure 19 highlights how often agents hit the iteration limit (basically timing out) because they couldn’t figure out the next step.

Chart showing the number of times agents stopped due to iteration limits, with GPT-3.5 showing high failure rates.

The categorical breakdown for API tasks shows that “Nearby” queries were particularly brutal for the models.

Categorical accuracy for MapEval-API. Nearby queries show very low performance across all models.

3. Visual Reasoning Results

The visual results confirm that reading maps is distinct from reading natural images.

Categorical accuracy for MapEval-Visual. Place Info is high, but Counting and Routing are difficult.

Counting is a weakness: As seen in the “Counting” category (Figure 20), models struggle to answer questions like “How many hospitals are visible?” They often hallucinate icons or miss them entirely.
Zoom Sensitivity: Interestingly, model performance fluctuates with zoom levels. As the map becomes more detailed (higher zoom), the visual clutter increases, making reasoning harder.

Chart showing accuracy by Zoom Level. Performance varies significantly as zoom increases.

Why Do Models Fail? The “Math” Problem

One of the most insightful parts of the MapEval paper is the error analysis. Why are models so bad at “Trip” planning or “Routing”?

It turns out, a lot of it comes down to spatial math.

Cardinal Directions: Calculating if Point B is “North-West” of Point A requires understanding coordinates.
Distances: Calculating the straight-line distance between two lat/long points requires the Haversine formula. LLMs are notoriously bad at complex arithmetic.

To prove this, the researchers isolated questions requiring straight-line distance calculations.

Chart showing LLM accuracy on straight-line distance questions. Performance is generally poor (below 50%).

As shown above, accuracy was abysmal. However, the researchers proposed a solution: Give the model a calculator.

By integrating a calculator tool that the LLM could call to perform the math, accuracy skyrocketed.

Chart showing improved accuracy on straight-line distance questions after integrating a calculator.

Figure 14 shows that providing a calculator improved Claude-3.5-Sonnet’s performance on distance tasks from ~51% to ~85%. This proves that the reasoning might be there, but the calculation capability is the bottleneck. A similar trend was observed for determining cardinal directions.

Chart showing improved accuracy on cardinal direction questions after integrating a calculator.

Conclusion & Implications

MapEval serves as a reality check for the AI community. While we often speak of “Artificial General Intelligence,” the inability of state-of-the-art models to reliably navigate a map—a task billions of humans do daily—highlights a significant gap in spatio-temporal reasoning.

Key Takeaways:

Humans are still superior: Across all tasks, human baselines (80%+) far exceed the best AI models (~60-65%).
Modality Matters: Models perform differently when reading text vs. looking at images. Visual map understanding is particularly immature.
Agents need help: API-based agents struggle with parameter management and loop detection.
Tools are essential: We shouldn’t expect LLMs to do complex geometry in their “heads.” Integrating tools like calculators or specialized routing engines is the path forward.

The release of MapEval provides a rigorous standard for future models. For AI to truly assist in the physical world—whether it’s autonomous driving, logistics planning, or just helping a tourist find a coffee shop—it needs to master the map.

The Problem: AI Doesn’t “Get” Geography#

Introducing MapEval#

The Three Evaluation Tasks#

1. MapEval-Textual#

2. MapEval-API#

3. MapEval-Visual#

Dataset Diversity and Realism#

The Core Challenge: Visual Map Understanding#

Experiments & Results: The Reality Check#

1. Textual Reasoning Results#

2. API-Based Reasoning Results#

3. Visual Reasoning Results#

Why Do Models Fail? The “Math” Problem#

Conclusion & Implications#