MapTab: Are MLLMs Ready for Multi-CriteriaRoute Planning in Heterogeneous Graphs?

Ziqiao Shang1,2†, Lingyue Ge1,2†, Yang Chen1,2, Shi-Yu Tian1,2, Zhenyu Huang1,2,
Wenbo Fu1,2, Yu-Feng Li1,2, Lan-Zhe Guo1,2*
2026
1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Equal contribution*Corresponding author: guolz@lamda.nju.edu.cn
Nanjing University
LAMDA
MapTab overview
MapTab is a comprehensive benchmark designed to evaluate the map understanding and spatial reasoning capabilities of Vision-Language Models (VLMs). The benchmark focuses on two core tasks: route planning and map-based question answering, using both metro maps and travel maps.

Abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs.

Route planning leaderboard

The following leaderboard presents the evaluation results of various Multimodal Large Language Models (MLLMs) across different data modalities in the MapTab path planning task. Performance is measured using three key metrics: Exact Match Accuracy (EMA), Partial Match Accuracy (PMA), and Difficulty-aware Score (DS). Models were tested using varying combinations of map, edge, and vertex data, as detailed below:
·Map-only: Only map data used
·Edge-only: Only edge data used
·Map+Edge: Map and edge data combined
·Map+Edge+Vertex: Map, edge, and vertex data combined
·Map+Vertex2: Map and merged vertex data (Vertex2_tab)
For clarity, comparisons involving Edge_tab + Vertex_tab were omitted, as they yielded similar results to the Map-only and Edge_tab-only groups without adding new insights. The best performing results in both open-source and closed-source model groups are highlighted in bold.

ModelType Map-only Edge-only Map+Edge Map+Edge+Vertex Map+Vertex2
EMAPMADS EMAPMADS EMAPMADS EMAPMADS EMAPMADS
Scenario: MetroMap
Open-source Models
Qwen3-VL-8B-InstructInstruct2.7517.586725.6946.44101821.2541.3078519.3139.317284.6921.87137
Qwen3-VL-8B-ThinkingThinking5.1220.9913231.6949.76127638.0057.06166923.7541.699486.3822.93194
Qwen3-VL-2B-InstructInstruct0.9415.14269.8827.613716.6323.852327.0026.912892.0017.8258
Qwen2.5-VL-7B-InstructInstruct0.9415.022114.0031.2053511.6928.324417.9420.773183.3818.09101
Phi-3.5-Vision-Instruct-4BInstruct0.0610.40110.8727.924026.6322.142082.7512.27990.8112.9413
Phi-4-Multimodal-Instruct-6BInstruct0.009.7502.1312.52662.1311.78851.759.51520.449.027
InternVL3-8B-InstructInstruct0.1313.98210.5029.5741412.8131.834889.0024.733771.7517.0068
Qwen3-VL-30B-A3B-InstructInstruct3.3119.2610223.6944.3396122.5643.5891419.0040.037246.7526.22218
Qwen3-VL-32B-InstructInstruct6.3122.2318131.8754.45127032.1254.54133928.5050.0611816.5624.43187
Qwen3-VL-32B-ThinkingThinking13.3129.4343731.8154.94127644.1262.77207826.5651.4810609.1928.89278
Closed-source Models
GPT-4oInstruct6.6325.6120542.3864.07211240.6962.40194435.6355.51163011.3131.11398
GPT-4.1Instruct7.9425.5223548.5667.07252346.8165.18241341.8162.88203814.0635.98515
Gemini-3-Flash-PreviewInstruct37.0657.15204674.7584.99534573.0683.37517169.1976.14476553.8765.843294
Doubao-Seed-1-6-251015-w/oNo-Thinking8.1324.6023346.9466.98239448.0666.95253340.5662.11208813.8135.61494
Doubao-Seed-1-6-251015-ThinkingThinking12.0630.4946174.3886.23499674.0085.68496476.0683.41502922.0342.48984
Qwen-VL-Plus-w/oNo-Thinking4.8121.8313336.8858.69164338.2558.59170631.6252.9213556.9427.69229
Qwen-VL-Plus-ThinkingThinking10.7529.1134961.5076.62357662.1976.42364845.7564.46231816.3837.44582
Scenario: TravelMap
Open-source Models
Qwen3-VL-8B-InstructInstruct19.2942.50119044.0561.66305143.3361.39300834.5255.56233015.6540.97869
Qwen3-VL-8B-ThinkingThinking22.6245.94134574.1782.41531982.6888.54626833.1555.60208812.7438.10705
Qwen3-VL-2B-InstructInstruct8.4534.3050011.2532.3576319.1745.68121012.1440.477873.1530.69164
Qwen2.5-VL-7B-InstructInstruct7.6830.4843121.0738.15132224.8242.02150815.6037.479024.7028.84235
Phi-3.5-Vision-Instruct-4BInstruct0.1220.00812.2034.817789.8231.876204.4623.212631.3122.6881
Phi-4-Multimodal-Instruct-6BInstruct0.4219.26217.2017.634795.3015.933181.739.361151.4318.9663
InternVL3-8B-InstructInstruct6.6129.2130929.5849.69182129.4050.16186513.5736.789332.5024.28136
Qwen3-VL-30B-A3B-InstructInstruct17.8644.15109850.9565.36345853.7567.71374738.4558.0227389.7037.93578
Qwen3-VL-32B-InstructInstruct36.9057.44243164.5276.16470468.3978.99518452.5669.18377021.6747.341299
Qwen3-VL-32B-ThinkingThinking39.1758.84265069.7679.60514991.7994.55728742.3262.99293119.9446.731201
Closed-source Models
GPT-4oInstruct16.8540.9893065.0675.84465162.7474.11446746.0763.07306912.0838.07675
GPT-4.1Instruct20.3043.24122674.8282.98557170.8979.84521154.7069.59391715.0640.67862
Gemini-3-Flash-PreviewInstruct60.0073.20446998.2798.38819094.4094.87775778.5182.40645943.5160.113250
Doubao-Seed-1-6-251015-w/oNo-Thinking33.0454.15219373.5182.16542576.8584.04581256.2571.46403125.4849.541610
Doubao-Seed-1-6-251015-ThinkingThinking38.4558.46273598.3998.87817897.8698.47812783.1589.08667225.3048.901678
Qwen-VL-Plus-w/oNo-Thinking30.6052.64193564.2376.45465669.6479.78513353.9970.07384222.9247.651417
Qwen-VL-Plus-ThinkingThinking38.2758.94253964.3576.53467094.2396.04757056.1970.84404223.2147.181481

QA leaderboard

This leaderboard summarizes the performance of various Multimodal Large Language Models (MLLMs) on QA tasks across the MetroMap and TravelMap scenarios. Input modalities are represented as:
·M for Map
·E for Edge_tab
·V for Vertex_tab
In the MetroMap scenario, the Vertex_tab combined with the Map input excludes the Line column to minimize unnecessary table details, ensuring the evaluation focuses on map-table coordination.
Tasks are categorized into three distinct types:
·Global Perception-based Reasoning Tasks (GP)
·Local Perception-based Reasoning Tasks (LP)
·Spatial Relationship Judgment Tasks (SR)
Bold values in the table indicate the best performance for open-source and closed-source models, respectively.

ModelType Map (M) Edge (E) Vertex (V) Map+Vertex (M+V)
GPLPSR GPLPSR GPLPSR GPLPSR
Scenario: MetroMap
Open-source Models
Qwen3-VL-8B-InstructInstruct55.0017.5073.1222.50100.007.5057.5051.8886.880.6322.5038.75
Qwen3-VL-8B-ThinkingThinking53.1228.1251.2556.8799.3856.8779.3777.5098.127.509.3835.63
Qwen3-VL-2B-InstructInstruct8.135.0063.123.7587.503.1226.2511.2564.380.008.1326.87
Qwen2.5-VL-7B-InstructInstruct48.7515.6266.2515.62100.0010.0044.3760.6287.501.2525.6233.12
Phi-3.5-Vision-Instruct-4BInstruct58.1318.7578.1222.50100.0021.8853.7566.8798.120.6321.2540.00
Phi-4-Multimodal-Instruct-6BInstruct60.6240.6280.0041.2593.1342.5058.7592.5099.385.6335.6349.38
InternVL3-8B-InstructInstruct35.0020.6265.007.5083.133.1233.7510.0070.630.0012.5028.12
Qwen3-VL-30B-A3B-InstructInstruct20.6216.2560.620.6368.751.250.001.2568.750.0011.2518.12
Qwen3-VL-32B-InstructInstruct5.000.0038.1210.6278.123.1216.254.3773.120.007.5023.75
Qwen3-VL-32B-ThinkingThinking26.8719.3850.005.0099.385.0021.2525.0085.620.006.8860.62
Closed-source Models
GPT4-oInstruct62.5013.7578.7531.87100.0028.1255.6378.12100.003.7529.3845.00
GPT4.1Instruct61.8826.8776.2550.6299.3838.1264.3883.75100.003.1225.6249.38
Gemini-3-Flash-PreviewInstruct59.3882.5093.1391.2598.1275.6288.7594.37100.0048.1380.0094.37
Doubao-Seed-1-6-251015-w/o_ThinkingNo-Thinking55.6320.6276.2541.88100.0059.3858.1386.2599.383.7549.3850.62
Doubao-Seed-1-6-251015-ThinkingThinking54.3740.6277.5072.50100.0069.3796.2598.75100.0027.5050.0053.12
Qwen-VL-Plus-w/o_ThinkingNo-Thinking60.0021.8878.7540.62100.0040.0064.3877.5097.501.2525.6240.62
Qwen-VL-Plus-ThinkingThinking57.5045.0081.8768.75100.0071.2590.6295.63100.0013.7546.2555.63
Scenario: TravelMap
Open-source Models
Qwen3-VL-8B-InstructInstruct7.1460.1252.9817.8699.4045.2438.6950.6061.3175.6070.2414.29
Qwen3-VL-8B-ThinkingThinking39.2970.8352.3887.50100.0039.88100.00100.00100.0063.1069.0513.69
Qwen3-VL-2B-InstructInstruct12.5058.939.526.0094.6464.881.1946.4338.1064.2967.864.17
Qwen2.5-VL-7B-InstructInstruct4.7665.4854.1738.1099.4047.6217.2687.5079.7633.9368.4516.07
Phi-3.5-Vision-Instruct-4BInstruct13.1048.2159.5239.8899.4041.0750.0075.6077.3872.0274.4010.71
Phi-4-Multimodal-Instruct-6BInstruct44.0570.2458.9378.5798.8133.9396.4398.21100.0073.2169.0517.86
InternVL3-8B-InstructInstruct12.5059.5235.718.9397.6249.4010.7150.0051.7970.8367.864.76
Qwen3-VL-30B-A3B-InstructInstruct5.9550.606.557.7486.3152.388.9344.6445.2454.7635.122.98
Qwen3-VL-32B-InstructInstruct0.0042.2614.8811.3163.6938.3118.4545.8347.6227.9863.695.95
Qwen3-VL-32B-ThinkingThinking8.3360.1223.211.1997.0244.6410.1238.6956.5569.0566.675.95
Closed-source Models
GPT4-oInstruct11.3163.6949.4047.02100.0036.3153.5799.4067.2671.4373.2111.31
GPT4.1Instruct3.5769.6455.9547.02100.0042.8666.07100.0069.6473.2177.3817.86
Gemini-3-Flash-PreviewInstruct45.8385.1277.9897.6299.4086.31100.0099.40100.0085.7181.5526.19
Doubao-Seed-1-6-251015-w/o_ThinkingNo-Thinking22.6258.9350.0063.1098.8151.7954.76100.0098.8166.0776.7925.60
Doubao-Seed-1-6-251015-ThinkingThinking24.4071.4355.9595.8384.5271.4397.62100.00100.0078.5772.0222.02
Qwen-VL-Plus-w/o_ThinkingNo-Thinking19.6472.0252.9848.81100.0035.7157.7498.8182.7454.1770.2419.64
Qwen-VL-Plus-ThinkingThinking24.4067.8662.5098.81100.0034.52100.00100.00100.0069.0572.6224.40

BibTeX

@article{shang2026maptab,
  title={MapTab: Can MLLMs Master Constrained Route Planning?},
  author={Shang, Ziqiao and Ge, Lingyue and Chen, Yang and Tian, Shi-Yu and Huang, Zhenyu and Fu, Wenbo and Li, Yu-Feng and Guo, Lan-Zhe},
  journal={arXiv preprint arXiv:2602.18600},
  year={2026}
}