MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

MapTab: Are MLLMs Ready for Multi-CriteriaRoute Planning in Heterogeneous Graphs?

Ziqiao Shang^1,2†, Lingyue Ge^1,2†, Yang Chen^1,2, Shi-Yu Tian^1,2, Zhenyu Huang^1,2,
Wenbo Fu^1,2, Yu-Feng Li^1,2, Lan-Zhe Guo^1,2*

2026

¹National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China ²School of Intelligence Science and Technology, Nanjing University, Suzhou, China

^†Equal contribution^*Corresponding author: guolz@lamda.nju.edu.cn

Abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs.

Route planning leaderboard

The following leaderboard presents the evaluation results of various Multimodal Large Language Models (MLLMs) across different data modalities in the MapTab path planning task. Performance is measured using three key metrics: Exact Match Accuracy (EMA), Partial Match Accuracy (PMA), and Difficulty-aware Score (DS). Models were tested using varying combinations of map, edge, and vertex data, as detailed below:
·Map-only: Only map data used
·Edge-only: Only edge data used
·Map+Edge: Map and edge data combined
·Map+Edge+Vertex: Map, edge, and vertex data combined
·Map+Vertex2: Map and merged vertex data (Vertex2_tab)
For clarity, comparisons involving Edge_tab + Vertex_tab were omitted, as they yielded similar results to the Map-only and Edge_tab-only groups without adding new insights. The best performing results in both open-source and closed-source model groups are highlighted in bold.

Model	Type	Map-only			Edge-only			Map+Edge			Map+Edge+Vertex			Map+Vertex2
Model	Type	EMA	PMA	DS	EMA	PMA	DS	EMA	PMA	DS	EMA	PMA	DS	EMA	PMA	DS
Scenario: MetroMap
Open-source Models
Qwen3-VL-8B-Instruct	Instruct	2.75	17.58	67	25.69	46.44	1018	21.25	41.30	785	19.31	39.31	728	4.69	21.87	137
Qwen3-VL-8B-Thinking	Thinking	5.12	20.99	132	31.69	49.76	1276	38.00	57.06	1669	23.75	41.69	948	6.38	22.93	194
Qwen3-VL-2B-Instruct	Instruct	0.94	15.14	26	9.88	27.61	371	6.63	23.85	232	7.00	26.91	289	2.00	17.82	58
Qwen2.5-VL-7B-Instruct	Instruct	0.94	15.02	21	14.00	31.20	535	11.69	28.32	441	7.94	20.77	318	3.38	18.09	101
Phi-3.5-Vision-Instruct-4B	Instruct	0.06	10.40	1	10.87	27.92	402	6.63	22.14	208	2.75	12.27	99	0.81	12.94	13
Phi-4-Multimodal-Instruct-6B	Instruct	0.00	9.75	0	2.13	12.52	66	2.13	11.78	85	1.75	9.51	52	0.44	9.02	7
InternVL3-8B-Instruct	Instruct	0.13	13.98	2	10.50	29.57	414	12.81	31.83	488	9.00	24.73	377	1.75	17.00	68
Qwen3-VL-30B-A3B-Instruct	Instruct	3.31	19.26	102	23.69	44.33	961	22.56	43.58	914	19.00	40.03	724	6.75	26.22	218
Qwen3-VL-32B-Instruct	Instruct	6.31	22.23	181	31.87	54.45	1270	32.12	54.54	1339	28.50	50.06	1181	6.56	24.43	187
Qwen3-VL-32B-Thinking	Thinking	13.31	29.43	437	31.81	54.94	1276	44.12	62.77	2078	26.56	51.48	1060	9.19	28.89	278

Closed-source Models
GPT-4o	Instruct	6.63	25.61	205	42.38	64.07	2112	40.69	62.40	1944	35.63	55.51	1630	11.31	31.11	398
GPT-4.1	Instruct	7.94	25.52	235	48.56	67.07	2523	46.81	65.18	2413	41.81	62.88	2038	14.06	35.98	515
Gemini-3-Flash-Preview	Instruct	37.06	57.15	2046	74.75	84.99	5345	73.06	83.37	5171	69.19	76.14	4765	53.87	65.84	3294
Doubao-Seed-1-6-251015-w/o	No-Thinking	8.13	24.60	233	46.94	66.98	2394	48.06	66.95	2533	40.56	62.11	2088	13.81	35.61	494
Doubao-Seed-1-6-251015-Thinking	Thinking	12.06	30.49	461	74.38	86.23	4996	74.00	85.68	4964	76.06	83.41	5029	22.03	42.48	984
Qwen-VL-Plus-w/o	No-Thinking	4.81	21.83	133	36.88	58.69	1643	38.25	58.59	1706	31.62	52.92	1355	6.94	27.69	229
Qwen-VL-Plus-Thinking	Thinking	10.75	29.11	349	61.50	76.62	3576	62.19	76.42	3648	45.75	64.46	2318	16.38	37.44	582

Scenario: TravelMap
Open-source Models
Qwen3-VL-8B-Instruct	Instruct	19.29	42.50	1190	44.05	61.66	3051	43.33	61.39	3008	34.52	55.56	2330	15.65	40.97	869
Qwen3-VL-8B-Thinking	Thinking	22.62	45.94	1345	74.17	82.41	5319	82.68	88.54	6268	33.15	55.60	2088	12.74	38.10	705
Qwen3-VL-2B-Instruct	Instruct	8.45	34.30	500	11.25	32.35	763	19.17	45.68	1210	12.14	40.47	787	3.15	30.69	164
Qwen2.5-VL-7B-Instruct	Instruct	7.68	30.48	431	21.07	38.15	1322	24.82	42.02	1508	15.60	37.47	902	4.70	28.84	235
Phi-3.5-Vision-Instruct-4B	Instruct	0.12	20.00	8	12.20	34.81	778	9.82	31.87	620	4.46	23.21	263	1.31	22.68	81
Phi-4-Multimodal-Instruct-6B	Instruct	0.42	19.26	21	7.20	17.63	479	5.30	15.93	318	1.73	9.36	115	1.43	18.96	63
InternVL3-8B-Instruct	Instruct	6.61	29.21	309	29.58	49.69	1821	29.40	50.16	1865	13.57	36.78	933	2.50	24.28	136
Qwen3-VL-30B-A3B-Instruct	Instruct	17.86	44.15	1098	50.95	65.36	3458	53.75	67.71	3747	38.45	58.02	2738	9.70	37.93	578
Qwen3-VL-32B-Instruct	Instruct	36.90	57.44	2431	64.52	76.16	4704	68.39	78.99	5184	52.56	69.18	3770	21.67	47.34	1299
Qwen3-VL-32B-Thinking	Thinking	39.17	58.84	2650	69.76	79.60	5149	91.79	94.55	7287	42.32	62.99	2931	19.94	46.73	1201

Closed-source Models
GPT-4o	Instruct	16.85	40.98	930	65.06	75.84	4651	62.74	74.11	4467	46.07	63.07	3069	12.08	38.07	675
GPT-4.1	Instruct	20.30	43.24	1226	74.82	82.98	5571	70.89	79.84	5211	54.70	69.59	3917	15.06	40.67	862
Gemini-3-Flash-Preview	Instruct	60.00	73.20	4469	98.27	98.38	8190	94.40	94.87	7757	78.51	82.40	6459	43.51	60.11	3250
Doubao-Seed-1-6-251015-w/o	No-Thinking	33.04	54.15	2193	73.51	82.16	5425	76.85	84.04	5812	56.25	71.46	4031	25.48	49.54	1610
Doubao-Seed-1-6-251015-Thinking	Thinking	38.45	58.46	2735	98.39	98.87	8178	97.86	98.47	8127	83.15	89.08	6672	25.30	48.90	1678
Qwen-VL-Plus-w/o	No-Thinking	30.60	52.64	1935	64.23	76.45	4656	69.64	79.78	5133	53.99	70.07	3842	22.92	47.65	1417
Qwen-VL-Plus-Thinking	Thinking	38.27	58.94	2539	64.35	76.53	4670	94.23	96.04	7570	56.19	70.84	4042	23.21	47.18	1481

QA leaderboard

This leaderboard summarizes the performance of various Multimodal Large Language Models (MLLMs) on QA tasks across the MetroMap and TravelMap scenarios. Input modalities are represented as:
·M for Map
·E for Edge_tab
·V for Vertex_tab
In the MetroMap scenario, the Vertex_tab combined with the Map input excludes the Line column to minimize unnecessary table details, ensuring the evaluation focuses on map-table coordination.
Tasks are categorized into three distinct types:
·Global Perception-based Reasoning Tasks (GP)
·Local Perception-based Reasoning Tasks (LP)
·Spatial Relationship Judgment Tasks (SR)
Bold values in the table indicate the best performance for open-source and closed-source models, respectively.

Model	Type	Map (M)			Edge (E)			Vertex (V)			Map+Vertex (M+V)
Model	Type	GP	LP	SR	GP	LP	SR	GP	LP	SR	GP	LP	SR
Scenario: MetroMap
Open-source Models
Qwen3-VL-8B-Instruct	Instruct	55.00	17.50	73.12	22.50	100.00	7.50	57.50	51.88	86.88	0.63	22.50	38.75
Qwen3-VL-8B-Thinking	Thinking	53.12	28.12	51.25	56.87	99.38	56.87	79.37	77.50	98.12	7.50	9.38	35.63
Qwen3-VL-2B-Instruct	Instruct	8.13	5.00	63.12	3.75	87.50	3.12	26.25	11.25	64.38	0.00	8.13	26.87
Qwen2.5-VL-7B-Instruct	Instruct	48.75	15.62	66.25	15.62	100.00	10.00	44.37	60.62	87.50	1.25	25.62	33.12
Phi-3.5-Vision-Instruct-4B	Instruct	58.13	18.75	78.12	22.50	100.00	21.88	53.75	66.87	98.12	0.63	21.25	40.00
Phi-4-Multimodal-Instruct-6B	Instruct	60.62	40.62	80.00	41.25	93.13	42.50	58.75	92.50	99.38	5.63	35.63	49.38
InternVL3-8B-Instruct	Instruct	35.00	20.62	65.00	7.50	83.13	3.12	33.75	10.00	70.63	0.00	12.50	28.12
Qwen3-VL-30B-A3B-Instruct	Instruct	20.62	16.25	60.62	0.63	68.75	1.25	0.00	1.25	68.75	0.00	11.25	18.12
Qwen3-VL-32B-Instruct	Instruct	5.00	0.00	38.12	10.62	78.12	3.12	16.25	4.37	73.12	0.00	7.50	23.75
Qwen3-VL-32B-Thinking	Thinking	26.87	19.38	50.00	5.00	99.38	5.00	21.25	25.00	85.62	0.00	6.88	60.62

Closed-source Models
GPT4-o	Instruct	62.50	13.75	78.75	31.87	100.00	28.12	55.63	78.12	100.00	3.75	29.38	45.00
GPT4.1	Instruct	61.88	26.87	76.25	50.62	99.38	38.12	64.38	83.75	100.00	3.12	25.62	49.38
Gemini-3-Flash-Preview	Instruct	59.38	82.50	93.13	91.25	98.12	75.62	88.75	94.37	100.00	48.13	80.00	94.37
Doubao-Seed-1-6-251015-w/o_Thinking	No-Thinking	55.63	20.62	76.25	41.88	100.00	59.38	58.13	86.25	99.38	3.75	49.38	50.62
Doubao-Seed-1-6-251015-Thinking	Thinking	54.37	40.62	77.50	72.50	100.00	69.37	96.25	98.75	100.00	27.50	50.00	53.12
Qwen-VL-Plus-w/o_Thinking	No-Thinking	60.00	21.88	78.75	40.62	100.00	40.00	64.38	77.50	97.50	1.25	25.62	40.62
Qwen-VL-Plus-Thinking	Thinking	57.50	45.00	81.87	68.75	100.00	71.25	90.62	95.63	100.00	13.75	46.25	55.63

Scenario: TravelMap
Open-source Models
Qwen3-VL-8B-Instruct	Instruct	7.14	60.12	52.98	17.86	99.40	45.24	38.69	50.60	61.31	75.60	70.24	14.29
Qwen3-VL-8B-Thinking	Thinking	39.29	70.83	52.38	87.50	100.00	39.88	100.00	100.00	100.00	63.10	69.05	13.69
Qwen3-VL-2B-Instruct	Instruct	12.50	58.93	9.52	6.00	94.64	64.88	1.19	46.43	38.10	64.29	67.86	4.17
Qwen2.5-VL-7B-Instruct	Instruct	4.76	65.48	54.17	38.10	99.40	47.62	17.26	87.50	79.76	33.93	68.45	16.07
Phi-3.5-Vision-Instruct-4B	Instruct	13.10	48.21	59.52	39.88	99.40	41.07	50.00	75.60	77.38	72.02	74.40	10.71
Phi-4-Multimodal-Instruct-6B	Instruct	44.05	70.24	58.93	78.57	98.81	33.93	96.43	98.21	100.00	73.21	69.05	17.86
InternVL3-8B-Instruct	Instruct	12.50	59.52	35.71	8.93	97.62	49.40	10.71	50.00	51.79	70.83	67.86	4.76
Qwen3-VL-30B-A3B-Instruct	Instruct	5.95	50.60	6.55	7.74	86.31	52.38	8.93	44.64	45.24	54.76	35.12	2.98
Qwen3-VL-32B-Instruct	Instruct	0.00	42.26	14.88	11.31	63.69	38.31	18.45	45.83	47.62	27.98	63.69	5.95
Qwen3-VL-32B-Thinking	Thinking	8.33	60.12	23.21	1.19	97.02	44.64	10.12	38.69	56.55	69.05	66.67	5.95

Closed-source Models
GPT4-o	Instruct	11.31	63.69	49.40	47.02	100.00	36.31	53.57	99.40	67.26	71.43	73.21	11.31
GPT4.1	Instruct	3.57	69.64	55.95	47.02	100.00	42.86	66.07	100.00	69.64	73.21	77.38	17.86
Gemini-3-Flash-Preview	Instruct	45.83	85.12	77.98	97.62	99.40	86.31	100.00	99.40	100.00	85.71	81.55	26.19
Doubao-Seed-1-6-251015-w/o_Thinking	No-Thinking	22.62	58.93	50.00	63.10	98.81	51.79	54.76	100.00	98.81	66.07	76.79	25.60
Doubao-Seed-1-6-251015-Thinking	Thinking	24.40	71.43	55.95	95.83	84.52	71.43	97.62	100.00	100.00	78.57	72.02	22.02
Qwen-VL-Plus-w/o_Thinking	No-Thinking	19.64	72.02	52.98	48.81	100.00	35.71	57.74	98.81	82.74	54.17	70.24	19.64
Qwen-VL-Plus-Thinking	Thinking	24.40	67.86	62.50	98.81	100.00	34.52	100.00	100.00	100.00	69.05	72.62	24.40

@article{shang2026maptab, title={MapTab: Can MLLMs Master Constrained Route Planning?}, author={Shang, Ziqiao and Ge, Lingyue and Chen, Yang and Tian, Shi-Yu and Huang, Zhenyu and Fu, Wenbo and Li, Yu-Feng and Guo, Lan-Zhe}, journal={arXiv preprint arXiv:2602.18600}, year={2026} }

MapTab: Are MLLMs Ready for Multi-CriteriaRoute Planning in Heterogeneous Graphs?

Abstract

Route planning leaderboard

QA leaderboard

BibTeX