Rafi Ibn Sultan - WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

Computer Vision - Cohere Labs Open Science Community

Google Meet

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision–Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. Sultan et al. introduce WalkGPT, a pixelgrounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. They also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance.
Rafi Ibn Sultan is a Ph.D. candidate in Computer Science at Wayne State University and a researcher in the Trustworthy AI Lab. His work focuses on computer vision and vision–language models, studying how multimodal systems can better understand spatial relationships in visual scenes. He develops models that combine segmentation, depth, and language to enable more grounded reasoning about the physical world, with applications ranging from medical image analysis to pedestrian navigation and accessibility technologies. His broader research goal is to build vision–language systems that move beyond surface-level description toward deeper spatial understanding and practical real-world use.

Presented by

Computer Vision - Cohere Labs Open Science Community

Led by Mayank Bhaskar and Benedict Emoe-Kabu. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science

Hosted By