


TAI AAI #13 - Embodied AI: From Seeing to Imagining to Doing
How Modern Robots Connect Perception to Action: Seeing, Imagining, and Doing
From self-driving cars that explain their choices to robots that plan and act in the physical world, the frontier of embodied AI is where perception meets purposeful action. This event explores how modern robots and intelligent agents bridge the gap between understanding the world and acting within it—linking vision, language, and behavior into a unified system of intelligence.
We’ll follow a simple but powerful arc:
🔹 Seeing (VLMs): language being an interface between humans and embodied AI (Roland Meertens).
🔹 Imagining (World Model): world model being a predictive “world representation or embedding” of the physical world (Alisher Abdulkhaev).
🔹 Doing (VLAs): mapping vision-language inputs into actionable skills and policies (Motonari Kambara)
Want to get insight about the embodied AI from conceptual introductions all the way to technical discussions?
Agenda
18:00 - 18:30 Doors Open
18:30 - 18:40 Introduction
18:40 - 19:10 Talk 1
19:10 - 19:40 Talk 2
19:40 - 20:10 Talk 3
20:10 - 21:00 Networking
Speakers
Talk 1: Seeing (VLMs): language being an interface between humans and embodied AI
Speaker: Roland Meertens (ML Engineer, Wayve)
Abstract: Understanding what your car wants to do. It's one thing to build a vehicle that drives itself autonomously through the streets of Tokyo; it's a different thing to also understand why it drives itself the way it does. You will learn what end-to-end self-driving cars are, and how you can make this car explain which decisions it takes. Last but not least, we will also see if we can use language to probe what the car would do in hypothetical scenarios.
Bio: Roland is working as a machine learning engineer for Wayve in London. This year, he helped set up the first operations of Wayve in Japan and set up the Wayve driver on the Nissan Ariya. He is also good at baking pizza.
Talk 2: Imagining (World Model): world model being a predictive “world representation or embedding” of the physical world
Speaker: Alisher Abdulkhaev (Co-founder, Kanaria Tech)
Abstract: The world model is a predictive “world representation or embedding” of the physical world that lets AI models comprehend the world state and imagine future states of the world. In his talk, Alisher (CTO & CoFounder, Kanaria Tech) will touch on the essential concepts in world modelling, including how the world modelling handles the uncertainties and plans ahead rather than reacting moment to moment.
Bio: Alisher Abdulkhaev is the Co-Founder and CTO of Kanaria Tech, where he develops the Kanaria Robotic Model (KRM), a world model-driven foundation model for social navigation in autonomous mobile robots. His work focuses on bridging embodied intelligence, world modeling, and goal-directed reasoning to enable robots to navigate and interact naturally in complex real-world environments. Alisher frequently shares insights on robotics, AI systems, and startup building through his writings on Medium and thoughts on X.
Talk 3: Doing (VLAs): mapping vision-language inputs into actionable skills and policies
Speaker: Motonari Kambara (JSPS Research Fellow, Keio University)
Abstract: This talk introduces the current capabilities and future directions of Vision-Language-Action (VLA) models that integrate perception, reasoning, and control for embodied intelligence. I will discuss how vision, language, and actions serve as complementary features enabling grounded understanding and purposeful behavior. The talk also highlights explainability—how VLAs enhance transparency and interpretability by aligning visual and linguistic representations of a robot’s reasoning, bridging the gap between autonomous control and human understanding.
Bio: Motonari Kambara is a JSPS Research Fellow at Keio University. He received his B.E., M.S., and Ph.D. in Engineering from Keio University in 2021, 2023, and 2025, respectively. From 2023 to 2025, he has also been a research fellow at JSPS (DC1). His research interests include vision and language, as well as robot learning.
Tokyo AI (TAI) information
TAI is the biggest AI community in Japan, with 2,700+ members mainly based in Tokyo (engineers, researchers, investors, product managers, and corporate innovation managers). Web: https://www.tokyoai.jp/
Event Supporters
DEEPCORE is a VC firm supporting AI Salon Tokyo. They operate a fund for seed and early-stage startups and KERNEL, a community supporting early entrepreneurs.
Hosts
Alisher Abdulkhaev: Alisher Abdulkhaev is the Co-Founder and CTO of Kanaria Tech, where he develops the Kanaria Robotic Model (KRM), a world model-driven foundation model for social navigation in autonomous mobile robots.
Ilya Kulyatin: Fintech and AI entrepreneur with work and academic experience in the US, Netherlands, Singapore, UK, and Japan, with an MSc in Machine Learning from UCL.