Cover Image for TAI AAI #13 - Embodied AI: From Seeing to Imagining to Doing

Presented by

Tokyo AI (TAI)

Hosted By

Featured in

Tokyo

TAI AAI #13 - Embodied AI: From Seeing to Imagining to Doing

Name: TAI AAI #13 - Embodied AI: From Seeing to Imagining to Doing
Start: 2025-10-14T18:00:00.000+09:00
End: 2025-10-14T21:00:00.000+09:00
Location: Bunkyo City, Tokyo

Tokyo AI (TAI)

Bunkyo City, Tokyo

Approval Required

Your registration is subject to approval by the host.

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

How Modern Robots Connect Perception to Action: Seeing, Imagining, and Doing

From self-driving cars that explain their choices to robots that plan and act in the physical world, the frontier of embodied AI is where perception meets purposeful action. This event explores how modern robots and intelligent agents bridge the gap between understanding the world and acting within it—linking vision, language, and behavior into a unified system of intelligence.

We’ll follow a simple but powerful arc:

🔹 Seeing (VLMs): language being an interface between humans and embodied AI (Roland Meertens).

🔹 Imagining (World Model): world model being a predictive “world representation or embedding” of the physical world (Alisher Abdulkhaev).

🔹 Doing (VLAs): mapping vision-language inputs into actionable skills and policies (Motonari Kambara)

Want to get insight about the embodied AI from conceptual introductions all the way to technical discussions?

Agenda

18:00 - 18:30 Doors Open
18:30 - 18:40 Introduction
18:40 - 19:10 Talk 1
19:10 - 19:40 Talk 2
19:40 - 20:10 Talk 3
20:10 - 21:00 Networking

Speakers

Talk 1: Seeing (VLMs): language being an interface between humans and embodied AI

Speaker: Roland Meertens (ML Engineer, Wayve)

Abstract: Understanding what your car wants to do. It's one thing to build a vehicle that drives itself autonomously through the streets of Tokyo; it's a different thing to also understand why it drives itself the way it does. You will learn what end-to-end self-driving cars are, and how you can make this car explain which decisions it takes. Last but not least, we will also see if we can use language to probe what the car would do in hypothetical scenarios.

Bio: Roland is working as a machine learning engineer for Wayve in London. This year, he helped set up the first operations of Wayve in Japan and set up the Wayve driver on the Nissan Ariya. He is also good at baking pizza.

Talk 2: Imagining (World Model): world model being a predictive “world representation or embedding” of the physical world

Speaker: Alisher Abdulkhaev (Co-founder, Kanaria Tech)

Abstract: The world model is a predictive “world representation or embedding” of the physical world that lets AI models comprehend the world state and imagine future states of the world. In his talk, Alisher (CTO & CoFounder, Kanaria Tech) will touch on the essential concepts in world modelling, including how the world modelling handles the uncertainties and plans ahead rather than reacting moment to moment.

Bio: Alisher Abdulkhaev is the Co-Founder and CTO of Kanaria Tech, where he develops the Kanaria Robotic Model (KRM), a world model-driven foundation model for social navigation in autonomous mobile robots. His work focuses on bridging embodied intelligence, world modeling, and goal-directed reasoning to enable robots to navigate and interact naturally in complex real-world environments. Alisher frequently shares insights on robotics, AI systems, and startup building through his writings on Medium and thoughts on X.

Talk 3: Doing (VLAs): mapping vision-language inputs into actionable skills and policies

Speaker: Motonari Kambara (JSPS Research Fellow, Keio University)

Abstract: This talk introduces the current capabilities and future directions of Vision-Language-Action (VLA) models that integrate perception, reasoning, and control for embodied intelligence. I will discuss how vision, language, and actions serve as complementary features enabling grounded understanding and purposeful behavior. The talk also highlights explainability—how VLAs enhance transparency and interpretability by aligning visual and linguistic representations of a robot’s reasoning, bridging the gap between autonomous control and human understanding.

Bio: Motonari Kambara is a JSPS Research Fellow at Keio University. He received his B.E., M.S., and Ph.D. in Engineering from Keio University in 2021, 2023, and 2025, respectively. From 2023 to 2025, he has also been a research fellow at JSPS (DC1). His research interests include vision and language, as well as robot learning.

Tokyo AI (TAI) information

TAI is the biggest AI community in Japan, with 2,700+ members mainly based in Tokyo (engineers, researchers, investors, product managers, and corporate innovation managers). Web: https://www.tokyoai.jp/

Event Supporters

DEEPCORE is a VC firm supporting AI Salon Tokyo. They operate a fund for seed and early-stage startups and KERNEL, a community supporting early entrepreneurs.

Hosts

Alisher Abdulkhaev: Alisher Abdulkhaev is the Co-Founder and CTO of Kanaria Tech, where he develops the Kanaria Robotic Model (KRM), a world model-driven foundation model for social navigation in autonomous mobile robots.

Ilya Kulyatin: Fintech and AI entrepreneur with work and academic experience in the US, Netherlands, Singapore, UK, and Japan, with an MSc in Machine Learning from UCL.

Location

Please register to see the exact location of this event.

Bunkyo City, Tokyo

Presented by

Tokyo AI (TAI)

Hosted By

TAI AAI #13 - Embodied AI: From Seeing to Imagining to Doing

​How Modern Robots Connect Perception to Action: Seeing, Imagining, and Doing

​Agenda

​Speakers

​Talk 1: Seeing (VLMs): language being an interface between humans and embodied AI

​Talk 2: Imagining (World Model): world model being a predictive “world representation or embedding” of the physical world

​Talk 3: Doing (VLAs): mapping vision-language inputs into actionable skills and policies

​​​​​​​Tokyo AI (TAI) information

​​​​​​​​​Event Supporters

​​Hosts

How Modern Robots Connect Perception to Action: Seeing, Imagining, and Doing

Agenda

Speakers

Talk 1: Seeing (VLMs): language being an interface between humans and embodied AI

Talk 2: Imagining (World Model): world model being a predictive “world representation or embedding” of the physical world

Talk 3: Doing (VLAs): mapping vision-language inputs into actionable skills and policies

Tokyo AI (TAI) information

Event Supporters

Hosts