Cover Image for Designing, Deploying, and Animating Multimodal AI Agents
Cover Image for Designing, Deploying, and Animating Multimodal AI Agents
Avatar for Tokyo AI (TAI)
Presented by
Tokyo AI (TAI)
Hosted By

Designing, Deploying, and Animating Multimodal AI Agents

Register to See Address
Bunkyo City, Tokyo
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Human dialogue is shaped by multimodal signals such as gaze, facial expressions, prosody, and timing, yet many conversational AI systems still rely on limited modalities. This event features three talks on advancing multimodal and interactive dialogue agents, from modeling subtle non-verbal cues and ensuring quality and reliability in high-stakes deployments like InteLLA, to designing expressive digital characters that feel alive. Together, the talks connect foundational research with real-world system design for the next generation of conversational and interactive AI.

Who is this for?

This event is intended for researchers, engineers, and practitioners working in conversational AI, multimodal machine learning, human–computer interaction, and dialogue systems, as well as those involved in deploying AI in high-stakes or user-facing applications. It will be especially relevant to attendees interested in bridging foundational research with real-world system design, evaluation, and operations.


Agenda

18:00 Doors open

18:30 - 19:00 Beyond Words: Understanding subtle multimodal cues for AI agent interaction (Mao Saeki)

19:00 - 19:30 Towards Full-Duplex Dialogue Quality Assurance for High-Stakes Assessment Agents (Sadahiro Yoshikawa)

19:30 - 20:00 Toward Interactive Intelligence for Digital Characters (Bo Zheng)

20:00 - 21:00 Networking

21:00 Doors close

Speakers:

Talk 1 - Beyond Words: Understanding subtle multimodal cues for AI agent interaction

Speaker: Mao Saeki (Research Scientist, Equmenopolis)

Abstract: Natural human conversation is shaped by subtle non-verbal signals that are largely overlooked by today’s dialogue systems—gaze shifts, head movements, prosodic patterns, and facial expressions. In this talk, I present a body of research on leveraging such multimodal cues to enable AI agents to interact in more human-like and engaging ways. I will cover three complementary directions: predicting conversational turn-taking using visual signals such as gaze and head motion; detecting user confusion from multimodal behavioral patterns to drive adaptive conversational strategies; and eliciting active user participation through incremental confirmation of user understanding. Together, these techniques underpin InteLLA, a multimodal dialogue agent deployed at scale, and demonstrate how fine-grained multimodal cue understanding can transform passive system interactions into collaborative, natural conversations.

Bio: Mao Saeki is a founding member and Research Scientist at Equmenopolis Inc., where he leads the development of InteLLA, a multimodal virtual agent for language proficiency assessment. He is currently pursuing a Ph.D. at Waseda University. His research focuses on multimodal conversational AI, particularly the understanding and generation of non-verbal signals—including gaze, facial expressions, and prosody—to achieve natural human-agent interaction.

Talk 2 - Towards Full-Duplex Dialogue Quality Assurance for High-Stakes Assessment Agents

Speaker: Sadahiro Yoshikawa (R&D Lead, Equmenopolis)

Abstract: Equmenopolis is a Waseda University spinout startup that researches, develops, and operates InteLLA, a conversational AI agent for assessing English speaking proficiency, used by schools and other educational institutions. This talk frames the challenges unique to such multimodal agents through the lens of DevOps and MLOps and shares practical lessons learned. It also outlines key requirements for high-stakes assessment agents and introduces parts of the research frameworks we use to meet them.

Bio: Sadahiro Yoshikawa is a Research and Development Group Lead at Equmenopolis, where he leads DialOps (Dialogue System Operations). Previously, he worked as a freelance Data Engineer. His research focuses on the interaction quality of multimodal dialogue systems from the perspective of interlocutors. He is particularly interested in developing frameworks and statistical methods for measuring and ensuring reliable dialogue quality.

Talk 3 - Toward Interactive Intelligence for Digital Characters

Speaker: Bo Zheng

Abstract: Recent advances in multimodal foundation models are rapidly transforming how interactive characters are created and experienced. In this talk, I will share our work on building next-generation digital characters powered by what we call Interactive Intelligence — systems that integrate thinker-talker-face animator-body animator-renderer into a unified architecture. I will introduce our research platform for digital characters, including multimodal interaction, personalized text-to-speech, expressive motion, and diffusion-based rendering. Beyond system design, I will also explore a deeper and more difficult question: what does it mean for an artificial character to feel “alive” to humans? I will discuss the technical and conceptual challenges of giving interactive agents something resembling a “soul” — including personality coherence over time, emotional continuity, self-evolution, and long-term memory. These challenges sit at the intersection of AI architecture, cognitive modeling, and interactive storytelling, and may define the next frontier of digital character research.

Bio: Bo Zheng is Chief Scientist at Shanda AI Research Tokyo, where he leads research on Interactive & Spatial Intelligence for next-generation game AI and digital humans. His work spans multimodal AI, real-time character animation, conversational agents, AI-driven interactive experiences, and world models.  Before joining Shanda, he served as a research scientist and associate professor in both industrial and academic institutions, including the Institute of Industrial Science at the University of Tokyo and Huawei Digital Human Lab. He received a Ph.D. in Computer Vision and Graphics from the University of Tokyo. He was also a visiting scholar at UCLA. His research interests include computer vision, graphics, digital humans, and human-centric interaction with AI.

Organizers

​​​​Ilya Kulyatin: Fintech and AI entrepreneur with work and academic experience in the US, Netherlands, Singapore, UK, and Japan, with an MSc in Machine Learning from UCL.

Supporters

​​Tokyo AI (​​​TAI) is the biggest AI community in Japan, with 4,000+ members mainly based in Tokyo (engineers, researchers, investors, product managers, and corporate innovation managers).

Value Create is a management advisory and corporate value design firm offering services such as business consulting, education, corporate communications, and investment support to help companies and individuals unlock their full potential and drive sustainable growth.

​Privacy Policy

​We will process your email address for the purposes of event-related communications and ongoing newsletter communications. You may unsubscribe from the newsletter at any time. Further details on how we process personal data are available in our Privacy Policy.

Location
Please register to see the exact location of this event.
Bunkyo City, Tokyo
Avatar for Tokyo AI (TAI)
Presented by
Tokyo AI (TAI)
Hosted By