Mozi — a voice companion that teaches children English through play

A child-focused voice AI: kids talk naturally, practice English, hear stories, invent adventures, and play word games — with a warm, patient personality instead of a cold gadget.

9 mo

Concept → App Store

Submitted

Distribution

Cooperating services

Inside the product

Four screens from the
shipped prototype.

01Companion launch — a warm, scientist-style character anchors the brand.

02Home — greeting, then tap-to-talk. The lowest-friction entry into a conversation we could design.

03Family settings — child profile, age band, and the language milestones it should hit.

04Language — a secondary language the AI understands while still speaking English back.

The challenge

A partner in China imagined English-language learning living inside toys — children talking naturally to a companion that reads with them, makes up silly tales, and turns practice into a game. They needed someone to take that vision from a whiteboard sketch to a shippable product that could pass App Store review.

Our approach

We separated the experience into three cooperating services so each could evolve independently. A Flutter client for the family-facing surface. A FastAPI + Firestore control plane for accounts, conversations, and prompt configuration. A LiveKit Agents worker running streaming speech-to-text, Vertex-backed LLM reasoning, and text-to-speech — sharing the same profile and conversation state as the API so behavior stays consistent across text and voice.

The outcome

After nine months of design, prototyping, and hardening, Mozi was submitted to Apple's App Store — a working voice companion that families can hold in their hands. The voice layer is hardware-portable: when the partner is ready to put it inside toys, the same agent attaches to a LiveKit client or a thin bridge without a rewrite.

Product

For families

English practice that feels like chatting, not drilling.
Books, made-up adventures, and "what happens next?" moments.
Simple, repeatable word games that build confidence.
A companion built around how children actually play and speak.

Platform

For partners & builders

A mobile app, a secure cloud control plane, and real-time voice tech.
The same family of tools as modern voice assistants — adapted for kids.
Voice layer is portable: attaches to hardware or OEM toys when ready.
No throw-away. The prototype is the path to production.

Three cooperating services,
not one monolith.

Each part owns a piece of the experience and can evolve independently — a pattern that scales toward toys and OEM hardware without rewriting the product.

01 · Client

Mobile (Flutter)

Flutter (Dart 3.x) with responsive layout, GetX + Provider for state, and get_it for DI. Firebase Auth with Google Sign-In and Sign in with Apple. HTTP to the hosted API; web_socket_channel for the experimental real-time audio flow.

One codebase, iOS-ready
Firebase Auth · Google / Apple sign-in
Audio via record, flutter_sound, audioplayers

02 · Control plane

Backend (Python · FastAPI)

Python 3.12 + FastAPI on Cloud Run. Firestore via firebase-admin as the system of record. Vertex AI (Gemini) through LangChain with a provider abstraction. Short-lived JWTs minted server-side for LiveKit rooms; bearer-authenticated WebSocket for streaming audio.

Modular CRUD: users, chats, prompts, config
Per-user LiveKit credentials, never client-baked
Firebase ID-token verification on protected routes

03 · Voice agent

Real-time (LiveKit Agents)

A Python worker that joins LiveKit rooms. STT (Deepgram), VAD (Silero), TTS (ElevenLabs), noise cancellation. LangChain agents with tools and LangGraph-style routing (general assistant vs. onboarding subgraphs). Loads the user profile and prior messages so behavior stays in sync with the API.

Low-latency turn-taking
Onboarding vs. ongoing room routing
Shared Firestore profile + history with the API

Engineering summary

We delivered an end-to-end vertical slice: a Flutter client on the App Store path, a FastAPI + Firestore control plane for identity and conversational state, and a LiveKit-based voice agent combining streaming STT/TTS with LangChain / LangGraph orchestration and Vertex-backed LLMs. The design separates mobile UX from low-latency voice so each can evolve independently — a pattern that scales toward toy OEM integration without rewriting the core product.