Back to Projects
AI

Daunt

Sensory layer for embodied agents in virtual worlds. Real-time runtime connecting worlds and engines to AI for vision, speech, and action.

Project Details

Daunt is an engine-agnostic runtime for embodied agents in virtual worlds. It acts as the bridge between simulated reality and agent intelligence, translating engine state into model context and model intent back into validated action. With support for low-latency streaming voice (under 300ms) and agent-perspective vision-language perception, Daunt enables NPCs to see, hear, and interact with the world with human-like presence and responsiveness.

dauntheader
The sensory layer for embodied agents in virtual worlds.

For more than thirty years, the most advanced way to interact with a character in a 3D world has been a menu.

Walk up. Press a button. Choose a line. Trade. Ask about rumors. Accept quest. Goodbye.

That is still the shape of the interaction. Even inside worlds with ray-traced lighting, physics simulation, open cities, dynamic weather, facial animation, crowd systems, and billion-dollar art pipelines, the moment you try to speak to the world, it usually collapses back into a list of options.

Not because designers lacked ambition. Because nothing else could reliably work.

Open-ended interaction required too much at once: language understanding, memory, speech, reasoning, world-state awareness, animation, valid actions, latency low enough to feel alive, and a runtime that could connect all of it to an engine without breaking the game. So we got menus. Menus became the compromise between the fantasy of a living world and the reality of brittle systems.

Now that compromise is starting to expire.

AI Dungeon showed that generative story could be played instead of merely read. Modern language models made open-ended characters plausible. ElevenLabs-style voice cloning and streaming speech made characters sound like characters instead of text-to-speech placeholders. Vision-language models made it possible to look at a scene and describe what matters. Local and hybrid models made it possible to move intelligence closer to the engine. And in Daunt’s current prototype, cloned and streamed voice can reach under 300ms time-to-first-audio: the threshold where a character starts to feel responsive instead of like a server loop waiting for a reply.

AI Dungeon ElevenLabs

Left: AI Dungeon, Right: ElevenLabs

The pieces are finally coalescing.

But most AI character systems still stop at the easiest layer. They make the NPC more talkative, not more embodied. Early 3D AI character products are still mostly chat wrappers: dialogue in, dialogue out, with only thin awareness of physics, state, space, other agents, or the player’s actual situation.

Daunt is built for the part they miss.

It is not a chatbot in a game.

It is the runtime that lets agents see, hear, speak, remember, move, and act from inside the world.

Daunt connects engine state, agent-perspective vision, real-time voice, spatial memory, and executable actions into one loop.

Inspiration

Role-playing games have always promised more than better dialogue.

The fantasy is consequence. A guard should react because he saw you draw a weapon. A companion should remember where you left them. A shopkeeper should overhear a rumor from another customer. A creature should follow sound. A crowd should shift because the people inside it perceive one another. A character should not wait for the correct quest flag before noticing that the room is on fire.

Games have spent decades simulating worlds that are far richer than the characters inside them can understand.

Promise

The promise of role-playing games: meaningful interaction.

The engine knows where the player is. It knows what collided. It knows which door opened, which object moved, which sound played, which path is blocked, which item is equipped, which animation is active, and which agents are nearby. But that knowledge usually stays trapped in the engine. The model outside the engine receives a prompt, a chat history, maybe a few manually injected variables, and is asked to pretend it lives there.

Daunt is the bridge between the simulated world and the agent mind.

Not the character. Not the model. The sensory and action layer between them.

Overview

Daunt is an engine-agnostic runtime for embodied agents in virtual worlds.

It translates world state into model context, and translates model intent back into validated world action. If a game or simulation exposes structured state, Daunt uses it directly. If the state layer is thin, inaccessible, or missing, Daunt can use a vision-language model from the agent’s own point of view. When both are available, Daunt fuses them: engine facts for precision, VLM perception for visual context, ambiguity, and scene detail.

That distinction matters.

An engine can tell an agent that object chair_014 exists at a coordinate. A VLM can see that the chair is overturned, visually blocking the doorway, and is probably the thing the player meant when they said “that mess by the exit.” Engine state gives the agent reliable structure. Agent-perspective vision gives it situated understanding.

Daunt combines both into a compact context stream: what this agent can see, hear, remember, reach, and do right now.

Daunt architecture diagram

World state becomes model context. Model intent becomes executable action.

The Old Interface Was a Menu

A menu is not just a UI pattern. It is a confession.

It says the system cannot handle arbitrary speech. It cannot handle open-ended intent. It cannot bind your words to the current world. It cannot safely let a character improvise an action. It cannot understand what you are pointing at unless that object was pre-authored into a branch.

So the game narrows the world into four choices.

fallout4tldrhd2

An interaction with a character from Fallout 4, which netted more than $750 million less than 24 hours after launch.

That narrowing has defined interactive characters for an entire medium. The player can explore a massive world, but conversation still becomes a vending machine: insert interaction key, receive authored options. Even when the world around the character is physically simulated, socially the character is frozen.

Daunt begins from a different assumption: the interface should not be the menu. The interface should be the world.

You should be able to speak, point, interrupt, walk away, return, overhear, ask another agent, move an object, change the room, and have those changes enter the interaction without everything being pre-scripted.

Not every response needs to be brilliant. But the loop needs to be alive.

Seeing From Inside the Agent

Most virtual agents do not need human-like vision. They need world-native perception.

a game engine already contains symbolic truth: object IDs, positions, tags, raycasts, navmeshes, collision layers, line-of-sight checks, animation states, faction data, ownership, inventory, and event history. Daunt turns that into agent-readable context.

But structured state is not always enough. Some worlds do not expose it cleanly. Some modding environments are partial. Some browser worlds are visually rich but semantically thin. Some scenes contain details that are obvious to a viewer but missing from metadata. This is where VLM perception becomes important.

fallout4tldrhd2

A peek into Daunt's agent view with an NPC from Call of Duty Black Ops III.

Daunt can render or capture the scene from the agent’s own perspective and pass it through a VLM to produce additional context. The agent is not just told abstractly that the player is nearby. It can be given an interpreted view of what it would plausibly notice: the player facing away, the object in the crosshair, the clutter on the table, the open doorway, the other agent standing close enough to hear.

The goal is not omniscience. Omniscience breaks believability.

The goal is situated perception: a bounded slice of the world from the agent’s position, attention, and available senses.

From NPCs to 3D Generative Agents

Daunt is not only about making one NPC talk faster.

The larger move is taking generative agents out of the 2D sandbox and putting them into real-time 3D space.

The Stanford Generative Agents work showed how language-model agents with memory, reflection, planning, and observation could produce believable social behavior. Agents remembered events, formed relationships, coordinated plans, spread information, and produced emergent social dynamics from simple initial conditions.

Daunt extends that idea into embodied worlds.

Agents are not just text processes exchanging messages. They have positions. They have voices. They have proximity. They can see one another. They can hear one another. They can miss things. They can overhear. They can interrupt. They can walk into a room at the wrong time and change the conversation because they are physically present.

A conversation in a 3D world is not just a transcript. It is a spatial event.

Someone stood near the door. Someone arrived late. Someone was facing away. Someone else was close enough to overhear. A rumor spread because two agents crossed paths. A plan changed because a third agent entered the room. A conflict escalated because the wrong person heard the right sentence.

That is the difference between dialogue generation and social simulation.

Daunt multi-agent spatial awareness

Agents can see, hear, speak to, overhear, remember, and react to one another inside the same 3D world.

This is where Daunt becomes more than an NPC runtime. It becomes infrastructure for inhabited simulations: environments where behavior emerges from perception, memory, voice, proximity, and action rather than from hand-authored dialogue branches.

Voice Is the First Wedge

Voice is where embodiment becomes obvious.

Try Daunt's custom NPC voices with <300ms time-to-first-audio, real-time voice interaction fast enough to feel playable.

A text response can be slow and still feel acceptable. A voice response cannot. The moment a character pauses too long, the illusion breaks. It stops feeling like conversation and starts feeling like a request waiting on a server.

That is why Daunt starts with the latency-critical voice runtime.

The first target is simple: custom NPC voices fast enough to feel playable, with under 300ms time-to-first-audio in current prototype conditions. Below that threshold, the interaction starts to feel interruptible, reactive, and socially present. Above it, even a good model feels like a call-and-response API.

But voice alone is not Daunt's goal.

A fast character that ignores the world is just a faster wrapper. Daunt’s voice loop is built to sit on top of world-state grounding: who is speaking, where they are, what the agent can see, who else is nearby, what just changed, whether the agent was interrupted, and which actions are valid right now.

The voice is the surface. The runtime is the body.

The Runtime Layer

Daunt sits in the hardest part of the stack: the last mile between AI models and world engines.

Python models do not naturally live inside C++ frame loops. Game engines do not want blocking HTTP calls. Audio systems care about milliseconds. Modding APIs are brittle. Browser worlds expose state differently than commercial engines.

That mess is the point.

Across modded games, commercial engines, browser-native worlds, and custom simulations, the same problem keeps appearing: translate a changing world into agent context, then translate agent intent back into valid action.

d_runtime

The Daunt runtime manages the last mile between AI models and world engines.

The runtime handles the boring problems that decide whether the agent feels alive: sampling state without flooding the model, using VLMs when state is missing, keeping audio responsive, representing nearby agents, validating actions, cancelling stale responses, preserving memory, and preventing the model from inventing things the engine cannot execute.

Users do not experience architecture diagrams. They experience timing, attention, interruption, and reaction. If a character answers instantly but ignores the sword in your hand, it feels fake. If it sees the sword but waits three seconds to speak, it feels dead.

The magic depends on the machinery.

Action and Affordance

An embodied agent should not hallucinate actions into existence.

If a model says “follow the player,” the runtime has to know whether following is possible. Is there a navmesh? Is the player reachable? Is the door closed? Is the agent allowed to move? Is there an animation? Is the target visible? Should the agent speak first, turn first, or stop because another agent is blocking the way?

d_constrain

Action catalogs ensure model intent is grounded in valid world affordances.

Daunt uses action catalogs to constrain behavior to what the world can actually support. These catalogs expose valid actions to the model and validate the result before execution. The model can improvise inside the boundaries of the world instead of pretending the world will obey anything it says.

This is what turns language into behavior.

Not “the agent says it will follow you.”

The agent follows you.

Blind Spots

Daunt does not magically solve embodiment.

Some worlds expose rich state. Others expose almost none. VLMs can infer missing visual context, but they can also misread a scene. Engine state can be precise, but semantically thin. Memory can preserve continuity, but it can also retrieve the wrong event.

d_goldilocks

The hard problem: deciding what an agent should know and when it should stay silent.

More context is not always better. Too little context and the agent becomes a chatbot. Too much context and the agent becomes slow, noisy, or unnervingly omniscient. The hard problem is deciding what this agent should know, what it could plausibly perceive, what it should remember, what it is allowed to do, and when it should stay silent.

Embodiment is not just adding sensors.

It is giving perception boundaries.

The Road Ahead

Daunt starts with voice because voice is the first place players feel latency. It expands into world-state grounding because speech without perception is hollow. It adds VLM scene understanding because many worlds do not expose the state agents need. It adds multi-agent spatial awareness because believable worlds are not made of isolated NPCs waiting for the player to activate them.

The near-term path is practical: make speech fast, make perception situated, make actions valid, make agents aware of each other, and make the runtime portable enough to work across real engines instead of only in a clean demo.

The long-term path is much bigger.

Examples of Daunt integrated within various games showcasing realtime voice, perception, and action.

World models are beginning to generate environments. Game engines already simulate them. Robots are trained in them. People socialize in them. Researchers use them as laboratories for behavior. Designers use them as stages for narrative. The missing layer is not more geometry. It is inhabitation.

A future virtual world should not be empty space waiting for a user. It should be full of agents with senses, memory, voice, social awareness, and the ability to act. Not omniscient gods. Not scripted mannequins. Agents that notice, misunderstand, recover, coordinate, teach, follow, argue, help, organize, and change because they are embedded in the same world you are.

That is the shift Daunt is built for.

From menus to speech.
From speech to perception.
From perception to action.
From isolated characters to inhabited worlds.

Daunt is the sensory layer for that future.

pressxtosayhello

Daunt was presented live to an audience of 600+ people at Cornell Tech on May 14, 2026. Friends joined me on stage to help deliver the pitch and perform the demo.

Daunt presentation
Daunt detail 1
Daunt detail 2
Daunt detail 3
Daunt detail 4


Technologies Used:

AI
Robotics
VLM
LLM
Real-Time
Multi-Agent
Game Development