From Brain Dump to Done: How Todoist’s Ramble Captures Tasks in Real Time with AI

Turning a rambling stream of consciousness into a clean task list while someone is still talking has been a longtime product dream of mine. With Ramble, Todoist brought that dream to life by using live audio AI to capture tasks in real time—no transcription step required. The result is a voice-to-task flow that feels natural, fast, and surprisingly disciplined.

As I listened to the Doist team—Ernesto Garcia (Front-end Product Engineer), Thomas Jost (Backend Software Engineer), and Hugo Fauquenoi (Product Manager)—walk through their approach, I heard a blueprint for building pragmatic GenAI features. What began as a two-to-three month AI exploration became one of their most technically deliberate releases: a “Gemini-powered pipeline that makes tool calls while the user is still speaking, surfacing tasks on screen in real time without any text output from the model.”

The breakthrough started with user research. People weren’t merely dictating tasks; they were doing a “brain dump” first—often into pen and paper or even ChatGPT voice—and only then committing items to Todoist. Meeting users where they already are reframed the problem: don’t force structure upfront; capture fluid thought and translate it into actionable tasks instantly.

That insight led to a bold architectural choice: skip transcription entirely and process raw audio directly with a Gemini live audio model. By removing the brittle middleman of text, the team reduced latency and kept the model focused on one job—turning intent into structured actions. It’s a crisp example of AI workflows designed for reliability over novelty.

The real magic is in the real-time “tool calls.” As the user speaks, the model triggers add task, edit task, and delete task operations immediately. For high-friction contexts like driving, they paired visual task cards with subtle sound effects as confirmation cues. It’s thoughtful conversation design that respects attention and safety without sacrificing speed.

Teaching the model to capture tasks literally—without over-interpreting or trying to complete the work—required careful prompt engineering for voice and temperature tuning. Drawing a bright line between “capture versus do” kept the experience trustworthy. In my own AI Strategy work, I’ve found that establishing explicit agentic guardrails early prevents unintended autonomy later.

Dates were the sleeper challenge. The team had to inject the current date, normalize to days vs. months, and always output dates in English for the natural language parser—while preserving the user’s original language for everything else. If you’ve ever shipped date handling across locales, you’ll appreciate how many edge cases hide in “Taming Dates and Time.”

Quality didn’t hinge on intuition alone. They built an LLM-judge eval system using real employee recordings from 100+ people across 35 countries in 20+ languages to catch prompt regressions. That’s eval-driven development done right: representative data, repeatable scoring, and tight feedback loops as models and prompts evolve.

For project and label matching, they chose direct context injection over RAG. Instead of building a retrieval pipeline, they injected the full project/label list into the system prompt. With smart context window management and a sharply constrained task schema, this was both simpler and more accurate. Sometimes the fastest path to product-market fit is removing moving parts, not adding them.

One product principle stood out: easy correction beats perfect first-time accuracy. Natural language interfaces earn trust when users can fix misfires in a tap or two. That bias toward quick recovery over false precision is how you ship AI that feels useful from day one.

Looking ahead, the roadmap is compelling: multimodal task capture from images and text blobs, Apple Watch support, and automation integrations. As voice AI agent patterns mature, this “tool-only architecture” sets a solid foundation for going from capture to coordinated execution—without losing the simplicity that makes Ramble shine.

If you want to hear the full conversation, you can listen on Spotify or Apple Podcasts. It’s a masterclass in building focused GenAI features that trade cleverness for clarity—and still delight.

Resources & Links: Todoist • Doist • Google Vertex AI (Gemini)

Inspired by this post on Product Talk.

What is Ramble and how does it work in real time?

Ramble captures tasks in real time by turning voice notes directly into structured tasks, skipping transcription steps. It uses a Gemini live audio model to process raw audio and trigger task actions as you speak.

What happens with tool calls during voice input?

As you speak, the model triggers add, edit, and delete task operations immediately. It surfaces tasks on screen with subtle confirmations in high-friction contexts like driving.

How are dates handled across locales?

Dates are injected as the current date and normalized to days versus months. They are always output in English for the parser, while preserving the user’s language for other content.

What is the LLM-judge eval system?

Todoist built an LLM-judge eval system using recordings from 100+ people across 35 countries in 20+ languages to catch prompt regressions. This demonstrates eval-driven development with representative data, repeatable scoring, and fast feedback loops as prompts evolve.

Why direct context injection over RAG?

They chose direct context injection by loading the full project/label lists into the system prompt and managing context windows. This approach kept the solution simpler and more accurate.

What does the roadmap look like for Ramble?

The roadmap includes multimodal task capture from images and text blobs, Apple Watch support, and automation integrations. These milestones aim to extend Ramble from capture to coordinated execution while maintaining simplicity.

From Brain Dump to Done: How Todoist’s Ramble Captures Tasks in Real Time with AI

What is Ramble and how does it work in real time?

What happens with tool calls during voice input?

How are dates handled across locales?

What is the LLM-judge eval system?

Why direct context injection over RAG?

What does the roadmap look like for Ramble?

Comments

Leave a Reply Cancel reply

Signup for Weekly Digest Emails

Categories

Archieve