When Typing Became the Bottleneck: How to Build a Voice-First AI System That Thinks With You
GenAI 30 Project Challenge - 10. Integrating voice input into every part of your workflow
Do you use AI voice input when working with your projects, or are you still typing everything?
I used to be firmly in the typing camp, until I realized I was sabotaging my own productivity.
Here's what happened: I picked up a specific way of interacting with AI that actually works. I don't just ask AI to "fix this problem"; I give full context: explain what I tried before, share relevant examples, describe what went wrong, and lay out the bigger picture.
This approach gets great results. The more context I provide, the better the AI response. But there's a cruel irony: by the time I finish typing all that context, half my original insights have evaporated. I'd start with a clear vision, then spend five minutes typing it out, only to lose the thread of my thinking.
ChatGPT's voice input seemed like the obvious solution. Until I discovered its special torture:
Speak for two minutes, watch it process, then watch it fail and lose everything.
Back to typing.
I tried other voice-to-text tools, but they all required copy-pasting between apps. The constant context-switching was just as disruptive as typing.
I'd been optimizing every other part of my workflow, but voice input remained this glaring gap. So I decided to fix it properly, not with another standalone app, but with something that actually integrates into how I work.
1. Built-in Dictation Failed, And What I Wanted Instead
If you've used Apple’s built-in dictation, you know it can work really well sometimes. I actually have dictation shortcuts set up on both my phone and laptop. Just by selecting the shortcuts icon and it would transcribe my voice, then place the result in my clipboard.
But it doesn't work very well for people like me. As a non-native speaker, my pronunciation isn't great, and the built-in dictation just doesn't recognize technical terms or professional vocabulary correctly.
In this example, I wanted to say “no, I already have a virtual environment existing in my repo, now find that venv and activate it”. But the built-in dictation perfectly missed the “repo” and “activate it”.
This became a real problem because my pain point was pretty urgent. So I set three specific goals:
Real-time voice-to-text flow that actually works for my accent and vocabulary.
Local hosting using Ollama and Whisper so I don't have to deal with API limits or connectivity issues.
Plus, with local processing, I could easily chain Whisper's raw output through other local models to transform the text into any format I wanted.True workflow integration.
Not just another voice-to-text app where I have to copy, paste, and juggle between different tabs. I was really bothered by that discontinuous feeling of switching contexts all the time.
Why some voice systems work better than others
As I researched solutions, I learned that voice-to-text is basically AI listening to audio patterns and predicting what words match those sounds. Modern systems like Whisper use neural networks trained on massive datasets of human speech, with different accents, languages, background noise, technical jargon.
This explained why built-in dictation struggled with my accent and technical vocabulary, while Whisper handles both pretty well. The training data makes a huge difference.
I knew there were multiple open-source projects that already solved the hard technical problems: real-time audio processing, model integration, and WebSocket handling. So it's silly for me to build everything from scratch again. I finally settled on one that provided solid scaffolding with all the fundamental infrastructure built.
The beauty was that I could focus on the AI enhancement layer instead of rebuilding the wheel.
2. The AI Helpfulness Problem, And How I Went Local
I wanted local hosting for personal reasons:
It's often faster and keeps my data private rather than sending everything to the internet.
But as I played with the open-source project, I hit another fundamental problem: the AI helpfulness dilemma.
When I said "What should I eat for dinner?" I wanted it transcribed exactly as spoken. Instead, OpenAI realtime API would try to help: "For dinner, I recommend trying a Mediterranean salad..." even when I prompt it not to answer any questions.
AI trying to be helpful is actually not helpful when you just want transcription. This deepened my preference for local hosting where I could control each step.
The Technical Setup
My architecture looked like this:
Audio Input → Whisper (transcription) → Local LLM (text processing) → Final Output
For Whisper, I started with the tiny model (39MB) for speed, but found the accuracy wasn't good enough for technical terms. I settled on the base model which gave me the right balance of speed and accuracy for my use case.
The bigger challenge was the local LLM. I initially tried llama3.2:3b but it couldn't follow basic instructions. When I told it to "just clean up text without answering questions," it would still try to answer everything. After testing several models, I settled on qwen2.5:7b. It’s smart enough to follow complex prompts but fast enough for real-time processing.
Multiple Processing Modes
This is where the local setup really shined. I added different processing modes for different contexts:
Raw mode: Output exactly what Whisper transcribed, no changes
Clean mode: Fix just the basic grammar and remove filler words ("um," "uh," "like").
Polished mode: Transform the speech into well-structured text while preserving the meaning
Assistant mode: Actually engage with the content, answer questions, provide suggestions, or help with whatever I'm working on
The same voice input "Um, what should I, uh, eat for dinner tonight?" would produce completely different outputs depending on the mode.
Raw gives me the exact transcription; Clean removes the filler words; Polished might restructure it as "What dinner options should I consider for tonight?"; and Assistant mode would actually suggest meal ideas.
This separation eliminates the helpfulness problem entirely.
Plus I have zero API costs and complete privacy.
The prompt engineering became much simpler too, each model focuses on what it does best instead of trying to juggle multiple responsibilities.
Enable Voice Input Everywhere
It’s very interesting that we have all these fancy voice-to-text apps that work brilliantly, but somehow none of them integrate into actual workflows.
My 2nd biggest pain point was constantly switching between tabs and trying to find the right window.
If you're like me, with more than 10 tabs and several browser windows open, you know that tab-switching is mentally exhausting. If I could just place my cursor where I want the output and invoke voice input right there, it would save so much mental overhead.
So here's what I did.
Global Hotkey Implementation
I added custom keyboard shortcuts so I can place my cursor anywhere on my system and start voice processing, and the output appears exactly where my cursor is positioned.
The technical implementation was surprisingly straightforward using Python's pynput
library for global hotkey detection and pyaudio
for system-wide audio capture (of course, with the superpower from Cursor).
The Architecture Behind It
The workflow looks like this:
Press global hotkey → Python script captures audio
Audio gets sent to my local server via HTTP
Server processes it through Whisper → Local LLM → Enhanced text
Python script automatically types the result where my cursor is
The most satisfying part? This works everywhere: Notion tables, Substack compose boxes, WhatsApp texts, email drafts, code comments.
Any text field becomes voice-enabled without changing my workflow.
In this example, I invoked the process in the Cursor chatbox so you can compare the terminal logs with the chatbox output. My input filled with filler words "Hmm, I think, well, uh, maybe, I can have some chocolate" is cleaned up to "I believe I could use some chocolate".
Integration Challenges
You might laugh if I say the first real challenge was choosing the right hotkey combination that wouldn't conflict with existing shortcuts.
With too many built-in and custom shortcuts, I finally to settle on a complex combo of Ctrl+Option+Cmd+D, specific enough to avoid conflicts but still manageable to press with one hand.
Getting this to work required handling platform-specific permissions. On macOS, I needed to grant accessibility permissions for global input monitoring and microphone access for background recording. The Python script runs as a background service, always listening for the hotkey combination.
Working with AI on this project, what stroke me the most is: the hardest part was no longer the technical implementation, but was making it feel natural. The latency had to be low enough that it didn't break my thinking flow.
How the Complete Workflow Look Like
The transformation is simple but profound: every time I need to communicate something, I invoke the hotkey and start speaking instead of scrambling to type or copy-paste from elsewhere.
Voice input removed the bottleneck between thinking and communicating.
Different Contexts, Different Modes
Here's how I actually use it:
For technical work (Cursor, code comments, git commits): Raw mode gives me exact transcription. Cursor understands my instructions whether they're perfectly polished or not, so I just speak naturally and let it parse the meaning.
For public communication (Substack responses, social media): Clean or polished mode depending on the context. Clean mode removes filler words but keeps my voice. Polished mode makes everything sound more intentional.
For writing and brainstorming (like this article): Raw mode lets me get ideas down as fast as I think them. I'm literally dictating this draft right now… speak into the document, stop talking, and text appears within seconds. The raw transcription becomes the foundation that I then edit and polish.
Now I’m happily putting out my comments whenever I want. (The note in the screenshot by
, he’s one of those friends not in the AI space, but always great to interact with)The Mental Shift
The real change isn't technical, it's psychological. I went from "how do I type this complex instruction to AI?" to "what exactly do I want it to do?" The friction between having an idea and expressing it was minimized.
Whether I'm responding to comments, explaining bugs to AI, or brainstorming article sections, the question shifted from "how do I phrase this?" to "what do I actually want to say?"
The Failures and What I Gave Up Optimizing
Hopefully this looks like a valid generative AI project to you, but I had bigger expectations.
The Streaming Dream
In the beginning, I really wanted real-time streaming transcription. That means text appearing as I spoke, just like built-in dictation but with better accuracy. It seemed like the obvious next step: why wait for me to finish speaking when the AI could process and display words in real-time?
I experimented with processing small audio chunks (1-2 seconds) for immediate feedback, but the output became nonsensical because tiny chunks lose context.
Then I tried accumulating chunks over time and continuously replacing the output with improved transcriptions. This created a repetitive nightmare. The system kept reprocessing the same audio and couldn't stop adding to itself.
I happened to be testing it while confirming with the dad that our baby girl was still sleeping. Here’s what it kept repeating: “She is still sleeping yes She is still sleeping yes…”
The Industry Reality
Apparently, this is a common problem. It explains why OpenAI's voice input processes complete batches instead of streaming in real-time, and why Perplexity's streaming output loses accuracy compared to batch processing.
I eventually accepted the batch processing approach. Sometimes the 2-3 second delay is just the price you pay for accuracy and reliability.
Don't quote me on the industry technical details; if you know more about why streaming Automatic Speech Recognition is so challenging, I'd love to hear your thoughts.
Your Move: Start Where You Are
Building this voice-to-text system transformed how I work with AI, but you don't need to replicate my entire setup to get value. Different situations call for different tools.
Here's how to start, wherever you:
Stage 1: Try ChatGPT's voice input for your next AI conversation. It's instant value with zero setup.
Stage 2 (Apple users): Set up built-in dictation shortcuts, then chain them with GPT. This combination is surprisingly powerful for quick inputs.
Stage 3: Experiment with local Whisper. This is where you gain control over accuracy and privacy.
Stage 4: Integrate with global hotkeys. When voice works anywhere in your system, that's when the magic happens.
There's no one superior approach. I still use ChatGPT voice input for brainstorming, basic dictation on my phone for quick notes and ideas, local Whisper for voice processing... when I use them all depend on the context.
The goal isn't the perfect voice system; it's removing that friction between thinking and communicating.
What's your biggest typing bottleneck, and which of these approaches might solve it?
Another amazing article, Jenny!
Many years ago, I experimented with decoders for HF radio transmissions. I built and trained different models, but the chunking that you describe was a big problem. I had to use a buffer up to 8 seconds to produce an acceptable error rate. Smaller chunking produced a lot of garbage.
It's amazing that you are able to do this with a 2 to 3-second buffer size. I want to check out this Whisper model that you are using here.
A great use case for local LLMs, and I like your brilliant idea of a global hotkey and pasting text to the current cursor location.
I'm building something similar for Linux! The optimization I ended up giving up on was using an LLM at output to refine the voice typing. Too high latency, too much memory, and often incorrect despite those two.