When Typing Became the Bottleneck: How to…

Jul 30

GenAI 30 Project Challenge - 10. Integrating voice input into every part of your workflow

41 Comments

Another amazing article, Jenny!

Many years ago, I experimented with decoders for HF radio transmissions. I built and trained different models, but the chunking that you describe was a big problem. I had to use a buffer up to 8 seconds to produce an acceptable error rate. Smaller chunking produced a lot of garbage.

It's amazing that you are able to do this with a 2 to 3-second buffer size. I want to check out this Whisper model that you are using here.

A great use case for local LLMs, and I like your brilliant idea of a global hotkey and pasting text to the current cursor location.

Expand full comment

Thank you so much, Finn! Really appreciate the comment, especially coming from you!

That example you shared is amazing, can only imagine how much domain knowledge you have to get in order to decode and chunk the HF radio transmissions in a sensible way.

Definitely give Whisper a try, it's super helpful now that I could speak whenever I prefer!

Expand full comment

Logan Thorneloe

I'm building something similar for Linux! The optimization I ended up giving up on was using an LLM at output to refine the voice typing. Too high latency, too much memory, and often incorrect despite those two.

Expand full comment

Thanks for sharing your experience Logan! This is totally fair, yes, using the LLM at the output to refine the voice typing is indeed unnecessary most of the time. I'm usually just using the raw output.

Expand full comment

How worried should creators be about AI voice cloning?

Expand full comment

Thanks Jens, that's a fair concern.

Since I am hosting everything including the LLM locally, then there is no way for my voice to be cloned. I think the same case applies to built-in dictation.

But if I use OpenAI's API to handle the voice-to-text the conversion, and if OpenAI does preserve all the voice submissions, then yeah, it's a bit concerning.

Expand full comment

Daria Cupareanu

Loved how practical this was, and how you shared the streaming limitations and the trade-offs too.

I really relate to the frustration of losing half my ideas while typing them out :)) your solution of integrating voice input everywhere feels like a real productivity unlock.

Expand full comment

Thank you, Daria!

I have a feeling this might really resonate with you, because you probably have tons of ideas flowing all the time, and being able to speak them out freely could be a real boost :)

Also, hope you have a great vacation!!!

Expand full comment

Daria Cupareanu

Ahh, the thought of you thinking that I’d enjoy this… thank you & you were right!! Vacation starts next week, right now it’s just the terror before it haha.

Keep doing amazing work🤗

Expand full comment

This was such a great read Jenny! I’ve recently leaned heavily into voice AI too and been trying out all the different STT/dictation tools (happy to give you my review on them.) I’ve found speaking rather than typing transforms my workflow and I stay locked in ideas instead of getting distracted. It’s so cool that you built your own, I would love to chat more about this with you because I’m looking at ways to incorporate voice AI into some of the tools I’m working on building. I’ll DM you!

Expand full comment

Wow, Zain, I’d love to learn more!

I couldn’t agree more. When I’m typing, especially with a ton of ideas in my head, it’s so easy to lose track of the chain of thought. But when I’m speaking, I can stay focused on just the thinking, and it’s a much more concentrated flow.

Expand full comment

This was really helpful insight Jenny. Thank you! 😊

Expand full comment

Thank you Stuart! Glad you found it useful.

Expand full comment

Aguilar Norah AI Council

i do better typing than dictating i had try to talk to my AI she is not the issue is me i try to say 10 things at the same time so yes for more empathy Norah Copilot show is frustrating for me so typing is my best option

Expand full comment

Totally fair. It's all about choosing whatever feels most frictionless for your flow, and if that’s typing, then that’s the right move.

Expand full comment

Really enjoyed the audio and screen sharing examples!

Expand full comment

Awww thank you Joel, for the reading and “enjoying” my accent :)

Expand full comment

It adds more depth and more Jenny to the post :)

Expand full comment

fwiw I use piper for tts and it works very well for speech production. I also use a FOSS for dictation but I don't do much dictation. https://github.com/rhasspy/piper

Expand full comment

Thanks for sharing it.

Expand full comment

No GitHub repo or sample code alongside the article? 🥲

Expand full comment

Oh sorry, I should make the linked GitHub repo more obvious! This is the repo that I've done minimal modification to get it working locally and added hotkey: https://github.com/grapeot/brainwave/

Expand full comment

James Presbitero

This is a really fascinating and practical solution. I switch to voice when I'm inputting long, complex prompts to ChatGPT as well. I haven't had any problems with it, though. Just to be clear, is the reason you went through all the steps you did there because:

a. Native text-to-speech solutions on AI apps couldn't decode your non native English accent, and

b. Using more powerful AI-powered app to what you're working with (CGPT, Cursor, etc.) introduced a lot of friction to your workflow?

So essentially, if I want a more powerful and intuitive way to encode text to speech, this is a good method to follow. Am I understanding it right? I'm afraid I'm not technical enough for many of the things you mentioned 😆

Expand full comment

Thanks for those questions, James!

To answer them:

a. Commercial AI apps like ChatGPT, Whisper, etc., recognize my voice just fine. The issue only arises with Apple’s built-in dictation.

b. Sort of. The friction isn’t from using AI itself, but rather from context switching, jumping between different tools like ChatGPT, Substack, private messages, and writing docs. That constant back-and-forth can be disruptive.

Not sure if that clears up your confusion?

That said, if you’re not feeling any friction, then it’s not a problem at all! Nothing wrong with sticking to what’s already working for you.

And honestly, in some ways, no one is ever “technical enough” 😄. You can pick up what you need just by asking the AI. If you ever do want to switch things up, there are also some open-source apps out there that might help streamline the process.

Expand full comment

James Presbitero

I seeeee!!! Yeah that definitely clears it up. Thanks for answering! So you made an AI powered voice system that’s “global” to your workspace, so you eliminate all the context switching!! I can better appreciate how cool that is now.

Expand full comment

Yes! Make AI voice system global in the workspace is a perfect term!

Thanks for appreciating on this :)

Expand full comment

I am sure this is more than enough for a prototype.

There are definitely ways to improve...

Thanks for sharing, Jenny.

Expand full comment

Yes, there are definitely ways to improve, this just scratches the itch for now.

Appreciate your comment, Sharyph!

Expand full comment

Ty! Great post.

Expand full comment

Thank you Robert! Glad you enjoyed it.

Expand full comment

Thanks for sharing this! I definitely feel your pain with the built-in dictation struggles as a non-native English speaker too.

Expand full comment

You are welcome Luan! Yeah, speaking with accent is hard :)

Expand full comment

thanks for the mention, Jenny ;)

Expand full comment

You are welcome Harrison :)

Expand full comment

I still type everything because of my accent. I just know something will go wrong. I’m currently trying to set up the built in voice dictation and chain it with GPT.

Expand full comment

Yeah, right? Built-in dictation still works most of the time, let me know how does chaining with GPT go for you :)

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts