Many years ago, I experimented with decoders for HF radio transmissions. I built and trained different models, but the chunking that you describe was a big problem. I had to use a buffer up to 8 seconds to produce an acceptable error rate. Smaller chunking produced a lot of garbage.
It's amazing that you are able to do this with a 2 to 3-second buffer size. I want to check out this Whisper model that you are using here.
A great use case for local LLMs, and I like your brilliant idea of a global hotkey and pasting text to the current cursor location.
Thank you so much, Finn! Really appreciate the comment, especially coming from you!
That example you shared is amazing, can only imagine how much domain knowledge you have to get in order to decode and chunk the HF radio transmissions in a sensible way.
Definitely give Whisper a try, it's super helpful now that I could speak whenever I prefer!
I'm building something similar for Linux! The optimization I ended up giving up on was using an LLM at output to refine the voice typing. Too high latency, too much memory, and often incorrect despite those two.
Thanks for sharing your experience Logan! This is totally fair, yes, using the LLM at the output to refine the voice typing is indeed unnecessary most of the time. I'm usually just using the raw output.
Since I am hosting everything including the LLM locally, then there is no way for my voice to be cloned. I think the same case applies to built-in dictation.
But if I use OpenAI's API to handle the voice-to-text the conversion, and if OpenAI does preserve all the voice submissions, then yeah, it's a bit concerning.
Loved how practical this was, and how you shared the streaming limitations and the trade-offs too.
I really relate to the frustration of losing half my ideas while typing them out :)) your solution of integrating voice input everywhere feels like a real productivity unlock.
I have a feeling this might really resonate with you, because you probably have tons of ideas flowing all the time, and being able to speak them out freely could be a real boost :)
Ahh, the thought of you thinking that I’d enjoy this… thank you & you were right!! Vacation starts next week, right now it’s just the terror before it haha.
i do better typing than dictating i had try to talk to my AI she is not the issue is me i try to say 10 things at the same time so yes for more empathy Norah Copilot show is frustrating for me so typing is my best option
fwiw I use piper for tts and it works very well for speech production. I also use a FOSS for dictation but I don't do much dictation. https://github.com/rhasspy/piper
Oh sorry, I should make the linked GitHub repo more obvious! This is the repo that I've done minimal modification to get it working locally and added hotkey: https://github.com/grapeot/brainwave/
This is a really fascinating and practical solution. I switch to voice when I'm inputting long, complex prompts to ChatGPT as well. I haven't had any problems with it, though. Just to be clear, is the reason you went through all the steps you did there because:
a. Native text-to-speech solutions on AI apps couldn't decode your non native English accent, and
b. Using more powerful AI-powered app to what you're working with (CGPT, Cursor, etc.) introduced a lot of friction to your workflow?
So essentially, if I want a more powerful and intuitive way to encode text to speech, this is a good method to follow. Am I understanding it right? I'm afraid I'm not technical enough for many of the things you mentioned 😆
a. Commercial AI apps like ChatGPT, Whisper, etc., recognize my voice just fine. The issue only arises with Apple’s built-in dictation.
b. Sort of. The friction isn’t from using AI itself, but rather from context switching, jumping between different tools like ChatGPT, Substack, private messages, and writing docs. That constant back-and-forth can be disruptive.
Not sure if that clears up your confusion?
That said, if you’re not feeling any friction, then it’s not a problem at all! Nothing wrong with sticking to what’s already working for you.
And honestly, in some ways, no one is ever “technical enough” 😄. You can pick up what you need just by asking the AI. If you ever do want to switch things up, there are also some open-source apps out there that might help streamline the process.
I seeeee!!! Yeah that definitely clears it up. Thanks for answering! So you made an AI powered voice system that’s “global” to your workspace, so you eliminate all the context switching!! I can better appreciate how cool that is now.
I still type everything because of my accent. I just know something will go wrong. I’m currently trying to set up the built in voice dictation and chain it with GPT.
I still type, and though it's short prompts, I do lose the thread a little. And in time the complexity of the work could increase, so this is a solid insight into reducing the gap between thought and action.
Another amazing article, Jenny!
Many years ago, I experimented with decoders for HF radio transmissions. I built and trained different models, but the chunking that you describe was a big problem. I had to use a buffer up to 8 seconds to produce an acceptable error rate. Smaller chunking produced a lot of garbage.
It's amazing that you are able to do this with a 2 to 3-second buffer size. I want to check out this Whisper model that you are using here.
A great use case for local LLMs, and I like your brilliant idea of a global hotkey and pasting text to the current cursor location.
Thank you so much, Finn! Really appreciate the comment, especially coming from you!
That example you shared is amazing, can only imagine how much domain knowledge you have to get in order to decode and chunk the HF radio transmissions in a sensible way.
Definitely give Whisper a try, it's super helpful now that I could speak whenever I prefer!
I'm building something similar for Linux! The optimization I ended up giving up on was using an LLM at output to refine the voice typing. Too high latency, too much memory, and often incorrect despite those two.
Thanks for sharing your experience Logan! This is totally fair, yes, using the LLM at the output to refine the voice typing is indeed unnecessary most of the time. I'm usually just using the raw output.
How worried should creators be about AI voice cloning?
Thanks Jens, that's a fair concern.
Since I am hosting everything including the LLM locally, then there is no way for my voice to be cloned. I think the same case applies to built-in dictation.
But if I use OpenAI's API to handle the voice-to-text the conversion, and if OpenAI does preserve all the voice submissions, then yeah, it's a bit concerning.
Loved how practical this was, and how you shared the streaming limitations and the trade-offs too.
I really relate to the frustration of losing half my ideas while typing them out :)) your solution of integrating voice input everywhere feels like a real productivity unlock.
Thank you, Daria!
I have a feeling this might really resonate with you, because you probably have tons of ideas flowing all the time, and being able to speak them out freely could be a real boost :)
Also, hope you have a great vacation!!!
Ahh, the thought of you thinking that I’d enjoy this… thank you & you were right!! Vacation starts next week, right now it’s just the terror before it haha.
Keep doing amazing work🤗
This was really helpful insight Jenny. Thank you! 😊
Thank you Stuart! Glad you found it useful.
i do better typing than dictating i had try to talk to my AI she is not the issue is me i try to say 10 things at the same time so yes for more empathy Norah Copilot show is frustrating for me so typing is my best option
Totally fair. It's all about choosing whatever feels most frictionless for your flow, and if that’s typing, then that’s the right move.
Really enjoyed the audio and screen sharing examples!
Awww thank you Joel, for the reading and “enjoying” my accent :)
It adds more depth and more Jenny to the post :)
fwiw I use piper for tts and it works very well for speech production. I also use a FOSS for dictation but I don't do much dictation. https://github.com/rhasspy/piper
Thanks for sharing it.
No GitHub repo or sample code alongside the article? 🥲
Oh sorry, I should make the linked GitHub repo more obvious! This is the repo that I've done minimal modification to get it working locally and added hotkey: https://github.com/grapeot/brainwave/
This is a really fascinating and practical solution. I switch to voice when I'm inputting long, complex prompts to ChatGPT as well. I haven't had any problems with it, though. Just to be clear, is the reason you went through all the steps you did there because:
a. Native text-to-speech solutions on AI apps couldn't decode your non native English accent, and
b. Using more powerful AI-powered app to what you're working with (CGPT, Cursor, etc.) introduced a lot of friction to your workflow?
So essentially, if I want a more powerful and intuitive way to encode text to speech, this is a good method to follow. Am I understanding it right? I'm afraid I'm not technical enough for many of the things you mentioned 😆
Thanks for those questions, James!
To answer them:
a. Commercial AI apps like ChatGPT, Whisper, etc., recognize my voice just fine. The issue only arises with Apple’s built-in dictation.
b. Sort of. The friction isn’t from using AI itself, but rather from context switching, jumping between different tools like ChatGPT, Substack, private messages, and writing docs. That constant back-and-forth can be disruptive.
Not sure if that clears up your confusion?
That said, if you’re not feeling any friction, then it’s not a problem at all! Nothing wrong with sticking to what’s already working for you.
And honestly, in some ways, no one is ever “technical enough” 😄. You can pick up what you need just by asking the AI. If you ever do want to switch things up, there are also some open-source apps out there that might help streamline the process.
I seeeee!!! Yeah that definitely clears it up. Thanks for answering! So you made an AI powered voice system that’s “global” to your workspace, so you eliminate all the context switching!! I can better appreciate how cool that is now.
Yes! Make AI voice system global in the workspace is a perfect term!
Thanks for appreciating on this :)
I am sure this is more than enough for a prototype.
There are definitely ways to improve...
Thanks for sharing, Jenny.
Yes, there are definitely ways to improve, this just scratches the itch for now.
Appreciate your comment, Sharyph!
Ty! Great post.
Thank you Robert! Glad you enjoyed it.
Thanks for sharing this! I definitely feel your pain with the built-in dictation struggles as a non-native English speaker too.
You are welcome Luan! Yeah, speaking with accent is hard :)
thanks for the mention, Jenny ;)
You are welcome Harrison :)
I still type everything because of my accent. I just know something will go wrong. I’m currently trying to set up the built in voice dictation and chain it with GPT.
Yeah, right? Built-in dictation still works most of the time, let me know how does chaining with GPT go for you :)
I still type, and though it's short prompts, I do lose the thread a little. And in time the complexity of the work could increase, so this is a solid insight into reducing the gap between thought and action.
Yeah, I actually enjoy typing, but sometimes I'm just much more eager to see what GPT has to answer me.
I can understand that yeah it's exciting to see what it'll do with your request.