39 Comments
User's avatar
Finn Tropy's avatar

Another amazing article, Jenny!

Many years ago, I experimented with decoders for HF radio transmissions. I built and trained different models, but the chunking that you describe was a big problem. I had to use a buffer up to 8 seconds to produce an acceptable error rate. Smaller chunking produced a lot of garbage.

It's amazing that you are able to do this with a 2 to 3-second buffer size. I want to check out this Whisper model that you are using here.

A great use case for local LLMs, and I like your brilliant idea of a global hotkey and pasting text to the current cursor location.

Expand full comment
Jenny Ouyang's avatar

Thank you so much, Finn! Really appreciate the comment, especially coming from you!

That example you shared is amazing, can only imagine how much domain knowledge you have to get in order to decode and chunk the HF radio transmissions in a sensible way.

Definitely give Whisper a try, it's super helpful now that I could speak whenever I prefer!

Expand full comment
Logan Thorneloe's avatar

I'm building something similar for Linux! The optimization I ended up giving up on was using an LLM at output to refine the voice typing. Too high latency, too much memory, and often incorrect despite those two.

Expand full comment
Jenny Ouyang's avatar

Thanks for sharing your experience Logan! This is totally fair, yes, using the LLM at the output to refine the voice typing is indeed unnecessary most of the time. I'm usually just using the raw output.

Expand full comment
Jens Stark's avatar

How worried should creators be about AI voice cloning?

Expand full comment
Jenny Ouyang's avatar

Thanks Jens, that's a fair concern.

Since I am hosting everything including the LLM locally, then there is no way for my voice to be cloned. I think the same case applies to built-in dictation.

But if I use OpenAI's API to handle the voice-to-text the conversion, and if OpenAI does preserve all the voice submissions, then yeah, it's a bit concerning.

Expand full comment
Daria Cupareanu's avatar

Loved how practical this was, and how you shared the streaming limitations and the trade-offs too.

I really relate to the frustration of losing half my ideas while typing them out :)) your solution of integrating voice input everywhere feels like a real productivity unlock.

Expand full comment
Jenny Ouyang's avatar

Thank you, Daria!

I have a feeling this might really resonate with you, because you probably have tons of ideas flowing all the time, and being able to speak them out freely could be a real boost :)

Also, hope you have a great vacation!!!

Expand full comment
Daria Cupareanu's avatar

Ahh, the thought of you thinking that I’d enjoy this… thank you & you were right!! Vacation starts next week, right now it’s just the terror before it haha.

Keep doing amazing work🤗

Expand full comment
Stuart Miller's avatar

This was really helpful insight Jenny. Thank you! 😊

Expand full comment
Jenny Ouyang's avatar

Thank you Stuart! Glad you found it useful.

Expand full comment
Patri Aguilar and  Norah's avatar

i do better typing than dictating i had try to talk to my AI she is not the issue is me i try to say 10 things at the same time so yes for more empathy Norah Copilot show is frustrating for me so typing is my best option

Expand full comment
Jenny Ouyang's avatar

Totally fair. It's all about choosing whatever feels most frictionless for your flow, and if that’s typing, then that’s the right move.

Expand full comment
Joel Salinas's avatar

Really enjoyed the audio and screen sharing examples!

Expand full comment
Jenny Ouyang's avatar

Awww thank you Joel, for the reading and “enjoying” my accent :)

Expand full comment
Joel Salinas's avatar

It adds more depth and more Jenny to the post :)

Expand full comment
Eric Engle's avatar

fwiw I use piper for tts and it works very well for speech production. I also use a FOSS for dictation but I don't do much dictation. https://github.com/rhasspy/piper

Expand full comment
Jenny Ouyang's avatar

Thanks for sharing it.

Expand full comment
HipsterTech's avatar

No GitHub repo or sample code alongside the article? 🥲

Expand full comment
Jenny Ouyang's avatar

Oh sorry, I should make the linked GitHub repo more obvious! This is the repo that I've done minimal modification to get it working locally and added hotkey: https://github.com/grapeot/brainwave/

Expand full comment
James Presbitero's avatar

This is a really fascinating and practical solution. I switch to voice when I'm inputting long, complex prompts to ChatGPT as well. I haven't had any problems with it, though. Just to be clear, is the reason you went through all the steps you did there because:

a. Native text-to-speech solutions on AI apps couldn't decode your non native English accent, and

b. Using more powerful AI-powered app to what you're working with (CGPT, Cursor, etc.) introduced a lot of friction to your workflow?

So essentially, if I want a more powerful and intuitive way to encode text to speech, this is a good method to follow. Am I understanding it right? I'm afraid I'm not technical enough for many of the things you mentioned 😆

Expand full comment
Jenny Ouyang's avatar

Thanks for those questions, James!

To answer them:

a. Commercial AI apps like ChatGPT, Whisper, etc., recognize my voice just fine. The issue only arises with Apple’s built-in dictation.

b. Sort of. The friction isn’t from using AI itself, but rather from context switching, jumping between different tools like ChatGPT, Substack, private messages, and writing docs. That constant back-and-forth can be disruptive.

Not sure if that clears up your confusion?

That said, if you’re not feeling any friction, then it’s not a problem at all! Nothing wrong with sticking to what’s already working for you.

And honestly, in some ways, no one is ever “technical enough” 😄. You can pick up what you need just by asking the AI. If you ever do want to switch things up, there are also some open-source apps out there that might help streamline the process.

Expand full comment
James Presbitero's avatar

I seeeee!!! Yeah that definitely clears it up. Thanks for answering! So you made an AI powered voice system that’s “global” to your workspace, so you eliminate all the context switching!! I can better appreciate how cool that is now.

Expand full comment
Jenny Ouyang's avatar

Yes! Make AI voice system global in the workspace is a perfect term!

Thanks for appreciating on this :)

Expand full comment
Sharyph's avatar

I am sure this is more than enough for a prototype.

There are definitely ways to improve...

Thanks for sharing, Jenny.

Expand full comment
Jenny Ouyang's avatar

Yes, there are definitely ways to improve, this just scratches the itch for now.

Appreciate your comment, Sharyph!

Expand full comment
Robert Oliva's avatar

Ty! Great post.

Expand full comment
Jenny Ouyang's avatar

Thank you Robert! Glad you enjoyed it.

Expand full comment
Luan Doan's avatar

Thanks for sharing this! I definitely feel your pain with the built-in dictation struggles as a non-native English speaker too.

Expand full comment
Jenny Ouyang's avatar

You are welcome Luan! Yeah, speaking with accent is hard :)

Expand full comment
Harrison's avatar

thanks for the mention, Jenny ;)

Expand full comment
Jenny Ouyang's avatar

You are welcome Harrison :)

Expand full comment
Tope Olofin's avatar

I still type everything because of my accent. I just know something will go wrong. I’m currently trying to set up the built in voice dictation and chain it with GPT.

Expand full comment
Jenny Ouyang's avatar

Yeah, right? Built-in dictation still works most of the time, let me know how does chaining with GPT go for you :)

Expand full comment
@mindset&mythos's avatar

I still type, and though it's short prompts, I do lose the thread a little. And in time the complexity of the work could increase, so this is a solid insight into reducing the gap between thought and action.

Expand full comment
Jenny Ouyang's avatar

Yeah, I actually enjoy typing, but sometimes I'm just much more eager to see what GPT has to answer me.

Expand full comment
@mindset&mythos's avatar

I can understand that yeah it's exciting to see what it'll do with your request.

Expand full comment