Discussion about this post

User's avatar
Finn Tropy's avatar

Another amazing article, Jenny!

Many years ago, I experimented with decoders for HF radio transmissions. I built and trained different models, but the chunking that you describe was a big problem. I had to use a buffer up to 8 seconds to produce an acceptable error rate. Smaller chunking produced a lot of garbage.

It's amazing that you are able to do this with a 2 to 3-second buffer size. I want to check out this Whisper model that you are using here.

A great use case for local LLMs, and I like your brilliant idea of a global hotkey and pasting text to the current cursor location.

Expand full comment
Logan Thorneloe's avatar

I'm building something similar for Linux! The optimization I ended up giving up on was using an LLM at output to refine the voice typing. Too high latency, too much memory, and often incorrect despite those two.

Expand full comment
37 more comments...

No posts