It's really cool how people are taking these tiny cheap MCUs and making them do fun things for hobbyists. There's nothing better than a project with zero real-world use-case but that's done just because it was a challenge.
The general concept has a real world use, but this specific implementation probably doesn't perform as well as most of the already available ones on more powerful hardware.
Maybe it could have an IRL use for efficient wake word spotting or the like though.
Minor nitpick/clarification. As it stands this is doing detection of a fixed, small vocabulary of words - not open ended text to speech covering entire language. Also called speech command recognition / keyword spotting. Which is already impressive and useful. General STT on this grade hardware would be an amazing feat!
this is exciting! it's still at prototype stage: 'getting about 90% accuracy [distinguishing between the spoken digits 'zero' to 'nine',] with the code as it stands.'
i wonder if modern continuous optimization algorithms could yield a neural network that would do better than this mfcc approach at, perhaps, even lower computational cost
If you search a part number in Google and the only datasheet result is from an unpronounceable Chinese website, there's a very good chance it's not going to be on digikey. LCSC or AliExpress will be your only options. Even when designing boards, you have to consider whether you want to pick parts from the LCSC library or Digikey because they don't carry all the same parts and even parts that you would think are jellybean don't have the exact same package on both sites (especially SOT packages, similarly sized but not the exact same).
Replacing the codebook approach with a statistical/DNN is more likely to give higher accuracy than getting rid of mfccs as spectral representation (at least in general ASR). (Arguably, using Mel spectra was the least controversial design choice made for Whisper.)
thank you! those are good points. i was thinking that maybe you could get by with some relatively sparse convolutional layers over the raw sound samples and save yourself the expense of doing a real fourier transform, but maybe that's a dumb idea
It is a good idea that is worth trying out! Like anything there are tradeoffs though, so it is not guaranteed to be better for this particular circumstance. The ability to use low bitdepth integer operations (which easy for a neural net) should be beneficial for a CPU without a floating point unit. But weights need to be stored - and it can be difficult to match FFT efficiency - depending on what resolution is actually needed/utilized.
They also don't list any HBM memory of gddr7 which is frustrating as I'm trying to use kicad to design a cheaper PCI card.....but finding any decent documentation on those chips is impossible at the time.
Really nice project! Great care is taken in optimized audio feature extraction, very cool to see. I am working on a very similar project[1], using the Puya PY32. I opted for that chip over CH32 since it has DMA (simplifies efficient ADC input at audio rates), and 1 kB more RAM. For a couple of cents more.
I have written about some of the hardware constraints on low cost audio already, and am getting to the audio DSP/ML in the next months.
Also be interesting to know if that Voice Control Products ever had a real design win.
I gather the VCP200 was a mask-programmed M6804 microcontroller. The M6804 was a strange and obscure beast, apparently a cost-reduced, internally serial ("1-bit"), partial reimplementation of the M6805, which was one of the first Motorola 8-bit microcontrollers based on the 6800. Max bus speed of 2.75MHz, with an instruction cycle time of 44 microseconds. 32 bytes of RAM and 1K mask-programmed ROM. No ADC.
http://www.bitsavers.org/components/motorola/6804/M6804_MCU_...
One should be able to do better with about any modern microcontroller. Then again, for all I know the VCP200 was not fit to even the modest tasks (looks like toy/novelty/hobbyist) it was marketed for back then.
Well, considering everything outside/before whisper would be less than a 40% accurate on my voice (don’t know the reason and now whisper is close to 100% even with tech stuff/abbreviations). Things like Siri, Google, Alexa, Dragon etc all never understand (I stopped trying, so it might have improved, but I did try not long ago) anything I say. When I ask for the weather, something like Siri looks on Google what the border is etc. I am not native english, however I am fluent (work in English fulltime) and humans never have any issues; also, in my own language, none of them work either, except whisper, even locally running (which, like said, might’ve improved recently).
So it would be interesting to hear how articulated you would need to speak and have different people with accents and such.
I experience exactly the same. For me it’s an “accent” caused by profound hearing loss. No issues in everyday conversation, but almost zero success with any speech to text tool.
Among other things, it had limited speech recognition -- you could say "Call" followed by a name, and it would match that against the address book on device.
My 2006 Infiniti had voice commands for calling people in your address book. Road noise trashed the microphone quality so it only really worked well when you were at a stop.
Handsfree mics in cars still suck and Bluetooth handsfree audio quality sucks too, not sure why this is still a problem. I get backwards compatibility issues but is good compression that difficult in newer devices?
i had a sprint samsung sch-6100 flip phone with a similar voice recognition feature at the end of last millennium, but it would only match the name you told it to call against names you'd previously made training voice recordings of. that is, it wasn't trying to do speech-to-text or text-to-speech; it was just trying to discriminate among the particular recordings you had made previously
i didn't use the feature very much because to activate it, iirc, you had to either flip the phone open or press a button on a hands-free headset. but obviously this wasn't a bluetooth headset, and the phone couldn't play music, so you wouldn't walk around with it in your ears all the time; unless you'd just gotten off a different call, you'd have to get it out, put it in your ears, plug it in, and then you could use the speech recognition feature
so unless you were a secretary or something, making one phone call after another for hours (to a small number of people), you might as well just use speed dial
Yeah, even 24 year-old Nokia 3310 had some form of voice dialling [1].
OS/2 Warp 4.0 (1996) came with speech recognition and dictation software [2]. The CPUs it supported back then weren't much better than a 10-year old phone.
In addition, way back in 1993, Apple released speech recognition with the original AV Macs (which were outfitted with 55 MHz AT&T 3210 DSPs in addition to their 25+ MHz Motorola 68040s) which was then also supported on the PowerPC Macs released the following year (that started at 60 MHz)
If you uploaded some training data somewhere, perhaps to some links to simulators, you might get a crowd of people code-golfing this to maximize accuracy.
Eg:
Making the CH32V003 programmable via USB: https://www.youtube.com/watch?v=j-QazXghkLY
CH32V003 "Super-Cluster": https://www.youtube.com/watch?v=lh93FayWHqw
Powering a Nixie Tube from USB with a CH32V003: https://www.youtube.com/watch?v=-4d3PgEXhdY
(A good rule in life in general is to just always watch CNLohr and Bitluni if you're into "useless but amazing hardware projects")