2084: Automatic Speech Recognition

A short article on speech

Nov 27, 2022

Today’s article will be very short, as I’ve used the time I usually use to write this to check out various types of speech to text models. It’s amazing to see how much the space has exploded in the last 2 years. I remember not too long ago how terrible most of them were and now they’re practically perfect, and a large part of them are open source.

Firstly, I took a look at Hugging Face’s collection of open source speech to text models in their Transformer library. They have a lot of them, that can do a variety of different things - sentiment analysis, speaker differentiation, and of course, transcription. And there’s a lot of variations that purport to have a wide variety of different effects. From there, I found SpeechBrain, which is an open source end to end toolkit for running a variety of speech models, which has as a subset, a sophisticated set of automatic speech to text models which have been pretrained and run. The collab notebooks they provide are also great to read first before trying to understand the other models as they provide a high level overview of the current state of the art in speech processing, which they present for speech recognition as:

Basically: modify soundwave, extract features, run through transformer or recurrent neural network or similar model, get a list of tokens with probabilities, and then finally the posterior probabilities over the output tokens are processed by a beamsearcher that explores different alternatives and outputs the best one. It’s quite interesting to read.

However, a lot of these models were too slow - waiting around for 15 minutes is no fun - so I tried searching to find some that were faster, where I found the following github saying that by using “non-autoregressive models” i.e. models that don’t output words one after another but can do a bunch of words at once, they can achieve a significant speedup. However, none of these papers had released an open source implementation or a pretrained model, so I couldn’t verify whether that was true or not.

Then I decided to look at some more commercial tools to see what there were, and I found DeepGram, which is probably by far the most impressive tool I looked at. It could transcribe a 15 minute speech in 10 seconds for about 30c, which was far better than a lot of the other models I could find, and probably the best on the market today. With a $200 initial credit and a cost of 1.5c per minute, it was reasonably priced too. Definitely the best if you don’t like waiting around.

I was surprised though at the lack of variety in training data. Most of these models seemed to use the same datasets over and over again, probably due to easy accessibility and usability. In addition, with the sentiment analysis, as far as I could see, it was mostly a simplistic “positive” or “negative”. I think there’s a definite gap for a more business inclined sentiment analysis tool, that has more like “investible”, “seriousness” or other more nuanced outputs.

There’s also so many platforms at the moment. A lot of models are scattered across a variety of platforms and frameworks, all of which are generally poorly documented, and require a variety of tools to get working. While PyTorch or Keras might be the basis, a lot of the code requires additional tools which aren’t as readily available. Of course this is always the issue with new and unmature research, but it does make finding interesting models hard - Hugging Face has 347 pretrained models for the espnet2 frame, and there’s no easy way to see how they differ and what that means. In addition, a lot of papers have no code, which makes it quite difficult to understand what exactly they mean.

Of course that is an issue with most of AI and models. I read an article recently which I can’t find again(of course) that talked about the relative absence of research into “Transfer Learning” or applying general models to specific problems. A lot of AI models seem to solve the everything problem, but most applications are very specific and this mismatch might be why the uptake by business is slow - businesses deal a lot more in specifics than generalities. I think that more specific datasets, for example datasets of investor briefs, or stock reports, could be an interesting domain of research in the future, rather than just variations on different structures constantly.

In 2084, this might be extremely commonplace - for everything you do, you select a super general model, and then a dataset for it to apply to. I think that the technology just needs a couple more years to mature.

2084

2084: Automatic Speech Recognition

A short article on speech

Discussion about this post