Since Microsoft recently invested 10 billion dollars into OpenAI, it’s only fair that they adopt the same naming scheme that OpenAI uses, and thus the announcement of their new research Vall-E.
Now Vall-E is essentially a really really good speech duplicator. From a snippet of sound, along with an example text, it can generate a read out of the text in the tone, emotion and voice of the snippet. It can also modify the speech to sound angrier, sadder or more melancholy, depending on desire.
Basically what it does is it learns a codec of the audios. A codec is a simplified numerical representation of an audio, usually used for making audio files significantly smaller. And then it generates the tokens in the codec for the output predicated on the text and the input audio, based on a transformer(of course) architecture.
Now the speech it generates isn’t the best - in fact it sounds subtly alien, but it is still pretty good for the amount of data it’s trained on which is only 60000 hours of speech. Extend it out massively, make the data even bigger and we could get even better audio generation. Of course tho, the old constraints on the absurdly large amount of data samples are still present, as for all audio generating AI, and the codec is a bit of kludge around this fact, but there’s no reason to think that the token count won’t keep getting larger and larger, one or two or 10 papers down the line. In 2084, generating the millions of tokens needed for genuine lifelike sound might not even be considered to be a challenge.