Microsoft AI Generates Talking Heads from Noisy Audio Samples - benoitcabol2001

Researchers at Microsoft throw developed a proficiency that aims to improve the truth and quality of talking read/write head generations by focusing along the audio frequency stream. As per occurrent talking head generation techniques, a unarmed and noise-free audio input with a inert feel is mandatary but the researchers take that their method "disentangles audio sequences" into factors like phonetic complacent, emotional tone, and downpla noise in order to work with any given audio sample.

"As we all know, actor's line is riddled with variations. Different people utter the same word in different contexts with varying continuance, amplitude, pure tone etcetera. In addition to linguistic (phonic) content, speech carries riotous information revealing about the speaker's spirit, identity (sex, age, ethnicity) and personality to name few.", wrote the researchers in a paper titled "Animating Face using Disentangled Audio Representations".

The proposed methodology of researchers takes place in two stages. Firstly, the extricated representations are known from the audio source by a variational autoencoder(VAE). Afterwards the unsnarling is done, talking heads are generated from the classified audio input based on the face image input signal by a GAN-based video source.

Microsoft researchers used three different data sets to train and test the VAE namely Control grid, CREMA-D, and LRS3. GRID is an audiovisual sentence corpus that contains 1,000 recordings from 34 people – 18 male, 16 female. CREMA-D is an audio dataset consisting of 7,442 clips from 91 ethnically-diverse actors – 48 male, 43 female. LRS3 is a dataset with over 100,000 spoken sentences from Ted videos.

Supported the test results analysis, the researchers say that their method is capable to perform consistently over the entire feeling spectrum. "We validate our model by testing on noisy and emotional audio samples, and she that our approach significantly outperforms the on-line state-of-the-nontextual matter in the presence of such audio variations."

The researchers take over also mentioned that their project can be swollen to identify former lecture factors like the identity element of a person and the gender in the future. So, what are your thoughts on this audio-unvoluntary head generation proficiency? Let us know in the comments.