Saying that almost everyone's facial movements can be synchronized with speech and sound clips Microsoft researchers began producing human animations using artificial intelligence. In fact, this is not surprising, because “deepfake” videos now appear in all areas of our lives. Even though entertaining videos appear in this way, they sometimes distort the statements of politicians.
Developing technology may prove that deepfake will never disappear. Because one of the world's most important technology companies, such as Microsoft “deepfake” does not seem to have stepped into this field with the new development. Moreover, Microsoft is not the only one in this field.
Last June Samsung researchers have described an end-to-end model that can portray a person's eyebrows, mouth, eyelashes and cheeks exactly. Only a few weeks from now Udacityintroduced a system that automatically produces lesson videos from audio narration. Two years ago. Carnegie Mellon the researchers published a statement explaining the approach that allowed the transfer of facial movements from one person to another.
Based on these and other studies, the Microsoft Research team put forward a technique they claim to improve the quality of sound-oriented talking head animations. Previous approaches to human head formation required a clean and relatively noiseless sound in a neutral tone. However, researchers say that with the new research, methods that separate sound sequences into factors such as phonetic content and background noise can generalize noisy and emotionally rich data samples.
We can say that human speeches are full of variations. Because different people, the same word, different times, toned and so on. they can say in different contexts. In addition to the phonetic content, the speech gives a lot of information about the speaker's emotional state, identity (gender, age, ethnicity) and personality. Microsoft explains what it is doing with its new research as the first approach to improving performance from the perspective of learning audio performance.
Under the proposed technique is a variable automatic encoder (VAE) that learns hidden representations. The input sound sequences are converted by VAE into different representations that encode content, emotion and other variation factors. Based on the input audio, a series of content representations are sampled from the distribution fed to a video generator along with the input face images to act on the face. So there's a possible sound that might suit that face.
The team says their approach, along with other methods for clear, impartial verbal expressions, is equal in performance on all criteria. Moreover, they are able to perform consistently across the entire emotional spectrum and are compatible with all current modern approaches to head-to-head conversation.