How much can we infer about a person's appearance by simply listening to the sound of their voice?
This is an interesting question. Some of us may be able to recognize the gender, and age of the speaker, and in some cases, with the help of his accent, we may be able to recognize the ethnicity as well.
Additionally, we often wonder what an individual looks like when we speak to them. Despite being able to detect gender, age, and ethnicity with some accuracy, we rarely come close to creating the true image we have of a person when we see them for the first time. Sometimes we say to ourselves or to someone we just met, "How young you look", probably because based on the preceding telephone conversation, we had created an expectation of an older person or vice versa.
The ability of voice biometrics to predict age and gender has been around for decades, certainly more accurate than our own estimations. Voice biometrics also recognizes a person's voice based on the physical characteristics of the person responsible for producing sound.
More recently researchers from MIT's Computer Science and Artificial Intelligence Laboratory published an article “Speech2Face: Learning the Face Behind a Voice” that artificial intelligence can vividly reconstruct people's faces with relatively impressive detail, using short audio clips of their voices as a reference.
Artificial intelligence continues to advance exponentially, along with impressive voice biometric accuracy improvements taking advantage of fine-tuning algorithms together with better quality audio because of multiple players contributing to noise reduction techniques. Corsound is the first commercial entity to stretch these limits with artificial intelligence and its voice biometric technology and is training an algorithm consisting of 200 patents to create a Deep Neural Network using millions of data points to generate an estimated image of an individual simply by listening to their voice.
With the help of complex models, the system learns and analyzes audio and produces an image reflecting the speaker's physical characteristics, age, gender, and ethnicity. The picture of the person is not a photograph but an AI-generated image and reflects a strong resemblance to the actual person. In a comparison with a group of images, the generated image would produce the highest probability of a match. Corsound is very encouraging and optimistic regarding initial results and is pin-focused on fine-tuning the accuracy of this ground-breaking technology.
The potential use cases for government security are endless with no need to elaborate. Also, there are many use cases in civilian, commercial, social, financial and healthcare settings, from authentication and fraud detection to enhanced engagement and sentiment towards a customer by looking at an image and not just listening to a voice.