Once More With Feeling

Originally published on the Clarify.io blog. View archived copy.

While we’re usually concerned with extracting text from audio and video, at least one member of our team has some background in doing the exact opposite: turning text into audio and video.

Earlier this week, our language scientist BalaKrishna Kolluru was granted his first patent in the United Kingdom. This patent was unique in that it begins to address the dreaded Uncanny Valley in computer science. In the simplest terms, the Uncanny Valley is that odd sense of revulsion you get from a robot or computer animation which is really lifelike but not quite right. It is believed to be caused by a number of things but lack of body language, facial expressions, and vocal variances all play a part.

With respect to this problem, BalaKrishna and team set out to address a portion of that by “training” a computer animated head to express and display emotions as it speaks. Here is a portion of the abstract:

A method of animating a head and displaying text of an electronic book, whereby the head has a mouth which moves in synch with an audio sound of the text being displayed thus lip-syncing the words, the method comprises steps of inputting the text of said book; dividing said input text into a sequence of acoustic units; determining expression characteristics for the inputted text; calculating a duration for each acoustic unit using a duration model; converting said acoustic units to image vectors using a statistical model, which define a face of said head;

and this is what it becomes in the real world:

If you’re interested in learning more about this patent, the underlying science, or speech analysis in general, don’t hesitate to drop us a note via email or Twitter.