Facial animation and lip-syncing can be the difference between a successful animation and a ropey one. It requires a massive amount of effort and incredible skill to get it right.
Lip-syncing is the act of getting your character’s mouth to move along with an audio track of some spoken dialog in order to create a convincing performance. While simply moving the jaw up and down may suffice for quick character animation work, it’s not going to be enough if you intend your character to be engaging and believable,
or at least for it not to be distracting or weird-looking.
Just like in non-verbal facial animation, lip-synching seems to get exponentially more difficult the more accurate you try to be. The brain is hardwired to decode human facial movements to the nth degree of subtlety, so it can be very difficult to deceive your audience into accepting that what they are seeing is a natural and realistic performance from a digital character.
First, in order to make a character talk you must animate the face so that the physical representation of the mouth shapes match the words in the audio file. The basic unit of speech in this case is not a word but a phoneme. Phonemes are the fundamental sounds that make up words, such as the ‘th’ sound in ‘that’ or the ‘oo’ in ‘spooky’.
When lip-synching in 3D, you don’t need to create custom mouth shapes for each and every word. Instead, you create a library of phoneme shapes and cobble together the performance by keyframing those shapes on your mesh, making sure the timing matches their occurrence in the audio file. Despite the relatively small pool of phonemes, it’s still a complex and time-consuming business. You do have the opportunity however, of deciding on the size of the phoneme set that you will use depending on the quality of the final animation. For good basic results you can get away with around nine phoneme shapes, while for more accurate results you may need to use as many as 40 or more.
The phoneme shapes, called visemes, are generally created before you animate and as each phoneme occurs you set a keyframe that applies the viseme on the mesh. How this is done will vary from application to application. In Maya, for instance, you might create a series of Blend Shapes or bone poses for each viseme, while in LightWave you would probably use Morph maps. Either way, you deform the mesh to match the sound file.
Most 3D applications allow you to perform basic lipsynching using their standard tools, though being able to view and hear the audio waveform directly in the timeline is a great help if it’s supported. Without this ability you will be forced to plot out the timing location of the phonemes with a pencil and paper and using this as a reference.
However, there are a number of dedicated lip syncing tools available that help speed up the process and even automate some of it. In this round-up we have six of the best lip-sync tools and utilities on test and it’s interesting how differently each handles the process.
One of the most laborious aspects of lip-synching is ascertaining which phoneme is where in an audio clip. It’s no surprise then that some of the applications focus on this problem by analyzing the audio and extracting phoneme animation from it automatically. Some of the applications let you apply the animation to a model in 3D view, while others only export animation files that are then applied to control objects or for keyframing morph targets in your main 3D application. While their functionality and methods vary they, all should help speed up what is a time-consuming and laborious task.