A digital human is a 3D model based on an actual person that looks and acts as naturally as the person, even in a virtual space.
In recent years, many companies have been developing digital humans for a variety of applications, including advertising, agents, and games.
The degree of reproduction depends on the purpose. Currently, digital humans are often used in movies for stunt scenes, but many moviegoers may not be aware that they are watching CG. Quality used to be a challenge, but we have gone beyond the level of the “uncanny valley.”
Our goal is to produce a digital human with such a realistic presence that it makes us feel “This person is here with us.” Ultimately, we are trying to reproduce a person so perfectly that, if you had to interact with them in a virtual space, you would be unable to tell whether they are an avatar or the actual person.
To reproduce a person’s characteristics, our focus was on the expression of individuality. Each individual has unique facial movements. We believe that if we can reproduce the wrinkles that appear when speaking and the characteristic changes in facial expressions, avatars in the metaverse could look exactly like the people they are based on. Accurately reproducing the individuality of facial expressions is what makes a person feel unique.
From the perspective of natural interaction, we focused on the way the avatar’s gaze meets the user. For example, if the avatar simply stares at the user continuously, the user will feel uncomfortable, so we reproduced more natural behavior by having the avatar emulate human eye and head movements. Ultimately, we believe that the approach and interaction should be based on cognitive science methods, taking into account the user’s mental state.
A typical CG animation workflow consists of the “rigging” process, which involves creating a “skeleton” (rig) for the model, and the “animation” process, which assigns movement by changing the rig parameters on a time axis.
The same goes for facial expression animation, which uses a facial rig. Since real human faces move in complex ways, the facial rigs used to reproduce them are usually complex as well, and it takes a high level of skill to construct a facial rig that faithfully reproduces a real person. In addition, complex facial rigs can be difficult to control, and the process of animating them also requires a high level of skill. We took on the challenge of solving these issues with technology.
The first problem that we tried to solve was to significantly reduce the workload of building complex facial rigs and animating them. To do so, we came up with a new workflow that eliminates the need to build a conventional facial rig.
Specifically, the system uses an actual person as an actor, captures the person’s facial movements, and generates a CG model of their face directly from the captured information. To achieve this, we worked on developing technologies for facial motion capture, which captures the actor’s facial movements, and facial deformation, which accurately reproduces this movement information in CG. In addition, since a CG character does not consist of only a face, we are also taking on the technical challenge of face & body integration to create a full model.
Facial motion capture is a technology that accurately tracks the movement of each part of a performer’s face to obtain the information necessary to make a CG representation of the face move naturally. Facial deformation technology requires the collection of very detailed three-dimensional information to accurately reproduce minute movements such as muscle stretching and contraction and the resulting wrinkles.
First, we analyzed various facial expression patterns and extracted the areas necessary to reproduce those expressions. As a result, it was determined that three-dimensional information on approximately 100 areas and information on the contours of the eyes and lips were needed, so we placed markers on the necessary locations.
To obtain three-dimensional information, it is necessary to capture images using multiple cameras synchronized to deduce three-dimensional motion information. In actual shooting, multiple cameras are set up at various angles to ensure that the markers’ movements are captured without omission. However, some facial expression changes may cause marker detection failure or occlusion, so the system uses optical flow and camera information from different angles to ensure stable information acquisition.
In addition, highly accurate contour detection of the eyes and lips is achieved by using video footage of the performer as training data. The accuracy of the information captured in this process determines the accuracy of the movements in the next process, so we are establishing the technology through a process of trial and error, and it continues to evolve.
From this year, we have been using head mounted cameras (HMCs) for filming. In conventional filming using multiple fixed cameras, the actor must perform while keeping his face in front of the camera. With HMCs, however, facial expressions can always be captured, meaning that the actor can move freely and body movements can be captured at the same time. A digital human is only complete when the entire body can move naturally, so HMCs are an indispensable tool for obtaining natural movement data.
Since HMCs are worn on the head, fewer cameras can be supported. However, with fewer cameras, it is even more important to estimate the three-dimensional position of the markers and acquire accurate data. Therefore, some of the performer’s facial expressions are captured in advance using conventional fixed cameras and HMCs, and this data is used to create training data specific to the performer. This training data includes the performer’s facial expressions and marker positions, and by learning the correlation between them, the 3D marker positions can be captured using HMCs only, but with the same level of accuracy as conventional fixed cameras.
Facial deformation is a technology that deforms CG models to give them human-like expressions based on 3D marker information and the contour information of eyelids and lips obtained through facial motion capture.
The figure shows the processing pipeline for facial deformation.
The first processing block deforms the entire CG model of the face based on 3D marker positions and eyelid contour information obtained through facial motion capture. Here, the markers and eyelid contours must be accurately positioned and the entire model must be plausibly deformed. We defined an energy function with geometric constraints and minimized it to achieve the overall deformation of the CG model.
The human face varies from individual to individual in terms of wrinkles caused by stretching and contracting the skin and muscular ridges caused by changes in facial expression. The deformation in the first stage can reproduce the position of the markers and the contour shape of the eyelids, but it cannot reproduce wrinkles or other individual characteristics.
To address these issues, we used more than a dozen previously acquired facial expression patterns, and by using a machine learning model unique to the individual, we were able to reproduce individual characteristics in response to changes in facial expression. Our machine learning model uses the degree of elongation or contraction of each region as the feature value and regresses the residuals between the geometrically deformed model and the true value. This allowed us to express deformations that reflect individual characteristics.
The last step of facial deformation adds processing to suppress breakdowns. For example, the eyelid is supposed to cover the eyeball, but conventional deformations can cause a gap between the eyeball and the eyelid or cause the eyeball to break through the skin. The shape of the oral cavity cannot be observed with a camera, making it difficult to reproduce the correct shape in response to changes in facial expression.
To solve these problems, special deformations were applied to the eyelids, lips, and oral cavity. For the eyelids, a shape deformation algorithm was introduced that assumes the shape of the eyeball and deforms the eyelids so that they touch the eyeball while avoiding interference between the two, thus making the eyelid motion more realistic. For the lips, deformation processing with geometric constraints was implemented so that the lip contours acquired with facial motion capture would match the lip shape of the CG model. Deformation processing is applied to the oral cavity in reference to anatomical findings to reproduce more realistic deformation of the human body.
By performing these processes, we were able to achieve realistic facial animation that reflects the individual’s characteristics.
The final step in achieving a realistic digital human is harmonizing facial expressions and body movements. If the face moves separately from the body, it will look unnatural, so face & body integration was developed to realize natural movements throughout the body.
Existing methods are used for deformation of the CG body model. Three-dimensional information on body movements is obtained from optical body motion capture, which attaches markers to the neck, legs, arms, etc. The data is used to estimate the posture of the skeleton, and the mesh associated with the skeleton is deformed to reproduce the shape of the body.
Our focus was on the shape of the neck, which is affected by both facial and body movements. The shape of the neck depends on both the movement of the neck bones, which determines the direction of the face, and the movement of the lower jaw, which is linked to the movement of the mouth. Therefore, we built a system to synchronize HMCs and body motion capture that enables us to use 3D information from the movements of the face and body at the same time.
Based on this data, neck deformation technology is used to estimate the shape of the neck. Training data combining multiple mouth shapes and facial orientation patterns are created in advance, and the underlying neck shape is extracted from this data. The neck shape is reproduced by combining these bases, but how they are combined is inferred from the lower jaw movement and neck orientation in the imaging data. This reproduces natural facial and body movements. Currently, the neck shape is reproduced from a large amount of training data, but we are continuing development to reproduce it with less data in the future.
The development of this technology has automated the entire process of creating natural human facial expressions in video images, greatly reducing the amount of work previously done by creators based on their experience and knowledge and enabling them to deliver higher quality products.
To achieve more natural behavior, we would like to reproduce a person’s characteristics in their body movements as well as their facial expressions.
With the advent of the metaverse era, digital humans will become a familiar presence, and the need to easily create one’s own avatar will increase. In the future, we would like to provide digital humans that capture the individuality of the consumer.
The video at the beginning is of a digital human produced in collaboration with RikaRiko of Sony Music Artists. One of Sony’s strengths is the ability to conduct verification tests with expressive artists and creators through in-group collaboration. We will continue to take advantage of this environment to collaborate in areas such as films and games to advance digital human technology.
We not only want to represent people realistically as 3D models, but also to use these models to provide users with new experiences that they cannot have in reality. What began as a vague idea at the start of the project has gradually taken shape, and we are now conducting R&D to take it further, thanks to Sony’s culture of taking on challenges and the support of many people both inside and outside the company. We will do our best to support people with a strong passion to achieve their goals.
To express people means to know them more deeply. In the future, I would like to work on developing technology that can reproduce emotions through facial expressions and body movements, and conversely, technology that can read emotions from facial expressions and body movements. I think that people who are interesting in CG or who are already creating CG with game engines can apply their skills in our field. This technology will ultimately have a diverse range of applications, so it is important to take an interest in a wide variety of fields.
I find it rewarding to conduct research that improves the world through the power of imaging technology. While technical skills are of course important for conducting research, I feel that the secret to success is to have fun with whatever you do. I regularly enjoy anime and video games, and this is what drives me in my research. For students in the sciences, I would like you to see your current research through to the end. That experience will be useful, no matter what job you end up getting.