Initiatives

AAAI-20

Image Processing Technology Area
(human recognition technology)

Jul 17, 2020

Focusing on human recognition research based on the understanding of human figure and movement, this section introduces 7 researches presented at AAAI-20 in 3 topics: 3D human pose estimation using monocular 2D images, search and tracking of specific persons, and action recognition.

Topic1:3D Human Pose Estimation from Monocular Images

2D human recognition represented by OpenPose*1 has been applied to 3D, and many techniques for estimating 3D human skeletons have been proposed. There are two reasons why it is difficult to deduce a 3D human skeleton using a 2D image. The first problem is data sets. In order to achieve high estimation accuracy, a sufficient amount of 2D images and corresponding 3D human skeleton data sets are required. However, it is difficult to collect a large amount of such data. The second problem is occlusion in 2D images, it has to predict the human body parts that are shielded by other objects. The following paper proposed methods to solve these problems.

[1]Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

This research focused on the problem of not having enough training data for 3D human skeletons recognition and proposed to use improved existing datasets and methods to address this problem. Using only 2D images as a training dataset, it trains an auto-encoder to recover the original 2D image from the skeleton. Although the method does not have human skeleton information in the training data, it achieved the same performance as supervised learning has the human skeleton information.

[2]Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations

This research also focused on the problem of not having enough training data for 3D human skeletons recognition. This proposed method reduced the complexity of the problem by employing an intermediate representation of the human body*2 between the image and the 3D human skeleton.

[3]Deep Reinforcement Learning for Active Human Pose Estimation

In this study, it was proposed that a unique idea for the problem of how to estimate the invisible part of the human body. Assuming a movable camera, the camera itself learned to move to a clearly visible position if the object is shielded by reinforcement learning. The camera successfully moved to a point where it can recognize a specific person avoiding shielding.

Topic2:Re-ID person search

Next, we report on studies that search and track specific persons. This person's recognition focuses on the whole body rather than on the face. Unlike the task of general person recognition from images, the difference in clothing and so on is a powerful piece of information for estimating a specific person. In addition, since people do not appear and disappear suddenly in videos, the time before and after the target frame can also be used for estimation. In the following paper, the authors devised a way to learn these features explicitly for the specific tasks.

[4]Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search

In this study, two network models that capture people in different levels of detail were proposed for retrieving a person from an image. When describing a specific person, the text often includes descriptions of features associated with individual parts, such as arms. The Coarse Alignment Network captured the whole body roughly. The Fine-Grained Alignment Network captured the 6 parts (head, upper torso, arm, hand, leg, foot) in detail. The proposed search improved by combining these two networks.

[5]Hierarchical Online Instance Matching for Person Search

This study is a proposal for a task to track a specific person from a video. The video data set to which ID are attached to persons includes a considerable number of persons without ID. The study improved the performance by storing a person without an ID in a temporary label different from the temporary ID and learning the label estimation at the same time as the normal ID estimation. In addition, the separation of the background and the person is learned at the same time by assigning a temporary label to the background. The method combining recognizing a specific person and separating from backgrounds had achieved greater accuracy than the previous work.

[6]Rethinking Temporal Fusion for Video-Based Person Re-Identification on Semantic and Time Aspect

This study is also a proposal for a task to track a specific person from a video. When recognizing a person from a video, the feature quantity used for estimation has two directions of meaning and time, and the direction of meaning corresponds to the number of layers of CNN, and the direction of time corresponds to the frame of the video. Each CNN layer's feature is determined based on attention between multiple frames so that each feature corresponds to a frame that is important for estimation. It was effective to use the attention between frames by comparing some methods using multiple feature quantities and the attention.

Topic3:Action recognition

Finally, I introduce the research on action recognition technology focusing on human behavior. The task of action recognition is to classify people's movements. For example, NTU RGB + D*3, which is a data set for behavior recognition, has 82 classes of daily activities to classify.

[7]Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition

This research applied the method of considering the human body as a graph structure and learning it with Graph CNN, and was proposed a Part Relation Block (PR) that learns the relationship between parts such as an arm, and a Part Attention Block (PA) that pays attention to which parts should be focused. It achieved high accuracy in action recognition by learning the human body structure automatically from training data, instead of having people clearly indicate the relationship between parts and where to focus on.

Summary

Human recognition, such as motion, is becoming more and more effective through the adoption of methods proposed in other fields. On the other hand, as the accuracy of human recognition technology improves, we will have to consider privacy more. I think it will be important to utilize the technology appropriately as an application in light of future changes in the concept of privacy.

*1 OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
*2 Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation
*3 LSMB19: A Large-Scale Motion Benchmark for Searching and Annotating in Motion Data Streams

Reporter

Name : Tetsuro Sato
Education : Mechanical Systems Engineering
Current job : Machine learning model development
Job Category : Software, Signaling, Information Processing

Related article