Human pose estimation 

When we develop an AI system, our goal is often to make it capable of either reproducing a task with the same accuracy as a human or assisting a human in the realization of a complex task. In both scenarios, we hope to build a system with some common sense. By common sense, we mean reasoning about its task and, more generally, the world or the environment if the AI is embedded in a robot. Yet, the world as we know it is mostly designed by humans, for humans. Therefore, to evolve common sense in this human world, an AI must understand how humans interact with the world. Robots should also be able to reproduce some human actions to interact with their environment. This necessity leads us to the problems of estimating both the posture of the humans present in a visual scene (human pose estimation) and their motion (human pose tracking).

Human pose estimation is a long-lasting computer vision problem that gained a lot of attention in the last decade. Indeed, solving this problem would have many important applications: in health (e.g., detection of people falling), “remote free” interaction in domotics, gaming, sports analytics, security, sport and dance AI assistants, etc. In fact, each application of AI to a domain involving human activity can generate multiple use cases requiring human pose estimation and/or tracking.

Yoga Demonstration

When developing our human pose module, we really wanted to test it on an interesting use case, and we finally chose yoga. This application is particularly exciting because it is directly linked to upcoming health applications, such as sport AI assistants. Moreover, yoga is a sport involving all body joints and limbs in a large diversity of poses. This is a perfectly challenging use case that obliges us to respect constraints that we might have otherwise ignored: the need to estimate a 3D skeleton in real-world units (meters) that is temporally stable (to have a physically plausible movement). Working on this first application is also an excellent way to identify recurring sources of error and make our system more robust. Now that we have presented the use case, let us introduce the features of our new module.


It is necessary to recognize human actions to understand the interaction between humans and their environment. These actions are defined by the motion of the rigid parts of a human body: the limbs. Because we need a representation usable in an artificial neural network, the human pose is usually defined by an ensemble of key points: the joints at each limb’s extremities. This representation defines a skeleton when we connect the key points that belong to a common limb.

A first approach to obtain information about humans in a robot’s environment is to analyze each picture taken by a camera placed on the robot. With our human pose estimation module, we can estimate each person’s 2D pose in the scene. This 2D pose provides information about key points in the image (which pixels correspond to a body joint).

This approach can be satisfactory in some applications, for example, some basic gesture recognition. But many use cases require information about the 3D skeleton in real metric units (meters) in a 3D space, which is consistent with the real world. For example, a robot moving in an environment. Therefore, we included a 3D pose estimation module.

Because the final goal is to understand human motion, we want to analyze the human pose’s evolution through time. While a picture is enough to obtain a static human pose, we can do a lot better from a video (or sequence of pictures) and track the pose’s evolution. Therefore, we added a feature to track the 3D skeleton as it moves. This allows us to make better predictions about future poses based on past observations. Tracking also makes the estimation robust to the presence of noisy data and to sensor failures (lack of detection or false detection) at some time steps.

The skeleton’s estimation is usually a means towards a more applicative goal: recognizing an action or an intention, checking that the person has a correct position, etc. Having a robust and physically plausible 3D skeleton is very useful for these applications because we can then extract some important physical features: angle of articulations, alignment of keypoints, angles of a limb with respect to the detected floor. We coded the extraction of these physical features and integrated them into our human pose module.

From a 2D skeleton to useful 3D features, developing these features was made easy by reusing previously developed modules from our generic visual system: depth estimation, object and stuff detection, optical flow.

Practical application

With the help of our new human pose module, unlocking new use cases of AI involving humans becomes easier than ever.

AI systems that assist humans in their activities will make better decisions based on a more detailed description of human actions.

In health and sport applications, like yoga/workout AI assistants, our features help the practitioner perform a safe and correct movement and by tracking her improvements, by, for example, checking that a joint’s angle does not put it at risk and that the body parts are correctly aligned when needed.

In warehouse logistics, a robot will be able to navigate more efficiently and safely in the presence of humans. This can be done using the 3D skeleton to understand the orientation of a human in space because orientation is an important predictor of its future motion direction. Future probable trajectories can be predicted by producing accurate estimates of 3D velocity vectors, expressed in m/s.

Humans will be able to communicate with AI systems in new meaningful ways.

In domotics, interacting with the system without a remote will become very intuitive. The accurate and stable 3D skeleton helps to recognize commands. For example, the alignment allows the user to point to some smart applicant and interact with it. Gesture recognition will be useful in unstructured environments where humans, such as workers, will communicate visually, from afar with the system. In robotics, we will be able to teach robots how to reproduce human complex gestures, by performing the gestures in front of them, which will be tracked and translated into a 3D skeleton. In the long-term, the ability to track a 3D skeleton will be a cornerstone for the emergence of common sense in AI systems. Indeed, it is necessary to understand how humans interact with the world and modify its state due to their actions.

Technical details

2D Pose Estimation 

The most common human pose estimation task is 2D pose estimation. Given an RGB image, we want to find every person, and for each person, estimate the position of its articulations (keypoint) and limb.

A significant difficulty in images with multiple people is the presence of multiple key points of the same type (for example, “left ankle”). We have to make sure that each keypoint is matched with the correct person.

In our application, we already have access to the output of the semantic detection model. This semantic detection gives us a list of people with their associated bounding box (bbox) and pixel-level mask. This data enables us to associate keypoints with the correct person easily. This data is cheap to obtain because: the detection model performs well in real-time, and its computation is already needed to track other objects. Therefore, the overhead of using it for human pose estimation is very low.

From 2D to 3D 

When a 3D shape (in our case, a human skeleton) is projected to a 2D image, a lot of information about its geometry is lost. This is a large source of ambiguity when reconstructing a 3D skeleton from a 2D image. Indeed, the same 2D image can correspond to multiple critically different 3D positions. This can lead to large errors in the estimation of the orientation of a person.


At the beginning of the animation, the 2D view could be enough to measure the angle of the shoulder (180°). But when the arm points in our direction, a 2D measurement would mistakenly stay large (~180°), when the real angle is 90°, as shown in the 3D view.

Our system relies on using a stereo camera, i.e., two cameras aligned on a horizontal axis. From this stereo input, our depth estimation model allows us to accurately unproject the 2D skeleton back to the 3D world. As for the previous detection model, the depth estimation is computed in real-time and is already used in other tasks, like object tracking. Therefore, using this input for the 3D skeleton estimation is cheap, with low overhead.

A significant advantage of our stereo approach is that the 3D skeleton is estimated in real metric units (meters) in a 3D space consistent with the real world.

Tracking the 3D skeleton in time

Our system makes tracking easy because we can use the optical flow model output to estimate objects’ motion between successive time frames. The flow is already computed in real-time for object tracking (see our previous demo) and can therefore be reused as input to our 3D skeleton tracking algorithm.


    With this demonstration, we proposed a new approach to 3D skeleton tracking. What makes it so unique is that it can track multiple people’s skeletons in real-time and in their 3D environment, thanks to our stereo-vision approach.

    This human pose estimation module is integrated into our generic visual system, performing a large panel of other visual tasks. This system allows combining the features of different modules efficiently – depth estimation, object detection (bounding boxes and masks), optical flow, object tracking, human pose estimation, and tracking – to solve complex visual tasks in real-time. This modular design will also help us to efficiently create many new features in the future. We will soon release our SDK and allow you to test it on your own use cases. Feel free to contact us for more information about our system.

    More articles

    The Three Cup Monte test

    At Visual Behavior, we created a new visual perception test based on the popular game “Three Cup Monte” to test robots’ abilities to understand the world. Sometimes tough even for humans, this task also challenges machines to understand ….

    View article