Today, robots already impact the world through a wide range of tasks, using sensor fusion, planning, and control in highly structured environments. However, except in a few specific cases, robots still lack the intelligence to understand the world through vision as humans do. Specific development, costly sensors (3D/2D lidar and radar), and massive infrastructures are usually used to compensate for this lack of understanding. To address this problem and find solutions to robotics challenges, Visual Behavior is developing an artificial visual cortex that understands complex visual scenes.

“Anyone relying on lidar is doomed”.

– Elon musk.

Robotics can be seen as the combination of three tasks: perceiving the environment, deciding which actions to take, and performing the actions. Robots’ capacity to perform complex physical actions has dramatically improved in the last decades, and impressive demonstrations continue to be showcased. The capacity to make good decisions in simulated or very controlled environments has also improved because we have good representations. However, decision-making algorithms, whether they are algorithms for learning types like reinforcement learning or handcrafted/engineered decision rules, do not perform well in complex environments because a proper visual representation of the world is necessary to make decisions. Incomplete or imprecise data cannot always be compensated for with good decision algorithms. Today, robotics has reached a point where improving robot vision is the key to reaching robot autonomy.

We believe that common sense in robots’ perception can only emerge from a system based on some physics-first principles like spatial and temporal consistency and some basic notions of objects’ physical properties. For example, this allows the understanding of object permanence and occlusion. Children around seven months old understand that an object can be temporarily hidden but continue to exist even if it cannot be seen. Such abilities are often missing from current robots’ vision systems because they are not designed to let them emerge. We think that they are necessary and fundamental pieces to designing more evolved vision systems and pretending to reach human-level vision abilities.

Consequently, benchmarks in vision should allow researchers to assess such abilities and check that some robotics constraints are met: real-time analysis, low computation hardware, and the need for consistent predictions. These considerations have led us to propose a new test to challenge robotics vision systems that include real-world analysis constraints.

The three cup monte test

The three-cup monte test is difficult for current vision systems as it requires common sense to understand complex and sometimes ambiguous object interactions.
The goal of the game is to follow a ball under one cup. The complexity arises when someone shuffles the cups. Additionally, the ball can be switched from one cup to another. However, because all cups are the same, the program’s success relies on its ability to understand complex object dynamics.
As an additional challenge, we added two major components for the test to be more realistic. Firstly, the challenge had to be addressed in an online and real-time fashion. Secondly, to incorporate most robotics environments’ constraints, we limited the resources needed at inference time to the one offered by a gaming laptop.

Common sense

Common sense covers a large spectrum of (artificial) intelligence. However, in the context of robot perception, it can be as “simple” as grasping the idea that an object can be visible, in motion, or hidden by another object. Beyond its importance in this game, understanding such details is key with complex robot autonomy applications like self-driving cars or robot mobility. Thinking about a cyclist passing behind a car, we could easily imagine why this feature is important in robotics applications.

Real-time analysis

“Real-time analysis” could also be referred to as “streaming perception.” This term describes the ability of an agent to sense the world so that it can react efficiently. In the context of robotics, such an ability becomes extremely important since, as soon as a vision model processes a frame, the world can change radically. In the following video, we showcase how large latency can impact our belief about the state of the world.

Detection with the mask-rcnn model latency
Detection with our model latency + our optical flow network

In fact, recent studies show that metrics like average precision (AP) that are used to evaluate semantic detectors could drop from 38.0 to 6.2 under real-time constraints. Because all of the cups are similar, the Three Cup Monte Test challenges current algorithms to be both accurate (precision), fast (inference time), and efficient (computation on limited hardware). In our case, we successfully addressed this problem by running three networks with four outputs (bbox, masks, flow, depth) in real-time within a gaming laptop.

Technical details

The technology we use is based on a modular design architecture inspired by the human visual cortex areas. Indeed, we know, from decades of research in neuroscience, the factors involved in mammals’ visual perception. As provided by the current semantic detector (Yolo, Mask-RCNN, SSD), humans can detect and recognize specific entities. However, our biological vision is not limited to semantic detection and includes other processes to estimate motion and depth. Beyond its practical usefulness, such estimation plays an important role in babies’ development to understand a scene. In the context of robotics applications, this architecture also brings values for the emergence of vision abilities and practical and experimental applications such as the Three Cup Monte test.

As an important rule, we did refuse to specifically adapt our networks and our technology’s architecture to this particular problem. We observed that benchmarks tend to bias the proposed solutions toward a score-focused optimization procedure. We prefer to aim for the long-term goal related to a benchmark rather than to problem-specific solutions. It is the philosophy we followed while addressing this task.

“When a measure becomes a target, it ceases to be a good measure.”.

– Marilyn Strathern.


Semantic detection: To address this challenge, we used a semantic detector based on box and mask predictions. Following the procedure described above, we did not specifically train our semantic detector on the cups shown in the video nor on the specific cup dataset. However, our semantic detector can detect and segment objects from the COCO dataset, including some objects from the kitchen and the cups.

Following moving objects: Because the world is always moving, we use an optical flow network to estimate each frame’s motion. The network is trained on synthetic data as well as on real data in an unsupervised manner. Beyond the simple application of optical flow, we also use occlusion estimation to detect when one object is passing by another or when an object disappears.
Additionally, the semantic detector outputs spatial object embedding, making it possible to detect when an object is lost during tracking.

Depth estimation: Human vision is based on stereo vision, meaning that we use two eyes to get a sense of how far each object is in the scene. Similarly, we train a stereo network to estimate the depth based on the two cameras. This estimation is important for detecting ambiguous object’s interactions as well as predicting their future positions.

Real-time analysis: We use attention mechanisms (not directly related to transformers in NLP) inspired by neural cognitive processes to optimize the inference speed of our models. While still a patent-pending technology, this process is based on new neural networks architectures and methods to efficiently analyze a scene in real-time. Thus, we’re able to use state of the art neural networks faster in cheaper hardware. Each of the previous methods can, therefore, benefit from these improvements to run in real-time scenarios.

Tracking: On top of the modules introduced above, we plugged in a tracking algorithm (MOT) that combines each network’s estimation. Because each module already gives meaningful information such as detections, distances, motion, and occlusions, we did not need to include a wide range of post-processing to fusion all networks’ outputs. Indeed, the module’s outputs tend to explain most of what happened in the scene by itself, therefore limiting the need for heavy post-processing computation.

Upcoming milestones and improvements


    Visual Behavior is working on the future of robotics. While this experiment was focused on research purposes, it gives excellent feedback on how well our system could understand complex scenes. To stay informed of our next demonstration and the project’s release on GitHub, subscribe to our newsletter or follow us on social media. The three-cup monte test was only a first step toward the ultimate goal of building the future of robotics systems. We will soon release content for the community and showcase how such technology could be applied to real-world problems.


    More articles

    Understanding Human Behavior

    In this article, we introduce another aspect of our visual system. During the first Three Cup Monte Test for visual perception, we demonstrated the software capabilities to track objects and understand their movements and interaction in the outside world….

    View article