Robotics n°2

Why are robotics different from other AI systems ?

Part 2: When you know what you know ?

You said real time ?

At Visual Behavior, we describe robots as real-world systems able to take information from the outside world, observe, decide, and make an action on this world, the movement. This definition covers many systems, from self-driving cars to industrial robots, but it also includes some other systems, like a smart camera, looking at the world to decide if it’s necessary to take a picture or even any camera triggering an action. The process in place to move from observation to action isn’t instant and induces some delay, the system’s latency. This fact implies that the action doesn’t occur in the observed world, but in a world slightly in its future and potentially a little bit different.

In domains involving real-time, we often say that “having information too late is the same as not having any information. To be concrete, you can imagine an automatic emergency brake that takes 5 seconds to detect an obstacle in front of a car. The decision and the action will occur far after the collision.

Accuracy versus Speed

Mengtian Li, Yu-Xiong Wang, and Deve Ramanan from Carnegie Mellon University and Argo AI propose a smart solution to measure the best compromise between accuracy and speed. They call it “Streaming Perception.” It consists of evaluating algorithms with delayed ground truth. Instead of using ground truth associated with the input data (the observation), they use ground truth corresponding to the time at which the algorithm outputs its prediction. 

There are two directions to perform well in this context. The first one is to reduce the latency to its minimum and have the prediction as temporally close as possible to the observation, and this always implies reducing the quality of the prediction. The best solution is between small delay and good accuracy: the compromise between quality and latency.

The second solution is to compensate for the latency by predicting not for the observed world but the forecasted near-future world. This implies developing algorithms using not only still images but temporal series of images, a.k.a. a video. Traditional signal processing can be a tracking layer, like the Kalman estimator, on top of classical still image algorithms.

The illustration above describes the Streaming Perception setup. At t=t1, the camera gets an image of the road when the car is in position A. From this image, an algorithm produces a prediction about the scene, especially a mask and a bbox for the detected car. This prediction is output at  t=t2, where the car is really in position B. Traditionally, algorithm evaluations don’t consider the latency, measuring the prediction performance with vehicle position A. Streaming Perception evaluation proposes using vehicle information at time t=t2 in position B.

Those considerations are relatively new in the deep learning field, which explains why traditional tracking algorithms are often used in conjunction with more modern single image deep learning algorithms instead of fully latency-aware deep learning processing pipelines.


Despite our sensation of simplicity and constant flowing time, internal processes responsible to make action from observation in humans aren’t straightforward. 

Depending on the task, our reaction time is between 200 milliseconds and half a second. But this only measures reaction time, in real-world situations, we often have information about upcoming events. Our brain spends its time predicting the future. For this purpose, it uses more or less sophisticated techniques ranging from a simple constant velocity model for simple object movement to a full psychologist model to predict the next people action. Based on this information, our brain can compensate for its latency or even anticipate by applying action based on potential future observations instead of actual observation. That’s how we can grab a ball that follows a ballistic trajectory or how we can navigate through a dense crowd.

Why anticipation is important

Anticipation is a good way to compensate for the latency. By predicting not the current world’s state but the next one, robots can take action with respect to up-to-date world information. Anticipation is essential in robotics not only because we need to compensate for the latency but also because rich interactions need it.

Imagine a factory where we want to make mobile robots cooperate with workers by delivering some raw materials, for example. The first level of autonomy must be to make the robot not crash into humans. To make that, we need a 3D human detector to trigger an emergency stop if a human is in front of the robot. In this case, the robot only reacts to its environment, and the system acts as a safety component. To make robots collaborate with humans, we need to go further and design an algorithm that, instead of telling if a human is in front of the robot, tells if, in the near future, the human trajectory will cross the robot trajectory and act consequently.

The self-driving car equivalent is the need to know not if a human is in the middle of the road, but if there is a high probability there will be a human in the middle of the road when the car passes.


How we do that at Visual Behavior

By working on our generic scene understanding at Visual Behavior, we discovered that managing anticipation and temporally-aware prediction is an opportunity rather than a constraint. Because we want to provide robots with systems allowing them to have fine and complex behaviors, we already have the anticipatory constraint. Inserting this constraint at the heart of the system will enable us to access rich interaction scenarios and open the door to collaborative robotics (a.k.a. cobotics).


After speaking about evaluation specificity in a previous article, we saw here that latency can make a big difference when evaluating algorithms. 

We describe the compromise between accuracy and latency. We depicted the idea behind Streaming Perception and its latency-aware evaluation. We have drawn an analogy between artificial algorithms and humans. Finally, we speak about latency from an anticipation point of view and the other necessity of having good anticipation methods.