Observation and Autonomy

In this series of articles, we will look at the history and challenges of autonomous robotics with the aim of offering insights into the missing elements of large-scale autonomy deployment.

Part 2 :

Observation is about learning

As a result of the DARPA Challenge in 2015, many questions emerged about the magic recipe for achieving the expected autonomy. One question remained: 

Wouldn’t it be better to rethink the vision system and the use of sensors to put intelligence back at the heart of autonomous systems?

This question continues to animate the R&D departments of many robotics companies in search of autonomy. Some have turned to the use of a multiplicity of sensors, with lidar still the favorite. This method concentrates the expertise of vision systems on the information and the precision coming from the sensors more than on the understanding of the scene and the behaviors of the entities present. 

This is where current vision systems have some limitations. In addition, the costs related to the acquisition, integration, fusion of sensors and the lack of interpretation of scenes reinforce these limits in the long term, encouraging the paradigm change launched by the race to artificial intelligence and deep learning systems. 

The challenge here is to put forward the understanding of the world, to allow the robot to observe and learn by itself. Thus, it will be able to analyze its environment and make the right decisions, just as a child observes its environment to understand how to interact with it as it grows.

Recap of the magic recipe:

From sensor to model fusion

It was in 2015 that Tesla appropriated the magic recipe by taking the challenge of using only camera sensors (just as humans use their eyes to move), machine learning and model development to replace lidar perception. Tesla used the hundreds of hours of driving all these customer vehicles to train a set of basic ML models (detection of cars, pedestrians, signs, lines, etc). 

This recipe, transformed by Tesla, relies on the development of intelligence and not sensors. The use of several cameras called stereovision (similar to human vision) allows the reconstruction of a scene in three dimensions and 360 degrees. These stereo cameras provide depth information and semantic understanding of the environment.

The difference: Lidars give only depth information while cameras give visual information necessary to understand semantics (meaning of signs, especially language).

Today, Tesla’s solution remains the most reliable and is able to bring the cars to a level 3 autonomy. Tesla has nevertheless transformed a sensor fusion problem (aggregating camera data with lidar, radar, etc.) into a vision model fusion problem (manually aggregating data from a multitude of separate but interdependent ML models).

Could the manual fusion of these models be a barrier to Tesla’s development in its autonomous car race? 

It was in 2016 that George Hotz, the founder of Comma.ai, had the ambition to propose an alternative to Tesla’s recipe: turn any car into an autonomous car with a smartphone and a camera. His recipe: an “end to end learning” method. This means that all it needs is a single model responsible for observing and acting on the car (accelerating, braking, turning the wheel). This model decides by itself what action to take according to the observation (difference between an automaton and an autonomous robot, see previous article in this series). The ambition of the founder of comma.ai is to fulfill the functions of a level 2 autonomous car much better than those already existing by privileging the intelligence of the use of data rather than the quantity of it. In fact, this strategy has proven to be efficient and promising as it has outperformed hybrid or manual heuristics-based approaches.

Today, this generic solution (a single network capable of studying and doing several tasks at the same time) remains little adopted by AI and robotics companies who tend to favor the heavy but reliable solution of the magic recipe. Nevertheless, other fields have fully adopted and exploited it, such as NLP (natural language processing). These models are based on huge models that aim to understand how language works in a generic way for this task (called language models). 

Many solutions exist for the autonomous car race, but few people are looking at the issues of interpretation and the place of observation.

Our opinion on observation and its integration into the autonomous car system. The companies wanted to teach a computer to drive a car. As a reminder, a human with only his two eyes and therefore no very advanced sensors can learn to drive a car in about twenty hours. But if we take enough distance, the human does not manage to learn to drive in only 20 hours of driving but in 18 years + 20 hours. This means that he has confronted the world and all possible and unimaginable cases in 18 years of perception and he specializes in the end during these 20 hours of driving. For autonomous cars the case is similar, it is thanks to the observation and therefore to all the databases of visualization of driving, that the car will know how to respond in the most adequate way to a situation.



    The challenges of autonomy do not only concern autonomous cars, which are in their beginning stages, but also the future of robotics in any field on a larger scale.