The future of artificial vision & Transformers

Coupled with unsupervised learning techniques, the architectures known as “transformers” have permitted the strong progression of NLP (Natural Language Processing) observed in recent years.

Coupled with unsupervised learning techniques, the architectures known as “transformers” have permitted the strong progression of NLP (Natural Language Processing) observed in recent years.

Yoshua Bengio, a pioneer in deep learning, describes the attention mechanisms present at the heart of transformers as the new operation making it possible to emerge with a more generic deep learning system that performs well on different modalities (audio, games, optical flow, image or video). Several recent studies, such as that of Perceiver IO, have highlighted the multimodal nature of these new architectures and the variety of massively available data on which they are able to operate. Recently, Tesla has adopted transformers to optimize its autonomous vision system, which remains the most advanced in production to date. 

In this article, we will evaluate the potential of transformers applied to the field of computer vision and their effects on the future of computer vision.

Symbolic reasoning about the world

In 2021, the Facebook group, renamed Meta, increased its publications on the application of transformers to image understanding. The publication of DETR (DEtection & TRansformers), which mixes attention mechanisms with the use of convolutional networks (CNNs), reveals the simplicity of this approach for object detection, a major task in computer vision. This approach contrasts with the increasing complexity of current detection architectures based on CNNs. 

To understand the benefits of transformers, let us first consider the difference between image analysis and NLP.

In language processing, information is segmented by letter, syllable or word, leaving the model with a better ability to concentrate on more complex tasks: generating automatic responses, dialoguing, summarizing text, etc.

In image analysis, the information is not previously segmented. The continuous nature of the information contained in the image complicates the extraction of entities. The complex high-level tasks (predicting the behavior of entities, evaluating a 3D position, predicting a trajectory, tracking an object) require the network to reason about the entities and not about the image. These tasks are then less accessible. The CNN has a spatially entangled representation of the image, complicating its symbolic (feature-based) analysis of the scene and thus limiting its progress towards results as efficient as in NLP.

The human visual cortex analyses its environment by summarizing space into a set of entities. When the human is driving, successive pictures encountered by the retina are automatically summarized by the brain into a set of entities essential to the performance of a driving task: “a pedestrian crossing is in front of me, a pedestrian is crossing.” High entropy information is extracted by the human brain into a finite set of symbols. 

It is this stage of perception that transformers allow us to overcome. They expose a disentangled representation to provide symbolic reasoning about the world. Transformers are therefore a first step towards high-level reasoning. 

In the research world, work is underway to test the performance of transformers, notably by substituting them for CNNs to solve certain tasks. Meta recently published 3DETR, a transformer for 3D object detection. While the Ego4D project has emerged to test the understanding of a human’s actions from his or her point of view, Tesla has switched to transformers to improve its scene understanding. 

From image analysis to scene understanding

The above-mentioned architectural limitations have contributed to the current constrained tendency of vision systems to rely on the analysis of a single image at a time. However, fusing information beyond a single image is necessary to perform high-level vision tasks. Augmented reality, social networking and autonomous robotics increasingly exploit video to provide spatial and temporal understanding of the scene. At present, it is observed that the data exploited is represented in a 4D space, which takes into account the 2D image, the temporality and the spatial dimension of the analyzed scene. 

As late as 2019, Tesla relied on the use of CNN backed by independent computation for each camera on board the vehicle. The high-level reasoning required for driving was therefore made available by a time-consuming and costly manual fusion of camera detections (sensor fusion). In addition, this technique reduced the genericity and transferability of the vision system to other domains. Since Tesla AI Day, we see that Tesla has placed transformers at the heart of its multi-camera architecture to automatically fuse information from each sensor.

Yoshua Bengio’s explanation of the evolution of deep learning helps to understand Tesla’s recent technological direction. According to Daniel Kahneman’s initial reasoning, humans use two systems of cognition: 

What they do intuitively, without any possible verbal explanation (e.g. driving "automatically" on a known road while talking with a passenger). This is today's deep learning, particularly used by for its autonomous driving. 

What it does with a conscious analysis of the environment. (e.g. driving in an unfamiliar city and having to find its way). The human's working memory is actively solicited. According to Y. Bengio, it is at the heart of this system that the transformers and attention mechanisms reside.

“We can think of attention as a mechanism that creates a dynamic connection between two layers, whereas in a traditional network the connections are fixed. Here we can choose which input will be sent to the module we are using with the help of an attention mechanism.” Y. Bengio, From System 1 Deep Learning to System 2 Deep Learning, NeurIPS. 

This connection with neuroscience demonstrates that attention mechanisms are biologically plausible and allow for deeper reasoning. Moreover, the modularity of this system addresses the problem of scene understanding in general. This analysis can partly explain Tesla’s technological choices, particularly with the Tesla bot. The potential of its software technology lies in its transferability to the vast market of autonomous robotics, seen in a broad sense. 

Considering the systemic change of needs linked to the autonomisation of robotics, we can therefore presume that companies mastering this technology, like those mentioned before and Visual Behavior, will be able to propose the vision processors for tomorrow’s autonomous robots.

The three cup monte test: A new test for robots visual perception

Introduction Today, robots already impact the world through a wide range of tasks, using sensor fusion, planning, and control in highly structured environments. However, except in a few specific…