AI armed with multiple senses could acquire more flexible intelligence
At the end of 2012, AI scientists first figured out how to get neural networks to “see”. They proved that software designed to vaguely mimic the human brain can dramatically improve existing computer vision systems. The field has since learned how to make neural networks mimic the way we reason, hear, speak and write.
But while AI has become remarkably human – even superhuman – to accomplish a specific task, it still doesn’t capture the flexibility of the human brain. We can learn skills in one context and apply them in another. On the other hand, although DeepMind’s game algorithm AlphaGo can beat the best Go masters in the world, he can’t extend that strategy beyond the board. In other words, deep learning algorithms are masters in pattern detection, but they cannot understand and adapt to a changing world.
Researchers have many hypotheses on how this problem could be overcome, but one in particular has gained ground. Children learn about the world by feeling it and talking about it. The combination seems key. As children begin to associate words with images, sounds and other sensory information, they are able to describe increasingly complicated phenomena and dynamics, to distinguish between what is causal and what does not reflect. that correlate and build a sophisticated model of the world. This model then helps them navigate unfamiliar environments and put new knowledge and experiences into context.
AI systems, on the other hand, are designed to do one of these things at a time. Computer vision and audio recognition algorithms can detect things but cannot use language to describe them. A natural language model can manipulate words, but words are detached from any sensory reality. If the senses and the language combined to give a AI, a more human way to collect and process new information, could he finally develop something like an understanding of the world?
The hope is that these “multimodal” systems, with access to both sensory and linguistic “modes” of human intelligence, should give rise to a more robust type of AI that can adapt more easily to new ones. situations or problems. Such algorithms could then help us solve more complex problems, or be carried in robots capable of communicating and collaborating with us in our daily lives.
New advances in language processing algorithms like OpenAI’s GPT-3 have helped. Researchers now understand how to reproduce language manipulation well enough to make its combination with detection capabilities more potentially more successful. For starters, they use the very first detection capability obtained in the field: computer vision. The results are simple bimodal models, or AI in visual language.
Over the past year, several interesting results have been obtained in this area. In September, researchers at the Allen Institute for Artificial Intelligence, AI2, created a model that can generate an image from a text caption, demonstrating the ability of the algorithm to associate words with visual information. In November, researchers at the University of North Carolina, Chapel Hill, developed a method that incorporates images into existing language models, which improved the reading comprehension of the models.
OpenAI then used these ideas to extend GPT-3. At the start of 2021, the lab published two visual language models. The objects in a picture are linked to the words that describe them in a legend. The other generates images based on a combination of the concepts he has learned. You can invite him, for example, to produce “a painting of a capybara sitting in a field at sunrise”. While he may never have seen this before, he can mix and match what he knows about paintings, capybaras, fields, and sunrises to imagine dozens of examples.
Getting more flexible intelligence wouldn’t just unlock new AI apps – it would make them more secure, too.
More sophisticated multimodal systems will also make it possible to advanced robotic assistants (think robot butlers, not just Alexa). The current generation of AI-powered robots primarily use visual data to navigate and interact with their environment. It’s good for doing simple tasks in constrained environments, like fulfilling orders in a warehouse. But labs like AI2 are working to add language and incorporate more sensory inputs, such as audio and touch data, so that machines can understand commands and perform more complex operations, like opening a door. when someone knocks.
In the long run, multimodal breakthroughs could help overcome some of AI’s biggest limitations. Experts claim, for example, that his inability to understand the world is also the reason he can easily fail or be deceived. (An image can be altered in ways that are imperceptible to humans, but allows an AI to identify it as something completely different.) Getting more flexible intelligence wouldn’t just unlock new AI applications: it would. would also make it safer. Algorithms that examine CVs would not treat irrelevant characteristics like gender and race as signs of ability. Self-driving cars would not lose their bearings in unfamiliar surroundings and would not crash in the dark or in snowy weather. Multimodal systems could become the first AIs we can really trust.