(Inside Science) -- Given a still image, a new artificial intelligence system can generate videos that simulate the future of that scene to predict what might happen next. Currently, these videos are less than two seconds long and can make people look like blobs. But researchers hope that in the future, more powerful versions of this system could help robots navigate homes and offices and also lead to safer self-driving cars.
Computers have grown steadily better at recognizing faces and other items within images. However, they still have major problems envisioning how the scenes they see might change, given the virtually limitless number of ways that items within images can interact.
To confront this challenge, computer scientist Carl Vondrick at the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Lab in Cambridge and his colleagues explore machine learning, a branch of artificial intelligence devoted to developing computers that can improve with experience. Specifically, they research "deep learning," where machine learning algorithms are run on advanced artificial neural networks designed to mimic the human brain.
In an artificial neural network, software or hardware components known as artificial neurons receive data, then cooperate to solve a problem such as reading handwriting or recognizing an image. The network can then alter the pattern of connections between those neurons to change the way they interact, after which the network attempts to solve the problem again. Over time, the network learns which patterns are best at computing solutions.
The scientists first trained their system on how to generate videos by having it analyze more than 2 million videos downloaded from the image and video hosting website Flickr. Next, they took images from beaches, train stations, hospitals and golf courses and had their system generate videos predicting what the next few seconds of that scene might look like. For instance, beach scenes had crashing waves, while golf scenes had people walking on grass.
Vondrick and his colleagues used a deep-learning technique called "adversarial learning" that involves two competing neural networks. One network generates videos, while the other attempts to discriminate between real videos and the fakes its rival creates. Over time, the generator learns to fool the discriminator. A key trick for generating more realistic videos involved simulating moving foregrounds and stationary backgrounds.
The scientists then used Amazon's Mechanical Turk, an online crowdsourcing marketplace, to hire 150 workers to compare their videos with snippets from real videos. In more than 13,000 comparisons, the system's videos were deemed as convincing as real videos about 20 percent of the time. "Our algorithm can generate a reasonably realistic video of what it thinks the future will look like," Vondrick said.
Although there were prior efforts to extrapolate videos from scenes, previous research tended to build up scenes frame by frame, which created "a large margin for error," Vondrick said. "It's kind of like a big game of telephone, where the message falls apart by the time you go around the whole room."
Instead of building up scenes sequentially, this new system generates all of a video's frames at the same time, "the 'telephone' equivalent of talking to everyone in the room at once," Vondrick said.
"There has been previous work on video generation," said computer vision expert Kris Kitani from Carnegie Mellon University in Pittsburgh. "What's interesting here is the ability of the deep neural network to memorize large amounts of data, in this case video, in such a way that it preserves the essential structure of the data."
While the new strategy is more accurate than previous efforts, it makes long videos harder to create. Videos that are longer than 1.5 seconds will require a more complex model, Vondrick said. The system faces other challenges as well. Moving objects in the videos are commonly low-resolution -- for example, the people on the beaches and golf courses are often pixelated lumps of color. Moreover, the system can neglect to include objects in a scene, or "hallucinate" objects that are not there.
Although the system is currently a long way from practical applications, "technologies like this have the potential to improve the abilities of robots and [artificial intelligence] systems to be able to navigate unpredictable environments and even interact with humans," Vondrick said. "My pipe dream is to be able to develop a version of this algorithm that can actually generate fully-formed feature-length movies."
Kitani echoed this idea. "Taken to the extreme, maybe you could make a whole movie from just a script."
Vondrick and his colleagues will detail their findings on Dec. 7 at the Neural Information Processing Systems conference in Barcelona.