Computers at Google now have a machine-learning system that can analyze images like the one above and generate captions for them. The phrase used to caption this image? “A person riding a motorcycle on a dirt road.” It might not seem like much, but it’s actually one hell of an accomplishment.

First, let’s get you up to speed on the challenges of computer vision. Perhaps you’re familiar with the following XKCD comic:

The upshot is pretty straightforward: Automatically identifying images in photographs is deceptively difficult for computers. How deceptive are we talking? The comic’s hover text summarizes the story of artificial intelligence pioneer Marvin Minsky, and his now-infamous summer assignment:


In the 60s, Marvin Minsky assigned a couple of undergrads to spend the summer programming a computer to use a camera to identify objects in a scene. He figured they’d have the problem solved by the end of the summer. Half a century later, we’re still working on it.

That’s the gist. Here are a few extra details: In 1966, Minsky asked some of his MIT undergrads to “spend the summer linking a camera to a computer and getting the computer to describe what it saw.” Minsky’s colleague and longtime collaborator Seymour Papert drafted a plan of attack, which you can read here. In that plan, Papert explains that the task was chosen “because it can be segmented into sub-problems which will allow individuals to work independently and yet participate in the construction of a system complex enough to be a real landmark in the development of ‘pattern recognition’.” The task before them, in other words, seemed challenging but doable. The ill-fated “Summer Vision Project” was born.

Nearly half a century later, college courses on computer vision are still structured around roadblocks Minsky’s students first encountered that summer in 1966. Many of those challenges we’re still wrestling with; others, still, remain entirely unknown. It is difficult to say exactly what makes vision hard,” reads the introduction to this MIT course on fundamental and advanced topics in computer vision, “as we do not have a solution yet.”


That said, two of the broader challenges facing computer vision are clear. “First is the structure of the input,” reads the introduction to the MIT course, “and second is the structure of the desired output.”

Turning Pictures Into Words

In a recent blog post at the Google Research Blog, Google Research scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan describe their approach to the input/output conundrum:

Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it?

The upshot is that Vinyals and his colleagues are using cutting edge machine translation to turn digital images (the input) into natural sounding language (output).

Computer caption: “Two pizzas sitting on top of a stove top oven.” /// Human caption: “Three different types of pizza on top of a stove.”

What’s impressive about that output is how descriptive it is. It does more than identify the object (or objects) in an image, something that’s been done in the past (two years ago, for example, Google researchers developed image-recognition software that could train itself to recognize photos of cats). Instead, the output describes the relationships between the objects. It provides a holistic description of what’s actually happening in the scene. The result is a caption that can wind up being surprisingly accurate, even next to captions provided by humans:

Computer caption: “A group of young people playing a game of frisbee.” /// Human caption: “A group of men playing frisbee in the park.”

How is this possible? The team’s system relies on recent advances in two types of neural networks. The first is structured to make sense of images. The second is designed to generate language.

As their name suggests, neural networks take their design inspiration from the organizational structure of neurons in the brain. The image-identifying, “deep” Convolutional Neural Network (CNN) used by Vinyals and his team relies on multiple layers of pattern identification. The first layer looks directly at the image, and picks out low level features like the orientation of lines, or patterns of light and dark. Above each layer is another layer that attempts to make sense of patterns from the layer beneath it. As you move further up its stack, the neural network begins to make sense of increasingly abstract patterns. The orientation of pixels identified in the first layer might be recognized in a higher layer as a curved line. Higher still, another layer might recognize the curve as the shape of a cat’s ear. Eventually, you get to a layer that effectively says “this seems to be an image of a cat.”

What Vinyals and his team have done is combine the image-identifying powers of a deep CNN with the linguistic abilities of language-generating Recurrent Neural Networks (RNN). Consider, for example, word2vec, a machine translation tool that transforms words, phrases, and sentences into “high dimensional vectors,” which is just a fancy name for vectors whose characteristics are defined by a large number of parameters. If you find this tough to wrap your head around, computer scientists John Hopcroft and Ravi Kannan describe a scenario involving vectors in high-dimensional space that you might find helpful:

Consider representing a document by a vector each component of which corresponds to the number of occurrences of a particular word in the document. The English language has on the order of 25,000 words. Thus, such a document is represented by a 25,000-dimensional vector.

A language-generating RNN can transform, say, a French sentence into a vector representation in “French Space.” Draw these vectors in high enough dimensional space, and the system can represent how the words, phrases and sentences are similar and different from one another. Feed that vector representation into a second RNN, and you can generate a sentence in German, and subject its constituent words and phrases to similar comparative analyses.

What Vinyals and his team do is replace the first RNN (the French Space RNN) and its input words with a deep CNN trained to classify objects in images:

Normally, the CNN’s last layer... [assigns] a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.

The result is a software program that can learn to identify patterns in pictures. Vinyals and his team trained their system with datasets of digital images that had previously been annotated by humans with descriptive sentences. Then they asked their system to describe images it had never seen before.

The descriptions aren’t always 100% accurate, as this selection of human-rated evaluation results clearly illustrates, but the system manages to be impressive, even when it falters:

See what I mean? Sure, many of the model’s mistakes are funny, but they’re also kind of endearing. Its missteps are charming, in the same almost-right-but-still-laughably-wrong way that toddler’s observations often are (e.g. describing a pink scooter as a “red motorcycle,” or an obviously static dog as “jumping to catch a frisbee”). It’s like observing the machine at an intermediary stage of its intellectual development – and in a very real sense, you are.

These results, so far, look promising, and you can read about them in the team’s full research paper over on arXiv. Measured quantitatively, Vinyal’s team’s program was able to describe objects and their relationships at more than twice the accuracy of previous technologies. In the near-future, this kind of technology could be a boon to the visually impaired, or a descriptive aid to people in remote regions who can’t always download large images over low bandwidth connections. And, of course, there’s Google search. When you search for images today, you probably do so not with natural-sounding sentences, but with key words. Vinyal’s team’s system could change that. Imagine searching not for “cat shaq wiggle,” but, more accurately and descriptively, “Shaq and a cat wiggle with excitement.”