Imagine you're dropped in the middle of an unknown city and you need to find a restaurant in a safe area using only the visual cues around you. Sounds hard, right? Well, researchers at MIT have developed an algorithm that can do it — and it consistently outperforms humans.
Looking at the image above, can you rank them by their distance to the closest McDonald's? What about ranking them based on the crime rate in the area? While not directly visible (i.e. we do not see any McDonald's or crime in action), we can predict the possible actions or the type of surrounding establishments from just a small glimpse of our surroundings. (Image and caption credit: Khosla et al.)
When we make inferences about our surroundings, we often look beyond our "visual scene" and consider a wide number of variables while making realtime judgement calls. Inspired by this, MIT researchers wanted to teach computers to "see" in the same way. Their resulting algorithm, once taught, can analyze a pair of photos and determine such things like which environmental scene has a higher crime rate, which is doing better economically, and which one is closer to a McDonald's or a hospital.
Results showed that, for the most part, the computer did better than humans on certain tasks. For example, when assessing the crime rate of a neighborhood based on visual cues alone, humans were accurate nearly 60% of the time, while the computer achieved an accuracy rate of 72.5%. As the researchers note in their paper, "[An] interesting thing to note is that computers significantly outperform humans, better being able to pick up on the visual cues that enable the prediction of crime rate in a given area."
In support of the project, the researchers set up an online demo that puts you in the middle of a Google Street View. You've got four directional options with the goal of navigating to the nearest McDonald's in the fewest possible steps.
Interestingly, humans tend to better at this specific task than the algorithm, but the researchers discovered that the computer consistently outperformed humans at a variation of the task in which users were shown two photos and asked which scene was closer to a McDonald's.
MIT News explains more:
To create the algorithm, the team — which included PhD students Aditya Khosla, Byoungkwon An, and Joseph Lim, as well as CSAIL principal investigator Antonio Torralba — trained the computer on a set of 8 million Google images from eight major U.S. cities that were embedded with GPS data on crime rates and McDonald's locations. They then used deep-learning techniques to help the program teach itself how different qualities of the photos correlate. For example, the algorithm independently discovered that some things you often find near McDonald's franchises include taxis, police vans, and prisons. (Things you don't find: cliffs, suspension bridges, and sandbars.)
"These sorts of algorithms have been applied to all sorts of content, like inferring the memorability of faces from headshots," said Khosla. "But before this, there hadn't really been research that's taken such a large set of photos and used it to predict qualities of the specific locations the photos represent."
So it's a kind of proof-of-concept that computer algorithms are capable of advanced scene understanding. One idea that Khosla has for the program is to create a navigation app that avoids high-crime areas.
Image: Khosla et al/MIT Press
You may also enjoy: