How did this game bot score higher than humans on a Turing Test?

Illustration for article titled How did this game bot score higher than humans on a Turing Test?

Anyone who plays video games knows that game bots, artificially intelligent virtual gamers, can be spotted a mile away on account of their mindless predictability and utter lack of behavioral realism. Looking to change this, 2K Games recently launched the BotPrize competition, a kind of Turing Test for nonplayer characters (NPCs). And remarkably, this year's co-winner, a team from The University of Texas at Austin, created a nonplayer character (NPC) that was so realistic that it appeared to be more human than the human players — which is kind of a problem when you think about it.



To create their super-realistic game bots, the software developers, a team led by Risto Miikkulainen, programmed their NPCs with pre-existing models of human behavior and fed them through a Darwinian weeding-out process called neuroevolution. Essentially, the only bots that survived into successive generations were the ones that appeared to be the most human — what the developers and competition organizers likened to passing the classic Turing Test (an attempt to distinguish AIs from actual humans).

Illustration for article titled How did this game bot score higher than humans on a Turing Test?

With each passing generation, the developers re-inserted exact copies of the surviving NPCs, along with slightly modified (or mutated) versions, thus allowing for ongoing variation and selection. The simulation was run over and over again until the developers were satisfied that their game bot had evolved the desired characteristics and behavior. And in fact, Miikkulainen and his team have been refining their virtual player over the past five years.


The final manifestation of their efforts was dubbed UT^2 — and it was this NPC that went head-to-head against human opponents and other game bots at the 2K Games tournament.


And the game of choice? Unreal Tournament 2004, of course. The game was selected on account of its complex gameplay and 3D environments — a challenge that would require humans and bots to move around in 3D space, engage in chaotic combat against multiple opponents, and reason about the best strategy at crucial moments. Moreover, the game is also capable of bringing about some telltale human behaviors, including irrationality, anger, and impulsivity.

As each player (human or otherwise) worked to eliminate their opponents, they were subsequently assessed for their "humanness." By the end of the tournament, there were two clear winners, UT^2 and MirrorBot (developed by Romanian computer scientist Mihai Polceanu). Both NPCs scored a humanness rating of 52%, which is all fine and well except for the fact that the human players scored only 40%.


In other words, the game bots appeared to be more human than human.

Limits of the Turing Test

Now, this is a serious problem. Human players should have been assessed with a humanness rating of 100%, not 40%. Clearly, the judges utterly failed to identify true human characteristics among the human players. So by consequence, UT^2 and MirrorBot essentially achieved a rating better than 100% — which is impossible. How can something be more than something you're trying to emulate?


And indeed, this experiment is a good showcase for the limits of the Turing Test. Admittedly, the 2K Games tournament wasn't meant to be a true Turing Test, merely one that measured the humanness of NPCs in a very specific gaming setting. That said, the results demonstrated that human behavior is much more complex and difficult to quantify than we tend to think. Human idiosyncrasies, plus the ability to adapt and counter-adapt to attempts to identify it, will likely forever put it beyond the reach of a simple Turing Test.

For example, given the implications of the 2K Games tournament, how are we supposed to assess something like a chatbot for its humanness now that we know something can apparently appear to be more human-like than humans? Moreover, given all the subjectivity involved in the evaluation, how accurate is any of this?


Perhaps its time to retire the Turing Test and come up with something a bit more....scientific.

Top image via. Inset image via Jacob Schrum/University of Texas at Austin.


Share This Story

Get our newsletter


This doesn't mean that the Turing Test is flawed... it just means that an FPS game, with no other input, is not an adequate window to view the "humanness" of an entity.

In reality, the best human players WILL seem robot-like: the very nature of FPS games requires that to be the best, you eliminate unnecessary motion. You learn where to hide. You learn where the ammo and weapon drops are. You learn the best possible way to get around a map and wipe the ground with your enemies.

When based solely on another player's actions, I don't think i could tell a good player from even an ordinary bot. They'd be doing exactly the same things.