You may have heard of the "Turing Test." It was proposed by Alan Turing in 1950 as a way to determine whether an Artificial Intelligence system "can think." There's extensive debate about whether it's even possible in principle to build a device which can properly be spoken of as "thinking"; but at this point AI is so far away, that we can put off that question for some time.
(One exception is a program known as PARRY, which was designed to imitate the conversation of paranoid schizophrenics. Psychiatrists reviewing the transcripts of conversations (by other psychiatrists) with PARRY, and with actual paranoid schizophrenic patients, did no better than chance at telling them apart. I'm not sure if that amounts to a stunning triumph for AI, or not.)
Nevertheless, every year a contest is held for the "Loebner Prize," which is sort of for passing the Turing Test, but more realistically for the best "conversational AI" program or "chatbot" available so far (regardless of whether it's anywhere near passing or not).
In short, the Turing Test says to have a human (the "tester") converse with a human and with a program, and then have the tester declare which was which. If they can't tell any better than chance, then the program at the other end passes (though perhaps we should instead say that the human at the other end fails?).
Most people don't think that making a computer *look* human or have a good human accent are the "hard" parts, so those features are usually excluded from the test. Typically the tester types messages, which either go to a computer program somewhere or to a display in front of a human in another room. Responses come back the same way.
The tester can use any criteria they like for deciding; and in a real Turing Test they can also say or discuss anything they like. For example:
* The tester can notice whether answer come back too fast or too slow
* The tester can ask about common knowledge ("are koalas bigger than DeLoreans?"), or about really uncommon knowledge ("what's the 249th digit of pi?")
* The tester can try jokes, puns, or poor taste
* The tester can try to get their interlocutor mad, make a pass at them, or bore them
* The tester can use symbolic language, obscure metaphors, bad or misleading spellings, or other linguistic behaviors to see how the other side reacts.
In practice the tester is commonly limited to 5 minutes of questioning, must remain within certain subjects, and/or is chosen to have no relevant expertise. It turns out that naive testers can be *really* naive -- an early program that had no pretensions to intelligence at all won because it generated human-like typographical errors. A recent winner was very up-front about being a computer -- which testers might have presumed a computer trying to imitate a human would never say.
You can see transcripts of the 2008 conversations at http://loebner.net/Prizef/2008_Contest/loebner-prize-2008.html (other information is also available on the same site, though not all the transcripts are legible). Perhaps unfortunately, nothing is said about goals that the non-testers are to pursue -- is each supposed to try to convince the tester that it's human (or that it's a computer)?
Even a quick look shows that the 2008 winner ("Elbot") usually answers with a non-sequiter ("Have you any plans for later in the day?" "This is confusing. Why bring the waking hours into this?"). Humans seldom do that. Elbot also makes suspiciously few spelling errors.
So far as I know, no one has yet used natural language processing to analyze the transcripts. I wonder what we would get if we ran each participant's text through OpenAmplify and did some statistics on the results? Do the chatbots use vocabulary or parts-of-speech in unusual ways? Do they almost always have exactly one topic in common with the preceding tester sentence? Do they show different patterns of "sentiment" or other measures?
I wonder if AI can beat itself at its own game?
Posted
26 Jan 2010 1:54 PM
by
sderose