Who is taller, Prince William or his baby son Prince George? Can you make a salad out of a polyester shirt? If you stick a pin into a carrot, does it make a hole in the carrot or in the pin? These types of questions may seem silly, but many intelligent tasks, such as understanding texts, computer vision, planning, and scientific reasoning require the same kinds of real-world knowledge and reasoning abilities. For instance, if you see a six-foot-tall person holding a two-foot-tall person in his arms, and you are told they are father and son, you do not have to ask which is which. If you need to make a salad for dinner and are out of lettuce, you do not waste time considering improvising by taking a shirt of the closet and cutting it up. If you read the text, "I stuck a pin in a carrot; when I pulled the pin out, it had a hole," you need not consider the possibility "it" refers to the pin.
To take another example, consider what happens when we watch a movie, putting together information about the motivations of fictional characters we have met only moments before. Anyone who has seen the unforgettable horse's head scene in The Godfather immediately realizes what is going on. It is not just it is unusual to see a severed horse head, it is clear Tom Hagen is sending Jack Woltz a message—if I can decapitate your horse, I can decapitate you; cooperate, or else. For now, such inferences lie far beyond anything in artificial intelligence.
In this article, we argue that commonsense reasoning is important in many AI tasks, from text understanding to computer vision, planning and reasoning, and discuss four specific problems where substantial progress has been made. We consider why the problem in its general form is so difficult and why progress has been so slow, and survey various techniques that have been attempted.
The importance of real-world knowledge for natural language processing, and in particular for disambiguation of all kinds, was discussed as early as 1960, by Bar-Hillel,3 in the context of machine translation. Although some ambiguities can be resolved using simple rules that are comparatively easy to acquire, a substantial fraction can only be resolved using a rich understanding of the world. A well-known example from Terry Winograd48 is the pair of sentences "The city council refused the demonstrators a permit because they feared violence," vs."... because they advocated violence." To determine that "they" in the first sentence refers to the council if the verb is "feared," but refers to the demonstrators if the verb is "advocated" demands knowledge about the characteristic relations of city councils and demonstrators to violence; no purely linguistic clue suffices.a
Machine translation likewise often involves problems of ambiguity that can only be resolved by achieving an actual understanding of the text—and bringing real-world knowledge to bear. Google Translate often does a fine job of resolving ambiguities by using nearby words; for instance, in translating the two sentences "The electrician is working" and "The telephone is working" into German, it correctly translates "working" as meaning "laboring," in the first sentence and as meaning "functioning correctly" in the second, because in the corpus of texts Google has seen, the German words for "electrician" and "laboring" are often found close together, as are the German words for "telephone" and "function correctly."b However, if you give it the sentences "The electrician who came to fix the telephone is working," and "The telephone on the desk is working," interspersing several words between the critical element (for example, between electrician and working), the translations of the longer sentences say the electrician is functioning properly and the telephone is laboring (Table 1). A statistical proxy for commonsense that worked in the simple case fails in the more complex case.
Almost without exception, current computer programs to carry out language tasks succeed to the extent the tasks can be carried out purely in terms of manipulating individual words or short phrases, without attempting any deeper understanding; commonsense is evaded, in order to focus on short-term results, but it is difficult to see how human-level understanding can be achieved without greater attention to commonsense.
Watson, the "Jeopardy"-playing program, is an exception to the above rule only to a small degree. As described in Kalyanpur,27 commonsense knowledge and reasoning, particularly taxonomic reasoning, geographic reasoning, and temporal reasoning, played some role in Watson's operations but only a quite limited one, and they made only a small contribution to Watson's success. The key techniques in Watson are mostly of the same flavor as those used in programs like Web search engines: there is a large collection of extremely sophisticated and highly tuned rules for matching words and phrases in the question with snippets of Web documents such as Wikipedia; for reformulating the snippets as an answer in proper form; and for evaluating the quality of proposed possible answers. There is no evidence that Watson is anything like a general-purpose solution to the commonsense problem.
Computer vision. Similar issues arise in computer vision. Consider the photograph of Julia Child's kitchen (Figure 1): Many of the objects that are small or partially seen, such as the metal bowls in the shelf on the left, the cold water knob for the faucet, the round metal knobs on the cabinets, the dishwasher, and the chairs at the table seen from the side, are only recognizable in context; the isolated image would be difficult to identify. The top of the chair on the far side of the table is only identifiable because it matches the partial view of the chair on the near side of the table.
The viewer infers the existence of objects that are not in the image at all. There is a table under the yellow tablecloth. The scissors and other items hanging on the board in the back are presumably supported by pegs or hooks. There is presumably also a hot water knob for the faucet occluded by the dish rack. The viewer also infers how the objects can be used (sometimes called their "affordances"); for example, the cabinets and shelves can be opened by pulling on the handles. (Cabinets, which rotate on joints, have the handle on one side; shelves, which pull out straight, have the handle in the center.)
Movies would prove even more difficult; few AI programs have even tried. The Godfather scene mentioned earlier is one example, but almost any movie contains dozens or hundreds of moments that cannot be understood simply by matching still images to memorized templates. Understanding a movie requires a viewer to make numerous inferences about the intentions of characters, the nature of physical objects, and so forth. In the current state of the art, it is not feasible even to attempt to build a program that will be able to do this reasoning; the most that can be done is to track characters and identify basic actions like standing up, sitting down, and opening a door.4
Robotic manipulation. The need for commonsense reasoning in autonomous robots working in an uncontrolled environment is self-evident, most conspicuously in the need to have the robot react to unanticipated events appropriately. If a guest asks a waiter-robot for a glass of wine at a party, and the robot sees the glass he is picked up is cracked, or has a dead cockroach at the bottom, the robot should not simply pour the wine into the glass and serve it. If a cat runs in front of a house-cleaning robot, the robot should neither run it over nor sweep it up nor put it away on a shelf. These things seem obvious, but ensuring a robot avoids mistakes of this kind is very challenging.