We want to build a robot that can help a disabled user eat an ordinary meal at an ordinary dining table. The user tells the robot “get me a grape” or “get me some hummus,” and the robot uses its fingers or a utensil to acquire the food and convey it to the user’s mouth. This simple task offers a microcosm of problems in AI, perception, and robotics.
Food violates the usual assumptions found in machine vision and robotic systems. Consider a plate of hummus. Hummus is not an object, and it is not a texture. It is a kind of stuff with certain optical and mechanical properties. Its optical properties are quite complex – not at all like the uniform opaque Lambertian surfaces favored in computer vision. Its geometry is irregular and unpredictable. The mechanical properties of hummus are maddeningly complex. It is not rigid and not elastic. It can flow but it is highly non-Newtonian. No one has a good parametric model of hummus.
What does the robot need to know about hummus? It doesn’t need an accurate physical model. It needs to understand, qualitatively and (to some coarse fidelity) qualitatively, how the hummus will react to a spoon or fork impinging upon it. The robot needs an intuitive physics. More to the point, it needs to understand the affordances of hummus, i.e., the various useful ways that the robot can physically interact with the hummus in pursuit of a goal.
Since we want our robot to deal with a large variety of actions taken on a wide variety of foods, it makes sense to use machine learning. The robot, through its own experience manipulating food, or by watching others do so, will learn the consequences of various actions taken on various foods. If the robot has seen lots of examples of spoons interacting with hummus, it can build up a model of the interaction process.
The problem is: how can this kind of knowledge be represented? It seems impractical to develop a mathematically correct model of hummus and run simulations. Another approach would be to code the prior experiences with symbolic descriptors (e.g., spoon approaching hummus, spoon entering hummus, hummus locally deforming) and to learn rules connecting those descriptors. This seems daunting, since we don’t have a language to describe the way in which utensils and hummus interact.
Instead, we plan to use a data-driven approach. For the robot to develop a data-driven model of hummus, it must begin by watching and storing vast numbers of hummus interactions. It will accumulate a library of video clips, where each clip is augmented with associated information about the motions and forces of the arm and hand through the recorded trajectory, collected using instrumentation that we will design and implement. Now suppose that the robot confronts a new tabletop scene. Chances are it has seen similar scenes before. It also can play back what happened when it took various actions, and thus it can predict what will happen when it takes similar actions now.
A key to success is finding a representation for expressing similarity. Freeman, Torralba, and colleagues have had success in developing good similarity metrics within other databases of images and image sequences, which gives us confidence that a similar approach can be used here. In their previous work, it is possible to specify a starting state and an ending state (e.g., a starting image and an ending image) and to find a smooth trajectory from one to the other. This has been done with sequences and gestures as well as with images.
For our project, the goal is to build up a library that is quite rich in data. For example, in addition to having video sequences, we can augment the 2D video with 3D from a Kinect depth camera or a motion-capture system, as well as with joint angles, forces, and torques from the arm itself. Finally, we plan to build advanced tactile sensors using our novel GelSight technology, to provide unprecedented sensory detail about the interaction with the food.
We will use the PR2 robot that is already in place in CSAIL. We will also gather training data by equipping a human operator with an instrumented glove in a motion capture system. Thus our library will have both human actions and robot actions, both of which will contain rich data.
The end result will be a new way of thinking about and representing affordances for a class of materials that has not previously received much attention. Instead of trying to capture the affordances of hummus in words, or in parametric physical models, we will build a large library that will stay relatively close to the original data. This has the advantage of great flexibility, allowing us to work with multiple actions on multiple types of foods.
A second end result will be a robot that can feed a disabled person. We feel that this project is exciting at both the theoretical and practical level.