|
Hi, I'm Robert. I'm employed by a new media agency called Nascom where I look after the information architecture and user experience of things. |
| read about | |||||
|---|---|---|---|---|---|
|
|
|
| Project Natal, how it could work |
| Tuesday, 02 June 2009 | ||||||
|
Project Natal was introduced yesterday at the E3 conference. Impressive demo gear, but some might look at it critically and wonder about the input. More specifically about the seamless conversion of image to body movement. If you’re interested in my wild guess on the technology side, carry on – but don’t take it for granted. The movement tracking When I saw it, the first thing that came to mind was the research of Tokyo University engineer Tsuyoshi Horo. He interfaced his robot by gestures. What I found particularly interesting was the way he used and processed his input from the cameras. The output was based on Human Volume Reconstruction. This means the system perceived the user as a virtual object composed out of little virtual cubes. ![]() Technically there was an array of cameras to detect body movement. This resulted in a real-time 3D volumetric model of himself that consisted out of these small virtual cubes. By analyzing these cubes as data, gestures could be extracted. From there on its all math and probability calculations’ mapped on the model of a human body. The video below shows the ropes: Fact is that Human Volume Reconstruction requires a whole lot of processing power, but that might explain the extra processor in the Natal hardware. Key about volume calculations is not that much image quality, but having objects that are physically separated. When they are separated, they can fairly easy be calculated and processed. This might be a limitation for using it as input from a game environment since people tend to sit closely together in a couch. Another limitation for Natal with Horo’s setup would be the need of an array of cameras. A big no-no for the average Joe’s living room. My bet is that Microsoft seems to have that covered by strongly refining the technology, and using the RGB camera for just height (y) and width (x) perception. A depth sensor would cover the Z-dimension. Sounds logical, but the big problem with all depth sensors (infrared, ultrasound, etc) is that they generally tend to work in very fixed line of sight, and thus making them pretty useless for Human Volume Reconstruction. So, that kind of puts this theory in thin air again. Or maybe, just maybe, they are making use of improved hardware like the one developed by Elliptic Labs. This Norwegian company creates touchless 3D interfaces (based on depth) for a while now. But the only problem with their depth perception is that the user can’t provide input if he is more than 1,5 away from the perceiver: So, however it would work, it will be an interesting revelation. The eye tracking First thoughts: red eye effect. In combination with the same camera that measures the X and Y volume, the users’ eyes could be tracked within the image, even in dark environments. This would be done by constantly sending out infrared light. The users’ pupils would reflect this, (also known as the red light effect) and show up in Natal as bright dots. ![]() There is a company that uses this concept to measure ad exposure based on how many times people look at an ad. Above you can see an image of the hardware they put around the ad, but this is more expensive than it looks. The speech recognition Last year I did a project for the university based on a speech interfaces. We prototyped some interfaces for cars (controlling the radio etc) and for a Pizza ordering telephone hotline. Since our software was open source, I’m quite sure that it wasn’t state of the art, but still.. ![]() The setup worked with flowcharts and actions were triggered by keywords. We user tested it with people with different English accents, and that worked quite well for simple things with limited answers. So keyword based speech recognition is something that would be possible. I do have serious doubts about natural language recognition though, but I don’t think that is needed for interacting with games. The user will probably have to be trained a bit in the use of keywords. Conclusion Most exiting will be to see how movement perception works – but I’m also most optimistic about accuracy of this one, seen recent research. If I had any concerns, it would be about the speech recognition and the overall cost of a system that combines all this technology. I can already tell: if it uses the above technology, it won’t come cheap. Natal input that eventually would cost more than the Xbox itself should not be a surprise.
02-06-2009 12:49 Great post! I'm curious to see whether real life usage will be as smooth an experience as the demo video. 03-06-2009 23:06 Write Comment
|
||||||