What makes object representation and recognition hard? - computer-vision

Intuitively, it would seem that given a dozen or so 2d images from different angles of almost any object, it should be easy to construct a 3d representation of that object. Subsequently a library of 3d representations attained in this way could be used to identify new 2d images.
What literature is there along these lines, and why has it not yet produced strong object recognition?

It is your word "intuitively" that is causing you trouble there. Your brain is not designed to be very good at certain tasks, like multiplying thousands of numbers in an instant. However for raw computational power your brain makes the fastest computer look like mere tiddly-winks (neural response time of only about 10 milliseconds, but all those 10^14 or so neurons all working in parallel totally beats any modern machine). Its just that your brain is designed to solve problems that are intensely more computationally complex, like recognizing objects in a picture, parsing sound data and picking out individual speakers amidst background noise. Learning to classify and deal with tens of thousands of types of objects.
The incredibly computationally intense things your brain is designed to do really well are the things that, to a person, seem "intuitive". The things it isn't designed to do really well seem "unintuitive" or difficult. But the raw computation needed for strong object recognition (because there are just so MANY kinds of objects, many of which really have subobjects, and multiple classifications, and non-rigid forms, e.g. "trousers", "water", "dog") is WAY more than what is needed accomplish things one considers only possible for a computer. Things like using "common sense" to solve an every day problem are similarly trivial for a person, but computationally incredibly complex.

What you want to do is indeed possible, but (there are quite a few buts)
for the 3D reconstruction:
For anything but the simplest shapes you need more than just a few dozen images.
The shape you are reconstructing needs to have a lot of recognizable features that look similar enough from different angles so that you can match them.
Lighting needs to be fairly constant over your entire set of images, otherwise shadows will throw you off (or you need even more images)
even with very feature rich objects (i.e. lot of variation in colour and shape) 3D reconstruction accuracy from any matched pair of features is going to be terrible if you do not have full knowledge of the parameters (position, view direction and opening angle) of the camera used to take each picture.
These are all problems can be solved, so suppose you did, and now you have a new picture from the object that you want to match to your 3D shape.
You could of course try to find a 2D projection of your shape that fit the new picture, but the search space there is enormous. It would probably be a lot easier and faster to use the feature finding and matching system you built for the initial 3D reconstruction to directly match the new picture to the existing set, and find where it fits on the object that way.
So once you've solved the problem of creating the initial 3D reconstruction your second step is basically done as well.
Photosynth is a brilliant example of these two steps. Browse the site, try to find some of the references they have there.
As for your final step, strong object recognition, just imagine the search space! What you need for strong object recognition, apart from a good representation of the objects you want to recognize, is a good way to search the space of objects you know, and a good way to represent your new object (the image of an object in this case) in that space. This is something I know nearly nothing about.
For just matching the same object in different 2D images there are SIFT features. But I don't think this translates well to 3D.

Note that what you're describing is instance recognition. Computer can indeed do a good job of instance recognition these days. For example, Google Goggles is very good at recognizing landmarks like the Golden Gate Bridge and Eiffel Tower.
However, computers are less good at doing category recognition and classification. Creating dozens of 2D snapshots for all possible objects under all types of lighting conditions etc. becomes intractable very quickly. The fact that certain objects such as a dog can move around makes the space of possibilities even bigger. Computers become much worse at this.
Also, from the biological standpoint, our visual field is around 100 million pixels. Graphics cards have only now started to become capable of rendering that much data in real-time. Making sense of that much data is even more computationally intensive.
One often talks about having a machine reach a 5 year old's ability to process information. But let's think about how much data that is. 100 million pixels with 3 color channels and 1 byte per pixel = 300MB/s. Now multiply that by 30 frames per second, 31,556,926 seconds per year, and 5 years, you end up with roughly 1.4 exabytes (1.4x10^18).

Related

C++ - fastest sorting algorithm for objects based on distance

I'm trying to make a game or 3D application using openGL. The game/program will have many objects in them and drawn to the screen(around 7000 of them). When I render them, I would need to calculate the distance between the camera and the object and sort them in order to correctly render the objects within the scene. Knowing this, what is the best way to sort them? I really want the sorting to be done really fast, but I've heard there are "trade off" for them, so what algorithm should I use to get the best performance out of it?
Any help would be greatly appreciated.
Edit: a lot of people are talking about the z-buffer/depth buffer. This doesn't work in some cases like a few people talked about. This is why I asked this question.
Sorting by distance doesn't solve the transparency problem perfectly. Consider the situation where two transparent surfaces intersect and each has a part which is closer to you. Perhaps rare in games, but still something to consider if you don't want an occasional glitched look to your renderer.
The better solution is order-independent transparency. With the latest graphics hardware supporting atomic operations, you can use an A-buffer to do this with little memory overhead and in a single pass so it is pretty efficient. See for example this article.
The issue of sorting your scene is still a valid one, though, even if it isn't for transparency -- it is still useful to sort opaque objects front to back to to allow depth testing to discard unseen fragments. For this, Vaughn provided the great solution of BSP trees -- these have been used for this purpose for as long as 3D games have been around.
Use http://en.wikipedia.org/wiki/Insertion_sort which has O(n) complexity for nearly sorted arrrays.
In your case by exploiting temporal cohesion insertion sort gives fastest results.
It is used for http://en.wikipedia.org/wiki/Sweep_and_prune
From link above:
In many applications, the configuration of physical bodies from one time step to the next changes very little. Many of the objects may not move at all. Algorithms have been designed so that the calculations done in a preceding time step can be reused in the current time step, resulting in faster completion of the calculation.
So in such cases insertion sort is best(or similar sorts with O(n) at best case)

Which library for voxel data structure?

I'm working in C++ with large voxel grids in a scientific context and I'm trying to decide, which library to use. Only a fraction of the voxel grid holds values - but might be several per voxel (e.g. struct), which are determined by raytracing. I'm not trying to render anything, but I have to determine the potential number of rays passing though the entire target area, thus an awful lot of ray-box computations will have to be caluculated and preferebly very fast...
So far, I found
OpenVDB http://www.openvdb.org/
Field3d http://sites.google.com/site/field3d/
The latter appeals a bit more, because it seems simpler/easier to use.
My question is: Which of them would be more suited if put to use in tasks, which are not aimed at rendering/visualization? Which one is faster/better when computing a lot of ray-box-intersections (no viewpoint-dependent culling possible)? Suggestions, anyone?
In any case, I want to use an existing C++ library and not write a kdTree/Octree etc. myself. Don't have the time for inventing the wheel anew.
I would advise
OpenSceneGraph
Ogre3D
VTK
I have personally used the first two. However, VTK is also a popular alternative. All three of them support voxel based rendering.

Render 1000+ shapes in opengl

How can I render a bunch of hand drawn shapes in opengl 1.x? I know about instancing but how is it possible in old opengl? Could I get examples of some sort? This is for a game, I'm expecting a thousand or so shapes all of which will need to be updated every frame.
Assuming that (at least most of) the shapes remain unchanged from one frame to the next, so most of the update is just moving them around, you could at least consider building a display list for each shape, then rendering the display lists during an update.
The amount of good you'll get from this varies widely depending on the hardware (and possibly driver) in use though. Some hardware supports display lists directly, and gains a lot from it. With other hardware, you'll be hard put to find any difference at all.
The good points are that at worst this won't do any harm, and building/using display lists is pretty quick and easy. So, in the worst case you don't lose much, and in the best case you might gain quite a bit.

How should I go about implementing algorithms to be used with black/white and color images?

I thought about:
1) Implement everything for the b/w images, then make wrappers for the methods that check if it's a color image. If it is, split the channels, make the operations on each individually and then merge them.
2) Use functors to correctly update the values depending on what I'm dealing with. Problem is that the compiler errors would be really complicated and I'm not used to it, and I think I may end up needing quite a few of them. Not sure if this is a good idea tbh.
There might be a correct design pattern here I'm not seeing too. There could also be a way to do this that's channel/color agnostic in OpenCV though I haven't found it yet, and so far the book I'm reading (OpenCV 2 Computer Vision Application Programming Cookbook) hasn't shown me such a possibility yet.
If speed is important, Don't.
It sounds like you're trying to encapsulate or abstract away the type of pixel using OO techniques or the like. This could add an extra level of indirection for every pixel access, killing your performance.
If you're calling staight to a function vs. a pointer to one (e.g., delegate, overriden method, functor) it can still be faster for the CPU, but if you're doing function calls at all reconsider; they're still extra work and if you can nest everything in the outer FOR loop, it will look ugly and functional programming snobs will sneer at you, remember, this isn't a big LOB app that will get hard to maintain. That's why engineers can still perfectly maintain 30 year old quickbasic code, the problem space doesn't need anything smarter (however usually their problems themselves need something a lot smarter than I!)
It's best to implement simple things (e.g., a threshold op or resizing) optimized for each kind of image if you want speed. You can also research transformation matrix and see if you can accomplish your work like that. That way you can write 2 transformer algorithms (b&w) only, and, using a similar (or same) matrix do the same thing for both types of pictures.
Hence accomplishing a major goal of abstraction anyway, seamless reuse, separation of concerns. And speed to boot (but hopefully not reboot!) good luck
Splitting the channels could work well with algorithms that work with the channels independently; not all of them do, so this will be quite limiting. You'll also spend a bit of time and space making all those copies.
By functors I presume you mean making templates out of your algorithm functions, with a pixel type as the template parameter. That could work also, but it means defining your basic pixel operations in a way that they could be implemented as functions or operators on a generic pixel type. This is harder than it looks and should be done after you've had some experience in implementing the algorithms.
A third option not mentioned is to promote the b/w images to full color, process them, and convert back to b/w. This optimizes the full color processing at the expense of the b/w.
For most algorithms it is not necessary to worry about monochrome vs. colour images. You either use the grey value of the monochrome image or you calculate the luminance/intensity/whatever of the colour and use that. You choose the measure luminance etc. by looking at which colour space will give you the result you want.
When you have calculated how you are going to modify your images you use some pixel aware processing, e.g. blending two pixels might be pixel_a*0.5 + pixel_b*0.5, your pixel class will sort out how to apply that to the different colour channels, i.e. Pixel::operator+(const Pixel &), Pixel::operator*(float) and so on.
There are algorithms that are applied individually to each colour channel but they are not as common and often there is some correlation between the spatiotemporal changes in the colours so you wouldn't do something as basic as process each channel totally independently of each other.
My own Image class uses a planar structure (that is, color channels are separate) instead of an interleaved structure. However this is VERY limiting when it comes to image quantization and other joint color processing tasks.
I am planning to rewrite it to use the other approach, to simply be a two dimensional array of pixels. At the moment I am not sure how will I implement it exactly (template pixel class, Pixel base class or a simple three dimensional array).
I also plan to write a planar wrapper for this interleaved image structure to ease any disadvantage I might encounter. One thing is sure, this wrapper will be much efficient than a pixel wrapper would be for planar images.
Frankly I believe splitting planes is rather inefficient, since you calculate various overheads several times. For example, if you want to resize an image, calculation of the various filter coefficients is very expensive, and it would be MUCH better to just calculate them once and apply Pixel::operator * and + instead of the same with the underlying subpixel components.

Directx 9 Terrain collision

I searched and I found some tutorials how to do terrain collision but they were using .raw files, I'm using .x. But, I think i can do same thing they did. They took x,y,z values of an object can checked it against every single triangle in the terrain. It makes sense but It look like it will be slow. It is just like picking checking against every single triangle is slow.
Is there faster way to do it and good?
UPDATE
My terrain is not flat, if it was i would use bounding boxes.
Last time I did this, I used the Bullet library, and it worked great. It has various collision shapes to choose from, optimised for different scenarios, including general triangle meshes and heightfields. You can use the library's collision routines without the physics.
One common way to significantly reduce the time it takes to detect collisions is to organize the space into an octree, which will allow you to very quickly determine whether or not a collision could occur in a particular node. Generally speaking, it's easier to accomplish these sorts of tasks with a game engine.