Detect if images are different in real-time - c++

I am working on a microscope that streams live images via a built-in video camera to a PC, where further image processing can be performed on the streamed image. Any processing done on the streamed image must be done in "real-time" (minimal frames dropped).
We take the average of a series of static images to counter random noise from the camera to improve the output of some of our image processing routines.
My question is: how do I know if the image is no longer static - either the sample under inspection has moved or rotated/camera zoom-in or out - so I can reset the image series used for averaging?
I looked through some of the threads, and some ideas that seemed interesting:
Note: using Windows, C++ and Intel IPP. With IPP the image is a byte array (Ipp8u).
1. Hash the images, and compare the hashes (normal hash or perceptual hash?)
2. Use normalized cross correlation (IPP has many variations - which to use?)
Which do you guys think is suitable for my situation (speed)?

If you camera doesn't shake, you can, as inVader said, subtract images. Then a sum of absolute values of all pixels of the difference image is sometimes enough to tell if images are the same or different. However, if your noise, lighting level, etc... varies, this will not give you a good enough S/N ratio.
And in noizy conditions normal hashes are even more useless.
The best would be to identify that some features of your object has changed, like it's boundary (if it's regular) or it's mass center (if it's irregular). If you have a boundary position, you'll need to analyze just one line of pixels, perpendicular to that boundary, to tell that boundary has moved.
Mass center position may be a subject to frequent false-negative responses, but adding a total mass and/or moment of inertia may help.
If the camera shakes, you may have to align images before comparing (depending on comparison method and required accuracy, a single pixel misalignment might be huge), and that's where cross-correlation helps.
And further, you doesn't have to analyze each image. You can skip one, and if the next differs, discard both of them. Here you have twice as much time to analyze an image.
And if you are averaging images, you might just define an optimal amount of images you need and compare just the first and the last image in the sequence.

So, simplest thing to try would be to take subsequent images, subtract them from each other and have a look at the difference. Then define some rules including local and global thresholds for the difference in which two images are considered equal. Simple subtraction of bitmap/array data, looking for maxima and calculating the average differnce across the whole thing should be ne problem to do in real time.

If there are varying light conditions or something moving in a predictable way(like a door opening and closing), then something more powerful, albeit slower, like gaussian mixture models for background modeling, might be worth looking into, click here. It is quite compute intensive, but can be parallelized pretty easily.

Motion detection algorithms is what is used.
http://www.codeproject.com/Articles/10248/Motion-Detection-Algorithms
http://www.codeproject.com/Articles/22243/Real-Time-Object-Tracker-in-C

First of all I would take a series of images at a slow fps rate and downsample those images to make them smaller, not too much but enough to speed up the process.
Now you have several options:
You could make a sum of absolute differences of the two images by subtracting them and use a threshold to value if the image has changed.
If you want to speed it up even further I would suggest doing a progressive SAD using a small kernel and moving from the top of the image to the bottom. You can value the complessive amount of differences during the process and eventually stop when you are satisfied.

Related

Why in CNN for image recognition tasks, the filters are always chosen to be extremely localized?

In CNN, the filters are usually set as 3x3, 5x5 spatially. Can the sizes be comparable to the image size? One reason is for reducing the number of parameters to be learnt. Apart from this, is there any other key reasons? for example, people want to detect edges first?
You answer a point of the question. Another reason is that most of these useful features may be found in more than one place in an image. So, it makes sense to slide a single kernel all over the image in the hope of extracting that feature in different parts of the image using the same kernel. If you are using big kernel, the features could be interleaved and not concretely detected.
In addition to yourself answer, reduction in computational costs is a key point. Since we use the same kernel for different set of pixels in an image, the same weights are shared across these pixel sets as we convolve on them. And as the number of weights are less than a fully connected layer, we have lesser weights to back-propagate on.

Which deconvolution algorithm is best suited for removing motion blur from text?

I'm using OpenCV to process pictures taken with a mobile phone. The pictures contain text, and they have small amounts of motion blur, which I need to remove.
What would be the most viable algorithm to use? I have tested so far Lucy-Richardson and Weiner deconvolution, but they did not yield satisfactory results.
Agree with #TheJuice, your problem lies in the PSF estimation. Usually to be able to do this from a single frame, several assumptions need to be made about the factors leading to the blur (motion of object, type of motion of the sensor, etc.).
You can find some pointers, especially on the monodimensional case, here. They use a filtering method that leaves mostly correlation from the blur, discarding spatial correlation of original image, and use this to deduce motion direction and thence the PSF. For small blurs you might be able to consider the motion as constant; otherwise you will have to use a more complex accelerated motion model.
Unfortunately, mobile phone blur is often a compound of CCD integration and non-linear motion (translation perpendicular to line of sight, yaw from wrist motion, and rotation around the wrist), so Yitzhaky and Kopeika's method will probably only yield acceptable results in a minority of cases. I know there are methods to deal with that ("depth awareness" and other) but I have never had occasion of dealing with them.
You can preview the results using photo recovery software such as Focus Magic; while they do not employ YK estimator (motion description is left to you), the remaining workflow is necessarily very similar. If your pictures are amenable to Focus Magic recovery, then probably YK method will work. If they are not (or not enough, or not enough of them to be worthwhile), then there's no point even trying to implement it.
Motion blur is a difficult problem to overcome. The best results are gained when
The speed of the camera relative to the scene is known
You have many pictures of the blurred object which you can correlate.
You do have one major advantage in that you are looking at text (which normally constitutes high contrast features). If you only apply deconvolution to high contrast (I know that the theory is often to exclude high contrast) areas of your image you should get results which may enable you to better recognise characters. Also a combination of sharpening/blurring filters pre/post processing may help.
I remember being impressed with this paper previously. Perhaps an adaption on their implementation would be worth a go.
I think the estimation of your point-spread function is likely to be more important than the algorithm used. It depends on the kind of motion blur you're trying to remove, linear motion is likely to be the easiest but is unlikely to be the kind you're trying to remove: i imagine it's non-linear caused by hand movement during the exposure.
You cannot eliminate motion blur. The information is lost forever. What you are dealing with is a CCD that is recording multiple real objects to a single pixel, smearing them together. In other words if the pixel reads 56, you cannot magically determine that the actual reading should have been 37 at time 1, and 62 at time 2, and 43 at time 3.
Another way to look at this: imagine you have 5 pictures. You then use photoshop to blend the pictures together, averaging the value of each pixel. Can you now somehow from the blended picture tell what the original 5 pictures were? No, you cannot, because you do not have the information to do that.

How to combine two images with different gains in Matlab/ C++

I have two images, both taken at the same time from the same detector.
Both images have 11 bit resolution (yes, its odd but that is the case here). The difference between the two images is that one image as been amplified by a factor of 1 and the other has been amplified by a factor of 10.
How can I take these two 11 bit images, and combine their pixel values to get a single 16 bit image? Basically, this increases the dynamic range of the final image.
I am fairly new to image processing. I know there is a solution for this, since other systems do this on the fly pixel-by-pixel in an FPGA. I was just hoping to be able to do this in Matlab post processing instead of live. I know doing bitwise operations in Matlab can be kinda difficult, but we do have an educational license with every toolbox available.
As mentioned below, this look an awful lot like HDR processing. The goal isn't artistic, rather data preservation. This is eventually going to be put in C++ and flown on an autonomous flight computer and running standard bloated HDR software on the fly would kill our timing requirements
Thanks for the help!
As a side note, I'd like to be able to do this for any combination of gains. ie 2x and 30x, 4x and 8x ect. In my gut I feel like this is a deceptively simple algorithm or interpolation, but I just don't know where to start.
Gains
Since there is some confusion on what the gains mean, I'll try to explain. The image sensor (CMOS) being used on our custom camera has the capability to simultaneously output two separate images, both taken from the same exposure. It can do this because the sensor has 2 different electrical amplifiers along its data path.
In photography terms, it would be like your DSLR being able to take a picture using 2 different ISO values at the same time.
Sorry for the confusion
The problems you pose is known as "High Dynamic Range Imaging" and "Tone Mapping". I suggest you start with those Wikipedia articles, then drill down to the bibliography cited therein.
You don't provide enough details about your imagery to give a more specific answer. What is the "gain" you mention? Did you crank up the sensor's gain (to what ISO-equivalent number?), or did you use a longer exposure time? Are the 11-bit pixel values linear or already gamma-compressed?
To upscale an 11bit range to a 16bit range multiple by (2^16-1)/(2^11-1).
(Assuming you want a linear scaling. (Which is reasonable when scaling up.)
If the gain was discrete (applied to the 11bit range), then you have two 11bit images which may have some values saturated.
If the gain was applied in a continuous (analog) or floating point range, then your values can go beyond the original 11bits. Also, if the gain was applied in a continuous (analog) or floating point range, the values were probably scaled to another range first e.g. [0,1] (by dividing by (2^11-1)).
If the values were scaled to another range, you will have to divide by the maximum of the new range instead of by (2^11-1).
Either way (whether gain was in 11bit range or not), due to the gain and due to the addtion, the resulting values may be large than the original range. In this case, you need to decide how you want to scale them:
Do you want to scale the original 11bit range to 16bit (possible causing saturation)?
If so multiple by multiple by (2^16-1)/(2^11-1)
Do you want to scale the maximum possible value to 2^16-1?
If so multiple by multiple by (2^16-1)/( (2^11-1) * (G1+G2) )
Do you want to scale the actual maximum value to 2^16-1?
If so multiple by multiple by (2^16-1)/(max(sum(I1+I2))
Edit:
Since you do not want to add the images, but rather use the different details in them, perhaps this article will help you:
Digital Photography with Flash and No-Flash Image Pairs

Why JPEG compression processes image by 8x8 blocks?

Why JPEG compression processes image by 8x8 blocks instead of applying Discrete Cosine Transform to the whole image?
8 X 8 was chosen after numerous experiments with other sizes.
The conclusions of experiments are:
1. Any matrices of sizes greater than 8 X 8 are harder to do mathematical operations (like transforms etc..) or not supported by hardware or take longer time.
2. Any matrices of sizes less than 8 X 8 dont have enough information to continue along with the pipeline. It results in bad quality of the compressed image.
Because, that would take "forever" to decode. I don't remember fully now, but I think you need at least as many coefficients as there are pixels in the block. If you code the whole image as a single block I think you need to, for every pixel, iterate through all the DCT coefficients.
I'm not very good at big O calculations but I guess the complexity would be O("forever"). ;-)
For modern video codecs I think they've started using 16x16 blocks instead.
One good reason is that images (or at least the kind of images humans like to look at) have a high degree of information correlation locally, but not globally.
Every relatively smooth patch of skin, or piece of sky or grass or wall eventually ends in a sharp edge and is replaced by something entirely different. This means you still need a high frequency cutoff in order to represent the image adequately rather than just blur it out.
Now, because Fourier-like transforms like DCT "jumble" all the spacial information, you wouldn't be able to throw away any intermediate coefficients either, nor the high-frequency components "you don't like".
There are of course other ways to try to discard visual noise and reconstruct edges at the same time by preserving high frequency components only when needed, or do some iterative reconstruction of the image at finer levels of detail. You might want to look into space-scale representation and wavelet transforms.

Tips for background subtraction in the face of noise

Background subtraction is an important primitive in computer vision. I'm looking at different methods that have been developed, and I've begun thinking about how to perform background subtraction in the face of random, salt and pepper noise.
In a system such as the Microsoft Kinect, the infrared camera will give off random noise pretty consistently. If you are trying to background subtract from the depth view, how can you avoid an issue with this random noise while reliably subtracting the background?
as you already said, noise and other unsteady parts of your background might give problems in segmentation, I mean lighting changes or other moving stuff in the background.
But if you're working on some indoor-project this shouldn't be too big of an issue, except of course the noise thing.
Besides substracing the background from an image to segment the objects in it you could also try to subtract two (or in some methods even three) following frames from each other. If the camera is steady this should leave the parts that have changed, so basically the objects that have moved. So this is an easy method for detecting moving objects.
But in most operations you might use you probably will have that noise you described. Easiest way to get rid of it is by using Median Filter or Morpholocigal Operators (Opening) on the segmented binary image. This should effectively remove small parts and leave the nice big blobs of the objects.
Hope that helps...
typically you do connected components (cc) in disparity space and then kill any cc that have a small size. The threshold for size and for connectedness (e.g. what is the disparity difference between two adjacent pixel to still consider them connected) are your two parameters to play with (ivlad#lab126.com).
As #evident mentioned, median filter is your ticket. That's the standard operator for getting rid of salt-and-pepper noise while being edge-preserving.
That said, I disagree with his suggestion that this occur on the segmented binary image. Median filtering is very low-level and should be applied on the raw data before any subsequent processing.