Introduction: What I am working on.
Hello everyone! I am working on a Demosaicing algorithm which I use to transform images that have Bayer pattern into images that represent red, green and blue channels. I wish that the algorithm would have the following properties:
It preserves as much raw information as possible.
It does not obscure details in the image, even if that means absence of denoising.
It produces as little artifacts as possible.
If the size of mosaic image is N x N, the three color images should each have size N/2 x N/2.
Algorithm should be fast. To put "fast" into a context, let me say this: I will settle for something that is at least twice as fast as OpenCV's algorithm which uses bilinear interpolation.
What I have achieved so far.
So far, I've come up with the algorithm that uses bilinear interpolation and produces three images that have half size of the mosaic image. The algorithm is approximately 3-4 times faster than OpenCV's cvtColor algorithm that performs CV_BayerBG2BGR conversion (bilinear interpolation).
See the sketch of the Bayer pattern below to get an idea about how it works. I perform the interpolation at points marked by circles. The numbers represent the coefficients by which I multiply the underlying pixels in order to get interpolated value in the point marked by black circle.
You can observe the results of my algorithm below. I've also added the results of both demosaicing algorithms that are available in OpenCV (bilinear interpolation and variable number of gradients). Please note that while results of my algorithm look really poor in comparison, the OpenCV's bilinear interpolation results look almost exactly the same if I downsample them. This is of course expected as the underlying algorithm is the same.
... so finally: the question.
My current solution gives acceptable results for my project and it is also acceptably fast. However, I would be willing to use a up to twice slower algorithm if that would bring improvements to any of the 5 criteria listed above. The question then is: how to improve my algorithm without significantly hindering the performance?
I have enough programming experience for this task so I am not specifically asking for code snippets - the answers of any kind (code, links, suggestions - especially the ones based on past experiences) are welcome.
Some additional information:
I am working in C++.
The algorithm is highly optimized, it uses SSE instructions and it is non-parallel.
I work with large images (few MB in size); cache-awareness and avoiding multiple passes through image are very important.
I am not looking for general programming advice (such as optimization in general, etc.), but on the other hand some task-specific answers are more than welcome. Thank you in advance.
High quality results are obtained by filtering the samples to their Nyquist frequency. Usually that's half the sample rate, but in this case since your red and blue samples only come at 1/2 the pixel rate then Nyquist will be 1/4 the sample rate. When you resize you need to filter to the lower of the Nyquist rate of the input and output, but since your output is 1/2 your input you again need to filter to 1/4.
The perfect filter is the Sinc function; it delivers 100% of the signal below the cutoff frequency and none above the cutoff. Unfortunately it's completely impractical, extending as it does to infinity. For practical applications a windowed Sinc is used instead, the most common of these being the Lanczos kernel. The window size is chosen on the basis of quality vs. speed, with higher orders being closer to the ideal Sinc function. In your case since speed is paramount I will suggest Lanczos2.
The cutoff frequency of the filter is inversely proportional to its width. Since in this case we want the cutoff to be half of the normal cutoff, the filter will be stretched to twice its normal width. Ordinarily a Lanczos2 filter will require inputs up to but not including +/-2 pixels from the center point; stretching it by 2 requires inputs up to +/-4 from the center.
The choice of a center point is completely arbitrary once you have a good cutoff filter. In your case you chose a point that was midway between 4 sample points. If we choose instead a point that is exactly on one of our input samples, some interesting things happen. Many of the filter coefficients become zero, which means those pixels don't have to be included in the calculations. In the example below I've centered on the Red pixel, and we find that red pixels need no filtering at all! Following is a diagram with Lanczos2 filter values scaled so that they total to 1.0 for each color, followed by the formulas that result.
red = p[x][y]
green = (p[x][y-3] + p[x-3][y] + p[x+3][y] + p[x][y+3]) * -0.03125 +
(p[x][y-1] + p[x-1][y] + p[x+1][y] + p[x][y+1]) * 0.28125
blue = (p[x-3][y-3] + p[x+3][y-3] + p[x-3][y+3] + p[x+3][y+3]) * 0.00391 +
(p[x-1][y-3] + p[x+1][y-3] + p[x-3][y-1] + p[x+3][y-1] + p[x-3][y+1] + p[x+3][y+1] + p[x-1][y+3] + p[x+1][y+3]) * -0.03516 +
(p[x-1][y-1] + p[x+1][y-1] + p[x-1][y+1] + p[x+1][y+1]) * 0.31641
If you'd prefer to keep everything in the integer domain this works very well with fixed point numbers too.
red = p[x][y]
green = ((p[x][y-3] + p[x-3][y] + p[x+3][y] + p[x][y+3]) * -32 +
(p[x][y-1] + p[x-1][y] + p[x+1][y] + p[x][y+1]) * 288) >> 10
blue = ((p[x-3][y-3] + p[x+3][y-3] + p[x-3][y+3] + p[x+3][y+3]) * 4 +
(p[x-1][y-3] + p[x+1][y-3] + p[x-3][y-1] + p[x+3][y-1] + p[x-3][y+1] + p[x+3][y+1] + p[x-1][y+3] + p[x+1][y+3]) * -36 +
(p[x-1][y-1] + p[x+1][y-1] + p[x-1][y+1] + p[x+1][y+1]) * 324) >> 10
The green and blue pixel values may end up outside of the range 0 to max, so you'll need to clamp them when the calculation is complete.
I'm a bit puzzled by your algorithm and won't comment on it... but to put some things into perspective...
OpenCV is a library which contains a lot of generic stuff to get the job done, and sometimes is deliberately not performance-over-optimized, there is a cost-maintainability tradeoff and “good enough is better than better”.
There is a trove of people selling performance-optimized libraries implementing some of OpenCV's features, sometimes with the exact same API.
I have not used it, but OpenCV has a cv::gpu::cvtColor() which could be achieving your goals, out of the box, assuming it's implemented for demosaicing, and that you have a suitable GPU.
Considering the bilinear demosaicing, a less-maintainable but more-optimized CPU implementation can run much faster than the one from OpenCV, I'd estimate above 250 Mpx/s on one mainstream CPU core.
Now to elaborate on the optimization path...
First, because demosaicing is a local operation, cache awareness is really not a significant problem.
An performance-optimized implementation will have different code paths depending on the image dimensions, Bayer pattern type, instruction sets supported by the CPU (and their speed/latency), for such a simple algorithm it's going to become a lot of code.
There are SIMD instructions to perform shuffling, arithmetic including averaging, streaming memory writes, which you'd find useful. Intel's summary is not so bad to navigate, and Agner Fog's site is also valuable for any kind of implementation optimization. AVX & AVX2 provide several interesting instructions for pixel processing.
If you are more the 80/20 kind of person (good for you!), you'll appreciate working with a tool like Halide which can generate optimized stencil code like a breeze (modulo the learning curve, which is already setting you back a few days from a 1-hour naive implementation or the 10 minutes using OpenCV) and especially handles the boundary conditions (image borders).
You can get a little further (or take an alternative road) by using compiler intrinsics to access specific CPU instructions, at this point your code is now 4x costlier (in terms of development cost), and will probably get you 99% as far as hand-crafted assembly ($$$ x4 again).
If you want to squeeze the last drop (not generally advisable), you will definitely have to perform days of implementation benchmarks, to see which sequence of instructions can get you the best performance.
But also, GPUs... you can use your integrated GPU to perform demosaicing, it may be a little faster than the CPU and has access to the main memory... of course you'd have to care about pre-allocating shared buffers. A discrete GPU would have a more significant transfer overhead, at these fill rates.
I know I'm late to the discussion but I want to contribute my ideas as this is a problem that I have been thinking about myself. I see how you determined your coefficients from a linear interpolation of the 4 closest red or blue pixels. This would give you the best possible result if the intensity per color channel would vary linearly.
De-bayering artifacts, however, are most significant at color edges. In your case you would interpolate over the color edge, which would give you worse results than simply picking the closest red or blue pixel.
This is why I average the green pixels and take the closest red and blue pixel for my combined De-bayering and down-sampling. I believe that this should work better at color edges, but it would work less well for image areas with gradually varying color.
I haven't yet had an idea yet of what would be the optimal way to do this.
I actually implemented exactly what you're talking about here, in Halide.
You should read the paper by Morgan McGuire that I used as a reference... it's less about how much the neighbors factor into the output identity pixel and more about which pixels you're looking at to do a straight average.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is there a way to detect if an image is blurry?
How to calculate blurness and sharpness of a given image usig opencv? Is there any functions there in opencv to do it? If there is no functions in opencv how can I implement it? nay ideas would be great..
The input will be an image and the output should be the blurness and sharpness of the image.
I recommend you to make a frequential analysis of the image. Energy in high band will tell you that the image is quite sharpened, while energy in low band usually means that image is blurry. For computing spectrum, you can use FFTW library.
Regards,
I don't know about opencv.
If I were trying to get an approximate measurement of where an imagine is on the sharp-to-blurry spectrum, I'd start from the observation that the sharpness of parts of an image is evident from the contrast between adjacent pixels - something like max(c1 * abs(r1 - r2), c2 * abs(g1 - g2), c3 * abs(b1 - b2)) where c1-3 weigh perceptual importance of each of the red, green and blue channels, and the two pixels are (r1,g1,b1) and (r2,g2,b2)).
Many tweaks possible, such as raising each colour's contribution to a power to emphasise changes at the dark (power <1)or bright (power >1) end of the brightness scale. Note that the max() approach considers sharpness for each colour channel separately: a change from say (255,255,255) to (0,255,255) is very dramatic despite only one channel changing.
You may find it better to convert from RBG to another colour representation, such as Hue/Saturation/Value (there'll be lots of sites online explaining the HSV space, and formulas for conversions).
Photographically, we're usually interested in knowing that the in-focus part of the image is sharp (foreground/background blur/bokeh due to shallow depth of field is a normal and frequently desirable quality) - the clearest indication of that is high contrast in some part of the image, suggesting you want the maximum value of adjacent-pixel contrasts. That said, some focused pixtures can still have very low local contrasts (e.g. a picture of a solid coloured surface). Further, damaged pixel elements on the sensor, dirt on the lens/sensor, and high-ISO / long-exposure noise may all manifest as spots of extremely high contrast. So the validity of your result's always going to be questionable, but it might be ball-park right a useful percentage of the time.
I am working on a microscope that streams live images via a built-in video camera to a PC, where further image processing can be performed on the streamed image. Any processing done on the streamed image must be done in "real-time" (minimal frames dropped).
We take the average of a series of static images to counter random noise from the camera to improve the output of some of our image processing routines.
My question is: how do I know if the image is no longer static - either the sample under inspection has moved or rotated/camera zoom-in or out - so I can reset the image series used for averaging?
I looked through some of the threads, and some ideas that seemed interesting:
Note: using Windows, C++ and Intel IPP. With IPP the image is a byte array (Ipp8u).
1. Hash the images, and compare the hashes (normal hash or perceptual hash?)
2. Use normalized cross correlation (IPP has many variations - which to use?)
Which do you guys think is suitable for my situation (speed)?
If you camera doesn't shake, you can, as inVader said, subtract images. Then a sum of absolute values of all pixels of the difference image is sometimes enough to tell if images are the same or different. However, if your noise, lighting level, etc... varies, this will not give you a good enough S/N ratio.
And in noizy conditions normal hashes are even more useless.
The best would be to identify that some features of your object has changed, like it's boundary (if it's regular) or it's mass center (if it's irregular). If you have a boundary position, you'll need to analyze just one line of pixels, perpendicular to that boundary, to tell that boundary has moved.
Mass center position may be a subject to frequent false-negative responses, but adding a total mass and/or moment of inertia may help.
If the camera shakes, you may have to align images before comparing (depending on comparison method and required accuracy, a single pixel misalignment might be huge), and that's where cross-correlation helps.
And further, you doesn't have to analyze each image. You can skip one, and if the next differs, discard both of them. Here you have twice as much time to analyze an image.
And if you are averaging images, you might just define an optimal amount of images you need and compare just the first and the last image in the sequence.
So, simplest thing to try would be to take subsequent images, subtract them from each other and have a look at the difference. Then define some rules including local and global thresholds for the difference in which two images are considered equal. Simple subtraction of bitmap/array data, looking for maxima and calculating the average differnce across the whole thing should be ne problem to do in real time.
If there are varying light conditions or something moving in a predictable way(like a door opening and closing), then something more powerful, albeit slower, like gaussian mixture models for background modeling, might be worth looking into, click here. It is quite compute intensive, but can be parallelized pretty easily.
Motion detection algorithms is what is used.
http://www.codeproject.com/Articles/10248/Motion-Detection-Algorithms
http://www.codeproject.com/Articles/22243/Real-Time-Object-Tracker-in-C
First of all I would take a series of images at a slow fps rate and downsample those images to make them smaller, not too much but enough to speed up the process.
Now you have several options:
You could make a sum of absolute differences of the two images by subtracting them and use a threshold to value if the image has changed.
If you want to speed it up even further I would suggest doing a progressive SAD using a small kernel and moving from the top of the image to the bottom. You can value the complessive amount of differences during the process and eventually stop when you are satisfied.
I have two images, both taken at the same time from the same detector.
Both images have 11 bit resolution (yes, its odd but that is the case here). The difference between the two images is that one image as been amplified by a factor of 1 and the other has been amplified by a factor of 10.
How can I take these two 11 bit images, and combine their pixel values to get a single 16 bit image? Basically, this increases the dynamic range of the final image.
I am fairly new to image processing. I know there is a solution for this, since other systems do this on the fly pixel-by-pixel in an FPGA. I was just hoping to be able to do this in Matlab post processing instead of live. I know doing bitwise operations in Matlab can be kinda difficult, but we do have an educational license with every toolbox available.
As mentioned below, this look an awful lot like HDR processing. The goal isn't artistic, rather data preservation. This is eventually going to be put in C++ and flown on an autonomous flight computer and running standard bloated HDR software on the fly would kill our timing requirements
Thanks for the help!
As a side note, I'd like to be able to do this for any combination of gains. ie 2x and 30x, 4x and 8x ect. In my gut I feel like this is a deceptively simple algorithm or interpolation, but I just don't know where to start.
Gains
Since there is some confusion on what the gains mean, I'll try to explain. The image sensor (CMOS) being used on our custom camera has the capability to simultaneously output two separate images, both taken from the same exposure. It can do this because the sensor has 2 different electrical amplifiers along its data path.
In photography terms, it would be like your DSLR being able to take a picture using 2 different ISO values at the same time.
Sorry for the confusion
The problems you pose is known as "High Dynamic Range Imaging" and "Tone Mapping". I suggest you start with those Wikipedia articles, then drill down to the bibliography cited therein.
You don't provide enough details about your imagery to give a more specific answer. What is the "gain" you mention? Did you crank up the sensor's gain (to what ISO-equivalent number?), or did you use a longer exposure time? Are the 11-bit pixel values linear or already gamma-compressed?
To upscale an 11bit range to a 16bit range multiple by (2^16-1)/(2^11-1).
(Assuming you want a linear scaling. (Which is reasonable when scaling up.)
If the gain was discrete (applied to the 11bit range), then you have two 11bit images which may have some values saturated.
If the gain was applied in a continuous (analog) or floating point range, then your values can go beyond the original 11bits. Also, if the gain was applied in a continuous (analog) or floating point range, the values were probably scaled to another range first e.g. [0,1] (by dividing by (2^11-1)).
If the values were scaled to another range, you will have to divide by the maximum of the new range instead of by (2^11-1).
Either way (whether gain was in 11bit range or not), due to the gain and due to the addtion, the resulting values may be large than the original range. In this case, you need to decide how you want to scale them:
Do you want to scale the original 11bit range to 16bit (possible causing saturation)?
If so multiple by multiple by (2^16-1)/(2^11-1)
Do you want to scale the maximum possible value to 2^16-1?
If so multiple by multiple by (2^16-1)/( (2^11-1) * (G1+G2) )
Do you want to scale the actual maximum value to 2^16-1?
If so multiple by multiple by (2^16-1)/(max(sum(I1+I2))
Edit:
Since you do not want to add the images, but rather use the different details in them, perhaps this article will help you:
Digital Photography with Flash and No-Flash Image Pairs
Why JPEG compression processes image by 8x8 blocks instead of applying Discrete Cosine Transform to the whole image?
8 X 8 was chosen after numerous experiments with other sizes.
The conclusions of experiments are:
1. Any matrices of sizes greater than 8 X 8 are harder to do mathematical operations (like transforms etc..) or not supported by hardware or take longer time.
2. Any matrices of sizes less than 8 X 8 dont have enough information to continue along with the pipeline. It results in bad quality of the compressed image.
Because, that would take "forever" to decode. I don't remember fully now, but I think you need at least as many coefficients as there are pixels in the block. If you code the whole image as a single block I think you need to, for every pixel, iterate through all the DCT coefficients.
I'm not very good at big O calculations but I guess the complexity would be O("forever"). ;-)
For modern video codecs I think they've started using 16x16 blocks instead.
One good reason is that images (or at least the kind of images humans like to look at) have a high degree of information correlation locally, but not globally.
Every relatively smooth patch of skin, or piece of sky or grass or wall eventually ends in a sharp edge and is replaced by something entirely different. This means you still need a high frequency cutoff in order to represent the image adequately rather than just blur it out.
Now, because Fourier-like transforms like DCT "jumble" all the spacial information, you wouldn't be able to throw away any intermediate coefficients either, nor the high-frequency components "you don't like".
There are of course other ways to try to discard visual noise and reconstruct edges at the same time by preserving high frequency components only when needed, or do some iterative reconstruction of the image at finer levels of detail. You might want to look into space-scale representation and wavelet transforms.
What algorithms to use for image downsizing?
What is faster?
What algorithm is performed for image resizing ( specially downsizing from big 600x600 to super small 6x6 for example) by such giants as flash and silver player, and html5?
Bilinear is the most widely used method and can be made to run about as fast as the nearest neighbor down-sampling algorithm, which is the fastest but least accurate.
The trouble with a naive implementation of bilinear sampling is that if you use it to reduce an image by more than half, then you can run into aliasing artifacts similar to what you would encounter with nearest neighbor. The solution to this is to use an pyramid based approach. Basically if you want to reduce 600x600 to 30x30, you first reduce to 300x300, then 150x150, then 75x75, then 38x38, and only then use bilinear to reduce to 30x30.
When reducing an image by half, the bilinear sampling algorithm becomes much simpler. Basically for each alternating row and column of pixels:
y[i/2][j/2] = (x[i][j] + x[i+1][j] + x[i][j+1] + x[i+1][j+1]) / 4;
There is an excellent article at The Code Project showing the effects of various image filters.
For shrinking an image I suggest the bicubic algorithm; this has a natural sharpening effect, so detail in the image is retained at smaller sizes.
There's one special case: downsizing JPG's by more than a factor of 8. A direct factor of 8 rescale can be done on the raw JPG data, without decompressing it. JPG's are stored as compressed blocks of 8x8 pixels, with the average pixel value first. As a result, it typically takes more time to read the file from disk or the network than it takes to downscale it.
Normally I would stick to a bilinear filter for scaling down. For resizing images to tiny sizes, though, you may be out of luck. Most icons are pixel-edited by hand to make them look their best.
Here is a good resource which explains the concepts quite well.