Optimization C++ code to match reference run time - c++

I have assigment to optimize some c++ code, I'm bad at coding but I made some attempts so the original is:
#include "stdafx.h"
#include "HistogramStretching.h"
void CHistogramStretching::HistogramStretching(BYTE** pImage, int nW, int nH)
{
//find minimal value
int nMin = pImage[0][0];
for(int j = 0; j < nW; j++)
for(int i = 0; i < nH; i++)
if(pImage[i][j] < nMin)
nMin = pImage[i][j];
//find maximal value
int nMax = pImage[0][0];
for(int j = 0; j < nW; j++)
for(int i = 0; i < nH; i++)
if(pImage[i][j] > nMax)
nMax = pImage[i][j];
//stretches histogram
for(int j = 0; j < nW; j++)
for(int i = 0; i < nH; i++)
{
if(nMax != nMin)
{
float fScale = (nMax - nMin)/100.0;//calculates scale
float fVal = (pImage[i][j] - nMin)/fScale;//scales pixel value
int nVal = (int)(fVal + 0.5);//rounds floating point number to integer
//checks BYTE range (must be 0-255)
if(nVal < 0)
nVal = 0;
if(nVal > 255)
nVal = 255;
pImage[i][j] = nVal;
}
else
pImage[i][j] = 0;//if all pixel values are the same, the image is changed to black
}
}
And my verison is:
#include "stdafx.h"
#include "HistogramStretching.h"
void CHistogramStretching::HistogramStretching(BYTE** pImage, int nW, int nH)
{
//find minimal value
int nMin = pImage[0][0];
int nMax = pImage[0][0];
for (int j = 0; j < nW; j++) {
for (int i = 0; i < nH; i++) {
if (pImage[i][j] < nMin)
nMin = pImage[i][j];
if (pImage[i][j] > nMax)
nMax = pImage[i][j];
}
}
if (nMax != nMin) {
float fScale = (nMax - nMin) / 100.0;//calculates scale
fScale = 1 / fScale;
//stretches histogram
for (int j = 0; j < nW; j++)
for (int i = 0; i < nH; i++)
{
float fVal = (pImage[i][j] - nMin) * fScale;//scales pixel value
int nVal = (int)(fVal + 0.5);//rounds floating point number to integer
//checks BYTE range (must be 0-255)
if (nVal < 0)
nVal = 0;
if (nVal > 255)
nVal = 255;
pImage[i][j] = nVal;
}
//if all pixel values are the same, the image is changed to black
}
else {
pImage[0][0] = 0;
}
}
So I merged the first two loops to one but still the first if make ~15% CPU time, next step was to pull the if statement outside the loops and changing division for multiplication and here that division takes ~8% of CPU time and float to int casting takes ~5% but I think I can't do much with casting. With this "correcions" my code is still some like 6-7 times slower than refference code. I test both code on the same machines. Can you point me to something I can make better?

I think tadman gave you the correct answer.
Replace
for (int j = 0; j < nW; j++) {
for (int i = 0; i < nH; i++) {
if (pImage[i][j] < nMin)
...
}
}
with
for (int i = 0; i < nH; i++) {
for (int j = 0; j < nW; j++) {
if (pImage[i][j] < nMin)
...
}
}
This way your data access becomes cache/memory aligned, which should be way faster.

All modern compilers can vectorize this nicely, when compiled at full optimization (/O2 for MSVC, -O3 for gcc and clang).
The idea is to give the compiler some help so that it can see that the code can be in fact vectorized:
Let the inner loop operate on a single pointer, not on indices, and without accessing anything but the pointed-to value.
Perform the scaling as an integer operation - and don't forget rounding :)
Try to set up operations such that additional range checks are unnecessary, e.g. your checks for BYTE being less than 0. By having the offset and scale set up properly, the result will be guaranteed to fall into the desired range.
The inner loops will get unrolled, and will be vectorized to process 4 bytes at a time. I've tried the recent gcc, clang and MSVC releases and they produce pretty fast code for this.
You're doing something "weird" in that you purposefully scale the results to a 0-99 range. Thus you lose the resolution of the data - you've got a full byte to work with, so why not scale it to 255?
But if you want to scale to 100 values, it's fine. Note that 100(dec) = 0x64. We can make the outputSpan flexible - it will work for any value <= 255.
Thus:
/* Code Part 1 */
#include <cstdint>
constexpr uint32_t outputSpan = 100;
static constexpr uint32_t scale_16(uint8_t min, uint8_t max)
{
return (outputSpan * 0x10000) / (1+max-min);
}
// scale factor in 16.16 fixed point unsigned format
// empty histogram produces scale = outputSpan
static_assert(scale_16(10, 10) == outputSpan * 0x10000, "Scale calculation is wrong");
static constexpr uint8_t scale_pixel(uint8_t const pixel, uint8_t min, uint32_t const scale)
{
uint32_t px = (pixel - min) * scale;
// result in 16.16 fixed point format
return (px + 0x8080u) >> 16;
// round to an integer value
}
We work with fixed-point numbers (instead of floating-point). The scale is in 16.16 format, thus 16 digits in the integer part, and 16 digits in the fractional part, e.g. 0x1234.5678. The value 1.0(dec) would be 0x1.0000.
The pixel scaling simply multiplies the pixel by the scale, rounds it, and returns the truncated integer part.
The rounding is "interesting". You'd think that it'd suffice to add 0.5(dec) = 0x0.8 to the result to round it. That's not the case. The value needs to be a bit larger than that, and 0x0.808 does the job. It pre-biases the value, so that the error range around the exact value has a zero mean. In all cases, the error is at most ±0.5 - thus the result, rounded to an integer, does not lose accuracy.
We use scale_16 and scale_pixel functions to implement the stretcher:
/* Code Part 2 */
void stretchHistogram(uint8_t **pImage, int const nW, int const nH)
{
uint8_t nMin = 255, nMax = 0;
for (uint8_t **row = pImage, **rowEnd = pImage + nH; row != rowEnd; ++row)
for (const uint8_t *p = *row, *pEnd = p + nW; p != pEnd; ++p)
{
auto const px = *p;
if (px < nMin) nMin = px;
if (px > nMax) nMax = px;
}
auto const scale = scale_16(nMin, nMax);
for (uint8_t **row = pImage, **rowEnd = pImage + nH; row != rowEnd; ++row)
for (uint8_t *p = *row, *pEnd = p + nW; p != pEnd; ++p)
*p = scale_pixel(*p, nMin, scale);
}
This also produces decent code on architectures without FPU, such as FPU-less ARM and AVR.
We can also do some manual checks. Suppose that min = 0x10, max = 0xEF, and pixel = 0x32. Let's remember that the scale is in 16.16 format:
scale = 0x64.0000 / (1 + max - min)
= 0x64.0000 / (1 + 0xEF - 0x10)
= 0x64.0000 / (1 + 0xDF)
= 0x64.0000 / 0xE0
Long division:
0x .7249
0x64.0000 / 0xE0
---------
64.0
- 62.0
------
2.00
- 1.C0
-------
.400
- .380
--------
. 800
- . 7E0
---------
. 20
So, we have scale = 0x0.7249. It's less than one (0x1.0), and also a bit less than 1/2 (0x0.8), since we map 224 values onto 100 values - a bit less than half as many.
Now
px = (pixel - min) * scale
= (0x32 - 0x10) * 0x0.7249
= 0x22 * 0x0.7249
Long multiplication:
0x 0.7249
* 0x .0022
------------
.E492
+ E.492
------------
0x F.2DB2
Thus, px = 0xF.2DB2 ≈ 0xF. We have to round it to an integer:
return = (px + 0x0.8080u) >> 16
= (0xF.2DB2 + 0x0.8080) >> 16
= 0xF.AE32 >> 16
≈ 0xF
Let's check in decimal system:
100 / (max-min+1) * (pixel-min) =
= 100 / (239 - 16 + 1) * (50 - 16)
= 100 / 224 * 34
= 100 * 34 / 224
= 3400 / 224
≈ 15.17
≈ 15
≈ 0xF
Here's a test case that ensures that there's no rounding bias for all combinations of min, max, and input pixel value, and that the error is bounded to [-0.5, 0.5]. Just append it to the code above and it should compile and run and produce the following output:
-0.5 0.5 1
For scaling to outputSpan = 256 values (instead of 100), it'd output:
-0.498039 0.498039 0.996078
/* Code Part 3 */
#include <cassert>
#include <cmath>
#include <iostream>
int main()
{
double errMin = 0, errMax = 0;
for (uint16_t min = 0; min <= 255; ++min)
for (uint16_t max = min; max <= 255; ++max)
for (uint16_t val = min; val <= max; ++val)
{
uint8_t const nMin = min, nMax = max;
uint8_t const span = nMax - nMin;
uint8_t const val_src = val;
uint8_t p_val = val_src;
uint8_t *const p = &p_val;
assert(nMin <= nMax);
assert(val >= nMin && val <= nMax);
auto const scale = scale_16(nMin, nMax);
*p = scale_pixel(*p, nMin, scale);
auto pValTarget = (val_src - nMin) * 256.0/(1.0+span);
auto error = pValTarget - *p;
if (error < errMin) errMin = error;
if (error > errMax) errMax = error;
}
std::cout << '\n' << errMin << ' ' << errMax << ' ' << errMax-errMin << std::endl;
assert((errMax-errMin) <= 1.0); // constrain the error
assert(std::abs(errMax+errMin) == 0.0); // constrain the error average
}

Related

How to reorder raw image color data to achieve a specific 2 by 2 format from four images? (C++)

I have the raw color data for four images, let's call them 1, 2, 3, and 4. I am storing the data in an unsigned char * with allocated memory. Individually I can manipulate or encode the images but when trying to concatenate or order them into a single image it works but takes more time than I would like.
I would like to create a 2 by 2 of the raw image data to encode as a single image.
1 2
3 4
For my example each image is 400 by 225 with RGBA (360000 bytes). Iim doing a for loop with memcpy where
for (int j = 0; j < 225; j++)
{
std::memcpy(dest + (j * (400 + 400) * 4), src + (j * 400 * 4), 400 * 4); //
}
for each image with an offset for the starting position added in (the example above would only work for the top left of course).
This works but I'm wondering if this is a solved problem with a better solution, either in an algorithm described somewhere or a small library.
#include <iostream>
const int width = 6;
const int height = 4;
constexpr int n = width * height;
int main()
{
unsigned char a[n], b[n], c[n], d[n];
unsigned char dst[n * 4];
int i = 0, j = 0;
/* init data */
for (; i < n; i++) {
a[i] = 'a';
b[i] = 'b';
c[i] = 'c';
d[i] = 'd';
}
/* re-order */
i = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, i++, j++) {
dst[i ] = a[j];
dst[i + width] = b[j];
dst[i + n * 2 ] = c[j];
dst[i + n * 2 + width] = d[j];
}
i += width;
}
/* print result */
i = 0;
for (int y = 0; y < height * 2; y++) {
for (int x = 0; x < width * 2; x++, i++)
std::cout << dst[i];
std::cout << '\n';
}
return 0;
}

How to read binary files properly?

I have a problem with the NIST/Diehard Binary Matrix test. It's about dividing a binary sequence into a 32x32 matrix and calculating its rank. After calculating ranks I need to compute a xi^2 value and then calculate p-value(must be from 0 to 1). I'm getting p-value extremely small even in a random sequence.
I've tried to hardcode some small examples and getting my p-value right though I think my problem is in reading a binary sequence file and getting bits from it.
This is reading from a file and converting to bits sequence.
ifstream fin("seq1.bin", ios::binary);
fin.seekg(0, ios::end);
int n = fin.tellg();
unsigned int start, end;
char *buf = new char[n];
fin.seekg(0, ios::beg);
fin.read(buf, n);
n *= 8;
bool *s = new bool[n];
for (int i = 0; i < n / 8; i++) {
for (int j = 7; j >= 0; j--) {
s[(i) * 8 + 7 - j] = (bool)((buf[i] >> j) & 1);
}
}
Then I form my matrix and calculate it's rank
int *ranks = new int[N];
for (int i = 0; i < N; i++) {
bool *arr = new bool[m*q];
copy(s + i * m*q, s +(i * m*q) + (m * q), arr);
ranks[i] = binary_rank(arr, m, q);
}
Cheking occurance in ranks
int count_occurrences(int arr[], int n, int x){
int result = 0;
for (int i = 0; i < n; i++)
if (x == arr[i])
result++;
return result;
}
Calculating xi^2 and p-value
double calculate_xi(int fm, int fm_1, int remaining, int N) {
double N1 = 0.2888*N;
double N2 = 0.5776*N;
double N3 = 0.1336*N;
double x1 = (fm - N1)*(fm - N1) / N1;
double x2 = (fm_1 - N2)*(fm_1 - N2) / N2;
double x3 = (remaining - N3)*(remaining - N3) / N3;
return x1 + x2 + x3;
}
double calculate_pvalue(double xi2) {
return exp(-(xi2 / 2));
}
I expect p-value between 0 and 1 but getting 0 every time. It's because of the extremely big xi^2 value and I couldn't find what I've done wrong. Could you please help me to get things right.
For this part:
for (int i = 0; i < n / 8; i++) {
for (int j = 7; j >= 0; j--) {
s[(i) * 8 + 7 - j] = (bool)((buf[i] >> j) & 1);
}
}
when you add elements to s array, looks like you switch the position of bytes inside each character: the last bit in character in buf goes into the first bit in character in s array, because the shift initially is 7, so you take first bit in char from buf[], but for s[] it looks to be 0, resulting in swapping. It is easy to verify with debugger though, as from code it is not so obvious. Thanks.

grayscale Laplace sharpening implementation

I am trying to implement Laplace sharpening using C++ , here's my code so far:
img = imread("cow.png", 0);
Mat convoSharp() {
//creating new image
Mat res = img.clone();
for (int y = 0; y < res.rows; y++) {
for (int x = 0; x < res.cols; x++) {
res.at<uchar>(y, x) = 0.0;
}
}
//variable declaration
int filter[3][3] = { {0,1,0},{1,-4,1},{0,1,0} };
//int filter[3][3] = { {-1,-2,-1},{0,0,0},{1,2,1} };
int height = img.rows;
int width = img.cols;
int filterHeight = 3;
int filterWidth = 3;
int newImageHeight = height - filterHeight + 1;
int newImageWidth = width - filterWidth + 1;
int i, j, h, w;
//convolution
for (i = 0; i < newImageHeight; i++) {
for (j = 0; j < newImageWidth; j++) {
for (h = i; h < i + filterHeight; h++) {
for (w = j; w < j + filterWidth; w++) {
res.at<uchar>(i,j) += filter[h - i][w - j] * img.at<uchar>(h,w);
}
}
}
}
//img - laplace
for (int y = 0; y < res.rows; y++) {
for (int x = 0; x < res.cols; x++) {
res.at<uchar>(y, x) = img.at<uchar>(y, x) - res.at<uchar>(y, x);
}
}
return res;
}
I don't really know what went wrong, I also tried different filter (1,1,1),(1,-8,1),(1,1,1) and the result is also same (more or less). I don't think that I need to normalize the result because the result is in range of 0 - 255. Can anyone explain what really went wrong in my code?
Problem: uchar is too small to hold partial results of filerting operation.
You should create a temporary variable and add all the filtered positions to this variable then check if value of temp is in range <0,255> if not, you need to clamp the end result to fit <0,255>.
By executing below line
res.at<uchar>(i,j) += filter[h - i][w - j] * img.at<uchar>(h,w);
partial result may be greater than 255 (max value in uchar) or negative (in filter you have -4 or -8). temp has to be singed integer type to handle the case when partial result is negative value.
Fix:
for (i = 0; i < newImageHeight; i++) {
for (j = 0; j < newImageWidth; j++) {
int temp = res.at<uchar>(i,j); // added
for (h = i; h < i + filterHeight; h++) {
for (w = j; w < j + filterWidth; w++) {
temp += filter[h - i][w - j] * img.at<uchar>(h,w); // add to temp
}
}
// clamp temp to <0,255>
res.at<uchar>(i,j) = temp;
}
}
You should also clamp values to <0,255> range when you do the subtraction of images.
The problem is partially that you’re overflowing your uchar, as rafix07 suggested, but that is not the full problem.
The Laplace of an image contains negative values. It has to. And you can’t clamp those to 0, you need to preserve the negative values. Also, it can values up to 4*255 given your version of the filter. What this means is that you need to use a signed 16 bit type to store this output.
But there is a simpler and more efficient approach!
You are computing img - laplace(img). In terms of convolutions (*), this is 1 * img - laplace_kernel * img = (1 - laplace_kernel) * img. That is to say, you can combine both operations into a single convolution. The 1 kernel that doesn’t change the image is [(0,0,0),(0,1,0),(0,0,0)]. Subtract your Laplace kernel from that and you obtain [(0,-1,0),(-1,5,-1),(0,-1,0)].
So, simply compute the convolution with that kernel, and do it using int as intermediate type, which you then clamp to the uchar output range as shown by rafix07.

C++ Pattern Matching with FFT cross-correlation (Images)

everyone I am trying to implement patter matching with FFT but I am not sure what the result should be (I think I am missing something even though a read a lot of stuff about the problem and tried a lot of different implementations this one is the best so far). Here is my FFT correlation function.
void fft2d(fftw_complex**& a, int rows, int cols, bool forward = true)
{
fftw_plan p;
for (int i = 0; i < rows; ++i)
{
p = fftw_plan_dft_1d(cols, a[i], a[i], forward ? FFTW_FORWARD : FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
}
fftw_complex* t = (fftw_complex*)fftw_malloc(rows * sizeof(fftw_complex));
for (int j = 0; j < cols; ++j)
{
for (int i = 0; i < rows; ++i)
{
t[i][0] = a[i][j][0];
t[i][1] = a[i][j][1];
}
p = fftw_plan_dft_1d(rows, t, t, forward ? FFTW_FORWARD : FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
for (int i = 0; i < rows; ++i)
{
a[i][j][0] = t[i][0];
a[i][j][1] = t[i][1];
}
}
fftw_free(t);
}
int findCorrelation(int argc, char* argv[])
{
BMP bigImage;
BMP keyImage;
BMP result;
RGBApixel blackPixel = { 0, 0, 0, 1 };
const bool swapQuadrants = (argc == 4);
if (argc < 3 || argc > 4) {
cout << "correlation img1.bmp img2.bmp" << endl;
return 1;
}
if (!keyImage.ReadFromFile(argv[1])) {
return 1;
}
if (!bigImage.ReadFromFile(argv[2])) {
return 1;
}
//Preparations
const int maxWidth = std::max(bigImage.TellWidth(), keyImage.TellWidth());
const int maxHeight = std::max(bigImage.TellHeight(), keyImage.TellHeight());
const int rowsCount = maxHeight;
const int colsCount = maxWidth;
BMP bigTemp = bigImage;
BMP keyTemp = keyImage;
keyImage.SetSize(maxWidth, maxHeight);
bigImage.SetSize(maxWidth, maxHeight);
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
RGBApixel p1;
if (i < bigTemp.TellHeight() && j < bigTemp.TellWidth()) {
p1 = bigTemp.GetPixel(j, i);
} else {
p1 = blackPixel;
}
bigImage.SetPixel(j, i, p1);
RGBApixel p2;
if (i < keyTemp.TellHeight() && j < keyTemp.TellWidth()) {
p2 = keyTemp.GetPixel(j, i);
} else {
p2 = blackPixel;
}
keyImage.SetPixel(j, i, p2);
}
//Here is where the transforms begin
fftw_complex **a = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
fftw_complex **b = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
fftw_complex **c = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
for (int i = 0; i < rowsCount; ++i) {
a[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
b[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
c[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
for (int j = 0; j < colsCount; ++j) {
RGBApixel p1;
p1 = bigImage.GetPixel(j, i);
a[i][j][0] = (0.299*p1.Red + 0.587*p1.Green + 0.114*p1.Blue);
a[i][j][1] = 0.0;
RGBApixel p2;
p2 = keyImage.GetPixel(j, i);
b[i][j][0] = (0.299*p2.Red + 0.587*p2.Green + 0.114*p2.Blue);
b[i][j][1] = 0.0;
}
}
fft2d(a, rowsCount, colsCount);
fft2d(b, rowsCount, colsCount);
result.SetSize(maxWidth, maxHeight);
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
fftw_complex& y = a[i][j];
fftw_complex& x = b[i][j];
double u = x[0], v = x[1];
double m = y[0], n = y[1];
c[i][j][0] = u*m + n*v;
c[i][j][1] = v*m - u*n;
int fx = j;
if (fx>(colsCount / 2)) fx -= colsCount;
int fy = i;
if (fy>(rowsCount / 2)) fy -= rowsCount;
float r2 = (fx*fx + fy*fy);
const double cuttoffCoef = (maxWidth * maxHeight) / 37992.;
if (r2<128 * 128 * cuttoffCoef)
c[i][j][0] = c[i][j][1] = 0;
}
fft2d(c, rowsCount, colsCount, false);
const int halfCols = colsCount / 2;
const int halfRows = rowsCount / 2;
if (swapQuadrants) {
for (int i = 0; i < halfRows; ++i)
for (int j = 0; j < halfCols; ++j) {
std::swap(c[i][j][0], c[i + halfRows][j + halfCols][0]);
std::swap(c[i][j][1], c[i + halfRows][j + halfCols][1]);
}
for (int i = halfRows; i < rowsCount; ++i)
for (int j = 0; j < halfCols; ++j) {
std::swap(c[i][j][0], c[i - halfRows][j + halfCols][0]);
std::swap(c[i][j][1], c[i - halfRows][j + halfCols][1]);
}
}
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
const double& g = c[i][j][0];
RGBApixel pixel;
pixel.Alpha = 0;
int gInt = 255 - static_cast<int>(std::floor(g + 0.5));
pixel.Red = gInt;
pixel.Green = gInt;
pixel.Blue = gInt;
result.SetPixel(j, i, pixel);
}
BMP res;
res.SetSize(maxWidth, maxHeight);
result.WriteToFile("result.bmp");
return 0;
}
Sample output
This question would probably be more appropriately posted on another site like cross validated (metaoptimize.com used to also be a good one, but it appears to be gone)
That said:
There's two similar operations you can perform with FFT: convolution and correlation. Convolution is used for determining how two signals interact with each-other, whereas correlation can be used to express how similar two signals are to each-other. Make sure you're doing the right operation as they're both commonly implemented throught a DFT.
For this type of application of DFTs you usually wouldn't extract any useful information in the fourier spectrum unless you were looking for frequencies common to both data sources or whatever (eg, if you were comparing two bridges to see if their supports are spaced similarly).
Your 3rd image looks a lot like the power domain; normally I see the correlation output entirely grey except where overlap occurred. Your code definitely appears to be computing the inverse DFT, so unless I'm missing something the only other explanation I've come up with for the fuzzy look could be some of the "fudge factor" code in there like:
if (r2<128 * 128 * cuttoffCoef)
c[i][j][0] = c[i][j][1] = 0;
As for what you should expect: wherever there are common elements between the two images you'll see a peak. The larger the peak, the more similar the two images are near that region.
Some comments and/or recommended changes:
1) Convolution & correlation are not scale invariant operations. In other words, the size of your pattern image can make a significant difference in your output.
2) Normalize your images before correlation.
When you get the image data ready for the forward DFT pass:
a[i][j][0] = (0.299*p1.Red + 0.587*p1.Green + 0.114*p1.Blue);
a[i][j][1] = 0.0;
/* ... */
How you grayscale the image is your business (though I would've picked something like sqrt( r*r + b*b + g*g )). However, I don't see you doing anything to normalize the image.
The word "normalize" can take on a few different meanings in this context. Two common types:
normalize the range of values between 0.0 and 1.0
normalize the "whiteness" of the images
3) Run your pattern image through an edge enhancement filter. I've personally made use of canny, sobel, and I think I messed with a few others. As I recall, canny was "quick'n dirty", sobel was more expensive, but I got comparable results when it came time to do correlation. See chapter 24 of the "dsp guide" book that's freely available online. The whole book is worth your time, but if you're low on time then at a minimum chapter 24 will help a lot.
4) Re-scale the output image between [0, 255]; if you want to implement thresholds, do it after this step because the thresholding step is lossy.
My memory on this one is hazy, but as I recall (edited for clarity):
You can scale the final image pixels (before rescaling) between [-1.0, 1.0] by dividing off the largest power spectrum value from the entire power spectrum
The largest power spectrum value is, conveniently enough, the center-most value in the power spectrum (corresponding to the lowest frequency)
If you divide it off the power spectrum, you'll end up doing twice the work; since FFTs are linear, you can delay the division until after the inverse DFT pass to when you're re-scaling the pixels between [0..255].
If after rescaling most of your values end up so black you can't see them, you can use a solution to the ODE y' = y(1 - y) (one example is the sigmoid f(x) = 1 / (1 + exp(-c*x) ), for some scaling factor c that gives better gradations). This has more to do with improving your ability to interpret the results visually than anything you might use to programmatically find peaks.
edit I said [0, 255] above. I suggest you rescale to [128, 255] or some other lower bound that is gray rather than black.

Implementing Gaussian Blur - How to calculate convolution matrix (kernel)

My question is very close to this question: How do I gaussian blur an image without using any in-built gaussian functions?
The answer to this question is very good, but it doesn't give an example of actually calculating a real Gaussian filter kernel. The answer gives an arbitrary kernel and shows how to apply the filter using that kernel but not how to calculate a real kernel itself. I am trying to implement a Gaussian blur in C++ or Matlab from scratch, so I need to know how to calculate the kernel from scratch.
I'd appreciate it if someone could calculate a real Gaussian filter kernel using any small example image matrix.
You can create a Gaussian kernel from scratch as noted in MATLAB documentation of fspecial. Please read the Gaussian kernel creation formula in the algorithms part in that page and follow the code below. The code is to create an m-by-n matrix with sigma = 1.
m = 5; n = 5;
sigma = 1;
[h1, h2] = meshgrid(-(m-1)/2:(m-1)/2, -(n-1)/2:(n-1)/2);
hg = exp(- (h1.^2+h2.^2) / (2*sigma^2));
h = hg ./ sum(hg(:));
h =
0.0030 0.0133 0.0219 0.0133 0.0030
0.0133 0.0596 0.0983 0.0596 0.0133
0.0219 0.0983 0.1621 0.0983 0.0219
0.0133 0.0596 0.0983 0.0596 0.0133
0.0030 0.0133 0.0219 0.0133 0.0030
Observe that this can be done by the built-in fspecial as follows:
fspecial('gaussian', [m n], sigma)
ans =
0.0030 0.0133 0.0219 0.0133 0.0030
0.0133 0.0596 0.0983 0.0596 0.0133
0.0219 0.0983 0.1621 0.0983 0.0219
0.0133 0.0596 0.0983 0.0596 0.0133
0.0030 0.0133 0.0219 0.0133 0.0030
I think it is straightforward to implement this in any language you like.
EDIT: Let me also add the values of h1 and h2 for the given case, since you may be unfamiliar with meshgrid if you code in C++.
h1 =
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
h2 =
-2 -2 -2 -2 -2
-1 -1 -1 -1 -1
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
It's as simple as it sounds:
double sigma = 1;
int W = 5;
double kernel[W][W];
double mean = W/2;
double sum = 0.0; // For accumulating the kernel values
for (int x = 0; x < W; ++x)
for (int y = 0; y < W; ++y) {
kernel[x][y] = exp( -0.5 * (pow((x-mean)/sigma, 2.0) + pow((y-mean)/sigma,2.0)) )
/ (2 * M_PI * sigma * sigma);
// Accumulate the kernel values
sum += kernel[x][y];
}
// Normalize the kernel
for (int x = 0; x < W; ++x)
for (int y = 0; y < W; ++y)
kernel[x][y] /= sum;
To implement the gaussian blur you simply take the gaussian function and compute one value for each of the elements in your kernel.
Usually you want to assign the maximum weight to the central element in your kernel and values close to zero for the elements at the kernel borders.
This implies that the kernel should have an odd height (resp. width) to ensure that there actually is a central element.
To compute the actual kernel elements you may scale the gaussian bell to the kernel grid (choose an arbitrary e.g. sigma = 1 and an arbitrary range e.g. -2*sigma ... 2*sigma) and normalize it, s.t. the elements sum to one.
To achieve this, if you want to support arbitrary kernel sizes, you might want to adapt the sigma to the required kernel size.
Here's a C++ example:
#include <cmath>
#include <vector>
#include <iostream>
#include <iomanip>
double gaussian( double x, double mu, double sigma ) {
const double a = ( x - mu ) / sigma;
return std::exp( -0.5 * a * a );
}
typedef std::vector<double> kernel_row;
typedef std::vector<kernel_row> kernel_type;
kernel_type produce2dGaussianKernel (int kernelRadius) {
double sigma = kernelRadius/2.;
kernel_type kernel2d(2*kernelRadius+1, kernel_row(2*kernelRadius+1));
double sum = 0;
// compute values
for (int row = 0; row < kernel2d.size(); row++)
for (int col = 0; col < kernel2d[row].size(); col++) {
double x = gaussian(row, kernelRadius, sigma)
* gaussian(col, kernelRadius, sigma);
kernel2d[row][col] = x;
sum += x;
}
// normalize
for (int row = 0; row < kernel2d.size(); row++)
for (int col = 0; col < kernel2d[row].size(); col++)
kernel2d[row][col] /= sum;
return kernel2d;
}
int main() {
kernel_type kernel2d = produce2dGaussianKernel(3);
std::cout << std::setprecision(5) << std::fixed;
for (int row = 0; row < kernel2d.size(); row++) {
for (int col = 0; col < kernel2d[row].size(); col++)
std::cout << kernel2d[row][col] << ' ';
std::cout << '\n';
}
}
The output is:
$ g++ test.cc && ./a.out
0.00134 0.00408 0.00794 0.00992 0.00794 0.00408 0.00134
0.00408 0.01238 0.02412 0.03012 0.02412 0.01238 0.00408
0.00794 0.02412 0.04698 0.05867 0.04698 0.02412 0.00794
0.00992 0.03012 0.05867 0.07327 0.05867 0.03012 0.00992
0.00794 0.02412 0.04698 0.05867 0.04698 0.02412 0.00794
0.00408 0.01238 0.02412 0.03012 0.02412 0.01238 0.00408
0.00134 0.00408 0.00794 0.00992 0.00794 0.00408 0.00134
As a simplification you don't need to use a 2d-kernel. Easier to implement and also more efficient to compute is to use two orthogonal 1d-kernels. This is possible due to the associativity of this type of a linear convolution (linear separability).
You may also want to see this section of the corresponding wikipedia article.
Here's the same in Python (in the hope someone might find it useful):
from math import exp
def gaussian(x, mu, sigma):
return exp( -(((x-mu)/(sigma))**2)/2.0 )
#kernel_height, kernel_width = 7, 7
kernel_radius = 3 # for an 7x7 filter
sigma = kernel_radius/2. # for [-2*sigma, 2*sigma]
# compute the actual kernel elements
hkernel = [gaussian(x, kernel_radius, sigma) for x in range(2*kernel_radius+1)]
vkernel = [x for x in hkernel]
kernel2d = [[xh*xv for xh in hkernel] for xv in vkernel]
# normalize the kernel elements
kernelsum = sum([sum(row) for row in kernel2d])
kernel2d = [[x/kernelsum for x in row] for row in kernel2d]
for line in kernel2d:
print ["%.3f" % x for x in line]
produces the kernel:
['0.001', '0.004', '0.008', '0.010', '0.008', '0.004', '0.001']
['0.004', '0.012', '0.024', '0.030', '0.024', '0.012', '0.004']
['0.008', '0.024', '0.047', '0.059', '0.047', '0.024', '0.008']
['0.010', '0.030', '0.059', '0.073', '0.059', '0.030', '0.010']
['0.008', '0.024', '0.047', '0.059', '0.047', '0.024', '0.008']
['0.004', '0.012', '0.024', '0.030', '0.024', '0.012', '0.004']
['0.001', '0.004', '0.008', '0.010', '0.008', '0.004', '0.001']
OK, a late answer but in case of...
Using the #moooeeeep answer, but with numpy;
import numpy as np
radius = 3
sigma = radius/2.
k = np.arange(2*radius +1)
row = np.exp( -(((k - radius)/(sigma))**2)/2.)
col = row.transpose()
out = np.outer(row, col)
out = out/np.sum(out)
for line in out:
print(["%.3f" % x for x in line])
Just a bit less of lines.
Gaussian blur in python using PIL image library. For more info read this: http://blog.ivank.net/fastest-gaussian-blur.html
from PIL import Image
import math
# img = Image.open('input.jpg').convert('L')
# r = radiuss
def gauss_blur(img, r):
imgData = list(img.getdata())
bluredImg = Image.new(img.mode, img.size)
bluredImgData = list(bluredImg.getdata())
rs = int(math.ceil(r * 2.57))
for i in range(0, img.height):
for j in range(0, img.width):
val = 0
wsum = 0
for iy in range(i - rs, i + rs + 1):
for ix in range(j - rs, j + rs + 1):
x = min(img.width - 1, max(0, ix))
y = min(img.height - 1, max(0, iy))
dsq = (ix - j) * (ix - j) + (iy - i) * (iy - i)
weight = math.exp(-dsq / (2 * r * r)) / (math.pi * 2 * r * r)
val += imgData[y * img.width + x] * weight
wsum += weight
bluredImgData[i * img.width + j] = round(val / wsum)
bluredImg.putdata(bluredImgData)
return bluredImg
// my_test.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <cmath>
#include <vector>
#include <iostream>
#include <iomanip>
#include <string>
//https://stackoverflow.com/questions/8204645/implementing-gaussian-blur-how-to-calculate-convolution-matrix-kernel
//https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#getgaussiankernel
//http://dev.theomader.com/gaussian-kernel-calculator/
double gaussian(double x, double mu, double sigma) {
const double a = (x - mu) / sigma;
return std::exp(-0.5 * a * a);
}
typedef std::vector<double> kernel_row;
typedef std::vector<kernel_row> kernel_type;
kernel_type produce2dGaussianKernel(int kernelRadius, double sigma) {
kernel_type kernel2d(2 * kernelRadius + 1, kernel_row(2 * kernelRadius + 1));
double sum = 0;
// compute values
for (int row = 0; row < kernel2d.size(); row++)
for (int col = 0; col < kernel2d[row].size(); col++) {
double x = gaussian(row, kernelRadius, sigma)
* gaussian(col, kernelRadius, sigma);
kernel2d[row][col] = x;
sum += x;
}
// normalize
for (int row = 0; row < kernel2d.size(); row++)
for (int col = 0; col < kernel2d[row].size(); col++)
kernel2d[row][col] /= sum;
return kernel2d;
}
char* gMatChar[10] = {
" ",
" ",
" ",
" ",
" ",
" ",
" ",
" ",
" ",
" "
};
static int countSpace(float aValue)
{
int count = 0;
int value = (int)aValue;
while (value > 9)
{
count++;
value /= 10;
}
return count;
}
int main() {
while (1)
{
char str1[80]; // window size
char str2[80]; // sigma
char str3[80]; // coefficient
int space;
int i, ch;
printf("\n-----------------------------------------------------------------------------\n");
printf("Start generate Gaussian matrix\n");
printf("-----------------------------------------------------------------------------\n");
// input window size
printf("\nPlease enter window size (from 3 to 10) It should be odd (ksize/mod 2 = 1 ) and positive: Exit enter q \n");
for (i = 0; (i < 80) && ((ch = getchar()) != EOF)
&& (ch != '\n'); i++)
{
str1[i] = (char)ch;
}
// Terminate string with a null character
str1[i] = '\0';
if (str1[0] == 'q')
{
break;
}
int input1 = atoi(str1);
int window_size = input1 / 2;
printf("Input window_size was: %d\n", input1);
// input sigma
printf("Please enter sigma. Use default press Enter . Exit enter q \n");
str2[0] = '0';
for (i = 0; (i < 80) && ((ch = getchar()) != EOF)
&& (ch != '\n'); i++)
{
str2[i] = (char)ch;
}
// Terminate string with a null character
str2[i] = '\0';
if (str2[0] == 'q')
{
break;
}
float input2 = atof(str2);
float sigma;
if (input2 == 0)
{
// Open-CV sigma � Gaussian standard deviation. If it is non-positive, it is computed from ksize as sigma = 0.3*((ksize-1)*0.5 - 1) + 0.8 .
sigma = 0.3*((input1 - 1)*0.5 - 1) + 0.8;
}
else
{
sigma = input2;
}
printf("Input sigma was: %f\n", sigma);
// input Coefficient K
printf("Please enter Coefficient K. Use default press Enter . Exit enter q \n");
str3[0] = '0';
for (i = 0; (i < 80) && ((ch = getchar()) != EOF)
&& (ch != '\n'); i++)
{
str3[i] = (char)ch;
}
// Terminate string with a null character
str3[i] = '\0';
if (str3[0] == 'q')
{
break;
}
int input3 = atoi(str3);
int cK;
if (input3 == 0)
{
cK = 1;
}
else
{
cK = input3;
}
float sum_f = 0;
float temp_f;
int sum = 0;
int temp;
printf("Input Coefficient K was: %d\n", cK);
printf("\nwindow size=%d | Sigma = %f Coefficient K = %d\n\n\n", input1, sigma, cK);
kernel_type kernel2d = produce2dGaussianKernel(window_size, sigma);
std::cout << std::setprecision(input1) << std::fixed;
for (int row = 0; row < kernel2d.size(); row++) {
for (int col = 0; col < kernel2d[row].size(); col++)
{
temp_f = cK* kernel2d[row][col];
sum_f += temp_f;
space = countSpace(temp_f);
std::cout << gMatChar[space] << temp_f << ' ';
}
std::cout << '\n';
}
printf("\n Sum array = %f | delta = %f", sum_f, sum_f - cK);
// rounding
printf("\nRecommend use round(): window size=%d | Sigma = %f Coefficient K = %d\n\n\n", input1, sigma, cK);
sum = 0;
std::cout << std::setprecision(0) << std::fixed;
for (int row = 0; row < kernel2d.size(); row++) {
for (int col = 0; col < kernel2d[row].size(); col++)
{
temp = round(cK* kernel2d[row][col]);
sum += temp;
space = countSpace((float)temp);
std::cout << gMatChar[space] << temp << ' ';
}
std::cout << '\n';
}
printf("\n Sum array = %d | delta = %d", sum, sum - cK);
// recommented
sum_f = 0;
int cK_d = 1 / kernel2d[0][0];
cK_d = cK_d / 2 * 2;
printf("\nRecommend: window size=%d | Sigma = %f Coefficient K = %d\n\n\n", input1, sigma, cK_d);
std::cout << std::setprecision(input1) << std::fixed;
for (int row = 0; row < kernel2d.size(); row++) {
for (int col = 0; col < kernel2d[row].size(); col++)
{
temp_f = cK_d* kernel2d[row][col];
sum_f += temp_f;
space = countSpace(temp_f);
std::cout << gMatChar[space] << temp_f << ' ';
}
std::cout << '\n';
}
printf("\n Sum array = %f | delta = %f", sum_f, sum_f - cK_d);
// rounding
printf("\nRecommend use round(): window size=%d | Sigma = %f Coefficient K = %d\n\n\n", input1, sigma, cK_d);
sum = 0;
std::cout << std::setprecision(0) << std::fixed;
for (int row = 0; row < kernel2d.size(); row++) {
for (int col = 0; col < kernel2d[row].size(); col++)
{
temp = round(cK_d* kernel2d[row][col]);
sum += temp;
space = countSpace((float)temp);
std::cout << gMatChar[space] << temp << ' ';
}
std::cout << '\n';
}
printf("\n Sum array = %d | delta = %d", sum, sum - cK_d);
}
}
function kernel = gauss_kernel(m, n, sigma)
% Generating Gauss Kernel
x = -(m-1)/2 : (m-1)/2;
y = -(n-1)/2 : (n-1)/2;
for i = 1:m
for j = 1:n
xx(i,j) = x(i);
yy(i,j) = y(j);
end
end
kernel = exp(-(xx.*xx + yy.*yy)/(2*sigma*sigma));
% Normalize the kernel
kernel = kernel/sum(kernel(:));
% Corresponding function in MATLAB
% fspecial('gaussian', [m n], sigma)
Here's a calculation in C#, which does not take single samples from the gaussian (or another kernel) function, but it calculates a large number of samples in a small grid and integrates the samples in the desired number of sections.
The calculation is for 1D, but it may easily be extended to 2D.
This calculation uses some other functions, which I did not add here, but I have added the function signatures so that you will know what they do.
This calculation produces the following discrete values for the limits +/- 3 (sum areaSum of integral is 0.997300):
kernel size: normalized kernel values, rounded to 6 decimals
3: 0.157731, 0.684538, 0.157731
5: 0.034674, 0.238968, 0.452716, 0.238968, 0.034674
7: 0.014752, 0.083434, 0.235482, 0.332663, 0.235482, 0.083434, 0.014752
This calculation produces the following discrete values for the limits +/- 2 (sum areaSum of integral is 0.954500):
kernel size: normalized kernel values, rounded to 6 decimals
3: 0.240694, 0.518612, 0.240694
5: 0.096720, 0.240449, 0.325661, 0.240449, 0.096720
7: 0.056379, 0.124798, 0.201012, 0.235624, 0.201012, 0.124798, 0.056379
Code:
using System.Linq;
private static void Main ()
{
int positionCount = 1024; // Number of samples in the range 0..1.
double positionStepSize = 1.0 / positionCount;
double limit = 3; // The calculation range of the kernel. +/- 3 is sufficient for gaussian.
int sectionCount = 3; // The number of elements in the kernel. May be 1, 3, 5, 7, ... (n*2+1)
// calculate the x positions for each kernel value to calculate.
double[] positions = CreateSeries (-limit, positionStepSize, (int)(limit * 2 * positionCount + 1));
// calculate the gaussian function value for each position
double[] values = positions.Select (pos => Gaussian (pos)).ToArray ();
// split the values into equal-sized sections and calculate the integral of each section.
double[] areas = IntegrateInSections (values, positionStepSize, sectionCount);
double areaSum = areas.Sum ();
// normalize to 1
double[] areas1 = areas.Select (a => a / areaSum).ToArray ();
double area1Sum = areas1.Sum (); // just to check it's 1 now
}
///-------------------------------------------------------------------
/// <summary>
/// Create a series of <paramref name="i_count"/> numbers, starting at <paramref name="i_start"/> and increasing by <paramref name="i_stepSize"/>.
/// </summary>
/// <param name="i_start">The start value of the series.</param>
/// <param name="i_stepSize">The step size between values in the series.</param>
/// <param name="i_count">The number of elements in the series.</param>
///-------------------------------------------------------------------
public static double[] CreateSeries (double i_start,
double i_stepSize,
int i_count)
{ ... }
private static readonly double s_gaussian_Divisor = Math.Sqrt (Math.PI * 2.0);
/// ------------------------------------------------------------------
/// <summary>
/// Calculate the value for the given position in a Gaussian kernel.
/// </summary>
/// <param name="i_position"> The position in the kernel for which the value will be calculated. </param>
/// <param name="i_bandwidth"> The width factor of the kernel. </param>
/// <returns> The value for the given position in a Gaussian kernel. </returns>
/// ------------------------------------------------------------------
public static double Gaussian (double i_position,
double i_bandwidth = 1)
{
double position = i_position / i_bandwidth;
return Math.Pow (Math.E, -0.5 * position * position) / s_gaussian_Divisor / i_bandwidth;
}
/// ------------------------------------------------------------------
/// <summary>
/// Calculate the integrals in the given number of sections of all given values with the given distance between the values.
/// </summary>
/// <param name="i_values"> The values for which the integral will be calculated. </param>
/// <param name="i_distance"> The distance between the values. </param>
/// <param name="i_sectionCount"> The number of sections in the integration. </param>
/// ------------------------------------------------------------------
public static double[] IntegrateInSections (IReadOnlyCollection<double> i_values,
double i_distance,
int i_sectionCount)
{ ... }