clFFT: Calculating overlapped FFTs

clFFT: Calculating overlapped FFTs - c++

I want to create a batch for clFFT to calculate 3 FFTs of 256 length, where the FFT input values overlap (FFT overlap processing)
Input: a 1D array of 276 complex numbers
Task: Calculate FFTs for [0..255], [10..265], [20..275]
Output: 3x 256 FFTs = 768 values.
If I where to write a loop, it would look like this:
std::complex<float> *input;
for (int i=0; i<3; ++i) {
calcFFT(input, input+256);
input += 10;
}
IOW: The fft calculates 256 input values then advances 10 values and calculates the next 256 values.
How do I set up a clFFT plan, so that this happens in one call?
clfftSetPlanIn/OutStride specifies the distance between the individual values, so that is the wrong parameter.
It looks as if clfftSetPlanDistance might be what I need. Doc says:
CLFFTAPI clfftStatus clfftSetPlanDistance( clfftPlanHandle plHandle, size_t iDist, size_t oDist );
Pitch is the distance between each discrete array object in an FFT array. This is only used
* for 'array' dimensions in clfftDim; see clfftSetPlanDimension (units are in terms of clfftPrecision)
which I find very confusing.

Yes, clfftSetPlanDistance is the right API to use. In the example I would have to use
cllSetPlanDistance(plan, 10, 256);
to calculate FFTs with a step of 10.
This will generate OpenCL code where the global offset of the first FFT index is calculated like this:
// Inside the generated fft_fwd OpenCL function
iOffset = (batch/32)*10 + (batch%32)*8;
where batch is the batch number of the FFT to calculate.

Related

How to set the frequency band of my array after fft

How I can set frequency band to my array from KissFFT? Sampling frequency is 44100 and I need to set it to my array realPartFFT. I have no idea, how it works. I need to plot my spectrum chart to see if it counts right. When I plot it now, it still has only 513 numbers on the x axis, without the specified frequency.
int windowCount = 1024;
float floatArray[windowCount], realPartFFT[(windowCount / 2) + 1];
kiss_fftr_cfg cfg = kiss_fftr_alloc(windowCount, 0, NULL, NULL);
kiss_fft_cpx cpx[(windowCount / 2) + 1];
kiss_fftr(cfg, floatArray, cpx);
for (int i = 0; i < (windowCount / 2) + 1; ++)
realPartFFT[i] = sqrtf(powf(cpx[i].r, 2.0) + powf(cpx[i].i, 2.0));

First of all: KissFFT doesn't know anything about the source of the data. You pass it an array of real numbers of a given size N, and you get in return an array of complex values of size N/2+1. The input array may be the whether forecast of the next N hours of the number of sunspots of the past N days. KissFFT doesn't care.
The mapping back to the real world needs to be done by you, so you have to interpret the data. As of you code snippet, you are passing 1024 of floats (I assume that floatArray contains the input data). You then get back an array of 513 (=1024/2+1) pairs of floats.
If you are sampling with 44.1 KHz and pass KissFFT chunks of 1024 (your window size) samples, you will get as highest frequency 22.05 KHz and as lowest frequency about 43 Hz (44,100 / 1024). You can get even lower by passing bigger chunks to KissFFT, but keep in mind that processing time will grow (with the fourth power of N, IIRC)!
Btw: You may consider making your windowSize variable const, to allow the compiler do some optimizations. Optimizations are very valuable when doing number crunching. In this case the effect may be negligible, but it's a good starting point.

Fast data structure or algorithm to find mean of each pixel in a stack of images

I have a stack of images in which I want to calculate the mean of each pixel down the stack.
For example, let (x_n,y_n) be the (x,y) pixel in the nth image. Thus, the mean of pixel (x,y) for three images in the image stack is:
mean-of-(x,y) = (1/3) * ((x_1,y_1) + (x_2,y_2) + (x_3,y_3))
My first thought was to load all pixel intensities from each image into a data structure with a single linear buffer like so:
|All pixels from image 1| All pixels from image 2| All pixels from image 3|
To find the sum of a pixel down the image stack, I perform a series of nested for loops like so:
for(int col=0; col<img_cols; col++)
{
for(int row=0; row<img_rows; row++)
{
for(int img=0; img<num_of_images; img++)
{
sum_of_px += px_buffer[(img*img_rows*img_cols)+col*img_rows+row];
}
}
}
Basically img*img_rows*img_cols gives the buffer element of the first pixel in the nth image and col*img_rows+row gives the (x,y) pixel that I want to find for each n image in the stack.
Is there a data structure or algorithm that will help me sum up pixel intensities down an image stack that is faster and more organized than my current implementation?
I am aiming for portability so I will not be using OpenCV and am using C++ on linux.

The problem with the nested loop in the question is that it's not very cache friendly. You go skipping through memory with a long stride, effectively rendering your data cache useless. You're going to spend a lot of time just accessing the memory.
If you can spare the memory, you can create an extra image-sized buffer to accumulate totals for each pixel as you walk through all the pixels in all the images in memory order. Then you do a single pass through the buffer for the division.
Your accumulation buffer may need to use a larger type than you use for individual pixel values, since it has to accumulate many of them. If your pixel values are, say, 8-bit integers, then your accumulation buffer might need 32-bit integers or floats.

Usually, a stack of pixels
(x_1,y_1),...,(x_n,y_n)
is conditionally independent from a stack
(a_1,b_1),...,(a_n,b_n)
And even if they weren't (assuming a particular dataset), then modeling their interactions is a complex task and will give you only an estimate of the mean. So, if you want to compute the exact mean for each stack, you don't have any other choice but to iterate through the three loops that you supply. Languages such as Matlab/octave and libraries such as Theano (python) or Torch7 (lua) all parallelize these iterations. If you are using C++, what you do is well suited for Cuda or OpenMP. As for portability, I think OpenMP is the easier solution.

A portable, fast data structure specifically for the average calculation could be:
std::vector<std::vector<std::vector<sometype> > > VoVoV;
VoVoV.resize(img_cols);
int i,j;
for (i=0 ; i<img_cols ; ++i)
{
VoVoV[i].resize(img_rows);
for (j=0 ; j<img_rows ; ++j)
{
VoVoV[i][j].resize(num_of_images);
// The values of all images at this pixel are stored continguously,
// therefore should be fast to access.
}
}
VoVoV[col][row][img] = foo;
As a side note, 1/3 in your example will evaluate to 0 which is not what you want.
For fast summation/averaging you can now do:
sometype sum = 0;
std::vector<sometype>::iterator it = VoVoV[col][row].begin();
std::vector<sometype>::iterator it_end = VoVoV[col][row].end();
for ( ; it != it_end ; ++it)
sum += *it;
sometype avg = sum / num_of_images; // or similar for integers; check for num_of_images==0
Basically you should not rely that the compiler would optimize away the repeated calculation of always the same offsets.

Fast 7x7 2D median filter in C and C++

I'm trying to convert the following code from MATLAB to C++:
function data = process(data)
data = medfilt2(data, [7 7], 'symmetric');
mask = fspecial('gaussian', [35 35], 12);
data = imfilter(data, mask, 'replicate', 'same');
maximum = max(data(:));
data = 1 ./ (data/maximum);
data(data > 10) = 16;
end
My problem is in the medfilt2, which is a 2D median filter. I need it to support 10 bits per pixels and more images.
I have looked into OpenCV it has a 5x5 median filter which supports 16 bits, but 7x7 only supports bytes.
medianBlur
I have also looked into Intel IPP, but I can see only a 1D median filter.
https://software.intel.com/en-us/node/502283
Is there a fast implementation for a 2D filter?
I am looking for something like:
Fast Median Search: An ANSI C Implementation using parallel programming and vectorized (AVX/SSE) operations...
Two Dimensional Digital Signal Processing II. Transforms and median filters.
Edited by T.S.Huang. Springer-Verlag. 1981.
There are more code examples in Fast median filtering with implementations in C/C++/C#/VB.NET/Delphi.
I also found Median Filtering in Constant Time.

Motivated by the fact that OpenCV does not implement 16-bit median filter for large kernel sizes (larger than 5), I tried three different strategies.
All of them are based on Huang's [2] sliding window algorithm. That is, the histogram is updated by removing and inserting pixel entries as the window slides from left to right. This is quite straightforward for 8-bit image and already implemented in OpenCV. However, a large 65536 bin histogram makes computation a bit difficult.
...The algorithm still remains O(log r), but storage considerations render it impractical for 16-bit images and impossible for floating-point images. [3]
I used the algorithm C++ standard library where applicable, and did not implement Weiss' additional optimization strategies.
1) A naive sorting implementation. I think this is the best starting point for arbitrary pixel type (floats particularly).
// copy pixels in the sliding window to a temporary vec and
// compute the median value (size is always odd)
memcpy( &v[0], &window[0], window.size() * sizeof(_Type) );
std::vector< _Type >::iterator it = v.begin() + v.size()/2;
std::nth_element( v.begin(), it, v.end() );
return *it;
2) A sparse histogram. We wouldn't want to step over 65536 bins to find the median of each pixel, so how about storing the sparse histogram then? Again, this is suitable for all pixel types, but it doesn't make sense if all pixels in the window are different (e.g. floats).
typedef std::map< _Type, int > Map;
//...
// inside the sliding window, update the histogram as follows
for ( /* pixels to remove */ )
{
// _Type px
Map::iterator it = map.find( px );
if ( it->second > 1 )
it->second -= 1;
else
map.erase( it );
}
// ...
for ( /* pixels to add */ )
{
// _Type px
Map::iterator lower = map.lower_bound( px );
if ( lower != map.end() && lower->first == px )
lower->second += 1;
else
map.insert( lower, std::pair<_Type,int>( px, 1 ) );
}
//... and compute the median by integrating from the one end until
// until the appropriate sum is reached ..
3) A dense histogram. So this is the dense histogram, but instead of a simple 65536 array, we make searching a little easier by dividing it into sub-bins e.g.:
[0...65535] <- px
[0...4095] <- px / 16
[0...255] <- px / 256
[0...15] <- px / 4096
This makes insertion a bit slower (by constant time), but search a lot faster. I found 16 a good number.
Figure I tested methods (1) red, (2) blue and (3) black against each other and 8bpp OpenCV (green). For all but OpenCV, the input image is 16-bpp gray scale. The dotted lines are truncated at dynamic range [0,255] and smooth lines are truncated at [0, 8020] ( via multiplication by 16 and smoothing to add more variance on pixel values).
Interesting is the divergence of sparse histogram as the variance of pixel values increases. Nth-element is safe bet always, OpenCV is the fastest (if 8bpp is ok) and the dense histogram is trailing behind.
I used Windows 7, 8 x 3.4 GHz and Visual Studio v. 10. Mine were running multithreaded, OpenCV implementation is single-threaded. Input image size 2136x3201 (http://i.imgur.com/gg9Z2aB.jpg, from Vogue).
[2]: Huang, T: "Two-Dimensional Signal Processing II: Transforms
and Median Filters", 1981
[3]: Weiss, B: "Fast median and Bilateral Filtering", 2006

I just implemented, in DIPlib, an efficient algorithm for computing the median filter (and the more generic percentile filter). This algorithm works for integer images of any bit depth as well as floating-point images, works for images of any number of dimensions, and works for kernels of any shape.
The algorithm is similar to the binary search tree implementation suggested by #mainactual in their answer to this question (as method #2), but uses a more appropriate order statistic tree. #mainactual's implementation needs O(n) to find the median in the search tree, for a tree with n nodes, because it iterates through half the nodes in the tree. This is only efficient if there are many fewer nodes than pixels in the kernel, which is typically only true for integer images with a small bit depth. In contrast, the order statistic tree can find the median value in O(log n), by storing an additional value in each node: the size of the subtree rooted at that node. The filter has a cost of O(k log k) for a compact 2D kernel with a height of k pixels (independent of the width).
I wrote down a more detailed description of the algorithm in my blog.
The C++ code is available on GitHub.
Here is a timing comparison for square kernels, comparing:
the new implementation in DIPlib (blue),
the naive implementation in scikit-image (which computes the median for each pixel's neighborhood independently, method #1 in #mainactual's answer, with a quadratic cost) (green), and
the O(1) implementation in OpenCV that only works for 8-bit images and square kernels (red).
"SFLOAT" stands for single-precision floating-point, "UINT8" stands for 8-bit unsigned integer, and "0-10" is also 8-bit unsigned integer, but containing only pixel values between 0 and 10 (this one tests what happens when there are many repeated values in each neighborhood).
The new implementation in DIPlib kicks in at k = 13, the lower part of the graph is the naive, quadratic cost algorithm.

I found this online. It is the same algorithm which OpenCV has. However, it is extended to 16 bit and optimized to SSE.
medianFilter.c

I happened to find (my) solution online as open source (image-quality-and-characterization-utilities, from include/Teisko/Image/Algorithm.hpp
The algorithm finds the Kth element of any set of size M<=64 in N steps, where N is the number of bits in the elements.
This is a radix-2 sort algorithm, which needs the original bit pattern int16_t data[7][7]; to be transposed into N planes of uint64_t bits[N] (10 for 10-bit images), with the MSB first.
// Runs N iterations for pixel data as bit planes in `bits`
// to recover the K(th) largest item (as set by initial threshold)
// The parameter `mask` must be initialized to contain a set bit for all those bits
// of interest in the corresponding pixel data bits[0..N-1]
template <int N> inline
uint64_t median8x8_iteration(uint64_t(&bits)[N], uint64_t mask, uint64_t threshold)
{
uint64_t result = 0;
int i = 0;
do
{
uint64_t ones = mask & bits[i];
uint64_t ones_size = popcount(ones);
uint64_t mask_size = popcount(mask);
auto zero_size = mask_size - ones_size;
int new_bit = 0;
if (zero_size < threshold)
{
new_bit = 1;
threshold -= zero_size;
mask = 0;
}
result = result * 2 + new_bit;
mask ^= ones;
} while (++i < N);
return result;
}
Use threshold = 25 to get median of 49 and mask = 0xfefefefefefefe00ull in the case the planes bits[] contain the support for 8x8 adjacent bits.
Toggling the MSB plane one can use the same inner loop for signed integers - and toggling conditionally MSB and the other planes one can use the algorithm for floating points as well.
Much after year 2016, Ice Lake with AVX-512 introduced _mm256_mask_popcnt_epi64 even on consumer machines, allowing the inner loop to be almost trivially vectorised for all the four submatrices in the common 8x8 support; the masks would be 0xfefefefefefefe00ull >> {0,1,8,9}.
The idea here is that the mask marks the set of pixels under inspection. Counting the number of ones (or zeros) in that set and comparing to a threshold, we can determine at each step if the Kth element belongs to the set of ones or zeros, producing also one correct output bit.
EDIT
Another method I tried was SSE2 version where one keeps a window of size Width*(7 + 1) sorted by rows:
original sorted
1 2 1 3 .... 1 1 0
2 1 3 4 .... 2 2 1
5 2 0 1 .... -> 5 2 3
. . . . .... . . .
Sorting 7 rows is efficiently done by a sorting network using 16 primitive sorting operations (32 instructions with 3-parameter VEX encoding + 14 instructions for memory access).
One can also incrementally remove element from input[row-1][column] from a presorted SSE2 register and add an element from input[row+7][column] to the register (which takes about 12 instructions per sorted column).
Having 7 sorted columns in 7 SSE2 registers, one can now implement bitonic merge sort of three different widths,
which at first column will sort in groups of
(r0),(r1,r2), ((r3,r4), (r5,r6))
<merge> <merge> <merge> // level #0
<--merge---> <---- merge* ----> // level #1
<----- merge + take middle ----> // partial level #2
At column 1, one needs to sort columns
(r1,r2), ((r3,r4),(r5,r6)), (r7)
******* **** <-- merge (r1,r2,r7)
<----- merge + take middle ----> <-- partial level #2
At column 2, with
r2, ((r3,r4),(r5,r6)), (r7,r8)
<merge> // level #0
** ******* // level #1
<----- merge + take middle ----> // level #2
This takes advantage of memorizing partially sorted substrings (and does it better than e.g. heap based priority queue). The final merging of the 3x7 and 4x7 element substrings does not need to compute every element correctly, since we are only interested in the element #24.
Still, while being better than (my implementations) of heap based priority queues and several hierarchical / flat histogram based methods, the overall winner was the popcount method (with a 64-bit popcount instruction).

super fast median of matrix in opencv (as fast as matlab)

I'm writing some code in openCV and want to find the median value of a very large matrix array (single channel grayscale, float).
I tried several methods such as sorting the array (using std::sort) and picking the middle entry but it is extremely slow when comparing with the median function in matlab. To be precise - what takes 0.25 seconds in matlab takes over 19 seconds in openCV.
My input image is originally a 12-bit greyscale image with the dimensions 3840x2748 (~10.5 megapixels), converted to float (CV_32FC1) where all the values are now mapped to the range [0,1] and at some point in the code I request the median value by calling:
double myMedianValue = medianMat(Input);
Where the function medianMat is:
double medianMat(cv::Mat Input){
Input = Input.reshape(0,1); // spread Input Mat to single row
std::vector<double> vecFromMat;
Input.copyTo(vecFromMat); // Copy Input Mat to vector vecFromMat
std::sort( vecFromMat.begin(), vecFromMat.end() ); // sort vecFromMat
if (vecFromMat.size()%2==0) {return (vecFromMat[vecFromMat.size()/2-1]+vecFromMat[vecFromMat.size()/2])/2;} // in case of even-numbered matrix
return vecFromMat[(vecFromMat.size()-1)/2]; // odd-number of elements in matrix
}
I timed the function medinaMat by itself and also the various parts - as expected the bottleneck is in:
std::sort( vecFromMat.begin(), vecFromMat.end() ); // sort vecFromMat
Does anyone here have an efficient solution?
Thanks!
EDIT
I have tried using std::nth_element given in the answer of Adi Shavit.
The function medianMat now reads as:
double medianMat(cv::Mat Input){
Input = Input.reshape(0,1); // spread Input Mat to single row
std::vector<double> vecFromMat;
Input.copyTo(vecFromMat); // Copy Input Mat to vector vecFromMat
std::nth_element(vecFromMat.begin(), vecFromMat.begin() + vecFromMat.size() / 2, vecFromMat.end());
return vecFromMat[vecFromMat.size() / 2];
}
The runtime has lowered from over 19 seconds to 3.5 seconds. This is still nowhere near the 0.25 second in Matlab using the median function...

Sorting and taking the middle element is not the most efficient way to find a median. It requires O(n log n) operations.
With C++ you should use std::nth_element() and take the middle iterator. This is an O(n) operation:
nth_element is a partial sorting algorithm that rearranges elements in [first, last) such that:
The element pointed at by nth is changed to whatever element would occur in that position if [first, last) was sorted.
All of the elements before this new nth element are less than or equal to the elements after the new nth element.
Also, your original data is 12 bit integers. Your implementation does a few things that make the comparison to Matlab problematic:
You converted to floating point (CV_32FC1 or double or both) this is costly and takes time
The code has an extra copy to a vector<double>
Operations on float and especially doubles cost more than on integers.
Assuming your image is continuous in memory, as is the default for OpenCV you should use CV_16C1, and work directly on the data array after reshape()
Another option which should be very fast is to simply build a histogram of the image - this is a single pass on the image. Then, working on the histogram, find the bin that corresponds to half the pixels on each side - this is at most a single pass over the bins.
The OpenCV docs have several tutorials on how to build a histograms. Once you have the histogram, accumulate the bin values until you get pass 3840x2748/2. This bin is your median.

OK.
I actually tried this before posting the question and due to some silly mistakes I disqualified it as a solution... anyway here it is:
I basically create a histogram of values for my original input with 2^12 = 4096 bins, compute the CDF and normalize it so it is mapped from 0 to 1 and find the smallest index in the CDF that is equal or larger than 0.5. I then divide this index by 12^2 and thus find the median value requested. Now it runs in 0.11 seconds (and that's in debug mode without heavy optimizations) which is less than half the time required in Matlab.
Here's the function (nVals = 4096 in my case corresponding with 12-bits of values):
double medianMat(cv::Mat Input, int nVals){
// COMPUTE HISTOGRAM OF SINGLE CHANNEL MATRIX
float range[] = { 0, nVals };
const float* histRange = { range };
bool uniform = true; bool accumulate = false;
cv::Mat hist;
calcHist(&Input, 1, 0, cv::Mat(), hist, 1, &nVals, &histRange, uniform, accumulate);
// COMPUTE CUMULATIVE DISTRIBUTION FUNCTION (CDF)
cv::Mat cdf;
hist.copyTo(cdf);
for (int i = 1; i <= nVals-1; i++){
cdf.at<float>(i) += cdf.at<float>(i - 1);
}
cdf /= Input.total();
// COMPUTE MEDIAN
double medianVal;
for (int i = 0; i <= nVals-1; i++){
if (cdf.at<float>(i) >= 0.5) { medianVal = i; break; }
}
return medianVal/nVals; }

It's probably faster to find it from the original data.
Since the original data has 12-bit values, there are only
4096 different possible values. That's a nice and small table!
Go through all the data in one pass, and count how many of each value
you have. That is a O(n) operation. Then it's easy to find the median,
only count size/2 items from either end of the table.

OpenCV: Fill in missing elements in CVMat with avg of nearest non-zero neighbors?

The basic problem is this:
I have a CVMat, type CV_8UC1, which is mostly filled in with integers (well, chars, actually, but whatever) between 1 and 100 inclusive. The remaining elements are zeros.
In this case, 0 basically means "unknown". I want to fill in the unknown elements with, essentially, the average of its nearest neighbors... i.e. if this matrix were representing a 3d surface with a bunch of holes in it, I want to smoothly fill in the holes.
Keeping in mind, of course, that it's possible there are some rather big holes.
Efficiency isn't super important, as this operation is only going to be happening once, and the matrix in question isn't bigger than around 1000x1000.
Here's the code I need to finish:
for(int x=0; x<heightMatrix.cols; x++) {
for (int y=0; y<heightMatrix.rows; y++) {
if (heightMatrix.at<char>(x,y) == 0) {
// ???
}
}
}
Thanks!!

How about this instead:
put your data in an image and use image closing with a large kernel (or with a lot of iterations):
http://opencv.willowgarage.com/documentation/image_filtering.html#morphologyex

What about this?
int sum = 0;
... paste the following part inside the loop ...
sum += heightMatrix.at<char>(x - 1,y);
sum += heightMatrix.at<char>(x + 1,y);
sum += heightMatrix.at<char>(x,y - 1);
sum += heightMatrix.at<char>(x,y + 1);
heightMatrix.at<char>(x,y) = sum / 4;
Since you deal with a CV_8UC1 Mat you have in practice a 2d array and each pixel has just 4 nearest neighbors.
There are some caveats however:
1) put your averaged pixel in a Mat of floats to avoid round off!
2) to fill the whole Mat with this average may not be what you are looking for if the non-zero pixels are quite sparse: when there is a lot of empty pixels and really few non-zero pixels the more you move away from a non-zero pixel, the more the average converges to 0. And this may happen in as few as 3-4 iterations (another good reason to store not to store the values in a Mat of integers).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js