How to set the frequency band of my array after fft - c++

How I can set frequency band to my array from KissFFT? Sampling frequency is 44100 and I need to set it to my array realPartFFT. I have no idea, how it works. I need to plot my spectrum chart to see if it counts right. When I plot it now, it still has only 513 numbers on the x axis, without the specified frequency.
int windowCount = 1024;
float floatArray[windowCount], realPartFFT[(windowCount / 2) + 1];
kiss_fftr_cfg cfg = kiss_fftr_alloc(windowCount, 0, NULL, NULL);
kiss_fft_cpx cpx[(windowCount / 2) + 1];
kiss_fftr(cfg, floatArray, cpx);
for (int i = 0; i < (windowCount / 2) + 1; ++)
realPartFFT[i] = sqrtf(powf(cpx[i].r, 2.0) + powf(cpx[i].i, 2.0));

First of all: KissFFT doesn't know anything about the source of the data. You pass it an array of real numbers of a given size N, and you get in return an array of complex values of size N/2+1. The input array may be the whether forecast of the next N hours of the number of sunspots of the past N days. KissFFT doesn't care.
The mapping back to the real world needs to be done by you, so you have to interpret the data. As of you code snippet, you are passing 1024 of floats (I assume that floatArray contains the input data). You then get back an array of 513 (=1024/2+1) pairs of floats.
If you are sampling with 44.1 KHz and pass KissFFT chunks of 1024 (your window size) samples, you will get as highest frequency 22.05 KHz and as lowest frequency about 43 Hz (44,100 / 1024). You can get even lower by passing bigger chunks to KissFFT, but keep in mind that processing time will grow (with the fourth power of N, IIRC)!
Btw: You may consider making your windowSize variable const, to allow the compiler do some optimizations. Optimizations are very valuable when doing number crunching. In this case the effect may be negligible, but it's a good starting point.

Related

clFFT: Calculating overlapped FFTs

I want to create a batch for clFFT to calculate 3 FFTs of 256 length, where the FFT input values overlap (FFT overlap processing)
Input: a 1D array of 276 complex numbers
Task: Calculate FFTs for [0..255], [10..265], [20..275]
Output: 3x 256 FFTs = 768 values.
If I where to write a loop, it would look like this:
std::complex<float> *input;
for (int i=0; i<3; ++i) {
calcFFT(input, input+256);
input += 10;
}
IOW: The fft calculates 256 input values then advances 10 values and calculates the next 256 values.
How do I set up a clFFT plan, so that this happens in one call?
clfftSetPlanIn/OutStride specifies the distance between the individual values, so that is the wrong parameter.
It looks as if clfftSetPlanDistance might be what I need. Doc says:
CLFFTAPI clfftStatus clfftSetPlanDistance( clfftPlanHandle plHandle, size_t iDist, size_t oDist );
Pitch is the distance between each discrete array object in an FFT array. This is only used
* for 'array' dimensions in clfftDim; see clfftSetPlanDimension (units are in terms of clfftPrecision)
which I find very confusing.
Yes, clfftSetPlanDistance is the right API to use. In the example I would have to use
cllSetPlanDistance(plan, 10, 256);
to calculate FFTs with a step of 10.
This will generate OpenCL code where the global offset of the first FFT index is calculated like this:
// Inside the generated fft_fwd OpenCL function
iOffset = (batch/32)*10 + (batch%32)*8;
where batch is the batch number of the FFT to calculate.

Fast 7x7 2D median filter in C and C++

I'm trying to convert the following code from MATLAB to C++:
function data = process(data)
data = medfilt2(data, [7 7], 'symmetric');
mask = fspecial('gaussian', [35 35], 12);
data = imfilter(data, mask, 'replicate', 'same');
maximum = max(data(:));
data = 1 ./ (data/maximum);
data(data > 10) = 16;
end
My problem is in the medfilt2, which is a 2D median filter. I need it to support 10 bits per pixels and more images.
I have looked into OpenCV it has a 5x5 median filter which supports 16 bits, but 7x7 only supports bytes.
medianBlur
I have also looked into Intel IPP, but I can see only a 1D median filter.
https://software.intel.com/en-us/node/502283
Is there a fast implementation for a 2D filter?
I am looking for something like:
Fast Median Search: An ANSI C Implementation using parallel programming and vectorized (AVX/SSE) operations...
Two Dimensional Digital Signal Processing II. Transforms and median filters.
Edited by T.S.Huang. Springer-Verlag. 1981.
There are more code examples in Fast median filtering with implementations in C/C++/C#/VB.NET/Delphi.
I also found Median Filtering in Constant Time.
Motivated by the fact that OpenCV does not implement 16-bit median filter for large kernel sizes (larger than 5), I tried three different strategies.
All of them are based on Huang's [2] sliding window algorithm. That is, the histogram is updated by removing and inserting pixel entries as the window slides from left to right. This is quite straightforward for 8-bit image and already implemented in OpenCV. However, a large 65536 bin histogram makes computation a bit difficult.
...The algorithm still remains O(log r), but storage considerations render it impractical for 16-bit images and impossible for floating-point images. [3]
I used the algorithm C++ standard library where applicable, and did not implement Weiss' additional optimization strategies.
1) A naive sorting implementation. I think this is the best starting point for arbitrary pixel type (floats particularly).
// copy pixels in the sliding window to a temporary vec and
// compute the median value (size is always odd)
memcpy( &v[0], &window[0], window.size() * sizeof(_Type) );
std::vector< _Type >::iterator it = v.begin() + v.size()/2;
std::nth_element( v.begin(), it, v.end() );
return *it;
2) A sparse histogram. We wouldn't want to step over 65536 bins to find the median of each pixel, so how about storing the sparse histogram then? Again, this is suitable for all pixel types, but it doesn't make sense if all pixels in the window are different (e.g. floats).
typedef std::map< _Type, int > Map;
//...
// inside the sliding window, update the histogram as follows
for ( /* pixels to remove */ )
{
// _Type px
Map::iterator it = map.find( px );
if ( it->second > 1 )
it->second -= 1;
else
map.erase( it );
}
// ...
for ( /* pixels to add */ )
{
// _Type px
Map::iterator lower = map.lower_bound( px );
if ( lower != map.end() && lower->first == px )
lower->second += 1;
else
map.insert( lower, std::pair<_Type,int>( px, 1 ) );
}
//... and compute the median by integrating from the one end until
// until the appropriate sum is reached ..
3) A dense histogram. So this is the dense histogram, but instead of a simple 65536 array, we make searching a little easier by dividing it into sub-bins e.g.:
[0...65535] <- px
[0...4095] <- px / 16
[0...255] <- px / 256
[0...15] <- px / 4096
This makes insertion a bit slower (by constant time), but search a lot faster. I found 16 a good number.
Figure I tested methods (1) red, (2) blue and (3) black against each other and 8bpp OpenCV (green). For all but OpenCV, the input image is 16-bpp gray scale. The dotted lines are truncated at dynamic range [0,255] and smooth lines are truncated at [0, 8020] ( via multiplication by 16 and smoothing to add more variance on pixel values).
Interesting is the divergence of sparse histogram as the variance of pixel values increases. Nth-element is safe bet always, OpenCV is the fastest (if 8bpp is ok) and the dense histogram is trailing behind.
I used Windows 7, 8 x 3.4 GHz and Visual Studio v. 10. Mine were running multithreaded, OpenCV implementation is single-threaded. Input image size 2136x3201 (http://i.imgur.com/gg9Z2aB.jpg, from Vogue).
[2]: Huang, T: "Two-Dimensional Signal Processing II: Transforms
and Median Filters", 1981
[3]: Weiss, B: "Fast median and Bilateral Filtering", 2006
I just implemented, in DIPlib, an efficient algorithm for computing the median filter (and the more generic percentile filter). This algorithm works for integer images of any bit depth as well as floating-point images, works for images of any number of dimensions, and works for kernels of any shape.
The algorithm is similar to the binary search tree implementation suggested by #mainactual in their answer to this question (as method #2), but uses a more appropriate order statistic tree. #mainactual's implementation needs O(n) to find the median in the search tree, for a tree with n nodes, because it iterates through half the nodes in the tree. This is only efficient if there are many fewer nodes than pixels in the kernel, which is typically only true for integer images with a small bit depth. In contrast, the order statistic tree can find the median value in O(log n), by storing an additional value in each node: the size of the subtree rooted at that node. The filter has a cost of O(k log k) for a compact 2D kernel with a height of k pixels (independent of the width).
I wrote down a more detailed description of the algorithm in my blog.
The C++ code is available on GitHub.
Here is a timing comparison for square kernels, comparing:
the new implementation in DIPlib (blue),
the naive implementation in scikit-image (which computes the median for each pixel's neighborhood independently, method #1 in #mainactual's answer, with a quadratic cost) (green), and
the O(1) implementation in OpenCV that only works for 8-bit images and square kernels (red).
"SFLOAT" stands for single-precision floating-point, "UINT8" stands for 8-bit unsigned integer, and "0-10" is also 8-bit unsigned integer, but containing only pixel values between 0 and 10 (this one tests what happens when there are many repeated values in each neighborhood).
The new implementation in DIPlib kicks in at k = 13, the lower part of the graph is the naive, quadratic cost algorithm.
I found this online. It is the same algorithm which OpenCV has. However, it is extended to 16 bit and optimized to SSE.
medianFilter.c
I happened to find (my) solution online as open source (image-quality-and-characterization-utilities, from include/Teisko/Image/Algorithm.hpp
The algorithm finds the Kth element of any set of size M<=64 in N steps, where N is the number of bits in the elements.
This is a radix-2 sort algorithm, which needs the original bit pattern int16_t data[7][7]; to be transposed into N planes of uint64_t bits[N] (10 for 10-bit images), with the MSB first.
// Runs N iterations for pixel data as bit planes in `bits`
// to recover the K(th) largest item (as set by initial threshold)
// The parameter `mask` must be initialized to contain a set bit for all those bits
// of interest in the corresponding pixel data bits[0..N-1]
template <int N> inline
uint64_t median8x8_iteration(uint64_t(&bits)[N], uint64_t mask, uint64_t threshold)
{
uint64_t result = 0;
int i = 0;
do
{
uint64_t ones = mask & bits[i];
uint64_t ones_size = popcount(ones);
uint64_t mask_size = popcount(mask);
auto zero_size = mask_size - ones_size;
int new_bit = 0;
if (zero_size < threshold)
{
new_bit = 1;
threshold -= zero_size;
mask = 0;
}
result = result * 2 + new_bit;
mask ^= ones;
} while (++i < N);
return result;
}
Use threshold = 25 to get median of 49 and mask = 0xfefefefefefefe00ull in the case the planes bits[] contain the support for 8x8 adjacent bits.
Toggling the MSB plane one can use the same inner loop for signed integers - and toggling conditionally MSB and the other planes one can use the algorithm for floating points as well.
Much after year 2016, Ice Lake with AVX-512 introduced _mm256_mask_popcnt_epi64 even on consumer machines, allowing the inner loop to be almost trivially vectorised for all the four submatrices in the common 8x8 support; the masks would be 0xfefefefefefefe00ull >> {0,1,8,9}.
The idea here is that the mask marks the set of pixels under inspection. Counting the number of ones (or zeros) in that set and comparing to a threshold, we can determine at each step if the Kth element belongs to the set of ones or zeros, producing also one correct output bit.
EDIT
Another method I tried was SSE2 version where one keeps a window of size Width*(7 + 1) sorted by rows:
original sorted
1 2 1 3 .... 1 1 0
2 1 3 4 .... 2 2 1
5 2 0 1 .... -> 5 2 3
. . . . .... . . .
Sorting 7 rows is efficiently done by a sorting network using 16 primitive sorting operations (32 instructions with 3-parameter VEX encoding + 14 instructions for memory access).
One can also incrementally remove element from input[row-1][column] from a presorted SSE2 register and add an element from input[row+7][column] to the register (which takes about 12 instructions per sorted column).
Having 7 sorted columns in 7 SSE2 registers, one can now implement bitonic merge sort of three different widths,
which at first column will sort in groups of
(r0),(r1,r2), ((r3,r4), (r5,r6))
<merge> <merge> <merge> // level #0
<--merge---> <---- merge* ----> // level #1
<----- merge + take middle ----> // partial level #2
At column 1, one needs to sort columns
(r1,r2), ((r3,r4),(r5,r6)), (r7)
******* **** <-- merge (r1,r2,r7)
<----- merge + take middle ----> <-- partial level #2
At column 2, with
r2, ((r3,r4),(r5,r6)), (r7,r8)
<merge> // level #0
** ******* // level #1
<----- merge + take middle ----> // level #2
This takes advantage of memorizing partially sorted substrings (and does it better than e.g. heap based priority queue). The final merging of the 3x7 and 4x7 element substrings does not need to compute every element correctly, since we are only interested in the element #24.
Still, while being better than (my implementations) of heap based priority queues and several hierarchical / flat histogram based methods, the overall winner was the popcount method (with a 64-bit popcount instruction).

Viola Jones AdaBoost running out of memory before even starts

I'm implementing the Viola Jones algorithm for face detection. I'm having issues with the first part of the AdaBoost learning part of the algorithm.
The original paper states
The weak classifier selection algorithm proceeds as follows. For each feature, the examples are sorted based on feature value.
I'm currently working with a relatively small training set of 2000 positive images and 1000 negative images. The paper describes having data sets as large as 10,000.
The main purpose of AdaBoost is to decrease the number of features in a 24x24 window, which totals 160,000+. The algorithm works on these features and selects the best ones.
The paper describes that for each feature, it calculates its value on each image, and then sorts them based on value. What this means is I need to make a container for each feature and store the values of all the samples.
My problem is my program runs out of memory after evaluating only 10,000 of the features (only 6% of them). The overall size of all the containers will end up being 160,000*3000, which is in the billions. How am I supposed to implement this algorithm without running out of memory? I've increased the heap size, and it got me from 3% to 6%, I don't think increasing it much more will work.
The paper implies that these sorted values are needed throughout the algorithm, so I can't discard them after each feature.
Here's my code so far
public static List<WeakClassifier> train(List<Image> positiveSamples, List<Image> negativeSamples, List<Feature> allFeatures, int T) {
List<WeakClassifier> solution = new LinkedList<WeakClassifier>();
// Initialize Weights for each sample, whether positive or negative
float[] positiveWeights = new float[positiveSamples.size()];
float[] negativeWeights = new float[negativeSamples.size()];
float initialPositiveWeight = 0.5f / positiveWeights.length;
float initialNegativeWeight = 0.5f / negativeWeights.length;
for (int i = 0; i < positiveWeights.length; ++i) {
positiveWeights[i] = initialPositiveWeight;
}
for (int i = 0; i < negativeWeights.length; ++i) {
negativeWeights[i] = initialNegativeWeight;
}
// Each feature's value for each image
List<List<FeatureValue>> featureValues = new LinkedList<List<FeatureValue>>();
// For each feature get the values for each image, and sort them based off the value
for (Feature feature : allFeatures) {
List<FeatureValue> thisFeaturesValues = new LinkedList<FeatureValue>();
int index = 0;
for (Image positive : positiveSamples) {
int value = positive.applyFeature(feature);
thisFeaturesValues.add(new FeatureValue(index, value, true));
++index;
}
index = 0;
for (Image negative : negativeSamples) {
int value = negative.applyFeature(feature);
thisFeaturesValues.add(new FeatureValue(index, value, false));
++index;
}
Collections.sort(thisFeaturesValues);
// Add this feature to the list
featureValues.add(thisFeaturesValues);
++currentFeature;
}
... rest of code
This should be the pseudocode for the selection of one of the weak classifiers:
normalize the per-example weights // one float per example
for feature j from 1 to 45,396:
// Training a weak classifier based on feature j.
- Extract the feature's response from each training image (1 float per example)
// This threshold selection and error computation is where sorting the examples
// by feature response comes in.
- Choose a threshold to best separate the positive from negative examples
- Record the threshold and weighted error for this weak classifier
choose the best feature j and threshold (lowest error)
update the per-example weights
Nowhere do you need to store billions of features. Just extract the feature responses on the fly on each iteration. You're using integral images, so extraction is fast. That is the main memory bottleneck, and it's not that much, just one integer for every pixel in every image... basically the same amount of storage as your images required.
Even if you did just compute all the feature responses for all images and save them all so you don't have to do that every iteration, that still only:
45396 * 3000 * 4 bytes =~ 520 MB, or if you're convinced there are 160000 possible features,
160000 * 3000 * 4 bytes =~ 1.78 GB, or if you use 10000 training images,
160000 * 10000 * 4 bytes =~ 5.96 GB
Basically, you shouldn't be running out of memory even if you do store all the feature values.

out of memory and vector of vectors

I'm implementing a distance matrix that calculates the distance between each point and all the other points and I have 100,000 points, so my matrix size will be 100,000 x 100,000. I implemented that using vector<vector<double> > dist. However, for this large data size it give out of memory error. The following is my code and any help will be really appreciated.
vector<vector<double> > dist(dat.size()) vector<double>(dat.size()));
size_t p,j;
ptrdiff_t i;
#pragma omp parallel for private(p,j,i) default(shared)
for(p=0;p<dat.size();++p)
{
// #pragma omp parallel for private(j,i) default(shared)
for (j = p + 1; j < dat.size(); ++j)
{
double ecl = 0.0;
for (i = 0; i < c; ++i)
{
ecl += (dat[p][i] - dat[j][i]) * (dat[p][i] - dat[j][i]);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
dist[j][p] = ecl;
}
}
A 100000 x 100000 matrix? A quick calculation shows why this is never going to work:
100000 x 100000 x 8 (bytes) / (1024 * 1024 * 1024) = 74.5 gigabytes...
Even if it was possible to allocate this much memory I doubt very much whether this would be an efficient approach for a real problem.
If you're looking to do some kind of geometric processing on large data sets you may be interested in some kind of spatial tree structure: kd-trees, quadtrees, r-trees maybe?
100,000 * 100,000 = 10,000,000,000 ~= 2^33
It is easy to see that in 32 bits system - an out of memory is guaranteed for such a large data base, without even calculating the fact that we found number of elements, and not number of bytes used.
Even in 64 bits systems, it is highly unlikely that the OS will allow you to so much memory [also note that you actually need much more memory, also because each element you allocate is much more then a byte.]
Did you know that 100,000 times 100,000 is 10 billion? If you're storing the distances as 32-bit integers, that would 40 billion bytes, or 37.5 GB. That is probably more RAM than you have so this will not be feasible.
100,000 x 100,000 x sizeof( double ) = roughly 80GIG (with 8 byte doubles) without the overhead of the vectors.
That's not likely to happen unless you're on a really big machine.
Look at using a database of some sort or one of the C/C++ collection libs that spills large data to disk.
Rogue Wave's SourcePRO class library has a few disk based collection classes but it is not free.

C++: Counting total frames in a game

Not a very good title, but I didn't know what to name it.
Anyway, I am counting the total frames (so I can calculate an average FPS) in my game with a long int. Just in case the game went on reallly long, what should I do to make sure my long int doesn't get incremented past its limit? And what would happen if it did go past its limit?
Thanks.
This problem is present for any kind of counters.
For your specific problem, I wouldn't worry.
A long int counts up to 2 billions (and more) in most worst cases (on 32 bit computers/consoles). Supposing your game is doing 1000 frames per second (which is a lot!), it would take 20000000 seconds to overflow your counter: more than 5000 hours, more than 231 days.
I'm pretty sure something else would cause your game to stop, if you try to run it for that long!
I would instead consider using an exponentially-weighted moving average. That approach will kill two birds with one stone: it will avoid the problem of accumulating a large number, and it will also adapt to recent behavior so that an accumulated average of 100fps in the year 2010 would not skew the average so that a 2fps rate would seem acceptable for a month or so in 2011 :).
Average FPS throughout the length of an entire game doesn't seem to be a very useful statistic. Typically you will wish to measure peaks and valleys, such as highest fps / lowest fps and amount of frames spent below and above threshold values.
In reality though, I would not worry. Even if you were to just use a 32 bit unsigned int, your game could run at 60fps for 19884 hours before it would overflow. You should be fine.
EDIT:
The best way to detect overflow in this case is to check and see if the integer decreased in value after being incremented. If so, you could just keep another counter around which is the number of times you have overflowed.
You could actively check for an overflow in your arithmetic operations. E. g. SafeInt can do that for you. Of course, the performance is worse than for i++.
However, it is unlikely that a 32 bit integer will overflow if you always increment by one.
If long int is 32-bits, the maximum value is 2^31-1, so with 1ms updates it will overflow in 24.9 days, not 231 [2^31/1000/60/60/24].
Hopefully not too OT... generally for games this may not really be an issue, but it is for other applications. A common mistake to be careful of is doing something like
extern volatile uint32_t counter;
uint32_t one_second_elapsed = counter + 1000;
while ( counter < one_second_elapsed ) do_something();
If counter + 1000 overflows, then do_something() will not be called. The way to check this is,
uint32_t start = counter;
while ( counter - start < 1000 ) do_something();
It's probably better to use an average over a small number of frames. You mentioned that you want to calculate an average, but there is really no reason to keep such a larger number of samples around to calculate an average. Just keep a running total of frametimes over some small period of time (where small could be something between 10-50 frames - we typically use 16). You can then use that total to calculate an average frames per second. This method also helps smooth out frame time reports so that the numbers don't jump all over the place. One thing to watch out for though is that if you average over too long of a time period, frame rate spikes become more "hidden", meaning it might be tougher to spot frames which cause framerate to drop if those frames only happen every so often.
Something like this would be totally sufficient I think (non-tested code to follow):
// setup some variables once
const int Max_samples = 16; // keep at most 16 frametime samples
int FPS_Samples = 0;
int Current_sample = 0;
int Total_frametime = 0.0f;
float Frametimes[Max_samples];
for ( int i = 0; i < Max_samples; i++ ) {
Frametimes[i] = 0.0f;
Then when you calculate your frametime, you could do something like this:
// current_frametime is the new frame time for this frame
Total_frametime -= Frametimes[Current_sample];
Total_frametime += current_frametime;
Frametimes[Current_sample] = current_frametime;
Current_sample = ( Current_sample + 1 ) % Max_samples; // move to next element in array
Frames_per_second = Max_samples / Total_frametime;
It's a rough cut and could probably use some error checking, but it gives the general idea.