Intel integrated performance primitives Fourier Transform magnitudes

Intel integrated performance primitives Fourier Transform magnitudes - c++

When I am using Intel IPP's ippsFFTFwd_RToCCS_64f and then ippsMagnitude_64fc I get a massive peak at zero index in magnitudes array.
My sine wave is long and main component I am interested is somewhere between 0.15 Hz and 0.25 Hz. I take the sample with 500Hz sampling frequency. If I reduce mean from the signal before FFT I get really small zero component not that peak anymore. Below is a pic of magnitudes array head:
Also the magnitude scaling seems to be 10 times the magnitude I see in the time series of the signal e.g. if amplitude is 29 in magnitudes it is 290.
I Am not sure why this is so and my question is 1. Do I really need to address the zero index peak with mean reduction and 2. Where does this scale of 10 come?
void CalculateForwardTransform(array<double> ^signal, array<double> ^transformedSignal, array<double> ^magnitudes)
{
// source signal
pin_ptr<double> pinnedSignal = &signal[0];
double *pSignal = pinnedSignal;
int order = (int)Math::Round(Math::Log(signal->Length, 2));
// get sizes
int sizeSpec = 0, sizeInit = 0, sizeBuf = 0;
int status = ippsFFTGetSize_R_64f(order, IPP_FFT_DIV_INV_BY_N, ippAlgHintNone, &sizeSpec, &sizeInit, &sizeBuf);
// memory allocation
IppsFFTSpec_R_64f* pSpec;
Ipp8u *pSpecMem = (Ipp8u*)ippMalloc(sizeSpec);
Ipp8u *pMemInit = (Ipp8u*)ippMalloc(sizeInit);
// FFT specification structure initialized
status = ippsFFTInit_R_64f(&pSpec, order, IPP_FFT_DIV_INV_BY_N, ippAlgHintNone, pSpecMem, pMemInit);
// transform
pin_ptr<double> pinnedTransformedSignal = &transformedSignal[0];
double *pDst = pinnedTransformedSignal;
Ipp8u *pBuffer = (Ipp8u*)ippMalloc(sizeBuf);
status = ippsFFTFwd_RToCCS_64f(pSignal, pDst, pSpec, pBuffer);
// get magnitudes
pin_ptr<double> pinnedMagnitudes = &magnitudes[0];
double *pMagn = pinnedMagnitudes;
status = ippsMagnitude_64fc((Ipp64fc*)pDst, pMagn, magnitudes->Length); // magnitudes is half of signal len
// free memory
ippFree(pSpecMem);
ippFree(pMemInit);
ippFree(pBuffer);
}

Do I really need to address the zero index peak with mean reduction?
For low frequency signal analysis a small bias can really interfere (especially due to spectral leakage). For sake of illustration, consider the following low-frequency signal tone and another one with a constant bias tone_with_bias:
fs = 1;
f0 = 0.15;
for (int i = 0; i < N; i++)
{
tone[i] = 0.001*cos(2*M_PI*i*f0/fs);
tone_with_bias[i] = 1 + tone[i];
}
If we plot the frequency spectrum of a 100 sample window of these signals, you should notice that the spectrum of tone_with_bias completely drowns out the spectrum of tone:
So yes it's better if you can remove that bias. However, it should be emphasized that this is possible provided that you know the nature of the bias. If you know that the bias is indeed a constant, removing it will reveal the low-frequency component. Otherwise, removing the mean from the signal may not achieve the desired effect if the bias is just a very low-frequency component.
Where does this scale of 10 come?
Scaling of the magnitude by the FFT should be expected, as described in this answer of approximately 0.5*N (where N is the FFT size). If you were processing a small chunk of 20 samples, then you should get such a factor of 10 scaling. If you scale the output of the FFT by 2/N (or equivalently scale by 2 while also using the IPP_FFT_DIV_FWD_BY_N flag) you should get results that have similar magnitudes as the time-domain signal.

Related

Fast, good quality pixel interpolation for extreme image downscaling

In my program, I am downscaling an image of 500px or larger to an extreme level of approx 16px-32px. The source image is user-specified so I do not have control over its size. As you can imagine, few pixel interpolations hold up and inevitably the result is heavily aliased.
I've tried bilinear, bicubic and square average sampling. The square average sampling actually provides the most decent results but the smaller it gets, the larger the sampling radius has to be. As a result, it gets quite slow - slower than the other interpolation methods.
I have also tried an adaptive square average sampling so that the smaller it gets the greater the sampling radius, while the closer it is to its original size, the smaller the sampling radius. However, it produces problems and I am not convinced this is the best approach.
So the question is: What is the recommended type of pixel interpolation that is fast and works well on such extreme levels of downscaling?
I do not wish to use a library so I will need something that I can code by hand and isn't too complex. I am working in C++ with VS 2012.
Here's some example code I've tried as requested (hopefully without errors from my pseudo-code cut and paste). This performs a 7x7 average downscale and although it's a better result than bilinear or bicubic interpolation, it also takes quite a hit:
// Sizing control
ctl(0): "Resize",Range=(0,800),Val=100
// Variables
float fracx,fracy;
int Xnew,Ynew,p,q,Calc;
int x,y,p1,q1,i,j;
//New image dimensions
Xnew=image->width*ctl(0)/100;
Ynew=image->height*ctl(0)/100;
for (y=0; y<image->height; y++){ // rows
for (x=0; x<image->width; x++){ // columns
p1=(int)x*image->width/Xnew;
q1=(int)y*image->height/Ynew;
for (z=0; z<3; z++){ // channels
for (i=-3;i<=3;i++) {
for (j=-3;j<=3;j++) {
Calc += (int)(src(p1-i,q1-j,z));
} //j
} //i
Calc /= 49;
pset(x, y, z, Calc);
} // channels
} // columns
} // rows
Thanks!

The first point is to use pointers to your data. Never use indexes at every pixel. When you write: src(p1-i,q1-j,z) or pset(x, y, z, Calc) how much computation is being made? Use pointers to data and manipulate those.
Second: your algorithm is wrong. You don't want an average filter, but you want to make a grid on your source image and for every grid cell compute the average and put it in the corresponding pixel of the output image.
The specific solution should be tailored to your data representation, but it could be something like this:
std::vector<uint32_t> accum(Xnew);
std::vector<uint32_t> count(Xnew);
uint32_t *paccum, *pcount;
uint8_t* pin = /*pointer to input data*/;
uint8_t* pout = /*pointer to output data*/;
for (int dr = 0, sr = 0, w = image->width, h = image->height; sr < h; ++dr) {
memset(paccum = accum.data(), 0, Xnew*4);
memset(pcount = count.data(), 0, Xnew*4);
while (sr * Ynew / h == dr) {
paccum = accum.data();
pcount = count.data();
for (int dc = 0, sc = 0; sc < w; ++sc) {
*paccum += *i;
*pcount += 1;
++pin;
if (sc * Xnew / w > dc) {
++dc;
++paccum;
++pcount;
}
}
sr++;
}
std::transform(begin(accum), end(accum), begin(count), pout, std::divides<uint32_t>());
pout += Xnew;
}
This was written using my own library (still in development) and it seems to work, but later I changed the variables names in order to make it simpler here, so I don't guarantee anything!
The idea is to have a local buffer of 32 bit ints which can hold the partial sum of all pixels in the rows which fall in a row of the output image. Then you divide by the cell count and save the output to the final image.
The first thing you should do is to set up a performance evaluation system to measure how much any change impacts on the performance.

As said precedently, you should not use indexes but pointers for (probably) a substantial
speed up & not simply average as a basic averaging of pixels is basically a blur filter.
I would highly advise you to rework your code to be using "kernels". This is the matrix representing the ratio of each pixel used. That way, you will be able to test different strategies and optimize quality.
Example of kernels:
https://en.wikipedia.org/wiki/Kernel_(image_processing)
Upsampling/downsampling kernel:
http://www.johncostella.com/magic/
Note, from the code it seems you apply a 3x3 kernel but initially done on a 7x7 kernel. The equivalent 3x3 kernel as posted would be:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]

FFTW3 compute cross-correlation in the same signal

I am currently creating a C code, which takes as an input a wav file (specifically just one channel of the original wav file), and it performs the short-time Fourier transform.
The main part of the code is this one:
stft_data = (fftw_complex*)(fftw_malloc(sizeof(fftw_complex)*windowSize));
fft_result= (fftw_complex*)(fftw_malloc(sizeof(fftw_complex)*windowSize));
storage = (fftw_complex*)(fftw_malloc(sizeof(fftw_complex)*storage_capacity));
//define the fftw plane
fftw_plan plan_forward;
plan_forward = fftw_plan_dft_1d(windowSize, stft_data, fft_result, FFTW_FORWARD, FFTW_ESTIMATE);
//integer indexes
int i,counter ;
counter = 0 ;
//create a Hamming window
double hamming_result[windowSize];
hamming(windowSize, hamming_result);
//implement the stft position indexes
int chunkPosition = 0; //actual chunk position
int readIndex ; //read the index of the wav file
while (chunkPosition < wav_length ){
//read the window
for(i=0; i<windowSize; i++){
readIndex = chunkPosition + i;
if (readIndex < wav_length){
stft_data[i] = wav_data[readIndex]*hamming_result[i]*_Complex_I + 0.0*I;
}
else{
//if we are beyond the wav_length
stft_data[i] = 0.0*_Complex_I + 0.0*I;//padding
break;
}
}
//compute the fft
fftw_execute(plan_forward);
//store the stft in a data structure
for (i=0; i<windowSize;i++)
{
//printf("RE: %.2f IM: %.2f\n", creal(fft_result[i]),cimag(fft_result[i]));
storage[counter] = creal(fft_result[i]) + cimag(fft_result[i]);
counter+=1;
}
//update indexes
chunkPosition += hop_size;
printf("Chunk Position %d\n", chunkPosition);
printf("Counter position %d\n", counter);
printf("Fourier transform done\n");
}
Once the FFT has been computed onto the selected window, I am storing the FFT real and imaginary part into a storage variable.
After that I would like to compute the cross correlation among the data points in each of the N windows I have in the end.
As an example, I would like to compute the correlation between the first data point of the first window ( storage[0] ) with the first element of the second window (storage[windowSize+1]).
However, I am facing some problems and I don't have reasonable values. According to what I studied, the correlation in the Fourier space it is just the complex multiplication between two Fourier terms. Thus,
what I am doing is something like :
correlation = storage[0]*conj(storage[windowSize+1]);
However, I got very huge values, which makes me wonder if I am really computing a correlation.
Where am I wrong?
How should I scale my correlation results?
How should I compute the correlation with the Fourier values?
and then, how should I plot the Fourier values I have from FFTW3 calculations? should I shift all the values or are they already shifted?
Thanks very much

The line storage[counter] = creal(fft_result[i]) + cimag(fft_result[i]); makes storage purely real. Since computing correlation = storage[0]*conj(storage[windowSize+1]); is the next step in the computation of the cross correlation, there is a problem. Indeed, there is no point in conjugating a real number.
Trying storage[counter] = fft_result[i]; could partly resolve the issue.
In addition, correlation = storage[0]*conj(storage[windowSize+1]); should be modified to correlation = storage[0]*conj(storage[windowSize]);
By performing correlation = storage[0]*conj(storage[windowSize]);, the magnitude of index [0] of the DFT of the correlation is obtained. Indeed, storage[0] corresponds to the average of the first frame, while storage[windowSize] corresponds to the average of the second frame. It is not equal to the averages, but much larger, as it is scaled by the length of the frame windowSize.
To compute the correlation between the two signals, the next step should be:
for (i=0; i<windowSize;i++)
{
dftofcorrelation[i]=storage[i]*conj(storage[i+windowSize]
}
Then, the inverse DFT must be applied to the array dftofcorrelation to get the correlation as an array. It must be kept in mind that neither the forward nor the backward DFT of FFTW include any scaling, see what FFTW really computes:
fftw_execute(plan_backward);
If two scalars are to be retained of this correlation array, it's its maximum (high if the signal are similar up to a delay) and the index of the maximum, that is the estimated time offset between the two signals.
The overall scaling induced by FFTW is a power of windowSize (windowSize^3?). It can be checked by computing the autocorrelation of a uniform signal (which is uniform).

FFT window causing unequal amplification across frequency spectrum

I am using FFTW to create a spectrum analyzer in C++.
After applying any window function to an input signal, the output amplitude suddenly seems to scale with frequency.
Retangular Window
Exact-Blackman
Graphs are scaled logarithmically with a sampling frequency of 44100 Hz. All harmonics are generated at the same level, peaking at 0dB as seen during the rectangular case. The Exact-Blackman window was amplified by 7.35dB to attempt to makeup for processing gain.
Here is my code for generating the input table...
freq = 1378.125f;
for (int i = 0; i < FFT_LOGICAL_SIZE; i++)
{
float term = 2 * PI * i / FFT_ORDER;
for (int h = 1; freq * h < FREQ_NYQST; h+=1) // Harmonics up to Nyquist
{
fftInput[i] += sinf(freq * h * K_PI * i / K_SAMPLE_RATE); // Generate sine
fftInput[i] *= (7938 / 18608.f) - ((9240 / 18608.f) * cosf(term)) + ((1430 / 18608.f) * cosf(term * 2)); // Exact-Blackman window
}
}
fftwf_execute(fftwR2CPlan);
Increasing or decreasing the window size changes nothing. I tested with the Hamming window as well, same problem.
Here is my code for grabbing the output.
float val; // Used elsewhere
for (int i = 1; i < K_FFT_COMPLEX_BINS_NOLAST; i++) // Skips the DC and Nyquist bins
{
real = fftOutput[i][0];
complex = fftOutput[i][1];
// Grabs the values and scales based on the window size
val = sqrtf(real * real + complex * complex) / FFT_LOGICAL_SIZE_OVER_2;
val *= powf(20, 7.35f / 20); // Only applied during Exact-Blackman test
}
Curiously, I attempted the following to try to flatten out the response in the Exact-Blackman case. This scaling back down resulted in a nearly, but still not perfectly flat response. Neat, but still doesn't explain to me why this is happening.
float x = (float)(FFT_COMPLEX_BINS - i) / FFT_COMPLEX_BINS; // Linear from 0 to 1
x = log10f((x * 9) + 1.3591409f); // Now logarithmic from 0 to 1, offset by half of Euler's constant
val = sqrt(real * real + complex * complex) / (FFT_LOGICAL_SIZE_OVER_2 / x); // Division by x added to this line

Might be a bug. You seem to be applying your window function multiple times per sample. Any windowing should be removed from your input compositing loop and applied to the input vector just once, right before the FFT.

I was not able to reproduce code because I do not have the library on hand. However, This may be a consequence of spectral leakage. https://en.wikipedia.org/wiki/Spectral_leakage
This is an inevevitiblity of window functions as well as sampling. If you look at the tradeoffs section of that article, the type of window can be adaptive for a wide range of frequencies or focused on a particular one. Since the frequency of your signal is increasing perhaps the lower freq signal outside your target is more subjected to spectral leakage.

Cufft set frequency?

I am using CUDA's Cufft to process data i receive from a hydrophone(500,000 integers a second at 250hertz, high and low channels). Now as a basic example of how Cufft works is here...
void runTest(int argc, char** argv)
{
printf("[1DCUFFT] is starting...\n");
cufftComplex* h_signal = (cufftComplex*)malloc(sizeof(cufftComplex)* SIGNAL_SIZE);
// Allocate host memory for the signal
//Complex* h_signal = (Complex*)malloc(sizeof(Complex) * SIGNAL_SIZE);
// Initalize the memory for the signal
for (unsigned int i = 0; i < SIGNAL_SIZE; ++i) {
h_signal[i].x = rand() / (float)RAND_MAX;
h_signal[i].y = 0;
}
int mem_size = sizeof(cufftComplex)* SIGNAL_SIZE;
// Allocate device memory for signal
cufftComplex* d_signal;
cudaMalloc((void**)&d_signal, mem_size);
// Copy host memory to device
cudaMemcpy(d_signal, h_signal, mem_size,
cudaMemcpyHostToDevice);
// CUFFT plan
cufftHandle plan;
cufftPlan1d(&plan, mem_size, CUFFT_C2C, 1);
// Transform signal
printf("Transforming signal cufftExecC2C\n");
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD);
// Transform signal back
printf("Transforming signal back cufftExecC2C\n");
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);
// Copy device memory to host
cufftComplex* h_inverse_signal = (cufftComplex*)malloc(sizeof(cufftComplex)* SIGNAL_SIZE);;
cudaMemcpy(h_inverse_signal, d_signal, mem_size,
cudaMemcpyDeviceToHost);
for (int i = 0; i < SIGNAL_SIZE; i++){
h_inverse_signal[i].x = h_inverse_signal[i].x / (float)SIGNAL_SIZE;
h_inverse_signal[i].y = h_inverse_signal[i].y / (float)SIGNAL_SIZE;
printf("first : %f %f after %f %f \n", h_signal[i].x, h_signal[i].y, h_inverse_signal[i].x, h_inverse_signal[i].y);
}
//Destroy CUFFT context
cufftDestroy(plan);
// cleanup memory
free(h_signal);
free(h_inverse_signal);
cudaFree(d_signal);
cudaDeviceReset();
}
Now all I want to know is, how do i set the frequency of the FFT (cufft) to be 250hertz?
Thanks
James

You don't. The FFT of N points is the same, regardless of the frequency at which those N points were sampled.
Also, 500.000 integers per second is 500.000 hz sample rate, aka 500 kHz. That gives you a Nyquist limit of 250 khz.

If I understand you right, you just need to know which element in the output vector is 250Hz.
The FFT gives you all the frequencies that are justified to be calculated based on the length and time resolution of your time vector.
The simple rule to calculate is :
- frequency range = 1/time resolution.
- frequency resolution = 1/time length.
In addition one has to know that the FFT of a real function (no data imaginary portion of the time vector) yields a symmetric spectrum with redundancy. The spectrum reaches from (- 1/2 frequency range to +1/2 freq. range). The negative frequency data can be discarded in the case of a real time vector. It's a little more complicated, though. The standard implementation of the FFT (which is an inplace operation) gives you the positive frequencies first , then the negative frequencies. Since you are only interested in the positive frequencies, the 2nd half of the FFT vector can be discarded. In your case, just ignore data above index 250k.
In your case the frequencies span from -250kHz to 250 kHz with a resolution of 1Hz, but because of the above, the first 250k points are actually the positive frequencies, at a separation of 1Hz.
So take the 250th point in the (unshifted, i.e. raw) FFT and you have the signal at 250 Hz. I would plot the data from 0 to around 500 to see how broad that peak is around 250 Hz. The signal strength is the integral of those non-zero frequencies (non-zero applied loosely here to indicate everything above noise). The signal width indicates the modulation that is being applied to the signal (which could include other measurement artifacts). If the signal is shifted from 250 Hz you might have a Doppler shift (either your source or you are moving).
If you are only interested in a finite frequency range, it might be faster to calculate the Fourier integral (O(n^2)) just for those few frequency points. Generally people use the FFT because it is O(n*log(n)), but if you need only say 10 frequency points then O(10*n) is not much different.

How to efficiently determine the minimum necessary size of a pre-rendered sine wave audio buffer for looping?

I've written a program that generates a sine-wave at a user-specified frequency, and plays it on a 96kHz audio channel. To save a few CPU cycles I employ the old trick of pre-rendering a short section of audio into a buffer, and then playing back the buffer in a loop, so that I can avoid calling the sin() function 96000 times per second for the duration of the program and just do simple memory-copying instead.
My problem is efficiently determining what the minimum usable size of this pre-rendered buffer would be. For some frequencies it is easy -- for example, an 8kHz sine wave can be perfectly represented by generating a 12-sample buffer and playing it in a looping, because (8000*12 == 96000). For other frequencies, however, a single cycle of the sine wave requires a non-integral number of samples to represent, and therefore looping a single cycle's worth of samples would cause unacceptable glitching.
For some of those frequencies, however, it's possible to get around that problem by pre-rendering more than one cycle of the sine wave and looping that -- if I can figure out how many cycles are required so that the number of cycles present in the buffer will be integral, while also guaranteeing that the number of samples in the buffer are integral. For example, a sine-wave frequency of 12.8kHz translates to a single-cycle buffer-size of 7.5 samples, which won't loop cleanly, but if I render two consecutive cycles of the sine wave into a 15-sample buffer, then I can cleanly loop the result.
My current approach to solving this issue is brute force: I try all possible cycle-counts and see if any of them result in a buffer size with an integral number of samples in it. I think that approach is unsatisfactory for the following reasons:
1) It's very inefficient. For example, the program shown below (which prints buffer-size results for 480,000 possible frequency values between 0Hz and 48kHz) takes 35 minutes to complete on my 2.7GHz machine. I think there must be a much faster way to do this.
2) I suspect that the results are not 100% accurate, due to floating-point errors.
3) The algorithm gives up if it can't find an acceptable buffer size less than 10 seconds long. (I could make the limit higher, but of course that would make the algorithm even slower).
So, is there any way to calculate the minimum-usable-buffer-size analytically, preferably in O(1) time? It seems like it should be easy, but I haven't been able to figure out what kind of math I should use.
Thanks in advance for any advice!
#include <stdio.h>
#include <math.h>
static const long long SAMPLES_PER_SECOND = 96000;
static const long long MAX_ALLOWED_BUFFER_SIZE_SAMPLES = (SAMPLES_PER_SECOND * 10);
// Returns the length of the pre-render buffer needed to properly
// loop a sine wave at the given frequence, or -1 on failure.
static int GetNumCyclesNeededForPreRenderedBuffer(float freqHz)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
for (int count=1; (count*oneCycleLengthSamples) < MAX_ALLOWED_BUFFER_SIZE_SAMPLES; count++)
{
double remainder = fmod(oneCycleLengthSamples*count, 1.0);
if (remainder > 0.5) remainder = 1.0-remainder;
if (remainder <= 0.0) return count;
}
return -1;
}
int main(int, char **)
{
for (int i=0; i<48000*10; i++)
{
double freqHz = ((double)i)/10.0f;
int numCyclesNeeded = GetNumCyclesNeededForPreRenderedBuffer(freqHz);
if (numCyclesNeeded >= 0)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
printf("For %.1fHz, use a pre-render-buffer size of %f samples (%i cycles, %f samples/cycle)\n", freqHz, (numCyclesNeeded*oneCycleLengthSamples), numCyclesNeeded, oneCycleLengthSamples);
}
else printf("For %.1fHz, there was no suitable pre-render-buffer size under the allowed limit!\n", freqHz);
}
return 0;
}

number_of_cycles/size_of_buffer = frequency/samples_per_second
This implies that if you can simplify your frequency/samples_per_second fraction, you can find the size of your buffer and the number of cycles in the buffer. If frequency and samples_per_second are integers, you can simplify the fraction by finding the greatest common divisor, otherwise you can use the method of continued fractions.
Example:
Say your frequency is 1234.5, and your samples_per_second is 96000. We can make these into two integers by multiplying by 10, so we get the ratio:
frequency/samples_per_second = 12345/960000
The greatest common divisor is 15, so it can be reduced to 823/64000.
So you would need 823 cycles in a 64000 sample buffer to reproduce the frequency exactly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js