I am using C++ function to find inverse Fourier transform.
int inYSize = 170; int inXSize = 2280;
float* outData = new float[inYSize*inXSize];
fftwf_plan mReverse = fftwf_plan_dft_c2r_2d(inYSize, inXSize,(fftwf_complex*)temp, outdata,
FFTW_ESTIMATE);
fftwf_execute(mReverse);
My input is 2D array temp with complex numbers. All the elements have real value 1 and imaginary 0.
So I am expecting InverseFFT of such an array should be 2D array with real values. Output array should have SPIKE at 0,0 and rest all values 0. But I am getting all different values in the output array even after normalizing with total size of an array. What could be the reason?
FFTW is not that trivial to deal with when it comes to multidimensional DFT and Complex to Real transform.
When doing a C2R transform of a MxN row-major array, the second dimension is cut in half because of the symmetry of the result : outData is twice bigger than needed, but it's not the reason of your problem (and not you're case as you are doing C2R and not R2C).
More info about this tortuous matter : http://www.fftw.org/doc/One_002dDimensional-DFTs-of-Real-Data.html
"Good Guy Advice" : Use only the C2C "easier" way of doing things, take the modulus of the output if you don't know how to process the results, but don't waste your time on n-D Complex to Real transforms.
Because of limited precision, because of the numerical implementation of the DFT, because of unsubordinated drunk bits, you can get values that are not 0 even if they are very small. This is the normal behavior of a FFT algorithm.
Besides reading carefully the user manual (http://www.fftw.org/doc/) even if it's a real pain (I lost few days around this library just to get a 3D transform working, just to understand how data was scaled)
You should try with a C2C 1D transform before going C2C 2D and C2R 2D, just to be sure you have somehow an idea of what you're doing.
What's the inverse FFT of a planar constant something where every bin of the "frequency-plane" is filled with a one ? Are you looking for a new way to define +inf or -inf ? In that case I would rather start with the easier division by 0 ^^. The direct FFT should be a as you described, with the SPIKE correctly scaled being 1, pretty sure the inverse is not.
Do not hesitate to add precision to your question, and good luck with FFTW
With this little information it is hard to tell. What i could imagine would be that you get spectral leakage due to the window selection (See This Wikipedia article for details about leakage).
What you could do is try using another windowing function to reduce leakage or redefine your windowing size.
Related
I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).
Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.
I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.
I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.
Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.
Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.
I am working on implementing Image convolution in C++, and I already have a naive working code based on the given pseudo code:
for each image row in input image:
for each pixel in image row:
set accumulator to zero
for each kernel row in kernel:
for each element in kernel row:
if element position corresponding* to pixel position then
multiply element value corresponding* to pixel value
add result to accumulator
endif
set output image pixel to accumulator
As this can be a big bottleneck with big Images and Kernels, I was wondering if there exist some other approach to make things faster ? even with additionnal input info like : sparse image or kernel, already known kernel etc...
I know this can be parallelized, but it's not doable in my case.
if element position corresponding* to pixel position then
I presume this test is meant to avoid a multiplication by 0. Skip the test! multiplying by 0 is way faster than the delays caused by a conditional jump.
The other alternative (and it's always better to post actual code rather than pseudo-code, here you have me guessing at what you implemented!) is that you're testing for out-of-bounds access. That is terribly expensive also. It is best to break up your loops so that you don't need to do this testing for the majority of the pixels:
for (row = 0; row < k/2; ++row) {
// inner loop over kernel rows is adjusted so it only loops over part of the kernel
}
for (row = k/2; row < nrows-k/2; ++row) {
// inner loop over kernel rows is unrestricted
}
for (row = nrows-k/2; row < nrows; ++row) {
// inner loop over kernel rows is adjusted
}
Of course, the same applies to loops over columns, leading to 9 repetitions of the inner loop over kernel values. It's ugly but way faster.
To avoid the code repetition you can create a larger image, copy the image data over, padded with zeros on all sides. The loops now do not need to worry about accessing out-of-bounds, you have much simpler code.
Next, a certain class of kernel can be decomposed into 1D kernels. For example, the well-known Sobel kernel results from the convolution of [1,1,1] and [1,0,-1]T. For a 3x3 kernel this is not a huge deal, but for larger kernels it is. In general, for a NxN kernel, you go from N2 to 2N operations.
In particular, the Gaussian kernel is separable. This is a very important smoothing filter that can also be used for computing derivatives.
Besides the obvious computational cost saving, the code is also much simpler for these 1D convolutions. The 9 repeated blocks of code we had earlier become 3 for a 1D filter. The same code for the horizontal filter can be re-used for the vertical one.
Finally, as already mentioned in MBo's answer, you can compute the convolution through the DFT. The DFT can be computed using the FFT in O(MN log MN) (for an image of size MxN). This requires padding the kernel to the size of the image, transforming both to the Fourier domain, multiplying them together, and inverse-transforming the result. 3 transforms in total. Whether this is more efficient than the direct computation depends on the size of the kernel and whether it is separable or not.
For small kernel size simple method might be faster. Also note that separable kernels (for example, Gauss kernel is separable) as mentioned, allow to make filtering by lines then by columns, resulting O(N^2 * M) complexity.
For other cases: there exists fast convolution based on FFT (Fast Fourier Transform). It's complexity is O(N^2*logN) (where N is size of image ) comparing to O(N^2*M^2) for naive implementation.
Of course, there some peculiarities in applying this techniques, for example, edge effects, but one needs to account for them in naive implementation too (in a lesser degree though).
FI = FFT(Image)
FK = FFT(Kernel)
Prod = FI * FK (element-by-element complex multiplication)
Conv(I, K) = InverseFFT(Prod)
Note that you can use some fast library intended for image filtering, for example, OpenCV allows to apply kernel to 1024x1024 image in 5-30 milliseconds.
One way to this speed up, might be, depending on target platform, to distinctly get every value in the kernel, then, in memory, store multiple copies of the image, one for every distinct value in the kernel, and multiply each copy of the image by its distinct kernel value, then at the end, multiply by distinct kernel value, shift, sum and divide up all the image copies into one image. This could be done on a graphics processor for example where memory is ample and which is more suited for this tight repetitive processing. The copies of the image will need to support overflow of the pixels, or you could use floating point values.
I'm trying to use armadillo to do linear regression as in the following function:
void compute_weights()
{
printf("transpose\n");
const mat &xt(X.t());
printf("inverse\n");
mat xd;
printf("mul\n");
xd = (xt * X);
printf("inv\n");
xd = xd.i();
printf("mul2\n");
xd = xd * xt;
printf("mul3\n");
W = xd * Y;
}
I've split this up so I could see what was going on with the program getting so huge. The matrix X has 64 columns and over 23 million rows. The transpose isn't too bad, but that first multiply causes the memory footprint to completely blow up. Now, as I understand it, if I multiply X.t() * X, each element of the matrix product will be the dot product of a column of X and a row of X.t(), and the result should be a 64x64 matrix.
Sure, it should take a long time, but why would the memory suddenly blow up to nearly 30 gigabytes?
Then it seems to hang on to that memory, and then when it gets to the second multiply, it's just too much, and the OS kills it for getting so huge.
Is there a way to compute products without so much memory usage? Can that memory be reclaimed? Is there a better way to represent these calculations?
You don't stand a chance doing this whole multiplication in one shot, unless you use a huge workstation. Like hbrerkere said, your initial consumption is about 22 GB. So you either be ready for that, or find another way.
If you don't have such a workstation, another way is to do the multiplication yourself, and parallelize it. Here's how you do it:
Don't load the whole matrix into memory, but load parts of it.
Load like a million rows of X, and store it somewhere.
Load a million columns of Y
Use std::transform with the binary operator std::multiplies to multiply the parts you loaded (this will utilize your processor's vectorization, and make it fast), and fill in the partial result you calculated.
Load the next part of your matrices, and repeat
This won't be as efficient, but it will work. Also another option is to consider using Armadillo after decomposing your matrix to smaller matrices, whose multiplication will yield sub-results.
Both methods are much slower than the full multiplication for 2 reasons:
The overhead of loading and deleting data from memory
Matrix multiplication is already an O(N^3) problem... and now splitting your multiplication is O(N^2), so it'll become O(N^6)...
Good luck!
You can compute the weights using far less memory using the QR decomposition (You might want to look up 'least squares QR');
Briefly:
Use householder transformations to (implicitly) find orthogonal Q so that
Q'*X = R where R is upper triangular
and at the same time transform Y
Q'*Y = y
Solve
R*y = W for W using only the top 64 rows of R and y
If you are willing to overwrite Z and Y, then this requires no extra memory; otherwise you will need a copy of X and a copy of Y.
I have 5+ million data to predict people's race. One textual feature gives rise to tens of thousands more. For example, name 'Smith' give rise to 'sm', 'mi', 'it'... etc. I then need to transform it into some sparse matrix
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X2= vec.fit_transform(measurements)
Because of the tens of thousands of generated features, I can't use the following to give me an array, otherwise I am getting an out of memory error.
X = vec.fit_transform(measurements).toarray()
As far as I can tell, a lot of other functions/modules in scikilearn only allows the array format data to be fitted. For example: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA and http://scikit-learn.org/stable/modules/feature_selection.html for dimensionality reduction and feature selection.
pca = PCA(n_components=2)
pca.fit(X) # X works but not X2, though I can't get X with my big data set because of out-of-memory error
I am not certain that this will help, but you can try to slice your X2 into smaller parts (but still as big as possible), and use IncrementalPCA on them.
from sklearn.utils import gen_batches
from sklearn.decomposition import IncrementalPCA
pca = IncrementalPCA()
n_samples, n_features = X2.shape
batchsize = n_features*5
for slice in gen_batches(n_samples, batchsize):
pca.partial_fit(X2[slice].toarray())
You may change that 5 constant to some bigger number, if your RAM size allows to do that.
As you noticed you probably won't be able to convert your text features into a numpy array.
So you'll need to focus on techniques that can handle sparse data.
PCA is not one of them.
The reason is that PCA performs centering of the data, which makes the data dense (picture a sparse matrix, then substract 0.5 to every element).
This SO answer provides more explanation and an alternative:
To clarify: PCA is mathematically defined as centering the data (removing the mean value to each feature) and then applying truncated SVD on the centered data.
As centering the data would destroy the sparsity and force a dense representation that often does not fit in memory any more, it is common to directly do truncated SVD on sparse data (without centering). This resembles PCA but it's not exactly the same.
In the context of text data performing SVD after a TfidfVectorizer or a CountVectorizer is actually a famous technique called latent semantic analysis.
As for the feature selection part, you'll probably have to modify the source code of your scoring function (e.g. chi2) so that it handles sparse matrices without making them dense.
It is possible, this is mostly a trade-off between keeping the sparsity of matrices and using efficient array operations.
In your case though I'd try and throw this at a classifier first to see if the extra work is worth your time.
I have a matrix stored in
GLdouble m[16];
Now, using glMultMatrixd(m),I multiplied this matrix with 3d coordinates of a point. I want to know what are the new coordinates that are formed after multiplying with the matrix m. Is there any command in openGL that can do this?
No, there isn't any useful way.
In modern GL, the integrated matrix stack has been completely removed, for good reasons. Application programmers are reuired to write the own matrix functions, or use some existing library like glm (which implements all of the stuff that used to be in old OpenGL in some header-only C++ library). It is worth noting in this context that operations like glMultMatrix never were GPU-accelerated, and were always carried out directly on the CPU by the GL implementation, so there is nothing lost by removing this stuff from the GL.
I'm not saying it would be impossible to somehow let OpenGL do that matrix * vector multiplication for you and to read back the result. The most direct approach for that would be to use transfrom feedback to capture the results of the vertex shader in some buffer object. However, for transforming a single point, the overhead would be exorbitant.
Another - totally cumbersome - approach to get old GL with it's builtin matrix functions to calculate that product for you would be simply putting your point as the first column into a matrix, multiply that by using glMultMatrix to the matrix you set, read back the current matrix and find the transformed point in the first column of the resulting matrix.