MPI_SCATTER Fortran Matrices by Rows - fortran

What is the best way to scatter a Fortran 90 matrix by its rows rather than columns? That is, let's say I have a matrix a(4,50) and I want to MPI_SCATTER it onto two processes where each part is alocal(2,50), where rank 0 has rows 1 and 2, and rank 1 has 3 and 4. Now, in C, this is simple since arrays are row-major, but in Fortran 90 they are column-major.
I'm trying to avoid using TRANSPOSE to flip a before scattering (i.e, doubling the memory use), and I figure there must be a way in MPI to do this. Would it be MPI_TYPE_VECTOR? MPI_TYPE_CREATE_SUBARRAY?
Likewise, what if I have a 3d array b(4,50,3) and I want two scattered matrices of blocal(2,50,3) distributed as above?

Yes, MPI_TYPE_VECTOR and MPI_TYPE_CREATE_SUBARRAY are what you want. The former for your first problem, the latter for your second. Comment if you want me to write the calls for you !

Didn't most of the MPI data transfer calls have a stride argument? Set it to the size of the data type times the height of the matrix and there you go...
I've taken a look to the MPI reference and there wasn't a explicit argument to that, but if you go to the example 5.12, they show how to send strided ints with MPI_Scatterv and MPI_Gatherv.

Related

Small matrices in ScaLAPACK: How to deal with empty blocks

I have a large matrix describing a physical system. The last two rows are fundamentally different from the others, and therefore need to be set up separately. Furthermore, it makes no sense to distribute each of these rows over different processes. I want to set up the two lines on the 0th process and then copy them to the global matrix.
What do I have? - A distributed M x N matrix where the upper (M-2) x N block is already filled.
What do I want to do? - Calculate the last 2 x N elements on the 0th process and then copy them with PDGEMR2D
What is the problem? - I need to call PDGEMR2D on all processes. The to-be-copied matrix (I think it's usually called a) therefore needs to be allocated and have a scalapack descriptor on all processes. On the 0th process, the local matrix is 2 x N, on all other processes it is 0 x N.
How do I deal with the empty submatrices?
Usually, to get the scalapack descriptors I would call descinit with the local number of rows as LLD. However this number needs to be >= 1, but on the processes with the empty matrices it is 0.
(Note that fortran lets you allocate arrays with 0 elements - this is purely a ScaLAPACK issue.)

What is the fastest way to perform FFT on a large file?

I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).
Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.
I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.
I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.
Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.
Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.

What is the most efficient matrix representation in C++?

I hope that this question is not OT.
I'm implementing a VLAD encoder using the VLFeat implementation and SIFT descriptors from different implementations to compare them (OpenCV, VLFeat, OpenSIFT).
This is supposed to be an high performance application in C++ (I know that SIFT is very inefficient, I'm implementing a parallel version of it).
Now, VLAD wants as input the pointer to a set of contiguous descriptors (math vectors). The point is that usually this SIFT descriptors are represented as a matrix, so it's easier to manage them.
So supposing that we have a matrix of 3 descriptors in 3 dimensions (I'm using these numbers for sake of simplicity, actually it's thousands of descriptors in 128 dimensions):
1 2 3
4 5 6
7 8 9
I need to do feed vl_vlad_encode with the pointer to:
1 2 3 4 5 6 7 8 9
An straightforward solution is saving descriptors in a cv::Mat m object and then pass m.data to vl_vlad_encode.
However I don't know if cv::Mat is an efficient matrix representation. For example, Eigen::Matrix is an alternative (I think it's easy to obtain the representation above using this object), but I don't know which implementation is faster/more efficient or if there is any other reason because I should prefer one instead of the other.
Another possible alternative is using std::vector<std::vector<float>> v, but I don't know if using v.data() I would obtain the representation above instead of:
1 2 3 *something* 4 5 6 *something* 7 8 9
Obviously *something* would mess up vl_vlad_encode.
Any other suggestion is more than welcome!
Unless you do some weird stuff (see here for details), data in a Mat are guaranteed to be continuous. You can think of a Mat as a lightweight wrapper over a float* (or other types) that allows easier access to the data. So it's as efficient as a pointer, but with a few nice-to-have abstractions.
If you need to efficiently load/save from/to file, you can save the Mat in binary format using matread and matwrite.
std::vector<std::vector<float>> v is not going to perform very well without some effort, since the memory will not be contiguous.
Once you have your memory contiguous, be it float[], float[][] or std::array/vector, how well it will perform will depend on how you iterate over your matrix. If it's random access, then it makes little difference; if you're iterating all columns per for then it's better to have your data grouped by column rather than row.

inverse fourier transform FFT3W

I am using C++ function to find inverse Fourier transform.
int inYSize = 170; int inXSize = 2280;
float* outData = new float[inYSize*inXSize];
fftwf_plan mReverse = fftwf_plan_dft_c2r_2d(inYSize, inXSize,(fftwf_complex*)temp, outdata,
FFTW_ESTIMATE);
fftwf_execute(mReverse);
My input is 2D array temp with complex numbers. All the elements have real value 1 and imaginary 0.
So I am expecting InverseFFT of such an array should be 2D array with real values. Output array should have SPIKE at 0,0 and rest all values 0. But I am getting all different values in the output array even after normalizing with total size of an array. What could be the reason?
FFTW is not that trivial to deal with when it comes to multidimensional DFT and Complex to Real transform.
When doing a C2R transform of a MxN row-major array, the second dimension is cut in half because of the symmetry of the result : outData is twice bigger than needed, but it's not the reason of your problem (and not you're case as you are doing C2R and not R2C).
More info about this tortuous matter : http://www.fftw.org/doc/One_002dDimensional-DFTs-of-Real-Data.html
"Good Guy Advice" : Use only the C2C "easier" way of doing things, take the modulus of the output if you don't know how to process the results, but don't waste your time on n-D Complex to Real transforms.
Because of limited precision, because of the numerical implementation of the DFT, because of unsubordinated drunk bits, you can get values that are not 0 even if they are very small. This is the normal behavior of a FFT algorithm.
Besides reading carefully the user manual (http://www.fftw.org/doc/) even if it's a real pain (I lost few days around this library just to get a 3D transform working, just to understand how data was scaled)
You should try with a C2C 1D transform before going C2C 2D and C2R 2D, just to be sure you have somehow an idea of what you're doing.
What's the inverse FFT of a planar constant something where every bin of the "frequency-plane" is filled with a one ? Are you looking for a new way to define +inf or -inf ? In that case I would rather start with the easier division by 0 ^^. The direct FFT should be a as you described, with the SPIKE correctly scaled being 1, pretty sure the inverse is not.
Do not hesitate to add precision to your question, and good luck with FFTW
With this little information it is hard to tell. What i could imagine would be that you get spectral leakage due to the window selection (See This Wikipedia article for details about leakage).
What you could do is try using another windowing function to reduce leakage or redefine your windowing size.

expand a dense matrix

What would be the most efficient way to expand a dense matrix with new columns in FORTRAN?
Say T is a dense matrix m by n
and I would like to make it m by n+1.
One strategy I could think of : Reallocate at each step and assign the last column or would there be some better ways, such as allocating some space before and checking if that is sufficient and if not do the reallocation kind of stuff? Any ideas?
Assuming m and n are in some sense not exceedingly large, so that your matrices fit into memory and what you're after is performance in time, what I'd do I'd allocate a large matrix and store the actual size separately. This is what, for example, BLAS libraries use as a 'leading dimension'. Then, when you need to add a column, you check if your actual size is still smaller than the maximum size, and reallocate memory if necessary.
If you have a Fortran 2003 Compiler you might make use of move_alloc: http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/lref_for/source_files/rfmvallo.htm