I have a PETSc-matrix, and would like to apply a 1d-FFT on each row of that matrix, preferably while keeping the possibility having the matrix distributed over several nodes. Based on the documentation and examples (such as here: https://www.mcs.anl.gov/petsc/petsc-current/src/mat/tests/ex143.c.html) I have to create an FFT-object ("FFT-matrix") and use this object then to create/initialize the vectors used for the FFT itself:
MatCreateFFT(PETSC_COMM_WORLD,DIM,dim,MATFFTW,&A);//Create FFT object
MatCreateVecsFFTW(A,&x,&y,&z); //Initialize Vectors
MatMult(A,x,y); //Apply FFT
Nevertheless, as far as I can see this only will execute a 1d-FFT over a vector, and not a 1d-FFT over each row in my matrix. Of course, I can just iterate over the rows in my matrix and copy them into the vector (and retrieve the result afterwards), but that would slow the process down quite a bit. Or do I have to resort to FFTW without using the interface from PETSc (as described in the same example as above) if I would like to execute the process described above?
You should be able to call MatDenseGetColumnVec and then feed that to your FFT.
Related
I have a (kind of) performance problem in my code, that roots in the chosen architecture.
I will use multidimensional tensors (basically matrices with more dimensions) in the form of cubes to store my data.
Since the dimension is not known at compile-time, I can't use Boost's MultidimensionalArray (IIRC), but have to come up, with my own solution.
Right now, I save each dimension, on it's own. I have a Tensor of dimension (let's say 3), that holds a lot of tensors of dimension 2 (in an std::vector), that each have a std::vector with tensors of dimension 1, that each holds a std::vector of (numerical) data. I use an abstract base-class for my tensor, so everything in there is a pointer to the abstract class, while beeing (secretly) multi- or one-dimensional.
I extract a single numerical data-point by giving a std::list of indices to a tensor, that get's the first element, searches for the according tensor and passes the rest of the list to that tensor in a (kind of) recursive call.
I now have to do a multi-dimensional Fast-Fourier Transformation on that data. I use a Threadpool and Job-Objects, that works on copying data from an Tensor along one dimension, doing an FFT and writes that data back.
I already have logic to implement ThreadPool and organize the dimensions to FFT along, but there is one problem:
My data-structure is the cache-unfriendliest beast, one can think of... While the Data-Copying along the first dimension (that, with it's data in a single 1D-Tensor) is reasonable fast, but in other directions, I need to copy my data from all over the place.
Since there are no race-conditions (I make sure every concurrent FFT is on distinct data-points), I thought, I would not use a Mutex-Guard to let everybody copy at the same time. However this heavily slows down the process ("I copy my data now!" - "No, I copy my data now!"- "But it's my turn now!"...)
Guarding the copy-Process with a mutex, does not increase speed. The FFT of a vector with 1024 elements is way faster, then the copy-process to get these elements, resulting in nearly all of my threads waiting, while one is copying.
Long story short:
Is there any kind of multi-dimensional data-structure, that does not need to set the dimension at compile-time, that allows me to traverse fast along all axis? I searched for a while now, by nothing came up besides Boost MultiArray. Vectorization also does not work since the indices would grow too fast to hold in usual int-types.
I can't think of how to present code-examples here, since most of that code is rather simple, but If needed, I can get that in.
Eigen has multi-dimensional tensor support (nominally unsupported, but written by the DeepMind people, so "somewhat" supported?), and FFTW has 1d to 3d FFTs. Using external libraries with a set of 1D to 3D FFTs would outsource most of the hard work.
Edit: Actually, FFTW has support for threaded n-dimensional FFTs
Lets assume we have a huge symmetric diagonal matrix. What is the efficient way to implement this?
The only way that i could think of is that by using the symmetric property where Xij = Xji, we can reduce the size of this matrix by half. But then representing this matrix using a 2D array would be inefficient, since we cant reduce the matrix size by using arrays.
Another thing representing this matrix using adjacency list also would be inefficient, because relating this matrix to a graph. It would be a density graph. And the operation of adj list takes lots of time such as removing, inserting and searching.
But what about using heaps?
There is no one answer until you decide what you are going to do with this matrix (or maybe matrices?).
If you are just going to store and remember it, then just store it sequentially, leaving out the redundant entries. (Your code knows how to access it, because that is all it does, right?)
More probably, you want to do normal matrix operations on it. In that case, are you trying to make the storage efficient, or the execution? In the later case, I don't see many opportunities based on it being symmetric--the multiplies are the expensive thing and you probably still need all of those. If it is the storage, then are you limiting yourself to operations that only take symmetric in and symmetric out? Sounds awfully specific. If so, then you only need to do the calculations for the part you are storing, because, by definition the other entries are symmetric, so just write your code to generate that part of the matrix and you are done.
I'm trying to write a driver in C++ to calculate the eigenvalues for an asymmetric, real-valued sparse matrix using the fortran functions offered by ARPACK, but I am having a bit of trouble with the reverse communication approach.
Generally, I am trying to solve the normal eigenvalue equation:
A*v = lambda*v
and any interaction with the matrix A is done in ARPACK via a function 'av':
av(n, workd[ipntr[0]], workd[ipntr[1]])
which multiplies the vector held in the array 'workd' beginning at location 'ipntr[0]' and inserts the result into the array 'workd' beginning at location 'ipntr[1]'. Examples of this approach are given in the manual at http://www.caam.rice.edu/software/ARPACK/ and also in the ARPACK/EXAMPLES/SIMPLE/dnsimp.f code.
What I would like to know is how do I actually involve the matrix A? If it is not passed to the routine then how is it possible to find its action on the vector provided?
In the example code dnsimp.f their matrix A is calculated within the function 'av', and is 'derived from the standard central difference discretisation of the 2 dimensional convection-diffusion operator'. However, I believe this is problem specific? It also doesn't seem too useful to have to code the derivation of the matrix A into the function. I can't find much information on this from the manual either.
It doesn't seem to be too much of a problem, since as it is a user defined function I am able to just change the definition of 'av' to include the matrix A as a parameter. However I would like to know how it is done properly in case of any potential compatibility issues.
Thank you!
You don't have to supply the matrix to ARPACK.
All you have to do, is to multiply the matrix with the returned vectors (thus, reverse communication) till the desired convergence is reached.
For information on the algorithms, you should take a look at the users guide and especially on the chapter about the underlying algorithms.
Response to comment: The underlying algorithm is a form of Arnoldi Iteration. The basic algorithm is shown in wikipedia and shows, that the matrix A won't be accessed. Neither directly, nor indirectly.
In particular, the algorithms starts with an arbitrary normalized vector q_1. This vector is returned to the user. The user multiplies this vector with the matrix A using their favourite routine (usually some efficient sparse matrix-vector-multiplication) and returns the result to the Arnoldi Iteration to calculate a part of the Hessenberg matrix H (whose eigenvalues typically converge to the extreme eigenvalues of A) and the next vector q_2. This has to be iterated, till your results are converged.
How to do element wise exponential for a matrix in Cuda programming?
for example:
A = [1 3 4; 6 5 2];
I want to compute:
B = [exp(1),exp(3),exp(4); exp(6);exp(5);(2)]
Is there a way of doing it efficiently and doing it in place (i.e, B replaces A)?
It seems cublas does not provide element wise operation on matrix.
I don't know if libraries exist that do element wise operations on matrices, but you could easily set up a CUDA kernel to do this job. You could for instance give one element of the A matrix to each thread, and they could perform the exponential and write the answer in B. You then call your CUDA kernel as usual. Take a look at this to get an idea of how to implement your kernel and how to call it (but instead of multiplying two vectors like they do in gpuMM you would do an exponential).
EDIT: It looks like you can do element wise operations using Thrust and the set of macros Newton, as is shown in this SO question.
I want to use the Sparse Blas in Fortran95 just for the creation of the matrices and I am using the point entry construction. After creation of the matrix using the command
call duscr_begin(n,n,a,istat)
here a is the handle to the matrix n by n. After inserting value in it, how can I see the final matrix using its handles a ? As I want to use the matrix for some other operation, so I want to see the matrix in three vectors (sparse) form (row_index, Col_index, Value).
detail about this Sparse Blas is given in Chapter 3 and can be seen here
http://www.netlib.org/blas/blast-forum/
actually what i have asked is before 16 days and it is not just writing of a variable to thee screen. I was using some library known as Sparse Blas for creation of the Sparse matrices. Later on by digging in to the library i found the solution to my problem that using the handles how can we get the three vectors row, col and Val. The commands are something like
call accessdata_dsp(mat,a_handle,ierr)
call get_infoa(mat%INFOA,'n',nnz,ierr)
allocate(K0_row(nnz),K0_col(nnz),K0_A(nnz))
K0_row=mat%IA1; K0_col=mat%IA2; K0_A=mat%A
so here nnz are the non zeros entries in the sparse matrix while K0_row, K0_col and K0_A are our required three vectors, which can be used in further calculation.