Dimensioning in MPI Scattering

Dimensioning in MPI Scattering - fortran

So I am working on a simple matrix multiplication code using MPI. One of the problems I am facing is in scattering one of the matrices to all the processors. I am assuming that the dimension of my matrix might not be divisible by the number of processors.
Also I have used a variable col_id which calculates the number of columns alloted to each processor by using the mod function. For example if we have 9 columns and 6 processors, the first 3 processors will have a col_id value of 2 while other three processors will have a col_id value of 1.
So I used a basic scattering operation.
call MPI_Scatter(b, dim2*col_id, MPI_Integer, b1, dim2*col_id, MPI_Integer, 0, MPI_COMM_WORLD, ierr)
col_id will be different for different processors. Are we allowed to use this variable to specify dimensions in MPI_scatter?

The function MPI_Scatterv is made for exactly that purpose. Instead of a sendcount you specify sendcounts - an integer array of the size of the communicator.

Related

Paralize numpy.linalg.matrix_power do not increase performance

I need to paralize the function numpy.linalg.matrix_power and I use the following code to test how much fast can be the parallel version
def aux_matrix_arg3(A):
aaa=np.linalg.matrix_power(np.random.randn(199,199),100)
return 1
N=10000
processes=4
chunksize=N/processes
poolWorkers=mp.Pool(processes=processes)
ti=t.time()
A=poolWorkers.map(aux_matrix_arg3,range(N),chunksize=chunksize)
print 't_iteration 3',t.time()-ti
I have tried with 1 and 4 processes in my laptop. I got very similar performance
4 processes: t_iteration 3 40.7985408306
1 processes: t_iteration 3 40.6538720131
Any clue why I do not get any improvment with paralle processes?

The docs say:
For positive integers n, the power is computed by repeated matrix squarings and matrix multiplications. If n == 0, the identity matrix of the same shape as M is returned. If n < 0, the inverse is computed and then raised to the abs(n).
If your system is set up correctly, BLAS will be used to parallelize matrix-multiplications and LAPACK (maybe SuperLU, the latter probably only in the sparse-case) for solving systems of linear-equations (used to calculate the inverse). So with a very high probability, your naive code is already very very optimized!
Despite that, you should be careful as naive parallelization copies all the data, which can hurt. (Normaly one would use mmap-arrays to share data instead of copying).

Replace do loop with array notation in fortran

I would like to replace the following do loop with FORTRAN's intrinsic functions and array notations.
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
However, as c(i,j) depends on c(i-1,j) none of following trials worked. Because they do not update c(i,j)
!FORALL(i = 2:n , j = 2:n ) c(i,j)=c(i-1,j)*(b(j)-b(j-1))+d(i,j)
!FORALL(i = 2:n) c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
!c(2:n,2:n)=RESHAPE( (/(c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=RESHAPE((/(((b(j)-b(j-1)) *c(i-1,j)+d(i,j) ,j=2,n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=spread(b(2:n)-b(1:n-1),ncopies = n-1,dim=1) * c(1:n-1,2:n) +d(2:n,2:n)
This is the best I can get. But it still has a do loop
do i=2, n
c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
end do
Could all do loops be replaced by intrinsic functions and array notation. Or could this one be replaced somehow ?

In my experience, nothing beats the traditional do-loop. All the extension intrinsics create memory and CPU overhead by copying stuff to temporary space (on the stack usually), reshaping and the sort. If you're manipulating large arrays, you may encounter out-of-memory issues with the intrinsic functions.
Your best option is to stick to a 2-d loop that has the indexes correctly laid out:
do i=2, n
e = c(1:n,i-1)
do j=2, n
a=b(j)-b(j-1)
c(j,i)=a*e(j)+d(j,i)
end do
end do
By replacing the indexes (and making sure your dimension declarations follow), you are saving on the memory paging. c(j,i) and d(j,i) references travel column-wise inside memory while c(j,i-1) would have cut across columns (and produced paging overhead). So we copy it to a temporary e array.
I think this will be the fastest....

Starting with
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
we can quickly eliminate the do loops with some nifty use of SIZE, SPREAD and EOSHIFT:
res = SPREAD(b - EOSHIFT(b,-1),2,SIZE(c,2))*EOSHIFT(c,-1) + d
Aha, turns out that the error I was receiving (V1) was due to my using RESHAPE rather than SPREAD. I fixed this in the current version (V2) and it compiles & works with both ifort and gfortran.

Multiple matrix-vector calls with CUBLAS

I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).
A bit more context:
The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:
matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]
With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:
cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);
I am making 128 such calls which I would like to do in one.
The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?
Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.
NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.

Updated
In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.
As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].
On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.
Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.
1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] ) by 1 dgmm()
2. C[(2048*128)x8] = transpose( B[8x(2048*128)] ) by 1 geam()
3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)] by 1 gemv()
4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
output[8x128] = transpose( OO[128x8] ) by 1 geam()
This col-major output[8x128] is what you want.
Since you need adding rather then replacing, you may need one more call to add the orginal values to output

I have done a very quick launch of the batchCUBLAS SDK example. I have considered 128 independent runs for matrices of size 2048x8 and 8x1. Here are the results on an NVIDIA GeForce GT 540M (compute capability 2.1) and on a Kepler K20c (compute capability 3.5).
For the NVIDIA GeForce GT 540M case, there is no relevant improvement for the "streamed" and "batched" versions against the "non-streamed" cuBLAS execution.
For the NVIDIA Kepler K20c, I have obtained
sgemm 1.87 GFlops (non-streamed); 3.08 GFlops (streamed); 6.58 GFlops (batched);
dgemm 1.00 GFlops (non-streamed); 1.43 GFlops (streamed); 6.67 GFlops (batched);
Streamed and batched cases seem to relevantly improve the non-streamed case for single precision.
Disclaimers
I'm not accounting for transposition, as you do;
The SDK example considers matrix-matrix multiplications, whereas you are needing matrix-vector multiplications; streaming is possible for gemv, but not batching.
I hope that those partial results could provide you with some useful information.

Compute rank of Matrix

I need to calculate rank of 4096x4096 sparse matrix, and I use C/C++ code.
I found some libraries (like Armadillo) that do it but they're too slow (almost 5 minutes).
I've also tried two Open Source version of Matlab (Freemat and Octave) but both crashed when I tried to make a test with a script.
5 minutes isn't so much but I must get rank from something like a million of matrix so the faster the better.
Someone knows a fast library for rank computation?

The Eigen library supports sparse matrices, try it out.

Computing the algebraic rank is O(n^3), where n is the matrix size, so it's inherently slow. You need eg. to perform pivoting, and this is slow and inaccurate if your matrix is not well conditioned (for n = 4096, a typical matrix is very ill conditioned).
Now, what is the rank ? It is the dimension of the image. It is very difficult to compute when n is large and it'll be spoiled by any small numerical inaccuracy of the input. For n = 4096, unless you happen to have particularly well conditioned matrices, this will prevent you from doing anything useful with a pivoting algorithm.
The best way is in fact to fix a cutoff epsilon, compute the singular values s_1 > ... > s_n and take as the rank the lowest integer r such that sum(s_i^2, i > r) < epsilon^2 * sum(s_i^2).
You thus need a sparse SVD routine, eg. from there.
This may not be faster, but to the very least it will be correct.
You can ask for less singular values that you need to speed up things. This is a tough problem, and with no info on the background and how you got these matrices, there is nothing more we can do.

Try the following code (the documentation is here).
It is an example for calculating the rank of the matrix A with Eigen library:
MatrixXd A(2,2);
A << 1 , 0, 1, 0;
FullPivLU<MatrixXd> luA(A);
int rank = luA.rank();

Higher exponent in pow reduces instructions in shader?

When I compile an HLSL shader with pow(foo, 6) or pow(foo, 8), the compiler creates assembly that has about 10 more instructions than if I create the same shader with pow(foo, 9) or pow(foo,10) or pow(foo,7).
Why is that?

Instructions or instruction slots?
The pow instruction takes three 3 slots, whereas the mul instruction takes only 1.
(Reference: instruction sets for: vs_2_0, ps_2_0, vs_3_0, ps_3_0)
When you are writing a shader you generally want to keep the instruction slot count down, because you have a limited number of instruction slots as defined by the shader model. It is also a reasonable way to approximate the computational complexity of your shader (ie: how fast it will run).
A power of 1 is obviously a no-op. A power of 2 requires one mul instruction. Powers of 3 and 4 can be done with two mul instructions. Powers of 5, 6, and 8 can be done with three mul instructions.
(I imagine the math behind this optimisation is explained by the link that Jim Lewis posted.)
The likely reason the compiler is choosing three mul instructions over a single pow instruction (both use the same number of instruction slots) is that the pow instruction with a constant exponent will also require the allocation of a constant register to store that exponent. Obviously using up three instruction slots and no constant registers is better than using up three slots and one constant register.
(Why you are getting 10 more instructions? I am not sure, it will depend on your shader code. The HLSL compiler does many weird and wonderful things in the name of optimisation.)
If you use the shader compiler (fxc) in the DirectX SDK with the options /Cc /Fc output.html, it will give you a nice assembly reading you can examine, including a count of the number of instruction slots used.

It might be doing some sort of exponentiation by squaring optimization, where the number of operations depends on the number of bits set to 1 in the exponent, and their positions. (That doesn't quite match what you're describing, though: you'd expect powers of two to be more efficient than exponents with more bits set, in a pure square-and-multiply implementation.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js