Which dgemm call is the fastest? - fortran
I need to do two matrix matrix multiplications to evaluate some intermediates:
I can do this using multiple variants of dgemms. The set I had until now uses one 'N','N' multiplication and one 'T','N' (N meaning normal, T meaning transformed):
call dgemm('N','N',Pdim,Kdim,pqdim, &
& 1.0d0,B,Pdim,D(1,1,iamthr),pqdim, &
& 0.0d0,Etilde(1,1,iamthr),Pdim )
call dgemm('T','N',pqdim,Kdim,Pdim, &
& 1.0d0,B,Pdim,Etilde(1,1,iamthr),Pdim, &
& 0.0d0,E(1,1,iamthr),pqdim )
Where Pdim is the dimension of P, Kdim is the dimension of K and pqdim is the dimension of pq and ij. `Kdim' is the smallest dimension, 'Pdim' is in from a couple hundred to maybe 2000 and pqdim can be everything from 1000 to 100000.
Now, I tried 3 versions of dgemm calls:
a) 'N','N' + 'N','N'
call dgemm('N','N',Pdim,Kdim,pqdim, &
& 1.0d0,B,Pdim,D(1,1,iamthr),pqdim, &
& 0.0d0,Etilde(1,1,iamthr),Pdim )
call dgemm('N','N',pqdim,Kdim,Pdim, &
& 0.50d0,Bt,pqdim,Etilde(1,1,iamthr),Pdim, &
& 0.0d0,E(1,1,iamthr),pqdim )
b) 'T','N' + 'T','N'
call dgemm('T','N',Pdim,Kdim,pqdim, &
& 1.0d0,Bt,pqdim,D(1,1,iamthr),pqdim, &
& 0.0d0,Etilde(1,1,iamthr),Pdim )
call dgemm('T','N',Kdim,pqdim,Kdim, &
& 1.0d0,Etilde(1,1,iamthr),Pdim,B,Pdim, &
& 0.0d0,E(1,1,iamthr),Kdim )
c) 'N','N' + 'T','N' (see above)
I don't know why, but combination c) is the fastest. This does not make sense to me. In combination a) and b) I use pre-transformed matrices B (Bt) and the leading dimension of E is different of course.
It does not seem logical to me why c) should be fastest, because either 'N','N' is faster than 'T','N' or the other way around. Either way, a) or b) would have to be fastest.
So I'm left with a couple possibilities:
Either the compiler (ifort 19) notices two dgemms after one another and somehow magically concatenates them or since the dimensions are so vastly different that does make a huge difference. In the latter case I would still guess combination b) to be the fastest because there pqdim (the biggest dimension) is the leading dimension for both matrices...
Or maybe I just missed something essential?
Related
Is there a way to set up an app to solve equations and then compare them in C++?
I am trying to write a piece of code for my old Highschool teacher for a game he had us play literally called the "Dice Game." Let's just say that the game takes two d12's and multiplies them together to get a number (D) in this instance. Then you take 3 d6's and roll them to get your A, B, and C variables. You would then either Add, Subtract, Multiply, Divide, Exponentiate, or Root by that number to get as close to as you could to D. Those operations would stand for x and y in the following equation AxByC=D I don't know how else to word this, but I am having trouble finding any way to solve these equations and then compare them. Maybe I am missing something simple, but I don't know. EDIT: I should probably be more clear about the question. I know how to set all the equations up. It is just a matter of finding a way to compare the answers to the D variable and then the other answers to the equation to see which one is closer. The closest number to D wins, thus the whole point to the dice game.
If you are just trying to compare the answers to the D variable, why not loop through each equations result and compare them equal to D? for (int i = 0; i < equationResults.size(); i++) { if (equationResults[i] == D) return true; } EDIT: If you are trying to find the closest to D, you can compare each answer to D and subtract the answer from D and store it, then return the min value: closeToD[0] = D - equationResults[0]; return *min_element(closeToD.begin(), closeToD.end());
Since you can juggle the values around, as well as picking operators, you actually have two problems: generating the permutations of variables and generating the permutations of operators. The first part is rather straightforward: std::array<int, 3> input; std::sort(input.begin(), input.end()); do { compute(input[0], input[1], input[2]); } while (std::next_permutation(input.begin(), input.end())); The compute part could be a function that takes such an array of 3 values and finds the best value, or closest to D, or just all values. Generating all permutations of operators is slightly more annoying because next_permutation can't compare them, and also we accept duplicates. The easiest way is to just brute-force through them; I'll do it just for the slightly easier operators: std::array<int, 16> compute(int a, int b, int c) { return { a + b + c, a + b - c, a + b * c, a + b / c, a - b + c, a - b - c, a - b * c, a - b / c, a * b + c, a * b - c, a * b * c, a * b / c, a / b + c, a / b - c, a / b * c, a / b / c, }; } Generating such list of operations programmatically is a bit more challenging; you can't simply do (a op b) op c because of the aforementioned precedence. Doing it this way guarantees that the results are actually achievable because of the operator precedence built into the language. This will still do redundant computations - e.g. in the first case, the result will be the same regardless of the permutation of a/b/c. Eliminating those is perhaps a more interesting exercise for later. Perhaps a small relief is the fact that if a == b or b == c, next_permutation will already take care of that for us, cutting the number of iterations from 6 to either 3 or 1.
Eigen virtually extend sparse matrix
I have a dense matrix A of size 2N*N that has to be multiplied by a matrix B, of size N*2N. Matrix B is actually a horizontal concatenation of 2 sparse matrices, X and Y. B requires only a read-only access. Unfortunately for me, there doesn't seem to be a concatenate operation for sparse matrices. Of course, I could simply create a matrix of size N*2N and populate it with the data, but this seems rather wasteful. It seems like there could be a way to group X and Y into some sort of matrix view. Additional simplification in my case is that either X or Y is a zero matrix.
For your specific case, it is sufficient to multiply A by either X or Y - depending on which one is nonzero. The result will be exactly the same as the multiplication by B (simple matrix algebra).
If your result matrix is column major (the default), you can assign partial results to vertical sub-blocks like so (if X or Y is structurally zero, the corresponding sub-product is calculated in O(1)): typedef Eigen::SparseMatrix<float> SM; void foo(SM& out, SM const& A, SM const& X, SM const &Y) { assert(X.rows()==Y.rows() && X.rows()==A.cols()); out.resize(A.rows(), X.cols()+Y.cols()); out.leftCols(X.cols()) = A*X; out.rightCols(Y.cols()) = A*Y; } If you really want to, you could write a wrapper class which holds references to two sparse matrices (X and Y) and implement operator*(SparseMatrix, YourWrapper) -- but depending on how you use it, it is probably better to make an explicit function call.
Translating Matlab's bsxfun to Eigen
Say we have a matrix A of dimension MxN and a vector a of dimension Mx1. In Matlab, to multiply 'a' with all columns of 'A', we can do bsxfun(#times, a, A) Is there an equivalent approach in Eigen, without having to loop over the columns of the matrix? I'm trying to do M = bsxfun(#times, a, A) + bsxfun(#times, a2, A2) and hoping that Eigen's lazy evaluation will make it more efficient. Thanks!
You can do: M = A.array().colwise()*a.array(); The .array() is needed to redefine the semantic of operator* to coefficient-wise products (not needed if A and a are Array<> objects). In this special case, it is probably better to write it as a scaling operation: M = a.asDiagonal() * A; In both cases you won't get any temporary thanks to lazy evaluation.
General method for constructing bitwise expressions satisfying constraints/with certain values?
Say I'm looking for a bitwise function to have certain values, for instance - f(0b00,0b00)!=0 f(0b00,0b10)==0 f(0b10,0b10)!=0 f(0b11,0b10)!=0 f(0b01,0b10)==0 Is there a general method for constructing a single bitwise expression f for such systems? (I don't know for sure, but think there might be crappy solutions possible if you have gigantic expressions masking out one bit at a time, so let's say that the expressions have to work for all sizes of ints) The best I've been able to do to convert the above is f(int a, int b) { if (a==0 ) { return b==0; } else { return (a&b)!=0; } } I have a suspicion that it's difficult to combine (x==0) conditions with (x!=0) conditions (given x, is there a bitwise function f such that x==0 <=> f(x)!=0? ), but i don't know how much of an impediment that is here. Any answers would be poured over with great interest :) Peace, S
The most general construction is an extended version of "minterms". Use bitwise operators to construct a predicate that is -1 iff the input matches a specific thing, AND the predicate with whatever you want the result to be, then OR all those things together. That leads to horrible expressions of course, possibly of exponential size. Using arithmetic right shifts, you can construct a predicate p(x, c) = x == c: p(x, c) = ~(((x ^ c) >> 31) | (-(x ^ c) >> 31)) Replace 31 by the size of an int minus one. The only number such that it and its negation are both non-negative, is zero. So the thing inside the final complement is only zero if x ^ c == 0, which is the same as saying that x == c. So in this example, you would have: (p(a, 0x00) & p(b, 0x00)) | (p(a, 0x10) & p(b, 0x10)) | (p(a, 0x11) & p(b, 0x10)) Just expand it.. into something horrible. Obviously this construction usually doesn't give you anything sensible. But it's general. In the specific example, you could do: f(a, b) = (p(a, 0) & p(b, 0)) | ~p(a & b, 0) Which can be simplified a little again (obviously the xors go away if c == 0, and two complements balance each other out).
Using std.range.Lockstep as an input range
Duplicating http://forum.dlang.org/thread/arlokcqodltcazdqqlby#forum.dlang.org to compare answer speed :) I basically want to be able to do stuff like this: auto result = map!( (a, b) => a+b )( lockstep(range1, range2) ); Are there any standard short ways to wrap an input range around struct with opApply (which Lockstep is)? Also what about redesigning Lockstep as a proper range? I could do a pull request but not sure about current intentions.
And the prize goes to D.learn and Simen Kjaeraas : Use std.range.zip instead: auto result = map!( (a, b) => a+b )( zip(range1, range2) ); The reason there are two ways is lockstep works better with foreach: foreach (a, b; lockstep(A, B) ) { // Use a and b here. } Contrast with zip: foreach (a; zip(A, B) ) { // Use a[0] and a[1] here. } There have been suggestions to better integrate tuples in the language, so in the future zip may have all the advantages of lockstep (and vice versa), but don't cross your fingers.