Eigen3/C++: MatrixXd multiply one row with another - c++

Using the Eigen3/C++ Library, given a MatrixXd
/ x0 ... y0 \
| x1 ... y1 |
M = | ... ... ... |
| |
\ xN ... yN /
what is the fasted method to achieve a modified version as shown below?
/ x0 * y0 ... y0 \
| x1 * y1 ... y1 |
M' = | ... ... ... |
| |
\ xN * yN ... yN /
That is, one column (the one with the x-s) is replaced by itself
multiplied with another column (that with the y-s).

do you mean how to coefficient-wise assign-multiply the first and last column vectors ? there are many ways of doing it, but the easiest/fastest might be
Eigen::MatrixXd M2 = M;
M2.leftCols<1>().array() *= M2.rightCols<1>().array();
an alternative might be constructing an uninitialized matrix with a given number of rows/cols and then block-assign like
Eigen::MatrixXd M2{ M.rows(), M.cols() };
M2.rightCols( M.cols() - 1 ) = M.rightCols( M.cols() - 1 );
M2.leftCols<1>() = M.leftCols<1>().cwiseProduct( M.rightCols<1>() );
which is faster I don't know ( but your preferred profiler does ).
for future questions, here is the official Eigen quick reference ;)

Related

Ignore missing values when generating new variable

I want to create a new variable in Stata, that is a function of 3 different variables, X, Y and Z, like:
gen new_var = (((X)*3) + ((Y)*2) + ((Z)*4))/7
All observations have missing values for one or two of the variables.
When I run the aforementioned command, all it generates are missing values, because no observation has values for all 3 of the variables. I would like Stata to complete the function ignoring the missing variables.
I tried the following commands without success:
gen new_var= (cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7
gen new_var= (!missing(X*3+Y*2+Z*4)/7)
gen new_var= (max(X , Y, Z)/7) if missing(X , Y, Z)
The egen command does not allow complicated functions; otherwise rowtotal() could work.
EDIT:
To clarify, "ignoring missing variables" means that even if any one of the component variables is not missing, then apply the function to only that variable and produce a value for the new variable. The new variable should have missing values only when all three component variables are missing.
I am going to guess that "ignoring missing values" means "treating them as zeros". If you have some other idea, you should make it explicit.
That could be
gen new_var = (cond(missing(X), 0, 3 * X) ///
+ cond(missing(Y), 0, 2 * Y) ///
+ cond(missing(Z), 0, 4 * Z)) / 7
Let's look at your solutions and explain why they are all wrong either in general or usually.
(cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7
It is sufficient is note that if it's true that X is missing, then cond() yields missing, as then X * 3 is missing too. The same kind of remark applies to terms involving Y and Z. So you're replacing any missing values by missing values, which is no gain.
!missing(X*3+Y*2+Z*4)/7
Given the information that at least one of X Y Z is always missing, then this always evaluates to 0/7 or 0. Even if X Y Z were all non-missing, then it would evaluate to 1/7. That is a long way from the sum you want. missing() always yields 1 or 0, and its negation thus 0 or 1.
(max(X, Y, Z)/7) if missing(X , Y, Z)
The maximum of X, Y, Z will be the right answer if and only if one of the values is not missing and the other two are missing. max() ignores missings to the extent possible (even though in other contexts missings are treated as if arbitrarily large positive numbers).
If you just want to "ignore missing values" without "treating them as zeros", the following will work:
clear
set obs 10
generate X = rnormal(5, 2)
generate Y = rnormal(10, 5)
generate Z = rnormal(1, 10)
replace X = . in 2
replace Y = . in 5
replace Z = . in 9
generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if X != . | Y != . | Z != .
list
+---------------------------------------------+
| X Y Z new_var |
|---------------------------------------------|
1. | 3.651024 3.48609 -24.1695 -11.25039 |
2. | . 14.14995 8.232919 . |
3. | 3.689442 9.812483 1.154064 5.044221 |
4. | 2.500493 13.02909 5.25539 7.797317 |
5. | 4.19431 . 6.584174 . |
6. | 7.221717 13.92533 5.045283 9.956708 |
7. | 5.746871 14.26329 3.828253 8.725744 |
8. | 1.396223 16.2358 19.01479 16.10277 |
9. | 4.633088 13.95751 . . |
10. | 2.521546 4.490258 -3.396854 .422534 |
+---------------------------------------------+
Alternatively, you could also use the inlist() function:
generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if !inlist(., X, Y, Z)

Efficient parallelisation of a linear algebraic function in C++ OpenMP

I have little experience with parallel programming and was wondering if anyone could have a quick glance at a bit of code I've written and see, if there are any obvious ways I can improve the efficiency of the computation.
The difficulty arises due to the fact that I have multiple matrix operations of unequal dimensionality that I need to compute, so I'm not sure the most condensed way of coding the computation.
Below is my code. Note this code DOES work. The matrices I am working with are of dimension approx 700x700 [see int s below] or 700x30 [int n].
Also, I am using the armadillo library for my sequential code. It may be the case that parallelizing using openMP but retaining the armadillo matrix classes is slower than defaulting to the standard library; does anyone have an opinion on this (before I spend hours overhauling!)?
double start, end, dif;
int i,j,k; // iteration counters
int s,n; // matrix dimensions
mat B; B.load(...location of stored s*n matrix...) // input objects loaded from file
mat I; I.load(...s*s matrix...);
mat R; R.load(...s*n matrix...);
mat D; D.load(...n*n matrix...);
double e = 0.1; // scalar parameter
s = B.n_rows; n = B.n_cols;
mat dBdt; dBdt.zeros(s,n); // object for storing output of function
// 100X sequential computation using Armadillo linear algebraic functionality
start = omp_get_wtime();
for (int r=0; r<100; r++) {
dBdt = B % (R - (I * B)) + (B * D) - (B * e);
}
end = omp_get_wtime();
dif = end - strt;
cout << "Seq computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
// 100 * parallel computation using OpenMP
omp_set_num_threads(8);
for (int r=0; r<100; r++) {
// parallel computation of I * B
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < s; k++) {
dBdt(i, j) += I(i, k) * B(k, j);
}
}
}
// parallel computation of B % (R - (I * B))
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
dBdt(i, j) = R(i, j) - dBdt(i, j);
dBdt(i, j) *= B(i, j);
dBdt(i, j) -= B(i, j) * e;
}
}
// parallel computation of B * D
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
dBdt(i, j) += B(i, k) * D(k, j);
}
}
}
}
end = omp_get_wtime();
dif = end - strt;
cout << "OMP computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
If I hyper-thread 4 cores I get the following output:
Seq computation: 5.54926e-10
relaxation time = 0.130031
OMP computation: 5.54926e-10
relaxation time = 2.611040
Which suggests that although both methods produce the same result, the parallel formulation is roughly 20 times slower than the sequential.
It is possible that for matrices of this size, the overheads involved in this 'variable-dimension' problem outweighs the benefits of parallelizing. Any insights would be much appreciated.
Thanks in advance,
Jack
If you use a compiler which corrects your bad loop nests and fuses loops to improve memory locality for non parallel builds, openmp will likely disable those optimizations. As recommended by others, you should consider an optimized library such as mkl or acml. Default gfortran blas typically provided with distros is not multithreaded.
The Art of HPC is right the efficiency ( poor grants never get HPC cluster quota )
so first hope is your process will never re-read from file
Why? This would be an HPC-killer:
I need to repeat this computation many thousands of times
Fair enough to say, this comment has increased the overall need to completely review the approach and to re-design the future solution not to rely on a few tricks, but to indeed gain from your case-specific arrangement.
Last but not least - the [PARALLEL] scheduling is not needed, as a "just"-[CONCURRENT]-process scheduling is quite enough here. There is no need to orchestrate any explicit inter-process synchonisation or any message-passing and the process could just get orchestrated for the best performance possible.
No "...quick glance at a bit of code..." will help
You need to first understand both your whole process and also the hardware resources, it will be executed on.
CPU-type will tell you the available instruction set extensions for advanced tricks, L3- / L2- / L1-cache sizes + cache-line sizes will help you decide on best cache-friendly re-use of cheap data-access ( not paying hundreds [ns] if one can operate smarter on just a few [ns] instead, from a not-yet-evicted NUMA-core-local copy )
The Maths first, implementation next:
As given dBdt = B % ( R - (I * B) ) + ( B * D ) - ( B * e )
On a closer look, anyone ought be ready to realise HPC/cache-alignment priorities and wrong-looping traps:
dBdt = B % ( R - ( I * B ) ) ELEMENT-WISE OP B[s,n]-COLUMN-WISE
+ ( B * D ) SUM.PRODUCT OP B[s,n].ROW-WISE MUL-BY-D[n,n].COL
- ( B * e ) ELEMENT-WISE OP B[s,n].ROW-WISE MUL-BY-SCALAR
ROW/COL-SUM.PRODUCT OP -----------------------------------------+++++++++++++++++++++++++++++++++++++++++++++
ELEMENT-WISE OP ---------------------------------------------+ |||||||||||||||||||||||||||||||||||||||||||||
ELEMENT-WISE OP ----------------------+ | |||||||||||||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||||||||||||
v v vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
dBdt[s,n] = B[s,n] % / R[s,n] - / I[s,s] . B[s,n] \ \
_________[n] _________[n] | _________[n] | ________________[s] _________[n] | |
|_| | |_| | | |_| | | |________________| | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | = | . | % | | . | - | | | . | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
[s]|_________| [s]|_________| | [s]|_________| | [s]|________________| [s]|_|_______| | |
\ \ / /
B[s,n] D[n,n]
_________[n] _________[n]
|_________| | | |
| . | | | |
| . | | | |
| . | | | |
+ | . | . [n]|_|_______|
| . |
| . |
| . |
[s]|_________|
B[s,n]
_________[n]
|_| . . . |
| . |
| . |
| . |
- | . | * REGISTER_e
| . |
| . |
| . |
[s]|_________|
Having this in mind, efficient HPC loops will look much different
Depending on real-CPU-caches,
the loop may very efficiently co-process naturally-B-row-aligned ( B * D ) - ( B * e )
in a single phase and also the highest-re-use-efficiency based part of the elementwise longest-pipeline B % ( R - ( I * B ) )
here having a chance to re-use ~ 1000 x ( n - 1 ) cache-hits of B-column-aligned, which ought quite well fit into L1-DATA-cache footprints, so achieving savings in the order of seconds just from a cache-aligned loops.
After this cache-friendly loop-alignment is finished,
next may a distributed processing help, not before.
So, an experimentation plan setup:
Step 0: The Ground-Truth: ~ 0.13 [s] for dBdt[700,30] using armadillo in 100-test-loops
Step 1: The manual-serial: - test the rewards of the best cache-aligned code ( not the posted one, but the math-equivalent, cache-line re-use optimised one -- where there ought be not more than just 4x for(){...} code-blocks 2-nested, having the rest 2 inside, to meet the Linear Algebra rules without devastating benefits of cache-line alignments ( with some residual potential to benefit yet a bit more in [PTIME] from using a duplicated [PSPACE] data-layout ( both a FORTRAN-order and a C-order, for respective re-reading strategies ), as matrices are so miniature in sizes and L2- / L1-DATA-cache available per CPU-core enjoy cache sizes well grown in scale )
Step 2: The manual-omp( <= NUMA_cores - 1 ): - test if omp can indeed yield any "positive" Amdahl's Law speedup ( beyond the omp-setup overhead costs ). A carefull process-2-CPU_core affinity-mapping may help avoid any possible cache-eviction introduced by any non-HPC thread spoiling the cache-friendly layout on a set of configuration-"reserved"-set of ( NUMA_cores - 1 ), where all other ( non-HPC processes ) ought be affinity-mapped onto the last ( shared ) CPU-core, thus helping to prevent those HPC-process-cores retain their cache-lines un-evicted by any kernel/scheduler-injected non-HPC-thread.
( As seen in (2), there are arangements, derived from HPC best-practices, that none compiler ( even a magic-wand equipped one ) would ever be able to implement, so do not hesitate to ask your PhD Tutor for a helping hand, if your Thesis needs some HPC-expertise, as it is not so easy to build on trial-error in this quite expensive experimental domain and your primary domain is not the Linear Algebra and/or ultimate CS-theoretic / HW-specific cache-strategy optimisations. )
Epilogue:
Using smart tools in an inappropriate way does not bring anything more than additional overheads ( task-splits/joins + memory-translations ( worse with atomic-locking ( worst with blocking / fence / barriers ) ) ).

C++, why is an increase in one element of a multi-dimensional array appear to be increasing another?

This may not be elegant. Chiefly because I am relatively new to C++, but this little program I am putting together is stumbling here.
I don't get it. Have I misunderstood arrays? The edited code is:
int diceArray [6][3][1] = {};
...
}else if (y >= xSuccess || x >= xSuccess){
// from here...
diceArray[2][1][0] = diceArray[2][1][0] + 1;
diceArray[2][1][1] = diceArray[2][1][1] + 1;
// ...to here, diceArray[2][2][0] increases by 1. I am not referencing that part of the array at all. Or am I?
}
By using comments I tracked the culprit down to the second expression. If I comment out the first one diceArray[2][2][0] does not change.
Why is diceArray[2][1][1] = diceArray[2][1][1] + 1 causing diceArray[2][2][0] to increment?
I tried..
c = diceArray[2][1][1] + 1;
diceArray[2][1][1] = c;
..as a workaround but it was just the same. It increased diceArray[2][2][0] by one.
You are indexing out of bounds. If I declare such an array
int data [3];
Then the valid indices are
data[0]
data[1]
data[2]
The analog to this is that you declare
int diceArray [6][3][1]
^
But then try to assign to
diceArray[2][1][0]
^
diceArray[2][1][1] // This is out of range
^
Since you are assigning out of range, due to pointer arithmetic you are actually assigning to the next dimension due to striding, etc.
The variable is declared as:
int diceArray [6][3][1] = {};
This is how it looks like in memory:
+---+ -.
| | <- diceArray[0][0] \
+---+ \
| | <- diceArray[0][1] > diceArray[0]
+---+ /
| | <- diceArray[0][2] /
+---+ -'
| | <- diceArray[1][0] \
+---+ \
| | <- diceArray[1][1] > diceArray[1]
+---+ /
| | <- diceArray[1][2] /
+---+ -'
. . .
. . .
. . .
+---+ -.
| | <- diceArray[5][0] \
+---+ \
| | <- diceArray[5][1] > diceArray[5]
+---+ /
| | <- diceArray[5][2] /
+---+ -'
The innermost component of diceArray is an array of size 1.
C/C++ arrays are always indexed starting from 0 and that means the only valid index in and array of size 1 is 0.
During the compilation, a reference to diceArray[x][y][z] is converted using pointer arithmetic to offset x*3*1+y*1+z (int values) using the memory address of diceArray as base.
The code:
diceArray[2][1][1] = diceArray[2][1][1] + 1;
operates on offset 8 (=2*3*1+1*1+1) inside diceArray. The same offset is computed using diceArray[2][2][0], which is a legal access inside the array.
The modern compilers are usually able to detect this kind of errors and warn you on the compilation.

Stata: Permutations of string variables

I have three string variables of the length 2 and I need to get (a) all possible permutations of the three variables (keeping the order of strings within each variable fixed), (b) all possible variable pairs. Small number of variables allows me to do it manually, but I was wondering if there is a more elegant and concise way of solving this.
It is currently coded as:
egen perm1 = concat(x1 x5 x9)
egen perm2 = concat(x1 x9 x5)
egen perm3 = concat(x5 x1 x9)
egen perm4 = concat(x5 x9 x1)
egen perm5 = concat(x9 x5 x1)
egen perm6 = concat(x9 x1 x5)
gen tuple1 = substr(perm1,1,4)
gen tuple2 = substr(perm2,3,4)
gen tuple3 = substr(perm3,1,4)
gen tuple4 = substr(perm4,3,4)...
An abstract from a resulting table illustrates the desired outcome:
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| x1 | x5 | x9 | perm1 | perm2 | perm3 | perm4 | perm5 | perm6 | tuple1 | tuple2 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| 01 | 05 | 09 | 010509 | 010905 | 050109 | 050901 | 090501 | 090105 | 0105 | 0509 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
Neat question. I don't know if there's a "built in" way to do permutations, but the following should do it.
You want to loop over all your variables, but make sure that don't get duplicates. As the dimensions increase this gets tricky. What I do it loop over the same list and each time remove the current counter from counter space of the nested loop.
Unfortunately, this still requires you to write each loop structure, but this should be easy enough to cut-paste-find-replace.
clear
set obs 100
generate x1 = "01"
generate x5 = "05"
generate x9 = "09"
local vars x1 x5 x9
local i = 0
foreach a of varlist `vars' {
local bs : list vars - a
foreach b of varlist `bs' {
local cs : list bs - b
foreach c of varlist `cs' {
local ++i
egen perm`i' = concat(`a' `b' `c')
}
}
}
Edit: Re-reading the question, I'm not clear on what you want (since row1_1 isn't one of your concated variables. Note that if you really want the "drop one" permutations, then just remove one variable from the concat call. This is because "n permute n" is the same as "n permute n-1". That is, there are 6 3-item permutations of 3 items. There are also 6 2-item permutations of 3 items. So
egen perm`i' = concat(`a' `b')

Understanding OpenGL Matrices

I'm starting to learn about 3D rendering and I've been making good progress. I've picked up a lot regarding matrices and the general operations that can be performed on them.
One thing I'm still not quite following is OpenGL's use of matrices. I see this (and things like it) quite a lot:
x y z n
-------
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
So my best understanding, is that it is a normalized (no magnitude) 4 dimensional, column-major matrix. Also that this matrix in particular is called the "identity matrix".
Some questions:
What is the "nth" dimension?
How and when are these applied?
My biggest confusion arises from how OpenGL makes use of this kind of data.
In most 3D graphics a point is represented by a 4-component vector (x, y, z, w), where w = 1. Usual operations applied on a point include translation, scaling, rotation, reflection, skewing and combination of these.
These transformations can be represented by a mathematical object called "matrix". A matrix applies on a vector like this:
[ a b c tx ] [ x ] [ a*x + b*y + c*z + tx*w ]
| d e f ty | | y | = | d*x + e*y + f*z + ty*w |
| g h i tz | | z | | g*x + h*y + i*z + tz*w |
[ p q r s ] [ w ] [ p*x + q*y + r*z + s*w ]
For example, scaling is represented as
[ 2 . . . ] [ x ] [ 2x ]
| . 2 . . | | y | = | 2y |
| . . 2 . | | z | | 2z |
[ . . . 1 ] [ 1 ] [ 1 ]
and translation as
[ 1 . . dx ] [ x ] [ x + dx ]
| . 1 . dy | | y | = | y + dy |
| . . 1 dz | | z | | z + dz |
[ . . . 1 ] [ 1 ] [ 1 ]
One of the reason for the 4th component is to make a translation representable by a matrix.
The advantage of using a matrix is that multiple transformations can be combined into one via matrix multiplication.
Now, if the purpose is simply to bring translation on the table, then I'd say (x, y, z, 1) instead of (x, y, z, w) and make the last row of the matrix always [0 0 0 1], as done usually for 2D graphics. In fact, the 4-component vector will be mapped back to the normal 3-vector vector via this formula:
[ x(3D) ] [ x / w ]
| y(3D) ] = | y / w |
[ z(3D) ] [ z / w ]
This is called homogeneous coordinates. Allowing this makes the perspective projection expressible with a matrix too, which can again combine with all other transformations.
For example, since objects farther away should be smaller on screen, we transform the 3D coordinates into 2D using formula
x(2D) = x(3D) / (10 * z(3D))
y(2D) = y(3D) / (10 * z(3D))
Now if we apply the projection matrix
[ 1 . . . ] [ x ] [ x ]
| . 1 . . | | y | = | y |
| . . 1 . | | z | | z |
[ . . 10 . ] [ 1 ] [ 10*z ]
then the real 3D coordinates would become
x(3D) := x/w = x/10z
y(3D) := y/w = y/10z
z(3D) := z/w = 0.1
so we just need to chop the z-coordinate out to project to 2D.
The short answer that might help you get started is that the 'nth' dimension, as you call it, does not represent any visualizable quantity. It is added as a practical tool to enable matrix multiplications that cause translation and perspective projection. An intuitive 3x3 matrix cannot do those things.
A 3d value representing a point in space always gets 1 appended as the fourth value to make this trick work. A 3d value representing a direction (i.e. a normal, if you are familiar with that term) gets 0 appended in the fourth spot.