Tensorflow cpp with Eigen element wise multiply - c++

I am trying to do an element wise multiply for my own op in tensorflow + Eigen. This is a simple version of what I am currently using:
// eg) temp_res_shape = [3, 8], temp_b_shape = [1, 8]
// allocate Tensorflow tensors
Tensor temp_res;
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
temp_res_shape, &temp_res));
Tensor temp_a;
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
temp_res_shape, &temp_a));
Tensor temp_b;
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
temp_b_shape, &temp_b));
// These actually come from different places but the type/shape is right
// ie) I want to do this within Eigen::TensorMap if possible
auto mult_res = Tensor(temp_res).flat_inner_dims<complex64, 2>();
auto a_in = Tensor(temp_a).flat_inner_dims<complex64, 2>();
auto b_in = Tensor(temp_b).flat_inner_dims<complex64, 2>();
// convert to an array
auto a_data = a_in.data();
auto b_data = b_in.data();
auto res = mult_res.data();
for ( int i = 0; i < 3; i++ ) {
for ( int j = 0; j < 8; j++ )
res[ i*8 + j ] = a_data[ i*3 + 8 ] * b_data[j];
}
This is obviously the wrong way to do it but I couldn't get anything else working. I feel like it should be something of the form:
mult_res.device( device ) = a_in * b_in;
But that does the matrix multiply. I couldn't figure out how to convert b_in to a diagonal matrix to multiply that way either :/
I feel like this should be trivial but I can't work it out (my cpp is not great). Thanks in advance!

Related

Add values serially in the same SIMD register

I'm trying to convert this to AVX2:
// parallel arrays
int16_t* Nums = ...
int16_t* Capacities = ...
int** Data = ...
int* freePointer = ...
for (int i = 0; i < n; i++)
{
if (Nums[i] == 0)
Capacities[i] = 0;
else
{
Data[i] = freePointer;
freePointer += Capacities[i];
}
}
But didn't get too far:
for (int i = 0; i < n; i += 4) // 4 as Data is 64 bits
{
const __m256i nums = _mm256_loadu_si256((__m256i*)&Nums[i]);
const __m256i bZeroes = _mm256_cmpeq_epi16(nums, ZEROES256);
const __m256i capacities = _mm256_loadu_si256((__m256i*)&Capacities[i]);
const __m256i zeroedCapacities = _mm256_andnot_si256(bZeroes, capacities);
_mm256_storeu_si256((__m256i*)&Capacities[i], zeroedCapacities);
}
Stuck at the else branch, not sure how to add (prefix sum?...) Capacities into freePointer and assign the "serial" results to Data in the same 256-bit SIMD register.
My terminology is probably off, I hope the code gets across what I'm trying to accomplish.
lane0: freePointer
lane1: freePointer + Capacities[i + 0]
lane2: freePointer + Capacities[i + 0] + Capacities[i + 1]
lane3: freePointer + Capacities[i + 0] + Capacities[i + 1] + Capacities[i + 2]
Basically this is what I want to do in as few instructions as possible, if at all possible. Target is AVX2.
You can find a lot of details here: https://stackoverflow.com/a/69452433/5021064
Here you can plug in any type instead of T and U see the resulting asm for x86 and arm

Diagonal matrix properly in armadillo

My code works but I'm just curious to see if someone knows how to do this but properly using Armadillo library.
Thanks for your time :)
arma::mat W = arma::mat(4, 4, arma::fill::ones);
arma::mat D = arma::mat(4, 4, arma::fill::zeros);
for (size_t i = 0; i < W.n_rows; i++)
{
for (size_t j = 0; j < W.n_cols; j++)
{
D(i, i) += W(i, j);
}
}
std::cout<< "W = \n"<< W <<std::endl;
std::cout<< "D = \n"<< D <<std::endl;
It seems you are summing the elements in each row in the W matrix and putting the result in the diagonal of the D matrix. That is, you are summing elements over the "columns" dimension. This is very easy to do in armadillo and does not require any manual loop.
Armadillo has a sum function with a few overloads. One of these overloads receives a second parameter that you can use to specify in which dimension you want to perform the sum. Just specify the second dimension (index 1) and you get the proper result.
However, the result you get from arma::sum(W, 1) will be a vector. It makes sense, since you are summing over one of the dimensions of the matrix. Just pass the result to arma::diagmat and you get the same D matrix as with you original code. Your code can then be replaced by
arma::mat W = arma::mat(4, 4, arma::fill::ones);
arma::mat D = arma::mat(4, 4, arma::fill::zeros);
W.print("W");
arma::diagmat(arma::sum(W, 1)).print("D");
Note: I have used the .print method to print the matrices, in case you don't know about it. It is easier to use than using std::cout;

How to make the following Halide code more efficient?

The code snippet below is running slower than expected. The authors of this paper http://www.cvlibs.net/publications/Geiger2010ACCV.pdf compute support_points of a 900x700 image in 118 ms. I have implemented their algorithm below in Halide.
In my algorithm, the nested for loops over length and width iterate over xi and yi, which are points in output_x and output_y (defined previously but not shown below). Over each iteration of the nested for loops, a vector top_k is computed and pushed_back into support_points.
Computing this pipeline even for left_buffer.width() == 20 and left_buffer.height() == 20 takes 500 ms. Thus this implementation is several orders of magnitude slower:
...
int k = 4; // # of support points
vector<pair<Expr, Expr>> support_points(k * left_buffer.width() * left_buffer.height());
// Calculate support pixel for each
Func support("support");
support(x, y) = Tuple(i32(0), i32(0), f32(0));
for (int yi = 0; yi < left_buffer.height(); yi++) {
for (int xi = 0; xi < left_buffer.width() - 2; xi++) {
bool left = xi < left_buffer.width() / 4;
bool center = (xi >= left_buffer.width() / 4 && xi < left_buffer.width() * 3 / 4);
bool right = xi >= left_buffer.width() * 3 / 4;
vector <pair<Expr, Expr>> scan_range;
pair <Expr, Expr> scan_height(0, (Expr) left_buffer.height());
pair <Expr, Expr> scan_width;
int which_pred = 0;
if (left) {
scan_width = make_pair((Expr) 0, (Expr) left_buffer.width() / 2);
which_pred = 0;
}
else if (center) {
scan_width = make_pair((Expr) xi - left_buffer.width() / 4, (Expr) left_buffer.width() / 2);
which_pred = 1;
}
else if (right) {
scan_width = make_pair((Expr) left_buffer.width() / 2, (Expr) left_buffer.width() / 2);
which_pred = 2;
}
else {
cout<<"Error"<<endl;
}
scan_range = {scan_width, scan_height};
// cout<<"xi "<<xi<<endl;
// cout<<"yi "<<yi<<endl;
// cout<<"scan_width= "<<scan_width.first<<" "<<scan_width.second<<endl;
// cout<<"scan_height= "<<scan_height.first<<" "<<scan_height.second<<endl;
RDom scanner(scan_range);
Expr predicate[3] = {scanner.x != xi && scanner.y != yi, scanner.x != 0 && scanner.y != 0, scanner.x != xi && scanner.y != yi};
scanner.where(predicate[which_pred]);
std::vector<Expr> top_k(k * 3);
for (int i = 0; i < k; i++) { // say we want top 4 support points.
top_k[3*i] = 10000.0f;
top_k[3*i+1] = 0;
top_k[3*i+2] = 0;
}
Func argmin("argmin");
argmin() = Tuple(top_k);
Expr next_val = abs(output_x(xi, yi) - output_x(scanner.x, scanner.y)) + abs(output_y(xi, yi) - output_y(scanner.x, scanner.y));
Expr next_x = scanner.x;
Expr next_y = scanner.y;
top_k = Tuple(argmin()).as_vector();
// Insert a single element into a sorted list without actually branching
top_k.push_back(next_val);
top_k.push_back(next_x);
top_k.push_back(next_y);
for (int i = k; i > 0; i--) {
Expr prev_val = top_k[(i-1)*3];
Expr prev_x = top_k[(i-1)*3 + 1];
Expr prev_y = top_k[(i-1)*3 + 2];
Expr should_swap = top_k[i*3] < prev_val;
top_k[(i-1)*3] = select(should_swap, top_k[i*3], prev_val);
top_k[(i-1)*3 + 1] = select(should_swap, top_k[i*3 + 1], prev_x);
top_k[(i-1)*3 + 2] = select(should_swap, top_k[i*3 + 2], prev_y);
top_k[i*3] = select(should_swap, prev_val, top_k[i*3]);
top_k[i*3 + 1] = select(should_swap, prev_x, top_k[i*3 + 1]);
top_k[i*3 + 2] = select(should_swap, prev_y, top_k[i*3 + 2]);
}
// Discard the k+1th element
top_k.pop_back(); top_k.pop_back(); top_k.pop_back();
bool cond = xi == 10 && yi == 10;
cout << xi << " "<< yi << " " << cond << endl;
Expr e = argmin()[0];
e = print_when(cond, e, "<- argmin() val");
argmin() = Tuple(top_k);
argmin.compute_root();
// argmin.trace_stores();
argmin.compile_to_lowered_stmt("argmin.html", {}, HTML);
Realization real = argmin.realize();
for (int i = 0; i < k; i++) {
pair<Expr, Expr> c(top_k[3*i+1], top_k[3*i+2]);
support_points.push_back(c);
}
}
}
double t2 = current_time();
cout<<(t2-t1)/100<<" ms"<<endl;
cout<<"executed"<<endl;
}
How can I improve efficiency?
It looks like you may be getting a bit confused between the stages of your program. With Halide, your C++ code that works with Exprs, Funcs, etc. is not actually evaluating anything, it is constructing a Halide program, which you can then compile and run. That means that the C++ for loops, std::vectors, etc. that you're using are all happening at program construction time (essentially compile time) of the Halide program. You might think of it like C++ templates, which evaluate at compile time, vs. the C++ code they construct, which evaluate at the run time of your program: the C++ code you're writing here is equivalent to template code with respect to the Halide program that you are building.
This gets a bit more confusing with the ability to JIT-compile and evaluate a Halide program inside of the same C++ program that builds it (realize).
As it is, I suspect the above program doesn't actually compute the results you expect it to. After the double for loop, what are you hoping to do with support_points? What you have built there is a big array of expressions (pieces of code), not concrete values. And you are JIT-compiling and running a new piece of Halide code each time around those loops (i.e., for every pixel).
I think you may have an easier time understanding what you are building if you stick to ahead-of-time compilation (compile_to_file or generators) for now. That makes the two stages—Halide code generation time, and the runtime of that code inside a separate program—very distinct.

c++ how to represent a vector as a matrix and transpose it?

I have a vector of size n; n is power of 2. I need to treat this vector as a matrix n = R*C. Then I need to transpose the matrix.
For example, I have vector: [1,2,3,4,5,6,7,8]
I need to find R and C. In this case it would be: 4,2. And treat vector as matrix:
[1,2]
[3,4]
[5,6]
[7,8]
Transpose it to:
[1, 3, 5, 7]
[2, 4, 6, 8]
After transposition vector should be: [1, 3, 5, 7, 2, 4, 6, 8]
Is there existing algorithms to perform in-place non-square matrix transposition? I don't want to reinvent a wheel.
My vector is very big so I don't want to create intermediate matrix. I need an in-place algorithm. Performance is very important.
All modofications should be done in oroginal vector. Ideally algorithm should work with chunks that will fit in CPU cache.
I can't use iterator because of memory locality. So I need real transposition.
It does not matter if matrix would be 2x4 or 4x2
The problem can be divided in two parts. First, find R and C and then, reshape the matrix. Here is something I would try to do:
Since n is a power of 2, i.e. n = 2^k then if k is even, we have: R=C=sqrt(n). And if k is odd, then R = 2^((k+1)/2) and C=2^((k-1)/2).
Note: Since you mentioned you want to avoid using extra memory, I have made some editions to my original answer.
The code to calculate R and C would be something like:
void getRandC(const size_t& n, size_t& R, size_t& C)
{
int k = (int)log2(double(n)),
i, j;
if (k & 1) // k is odd
i = (j = (k + 1) / 2) - 1;
else
i = j = k / 2;
R = (size_t)exp2(i);
C = (size_t)exp2(j);
}
Which needs C++11. For the second part, in case you want to keep the original vector:
void transposeVector(const std::vector<int>& vec, std::vector<int>& mat)
{
size_t R, C;
getRandC(vec.size(), R, C);
// first, reserve the memory
mat.resize(vec.size());
// now, do the transposition directly
for (size_t i = 0; i < R; i++)
{
for (size_t j = 0; j < C; j++)
{
mat[i * C + j] = vec[i + R * j];
}
}
}
And, if you want to modify the original vector and avoid using extra memory, you can write:
void transposeInPlace(std::vector<int>& vec)
{
size_t R, C;
getRandC(vec.size(), R, C);
for (size_t j = 0; R > 1; j += C, R--)
{
for (size_t i = j + R, k = j + 1; i < vec.size(); i += R)
{
vec.insert(vec.begin() + k++, vec[i]);
vec.erase(vec.begin() + i + 1);
}
}
}
See the live example
Since you haven't provided us with any of your code, can I suggest a different approach (that I don't know will work for your particular situation)?
I would use an algorithm based on your matrix to transpose your values into the new matrix yourself. Since performance is an issue this will help even more so since you don't have to create another matrix. If this is applicable for you.
Have a vector
[1, 2, 3, 4, 5, 6, 7, 8]
Create your matrix
[1, 2]
[3, 4]
[5, 6]
[7, 8]
Reorder vector without another matrix
[1, 3, 5, 7, 2, 4, 6, 8]
Overwrite the values in the current matrix (so you don't have to create a new one) and reorder the values based on your current matrix.
Add values in order
R1 and C1 to transposed_vector[0]
R2 and C1 to transposed_vector[1]
R3 and C1 to transposed_vector[2]
R4 and C1 to transposed_vector[3]
R1 and C2 to transposed_vector[4]
And so on.
For non square matrix representation, I think it may be tricky, and not worth the effort to make the transpose of your flat vector without creating another one. Here is a snippet of what I came up with:
chrono::steady_clock::time_point start = chrono::steady_clock::now();
int i, j, p, k;
vector<int> t_matrix(matrix.size());
for(k=0; k< R*C ;++k)
{
i = k/C;
j = k - i*C;
p = j*R + i;
t_matrix[p] = matrix[k];
}
cout << chrono::duration_cast<chrono::milliseconds> chrono::steady_clock::now() - start).count() << endl;
Here, matrix is your flat vector, t_matrix is the "transposed" flat vector, and R and C are, respectively rows and vector you found for your matrix representation.

Exponential operator in C++ loop

I wrote C++ codes and matlab codes to test speed. My C++ code is:
int nrow = dim[0], ncol = dim[1];
double tmp, ldot;
for (int k = ncol - 1; k >= 0; --k){
grad[k] = 0;
for (int j = nrow - 1; j >= 0; --j){
tmp = exp(eta[j + nrow * k]);
ldot = (-Z[j + nrow * k] + tmp / (1 + tmp));
grad[k] += A[j] * ldot;
}
}
My matlab code is:
prob = exp(eta);
prob = prob./(1+prob);
ldot = prob - Z;
grad=sum(repmat(A,1,nGWAS).*ldot);
I run each code 100 times, it took over 5 seconds for C++ and only 1.2 seconds for matlab.
Anyone can help my here? Thanks.
The folks at matlab know very well how to optimize matrix access.
You chose to access it column by column. My initial guess is that the matrix is laid out in memory row by row. This causes your code to run over the whole matrix ncol times. Cache misses all over the place.