Compiler optimization when chaining operators - c++

I overloaded the arithmetic/assignment operators on std::vector in order to be able to do some basic linear algebra operations. However, I'm having some performance trouble when chaining those operations.
Here's the content of my main.h:
#include <vector>
#include <stdlib.h>
using namespace std;
typedef vector<float> vec;
inline vec& operator+=(vec& lhs, const vec& rhs) {
for (size_t i = 0; i < lhs.size(); ++i) {
lhs[i] += rhs[i];
}
return lhs;
}
inline vec operator*(float lhs, vec rhs) {
for (size_t i = 0; i < rhs.size(); ++i) {
rhs[i] *= lhs;
}
return rhs;
}
Content of main1.cpp:
#include "main.h"
// gcc 4.9.2 (-O3): 0m5.965s
int main(int, char**) {
float x = rand();
vec v1(1000);
vec v2(1000);
for (size_t i = 0; i < v1.size(); ++i) {
v1[i] = rand();
v2[i] = rand();
}
for (int i = 0; i < 10000000; ++i) {
v1 += x * v2;
// same as:
//vec y = x * v2;
//v1 += y;
}
return 0;
}
Content of main2.cpp:
#include "main.h"
// gcc 4.9.2 (-O3): 0m2.400s
int main(int, char**) {
// same stuff
for (int i = 0; i < 10000000; ++i) {
for (size_t j = 0; j < v1.size(); ++j) {
v1[j] += x * v2[j];
}
}
return 0;
}
The second program runs much faster than the first. I do understand why this is the case: instead of just one loop, the first program does two loops, and it allocates a temporary vector.
But this is the kind of thing I'd expect the compiler to see and optimize. Or am I doing something wrong?
I don't recall having this problem with linear algebra libraries (e.g. Armadillo). How do they tackle this problem? Does this involve some complicated template programming, or is there some simple way to help the compiler optimize this?

There were some super ugly template meta-programming solutions to that problem. But then the standards committee invented the combination of rvalue references and move semantics. Look those up and find many examples of the solution without the absurd levels of meta programming.

Related

Operator Overloading Matrix Multiplication

The issue I am having is how to get the correct number columns to go through for the inner most loop of K.
An example is a 2x3 matrix and a 3x2 matrix being multiplied.
The result should be a 2x2 matrix, but currently I dont know how to send the value of 2 to the operator overloaded function.
It should be
int k = 0; k < columns of first matrix;k++
Matrix::Matrix(int row, int col)
{
rows = row;
cols = col;
cx = (float**)malloc(rows * sizeof(float*)); //initialize pointer to pointer matrix
for (int i = 0; i < rows; i++)
*(cx + i) = (float*)malloc(cols * sizeof(float));
}
Matrix Matrix::operator * (Matrix dx)
{
Matrix mult(rows, cols);
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
mult.cx[i][j] = 0;
for (int k = 0; k < ?;k++) //?????????????
{
mult.cx[i][j] += cx[i][k] * dx.cx[k][j];
}
}
}
mult.print();
return mult;
//calling
Matrix mult(rowA, colB);
mult = mat1 * mat2;
}
Linear algebra rules say the result should have dimensions rows x dx.cols
Matrix Matrix::operator * (Matrix dx)
{
Matrix mult(rows, dx.cols);
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
mult.cx[i][j] = 0;
for (int k = 0; k < cols;k++) //?????????????
{
mult.cx[i][j] += cx[i][k] * dx.cx[k][j];
}
}
}
mult.print();
return mult;
A few random hints:
Your code is basically C; it doesn’t use (e.g.) important memory-safety features from C++. (Operator overloading is the only C++-like feature in use.) I suggest that you take advantage of C++ a bit more.
Strictly avoid malloc() in C++. Use std::make_unique(...) or, if there is no other way, a raw new operator. (BTW, there is always another way.) In the latter case, make sure there is a destructor with a delete or delete[]. The use of malloc() in your snippet smells like a memory leak.
What can be const should be const. Initialize as many class members as possible in the constructor’s initializer list and make them const if appropriate. (For example, Matrix dimensions don’t change and should be const.)
When writing a container-like class (which a Matrix may be, in a sense), don’t restrict it to a single data type; your future self will thank you. (What if you need a double instead of a float? Is it going to be a one-liner edit or an all-nighter spent searching where a forgotten float eats away your precision?)
Here’s a quick and dirty runnable example showing matrix multiplication:
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <memory>
namespace matrix {
using std::size_t;
template<typename Element>
class Matrix {
class Accessor {
public:
Accessor(const Matrix& mat, size_t m) : data_(&mat.data_[m * mat.n_]) {}
Element& operator [](size_t n) { return data_[n]; }
const Element& operator [](size_t n) const { return data_[n]; }
private:
Element *const data_;
};
public:
Matrix(size_t m, size_t n) : m_(m), n_(n),
data_(std::make_unique<Element[]>(m * n)) {}
Matrix(Matrix &&rv) : m_(rv.m_), n_(rv.n_), data_(std::move(rv.data_)) {}
Matrix operator *(const Matrix& right) {
Matrix result(m_, right.n_);
for (size_t i = 0; i < m_; ++i)
for (size_t j = 0; j < right.n_; ++j) {
result[i][j] = Element{};
for (size_t k = 0; k < n_; ++k) result[i][j] +=
(*this)[i][k] * right[k][j];
}
return result;
}
Accessor operator [](size_t m) { return Accessor(*this, m); }
const Accessor operator [](size_t m) const { return Accessor(*this, m); }
size_t m() const { return m_; }
size_t n() const { return n_; }
private:
const size_t m_;
const size_t n_;
std::unique_ptr<Element[]> data_;
};
template<typename Element>
std::ostream& operator <<(std::ostream &out, const Matrix<Element> &mat) {
for (size_t i = 0; i < mat.m(); ++i) {
for (size_t j = 0; j < mat.n(); ++j) out << std::setw(4) << mat[i][j];
out << std::endl;
}
return out;
}
} // namespace matrix
int main() {
matrix::Matrix<int> m22{2, 2};
m22[0][0] = 0; // TODO: std::initializer_list
m22[0][1] = 1;
m22[1][0] = 2;
m22[1][1] = 3;
matrix::Matrix<int> m23{2, 3};
m23[0][0] = 0; // TODO: std::initializer_list
m23[0][1] = 1;
m23[0][2] = 2;
m23[1][0] = 3;
m23[1][1] = 4;
m23[1][2] = 5;
matrix::Matrix<int> m32{3, 2};
m32[0][0] = 5; // TODO: std::initializer_list
m32[0][1] = 4;
m32[1][0] = 3;
m32[1][1] = 2;
m32[2][0] = 1;
m32[2][1] = 0;
std::cout << "Original:\n\n";
std::cout << m22 << std::endl << m23 << std::endl << m32 << std::endl;
std::cout << "Multiplied:\n\n";
std::cout << m22 * m22 << std::endl
<< m22 * m23 << std::endl
<< m32 * m22 << std::endl
<< m23 * m32 << std::endl
<< m32 * m23 << std::endl;
}
Possible improvements and other recommendations:
Add consistency checks. throw, for example, a std::invalid_argument when dimensions don’t match on multiplication, i.e. when m_ != right.n_, and a std::range_error when the operator [] gets an out-of-bounds argument. (The checks may be optional, activated (e.g.) for debugging using an if constexpr.)
Use a std::initializer_list or the like for initialization, so that you can have (e.g.) a const Matrix initialized in-line.
Always check your code using valgrind. (Tip: Buliding with -g lets valgrind print also the line numbers where something wrong happened (or where a relevant preceding (de)allocation had happened).)
The code could me made shorter and more elegant (not necessarily more efficient; compiler optimizations are magic nowadays) by not using operator [] everywhere and having some fun with pointer arithmetics instead.
Make the type system better, so that (e.g.) Matrix instances with different types can play well with each other. Perhaps a Matrix<int> multiplied by a Matrix<double> could yield a Matrix<double> etc. One could also support multiplication between a scalar value and a Matrix. Or between a Matrix and a std::array, std::vector etc.

Copy Vector of many Vectors speed issue

im working on a file parser to import some specific type of JSON to R. The implementation in R requires me to have a set of Vectors with the same length. Its working fine and the async makes it pretty fast, but a lot of time is "wasted" (2/3 of total time!) when collecting the future Vector of Vectors of the async.
Do you have an idea how to speed this up?
For me, basically the problem is to just append the "array" of vectors of vectors.
Im new to C++ and it would be awesome to learn some more!
#include <vector>
#include <iostream>
#include <future>
struct longStruct
{
std::vector<int> namevec;
std::vector<std::vector<int>> vectorOfVector;
};
longStruct parser()
{
longStruct myStruct;
//vectorsize is variable, just here fixed
//i need the zeros in the result "matrix"
std::vector<int> vector(10, 0);
std::vector<std::vector<int>> vectorOfVector;
for (size_t i = 0; i < 15; i++) //max amount of vectors is fixed
{
myStruct.vectorOfVector.push_back(vector);
}
for (size_t i = 0; i < 10; i++)
{
myStruct.namevec.push_back(1); //populated with strings usually
for (size_t j = 0; j < 5; j++)
{
//Just change value where it has to be changed.
//Keep initial zeros if there is no value (important)!
myStruct.vectorOfVector[i][j] = j;
}
}
return myStruct;
}
int main()
{
std::vector<std::future<longStruct>> futures;
longStruct results;
std::vector<int> v;
for (int i = 0; i < 15; ++i)
{
results.vectorOfVector.push_back(v);
}
for (size_t i = 0; i < 5; i++)
{
//Start async operations
futures.emplace_back(std::async(std::launch::async, parser));
}
for (auto &future : futures)
{
//Merge results of the async operations
auto result = future.get();
//For ne non int vectors
std::copy(result.namevec.begin(), result.namevec.end(), std::back_inserter(results.namevec));
//And the "nested": This takes so much time!!!
for (size_t i = 0; i < result.vectorOfVector.size(); i++)
{
std::copy(result.vectorOfVector[i].begin(), result.vectorOfVector[i].end(), std::back_inserter(results.vectorOfVector[i]));
}
}
return 1;
}

Compiling code using personal C++ library breaks when used in conjunction with header files

I have some C++ utility functions I've been using across projects.
I want to make a library out of those utilities to service projects without having to copy/paste any changes I might make.
I can turn the single .cpp file into a library with:
$ g++ -c util.cpp
$ ar rcs libutil.a util.o
and made a util.h header with all of the functions.
This library works to compile and run a simple test.cpp, which prints a dot and the mean of a vector using the library functions: (I moved the header to ~/.local/include/ and the library to ~/.local/lib/)
$ g++ -o test test.cpp -L ~/.local/lib/ -lutil
$ ./test
.
4.5
However, when I try to compile (parts of) a project with the library, I get "{function} was not declared in this scope" errors.
$ g++ -c source/linreg.cpp -L ~/.local/lib/ -lutil
...
linreg.cpp:11:18: error: ‘vecMean’ was not declared in this scope
...
Trying to reproduce this behavior I wrote this:
// header.h
#ifndef HEADER_H
#define HEADER_H
void test();
#endif
// main.cpp
#include "header.h"
#include "util.h"
int main()
{
dot();
test();
return 0;
}
// test.cpp
#include <string>
#include <vector>
#include <iostream>
#include "util.h"
#include "header.h"
void test()
{
dot();
std::vector<double> x;
for(int i = 0; i < 10; ++i)
x.push_back(i * 1.0);
std::cout << vecMean(x) << std::endl;
}
Which does not compile.
Depending on which of the #includes precedes the other,
different errors are thrown.
The above throws "'dot' was not declared in this scope",
while the below throws "'test' was not declared in this scope"
// main.cpp
#include "util.h"
#include "header.h"
...
This is the same kind of behavior I see when I try to compile my actual project.
If I remove the dot() call from main.cpp the example compiles and runs fine, except when placing the util.h include statement before the header.h one (although I guess the util.h include is pointless). This leads to 'test' not being declared.
I feel like I'm missing something obvious,
even though the entire process of learning to set up a library has been a struggle.
Seeing as header files appear to be part of the problem I'm adding my util.h below,
as well as util.cpp, for good measure.
#ifndef HEADER_H
#define HEADER_H
#include <vector>
#include <tuple>
#include <fstream>
#include <string>
/***** utils *****/
// smallest/largest value from a vector
int indexSmallest(const std::vector<double> &vec);
int indexLargest(const std::vector<double> &vec);
// some vector operations
std::vector<double> sclMult(const std::vector<double> &vec, double scl);
std::vector<double> sclAdd(const std::vector<double> &vec, double scl);
std::vector<double> vecAdd(const std::vector<double> &vec1, const std::vector<double> &vec2);
std::vector<double> vecSub(const std::vector<double> &vec1, const std::vector<double> &vec2);
std::vector<std::vector<double> > vecCat(const std::vector<double> &vec1,
const std::vector<double> &vec2,
const std::vector<double> &vec3);
double vecMean(const std::vector<double> &vec);
double vecSum(const std::vector<double> &vec);
// sort two vectors of length 3 by the elements in the err vector
std::tuple<std::vector<std::vector<double> >, std::vector<double> >
sort(const std::vector<std::vector<double> > &X, const std::vector<double> &err);
// return maximum and minimum values from vector
std::vector<double> topbot(std::vector<double> &vec);
// print a dot
void dot(std::string str = ".");
// print a vector of doubles
void printVec(std::vector<double> vec);
// print a matrix of doubles
void printMat(std::vector<std::vector<double> > mat);
#endif
#include <vector>
#include <tuple>
#include <cmath>
#include <iostream>
#include <string>
#include "util.h"
int indexSmallest(const std::vector<double> &vec)
{
int index = 0;
for(int i = 1; i < vec.size(); i++)
{
if(vec[i] < vec[index])
index = i;
}
return index;
}
int indexLargest(const std::vector<double> &vec)
{
int index = 0;
for(int i = 1; i < vec.size(); i++)
{
if(vec[i] > vec[index])
index = i;
}
return index;
}
std::vector<double> sclMult(const std::vector<double> &vec, double scl)
{
std::vector<double> vvec(vec.size());
for(int i = 0; i < vec.size(); i++){
vvec[i] = vec[i] * scl;
}
//printVec(vvec);
return vvec;
}
std::vector<double> sclAdd(const std::vector<double> &vec, double scl)
{
std::vector<double> vvec(vec.size());
for(int i = 0; i < vec.size(); i++)
vvec[i] = vec[i] + scl;
return vvec;
}
std::vector<double> vecAdd(const std::vector<double> &vec1, const std::vector<double> &vec2)
{
std::vector<double> vvec(vec1.size());
//std::cout << "aaaa ";
//printVec(vec1);
for(int i = 0; i < vec1.size(); i++){
vvec[i] = (vec1[i] + vec2[i]);
}
return vvec;
}
std::vector<double> vecSub(const std::vector<double> &vec1, const std::vector<double> &vec2)
{
std::vector<double> vvec(vec1.size());
for(int i = 0; i < vec1.size(); i++)
vvec[i] = (vec1[i] - vec2[i]);
//vvec.push_back(vec1[i] - vec2[i]);
return vvec;
}
std::vector<std::vector<double> > vecCat(const std::vector<double> &vec1,
const std::vector<double> &vec2,
const std::vector<double> &vec3)
{
std::vector<std::vector<double> > vecCat(3);
vecCat[0] = vec1;
vecCat[1] = vec2;
vecCat[2] = vec3;
return vecCat;
}
std::tuple<std::vector<std::vector<double> >, std::vector<double> >
sort(const std::vector<std::vector<double> > &X, const std::vector<double> &err)
{
//std::cout << X.size() << ' ' << err.size() << std::endl;
std::vector<double> sortErr(3);
//std::vector<std::vector<double> > sortX;
int small = indexSmallest(err), large = indexLargest(err);
if(small == large)
return std::make_tuple(X,err);
int middle = fabs(small + large - 3);
//std::cout << small << ' ' << middle << ' ' << large << std::endl;
sortErr[0] = err[small];
sortErr[1] = err[middle];
sortErr[2] = err[large];
std::vector<std::vector<double> > sortX = vecCat(X[small],X[middle],X[large]);
/* sortX[0] = X[small];
sortX[1] = X[middle];
sortX[2] = X[large];*/
return std::make_tuple(sortX,sortErr);
}
double vecMean(const std::vector<double> &vec)
{
double sum = 0;
for(int i = 0;i < vec.size();i++){
sum += vec[i];
}
return sum / vec.size();
}
double vecSum(const std::vector<double> &vec)
{
double sum = 0;
for(int i = 0;i < vec.size();i++){
sum += vec[i];
}
return sum;
}
void dot(std::string str)
{
std::cout << str << std::endl;
}
std::vector<double> topbot(std::vector<double> &vec)
{
double top = vec[0];
double bot = vec[0];
for(int i = 1; i < vec.size(); ++i){
if(vec[i] > top)
top = vec[i];
if(vec[i] < bot)
bot = vec[i];
}
std::vector<double> topbot = {top,bot};
return topbot;
}
void printVec(std::vector<double> vec)
{
for(int i = 0; i < vec.size(); ++i){
std::cout << vec[i] << ',';
}
std::cout << std::endl;
}
void printMat(std::vector<std::vector<double> > mat)
{
for(int i = 0; i < mat.size(); ++i){
printVec(mat[i]);
}
}
std::vector<double> head(std::vector<double> vec, int n)
{
std::vector<double> head;
for(int i = 0; i < n; ++i)
head.push_back(vec[i]);
return head;
}
std::vector<double> tail(std::vector<double> vec, int n)
{
std::vector<double> tail;
for(int i = vec.size() - n; i < vec.size(); ++i)
tail.push_back(vec[i]);
return tail;
}
std::vector<double> normalize(std::vector<double> vec)
{
std::vector<double> tb = topbot(vec);
std::vector<double> norm;
for(int i = 0; i < vec.size(); ++i)
norm.push_back((vec[i] - tb[1]) / (tb[0] - tb[1]));
return norm;
}
std::vector<double> vecLog(std::vector<double> vec)
{
std::vector<double> logged;
for(int i = 0; i < vec.size(); ++i)
logged.push_back(std::log(vec[i]));
return logged;
}
std::vector<double> vecExp(std::vector<double> vec)
{
std::vector<double> logged;
for(int i = 0; i < vec.size(); ++i)
logged.push_back(std::exp(vec[i]));
return logged;
}
The problem is that you have the SAME include guard in both headers:
#ifndef HEADER_H
#define HEADER_H
So then one of the two files -- here, util.h -- is being skipped because that symbol has in fact already been defined.
May I also recommend from experience that you name the library something more unique -- say, verkutil -- with an include guard to match? My experience is that too many projects have their own UTIL_H symbol and similarly named files.

Multi threading inside a for loop - OpenMP

I am trying to add multi-threading in a C++ code. The target is the for loop inside the function. The objective is to reduce the execution time of the program. It takes 3.83 seconds for execution.
I have tried to add the command #pragma omp parallel for reduction(+:sum) in the inner loop (before the j for-loop) but it was not enough. It took 1.98 seconds. The aim is to decrease the time up to 0.5 seconds.
I made some research to increase the speed up and some people recommend the Strip Mining method for Vectorization for better results. However I do not know how to implement it yet.
Could someone know how to do it ?
The code is:
void filter(const long n, const long m, float *data, const float threshold, std::vector &result_row_ind) {
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
}
if (sum > threshold)
result_row_ind.push_back(i);
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
}
Thank you very much
When possible, you likely want to parallelize the outer loop. The simplest way to go about this in OpenMP is to do this:
#pragma omp parallel for
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
}
if (sum > threshold) {
#pragma omp critical
result_row_ind.push_back(i);
}
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
This works, and is probably a great deal faster than parallelizing the inner loop (launching a parallel region is expensive), but it uses a critical section for locking to prevent races. The race could also be avoided by using a user defined reduction over vectors with a reduction on that loop, if the number of threads is very large and the number of matching results is very small this might be slower, but otherwise it is likely notably faster. This is not quite right, the vector type is incomplete since it wasn't listed, but should be pretty close:
#pragma omp declare \
reduction(CatVec: std::vector<T>: \
omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end())) \
initializer(omp_priv=std::vector<T>())
#pragma omp parallel for reduction(CatVec: result_row_ind)
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
}
if (sum > threshold) {
result_row_ind.push_back(i);
}
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
If you have a C++ compiler with support for execution policies, you could try std::for_each with the execution policy std::execution::par to see if that helps. Example:
#include <iostream>
#include <vector>
#include <algorithm>
#if __has_include(<execution>)
# include <execution>
#elif __has_include(<experimental/execution_policy>)
# include <experimental/execution_policy>
#endif
// iterator to use with std::for_each
class iterator {
size_t val;
public:
using iterator_category = std::forward_iterator_tag;
using value_type = size_t;
using difference_type = size_t;
using pointer = size_t*;
using reference = size_t&;
iterator(size_t value=0) : val(value) {}
inline iterator& operator++() { ++val; return *this; }
inline bool operator!=(const iterator& rhs) const { return val != rhs.val; }
inline reference operator*() { return val; }
};
std::vector<size_t> filter(const size_t rows, const size_t cols, const float* data, const float threshold) {
std::vector<size_t> result_row_ind;
std::vector<float> sums(rows);
iterator begin(0);
iterator end(rows);
std::for_each(std::execution::par, begin, end, [&](const size_t& row) {
const float* dataend = data + (row+1) * cols;
float& sum = sums[row];
for (const float* dataptr = data + row * cols; dataptr < dataend; ++dataptr) {
sum += *dataptr;
}
});
// pushing moved outside the threaded code to avoid using mutexes
for (size_t row = 0; row < rows; ++row) {
if (sums[row] > threshold)
result_row_ind.push_back(row);
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
return result_row_ind;
}
int main() {
constexpr size_t rows = 1<<15, cols = 1<<18;
float* data = new float[rows*cols];
for (int i = 0; i < rows*cols; ++i) data[i] = (float)i / (float)100000000.;
std::vector<size_t> res = filter(rows, cols, data, 10.);
std::cout << res.size() << "\n";
delete[] data;
}

Square root of all elements of Boost Ublas Matrix

I am trying to compute square root of all elements of a Boost Ublas matrix. So far, I have this, and it works.
#include <iostream>
#include "boost\numeric\ublas\matrix.hpp"
#include <Windows.h>
#include <math.h>
#include <cmath>
#include <algorithm>
typedef boost::numeric::ublas::matrix<float> matrix;
const size_t X_SIZE = 10;
const size_t Y_SIZE = 10;
void UblasExpr();
int main()
{
UblasExpr();
return 0;
}
void UblasExpr()
{
matrix m1, m2, m3;
m1.resize(X_SIZE, Y_SIZE);
m2.resize(X_SIZE, Y_SIZE);
m3.resize(X_SIZE, Y_SIZE);
for (int i = 0; i < X_SIZE; i++)
{
for (int j = 0; j < Y_SIZE; j++)
{
m1(i, j) = 2;
m2(i, j) = 10;
}
}
m3 = element_prod(m1, m2);
std::transform(m1.data().begin(), m1.data().end(), m3.data().begin(), std::sqrtf);
for (int i = 0; i < X_SIZE; i++)
{
for (int j = 0; j < Y_SIZE; j++)
{
std::cout << m3(i, j) << " ";
}
std::cout << std::endl;
}
}
But, I would like to not use the std::transform, and instead do something like this :
m3 = sqrtf(m1);
Is there a way to make it work? My application is very performance sensitive, so the alternative is only acceptable if it results in no loss of efficiency.
P.S. I would like to do this for a whole lot of other operations like log10f, cos, acos, sin, asin, pow. I need these all in my code.
You can define your own sqrt function with an appropriate signature:
typedef boost::numeric::ublas::matrix<float> matrix;
matrix sqrt_element(const matrix& a)
{
matrix result(a.size1(), a.size2());
std::transform(a.data().begin(), a.data().end(), result.data().begin(), std::sqrtf);
return result;
}
You could also define a general 'apply_elementwise' to take a callable object as an argument (untested/not compiled):
typedef boost::numeric::ublas::matrix<float> matrix;
template <typename CALLABLE>
matrix apply_elementwise(const CALLABLE& f, const matrix& a)
{
matrix result(a.size1(), a.size2());
std::transform(a.data().begin(), a.data().end(), result.data().begin(), f);
return result;
}
Then you could call this as:
matrix y(apply_elementwise(std::sqrt, x));
matrix z;
z = apply_elementwise(std::cos, x);
In these functions, we're returning a matrix by value. Ideally, you want to make sure you that the matrix class you're using employs rvalue-reference constructors and assignment operators to minimize copying of data.