At the moment I am trying to figure out why my naive matrix-matrix-multiplication is slower (0.7 sec) when I use the overloaded parentheses operator (//first multiplication). If I don't use them (//second multiplication) and the multiplication directly access the class member array data_ it is about twice as fast (0.35 sec). I use my own matrix class as defined in Matrix.h.
Why is there such a significant difference in speed? Is there something wrong with my copy constructor? Is there so much "overhead" in calling the overloaded operator function that it justifies for that kind of performance penalty?
There is one more question / weird behavior: When you exchange the two inner most loops (x and inner) with each other, then the multiplication gets (of course) really slow, but both multiplications take almost the SAME time (7 sec) now. Why does it take the same time for them in this case, but before there was a ~50% performance difference.
edit: The program is compiled the following way: g++ -c -std=c++0x -O3 -DNDEBUG
Thank you so much for your help!
My main function looks like this:
#include "Matrix.h"
int main(){
Matrix m1(1024,1024, 2.0);
Matrix m2(1024,1024, 2.5);
Matrix m3(1024,1024);
//first multiplication
for(int y = 0; y < 1024; ++y){
for(int inner = 0; inner < 1024; ++inner){
for(int x = 0; x < 1024; ++x){
m3(y,x) += m1(y, inner) * m2(inner, x);
}
}
}
//second multiplication
for(int y = 0; y < 1024; ++y){
for(int inner = 0; inner < 1024; ++inner){
for(int x = 0; x < 1024; ++x){
m3.data_[y*1024+x] += m1.data_[y*1024+inner]*m2.data_[inner*1024+inner];
}
}
}
}
And here is the part of Matrix.h:
class Matrix{
public:
Matrix();
Matrix(int sizeY, int sizeX);
Matrix(int sizeY, int sizeX, double init);
Matrix(const Matrix & orig);
~Matrix(){delete[] data_;}
double & operator() (int y, int x);
double operator() (int y, int x) const;
double * data_;
private:
int sizeX_;
int sizeY_;
}
And here the Implementation of Matrix.h
Matrix::Matrix()
: sizeX_(0),
sizeY_(0),
data_(nullptr)
{ }
Matrix::Matrix(int sizeY, int sizeX)
: sizeX_(sizeX),
sizeY_(sizeY),
data_(new double[sizeX*sizeY]())
{
assert( sizeX > 0 );
assert( sizeY > 0 );
}
Matrix::Matrix(int sizeY, int sizeX, double init)
: sizeX_(sizeX),
sizeY_(sizeY)
{
assert( sizeX > 0 );
assert( sizeY > 0 );
data_ = new double[sizeX*sizeY];
std::fill(data_, data_+(sizeX_*sizeY_), init);
}
Matrix::Matrix(const Matrix & orig)
: sizeX_(orig.sizeX_),
sizeY_(orig.sizeY_)
{
data_ = new double[orig.sizeY_*orig.sizeX_];
std::copy(orig.data_, orig.data_+(sizeX_*sizeY_), data_);
}
double & Matrix::operator() (int y, int x){
assert( x >= 0 && x < sizeX_);
assert( y >= 0 && y < sizeY_);
return data_[y*sizeX_ + x];
}
double Matrix::operator() (int y, int x) const {
assert( x >= 0 && x < sizeX_);
assert( y >= 0 && y < sizeY_);
return data_[y*sizeX_ + x];
}
EDIT2: Turns out I used the wrong array access for the //second multiplication. I changed it to m3.data_[y*1024+x] += m1.data_[y*1024+inner]*m2.data_[inner*1024+x]; and now both multiplications take the same time.
Thank you very much for your help!
I think your two versions are not computing the same thing:
In the first you have:
m3(y,x) += m1(y, inner) * m2(inner, x);
But in the second you have
m3.data_[y*1024+x] += m1.data_[y*1024+inner]*m2.data_[inner*1024+inner];
The second one can factor inner out and instead do inner * (1024 + 1) which can be optimized a number of ways that the first can't.
What are the outputs of the two versions? Do they match?
Edit: Another answerer is quite right suggesting that the dimensions in the class not being constant will take some optimizations off the table; in the first version the compiler doesn't know that the size is a power of two so it uses general-purpose multiplication but in the second version it knows that one of the operands is 1024 (not just a constant but a compile time constant) so it can use fast multiplication (left shift by the power of two).
(Apologies for my earlier answer about NDEBUG: I had the page open for a while so didn't see your edit with the compilation line.)
I suspect the difference is that in the operator() version, sizeX_ is not const, and this may be preventing the compiler from optimizing something, i.e. loading sizeX_ into a register repeatedly. Try declaring sizeX_ and sizeY_ const in the class definition.
That and you should inline the functions in the header, as has been suggested in the comments.
Related
What is the best practice in this case:
Should I get variables before running a for loop like this:
void Map::render(int layer, Camera* pCam)
{
int texture_index(m_tilesets[layer]->getTextureIndex());
int tile_width(m_size_of_a_tile.getX());
int tile_height(m_size_of_a_tile.getY());
int camera_x(pCam->getPosition().getX());
int camera_y(pCam->getPosition().getY());
int first_tile_x(pCam->getDrawableArea().getX());
int first_tile_y(pCam->getDrawableArea().getY());
int map_max_x( (640 / 16) + first_tile_x );
int map_max_y( (360 / 16) + first_tile_y );
if (map_max_x > 48) { map_max_x = 48; }
if (map_max_y > 28) { map_max_x = 28; }
Tile* t(nullptr);
for (int y(first_tile_y); y < map_max_y; ++y) {
for (int x(first_tile_x); x < map_max_x; ++x) {
// move map relative to camera
m_dst_rect.x = (x * tile_width) + camera_x;
m_dst_rect.y = (y * tile_height) + camera_y;
t = getTile(layer, x, y);
if (t) {
pTextureManager->draw(texture_index, getTile(layer, x, y)->src, m_dst_rect);
}
}
}
}
or is it better to get it directly in the loop like this (in this case the code is shorter but less readable):
void Map::render(int layer, Camera* pCam)
{
int first_tile_x(pCam->getDrawableArea().getX());
int first_tile_y(pCam->getDrawableArea().getY());
for (int y(first_tile_y); y < (640 / 16) + first_tile_x; ++y) {
for (int x(first_tile_x); x < (360 / 16) + first_tile_y; ++x) {
// move map relative to camera
m_dst_rect.x = (x * m_size_of_a_tile.getX()) + pCam->getPosition().getX();
m_dst_rect.y = (y * m_size_of_a_tile.getY()) + pCam->getPosition().getY();
Tile* t(getTile(layer, x, y));
if (t) {
pTextureManager->draw(m_tilesets[layer]->getTextureIndex(), getTile(layer, x, y)->src, m_dst_rect);
}
}
}
}
Is there an impact on performance using one method over another?
Syntactically the second version is to be preferred as it does contain the object in the scope where it is being used, not leaking it to different contexts. Performance wise you will need to profile but I'd be surprised if there was any difference at all because a compiler will often notice that the results don't change, at least for simple functions, and do this optimization for you.
For functions that are more complex or potentially dynamic, but you know they will not change their result during the for loop it makes sense to define them before the loop.
I want to optimize my application using vectorization. More specifically, I want to vectorize the mathematical operations on the std::complex<double> type. However, this seems to be quite difficult. Consider the following example:
#define TEST_LEN 100
#include <algorithm>
#include <complex>
typedef std::complex<double> cmplx;
using namespace std::complex_literals;
#pragma omp declare simd
cmplx add(cmplx a, cmplx b)
{
return a + b;
}
#pragma omp declare simd
cmplx mult(cmplx a, cmplx b)
{
return a * b;
}
void k(cmplx *x, cmplx *&y, int i0, int N)
{
#pragma omp for simd
for (int i = i0; i < N; i++)
y[i] = add(mult(-(1i + 1.0), x[i]), 1i);
}
int main(int argc, char **argv)
{
cmplx *x = new cmplx[TEST_LEN];
cmplx *y = new cmplx[TEST_LEN];
for (int i = 0; i < TEST_LEN; i++)
x[i] = 0;
for (int i = 0; i < TEST_LEN; i++)
{
int N = std::min(4, TEST_LEN - i);
k(x, y, i, N);
}
delete[] x;
delete[] y;
return 1;
}
I am using the g++ compiler. For this code the compiler gives the following warning:
warning: unsupported return type 'cmplx' {aka 'std::complex'} for simd
for the lines containing the mult and add function.
It seems like it is not possible to vectorize the std::complex<double> type like this.
Is there a different way how this can be archieved?
Not easily. SIMD works quite well when you have values in the next N steps that behave the same way. So consider for example an array of 2D vectors:
X Y X Y X Y X Y
If we were to do a vector addition operation here,
X Y X Y X Y X Y
+ + + + + + + +
X Y X Y X Y X Y
The compiler will nicely vectorise that operation. If however we were to want to do something different for the X and Y values, the memory layout becomes problematic for SIMD:
X Y X Y X Y X Y
+ / + / + / + /
X Y X Y X Y X Y
If you consider for example the multiplication case:
(a + bi) (c + di) = (ac - bd) (ad + bc)i
Suddenly the operations are jumping between SIMD lanes, which is pretty much going to kill any decent vectorization.
Take a quick look at this godbolt: https://godbolt.org/z/rnVVgl
Addition boils down to some vaddps instructions (working on 8 floats at a time).
Multiply ends up using vfmadd231ss and vmulss (which both work on 1 float at a time).
The only easy way to automatically vectorise your complex code would be to seperate out the real and imaginary parts into 2 arrays:
struct ComplexArray {
float* real;
float* imaginary;
};
Within this godbolt you can see that the compiler is now using vfmadd213ps instructions (so again back to working on 8 floats at a time).
https://godbolt.org/z/Ostaax
I trying to create a function to transpose in-place a bitmap. But so far, the result I get is all messed up, and I can’t find what I’m doing wrong.
Source bitmaps are as a 1d pixel array in ARGB format.
void transpose(uint8_t* buffer, const uint32_t width, const uint32_t height)
{
const size_t stride = width * sizeof(uint32_t);
for (uint32_t i = 0; i < height; i++)
{
uint32_t* row = (uint32_t*)(buffer + (stride * i));
uint8_t* section = buffer + (i * sizeof(uint32_t));
for (uint32_t j = i + 1; j < height; j++)
{
const uint32_t tmp = row[j];
row[j] = *((uint32_t*)(section + (stride * j)));
*((uint32_t*)(section + (stride * j))) = tmp;
}
}
}
UPDATE:
To clarify and avoid confusions as it seems some people think this is just a rotate image question. Transposing an image is composed by 2 transformations: 1) flip horizontally 2) Rotate by 90 CCW. (As shown in the image example, see the arrow directions)
I think the problem is more complex than you realise and is not simply a case of swapping the pixels at x, y with the pixels at y, x. If you consider a 3*7 pixel image in which I've labelled the pixels a-u:
abcdefg
hijklmn
opqrstu
Rotating this image gives:
aho
bip
cjq
dkr
els
fmt
gnu
Turning both images into a 1D array gives:
abcdefghijklmnopqrstu
ahobipcjqdkrelsfmtgnu
Notice that b has moved to the position of d but has been replaced by h.
Rethink your algorithm, draw it out for a small image and make sure it works before attempting to implement it.
Due to the complexity of the task it may actually end up being faster to create a temporary buffer, rotate into that buffer then copy back as it could end up with fewer copies (2 per pixel) than the inplace algorithm that you come up with.
Mostly equivalent code that should be easier to debug:
inline uint32_t * addr(uint8_t* buffer, const uint32_t width, uint32_t i, uint32_t j) {
uint32_t * tmp = buffer;
return tmp+i*width+j;
}
void transpose(uint8_t* buffer, const uint32_t width, const uint32_t height) {
for (uint32_t i = 0; i < min(width,height); i++) {
for (uint32_t j = 0; j < i; j++) {
uint32_t * a = addr(buffer, width, i, j);
uint32_t * b = addr(buffer, width, j, i);
const uint32_t tmp = *a;
*a = *b;
*b = tmp;
}
}
}
If this doesn't work right, it is possible that it needs to know not just the width of the picture, but also the width of the underlying buffer. This only flips the square portion at the top-left, more work would be needed for non-square bitmaps. (or just pad everything to square before using...)
Note that transposing a matrix in place is not trivial when N!=M. See eg here for details.
The reason is that when N=M you can simply iterate through half of the matrix and swap elements. When N!=M this isnt the case.
For illustration, consider a simpler case:
First a 2d view on 1d data:
struct my2dview {
std::vector<int>& data;
int width,height;
my2dview(std::vector<int>& data,int width,int height):data(data),width(width),height(height){}
int operator()(int x,int y) const { return data[x*width + y]; }
int& operator()(int x,int y){ return data[x*width + y]; }
my2dview get_transposed() { return my2dview(data,height,width);}
};
std::ostream& operator<<(std::ostream& out, const my2dview& x){
for (int h=0;h<x.height;++h){
for (int w=0;w<x.width;++w){
out << x(h,w) << " ";
}
out << "\n";
}
return out;
}
Now a transpose that would work for N=M:
my2dview broken_transpose(my2dview x){
auto res = x.get_transposed();
for (int i=0;i<x.height;++i){
for (int j=0;j<x.width;++j){
res(j,i) = x(i,j);
}
}
return res;
}
Using it for some small matrix
int main() {
std::vector<int> x{1,2,3,4,5,6};
auto v = my2dview(x,2,3);
std::cout << v << '\n';
std::cout << v.get_transposed() << '\n';
auto v2 = broken_transpose(v);
std::cout << v2;
}
prints
1 2
3 4
5 6
1 2 3
4 5 6
1 3 2
2 2 6
Conclusion: The naive swapping elements approach does not work for non-square matrices.
Actually this answer just rephrases the one by #Alan Birtles. I felt challenged by his
Due to the complexity of the task it may actually end up being faster to create a temporary buffer [...]
just to come to the same conclusion ;).
EDIT You can checkout my implementation on Github: https://github.com/Sheljohn/WalshHadamard
I am looking for an implementation, or indications on how to implement, the sequency-ordered Fast Walsh Hadamard transform (see this and this).
I slightly adapted a very nice implementation found online:
// (a,b) -> (a+b,a-b) without overflow
void rotate( long& a, long& b )
{
static long t;
t = a;
a = a + b;
b = t - b;
}
// Integer log2
long ilog2( long x )
{
long l2 = 0;
for (; x; x >>=1) ++l2;
return l2;
}
/**
* Fast Walsh-Hadamard transform
*/
void fwht( std::vector<long>& data )
{
const long l2 = ilog2(data.size()) - 1;
for (long i = 0; i < l2; ++i)
{
for (long j = 0; j < (1 << l2); j += 1 << (i+1))
for (long k = 0; k < (1 << i ); ++k)
rotate( data[j + k], data[j + k + (1<<i)] );
}
}
but it does not compute the WHT in sequency order (the natural Hadamard matrix is used implicitly). Note that in the code above (and if you try it), the size of data needs to be a power of 2.
My question is: is there a simple adaptation of this implementation that gives the sequency-ordered FWHT?
A possible solution would be to write a small function to compute dynamically the elements of Hn (the Hadamard matrix of order n), count the number of zero crossings, and create a ranking of the rows, but I am wondering whether there is a smarter way. Thanks in advance for any input! Cheers
As indicated here (linked from within your reference):
The sequency ordering of the rows of the Walsh matrix can be derived from the ordering of the Hadamard matrix by first applying the bit-reversal permutation and then the Gray code permutation.
There are various implementations of bit-reversal algorithm such as this:
// Bit-reversal
// adapted from http://www.idi.ntnu.no/~elster/pubs/elster-bit-rev-1989.pdf
void bitrev(int t, std::vector<long>& c)
{
long n = 1<<t;
long L = 1;
c[0] = 0;
for (int q=0; q<t; ++q)
{
n /= 2;
for (long j=0; j<L; ++j)
{
c[L+j] = c[j] + n;
}
L *= 2;
}
}
The gray code can be obtained from here:
/*
The purpose of this function is to convert an unsigned
binary number to reflected binary Gray code.
The operator >> is shift right. The operator ^ is exclusive or.
*/
unsigned int binaryToGray(unsigned int num)
{
return (num >> 1) ^ num;
}
These can be combined to yields the final permutation:
// Compute a permutation of size 2^order
// to reorder the Fast Walsh-Hadamard transform's output
// into the Walsh-ordered (sequency-ordered)
void sequency_permutation(long order, std::vector<long>& p)
{
long n = 1<<order;
std::vector<long> tmp(n);
bitrev(order, tmp);
p.resize(n);
for (long i=0; i<n; ++i)
{
p[i] = tmp[binaryToGray(i)];
}
}
All that's left to do is to apply the permutation to the normal Walsh-Hadamard Transform output.
void permuted_fwht(std::vector<long>& data, const std::vector<long>& permutation)
{
std::vector<long> tmp = data;
fwht(tmp);
for (long i=0; i<data.size(); ++i)
{
data[i] = tmp[permutation[i]];
}
}
Note that the permutation is fixed for a given data size, so it only needs to be computed once (assuming you are processing multiple blocks of data). So, putting it all together you would get something such as:
std::vector<long> p;
const long order = ilog2(data_block_size) - 1;
sequency_permutation(order, p);
permuted_fwht( data_block_1, p);
permuted_fwht( data_block_2, p);
//...
How would you solve the problem of finding the points of a (integer) grid within a circle centered on the origin of the axis, with the results ordered by norm, as in distance from the centre, in C++?
I wrote an implementation that works (yeah, I know, it is extremely inefficient, but for my problem anything more would be overkill). I'm extremely new to C++, so my biggest problem was finding a data structure capable of
being sort-able;
being able to save an array in one of its elements,
rather than the implementation of the algorithm. My code is as follows. Thanks in advance, everyone!
typedef std::pair<int, int[2]> norm_vec2d;
bool norm_vec2d_cmp (norm_vec2d a, norm_vec2d b)
{
bool bo;
bo = (a.first < b.first ? true: false);
return bo;
}
int energy_to_momenta_2D (int energy, std::list<norm_vec2d> *momenta)
{
int i, j, norm, n=0;
int energy_root = (int) std::sqrt(energy);
norm_vec2d temp;
for (i=-energy_root; i<=energy_root; i++)
{
for (j =-energy_root; j<=energy_root; j++)
{
norm = i*i + j*j;
if (norm <= energy)
{
temp.first = norm;
temp.second[0] = i;
temp.second[1] = j;
(*momenta).push_back (temp);
n++;
}
}
}
(*momenta).sort(norm_vec2d_cmp);
return n;
}
How would you solve the problem of finding the points of a (integer) grid within a circle centered on the origin of the axis, with the results ordered by norm, as in distance from the centre, in C++?
I wouldn't use a std::pair to hold the points. I'd create my own more descriptive type.
struct Point {
int x;
int y;
int square() const { return x*x + y*y; }
Point(int x = 0, int y = 0)
: x(x), y(y) {}
bool operator<(const Point& pt) const {
if( square() < pt.square() )
return true;
if( pt.square() < square() )
return false;
if( x < pt.x )
return true;
if( pt.x < x)
return false;
return y < pt.y;
}
friend std::ostream& operator<<(std::ostream& os, const Point& pt) {
return os << "(" << pt.x << "," << pt.y << ")";
}
};
This data structure is (probably) exactly the same size as two ints, it is less-than comparable, it is assignable, and it is easily printable.
The algorithm walks through all of the valid points that satisfy x=[0,radius] && y=[0,x] && (x,y) inside circle:
std::set<Point>
GetListOfPointsInsideCircle(double radius = 1) {
std::set<Point> result;
// Only examine bottom half of quadrant 1, then
// apply symmetry 8 ways
for(Point pt(0,0); pt.x <= radius; pt.x++, pt.y = 0) {
for(; pt.y <= pt.x && pt.square()<=radius*radius; pt.y++) {
result.insert(pt);
result.insert(Point(-pt.x, pt.y));
result.insert(Point(pt.x, -pt.y));
result.insert(Point(-pt.x, -pt.y));
result.insert(Point(pt.y, pt.x));
result.insert(Point(-pt.y, pt.x));
result.insert(Point(pt.y, -pt.x));
result.insert(Point(-pt.y, -pt.x));
}
}
return result;
}
I chose a std::set to hold the data for two reasons:
It is stored is sorted order, so I don't have to std::sort it, and
It rejects duplicates, so I don't have to worry about points whose reflection are identical
Finally, using this algorithm is dead simple:
int main () {
std::set<Point> vp = GetListOfPointsInsideCircle(2);
std::copy(vp.begin(), vp.end(),
std::ostream_iterator<Point>(std::cout, "\n"));
}
It's always worth it to add a point class for such geometric problem, since usually you have more than one to solve. But I don't think it's a good idea to overload the 'less' operator to satisfy the first need encountered. Because:
Specifying the comparator where you sort will make it clear what order you want there.
Specifying the comparator will allow to easily change it without affecting your generic point class.
Distance to origin is not a bad order, but for a grid but it's probably better to use row and columns (sort by x first then y).
Such comparator is slower and will thus slow any other set of points where you don't even care about norm.
Anyway, here is a simple solution using a specific comparator and trying to optimize a bit:
struct v2i{
int x,y;
v2i(int px, int py) : x(px), y(py) {}
int norm() const {return x*x+y*y;}
};
bool r_comp(const v2i& a, const v2i& b)
{ return a.norm() < b.norm(); }
std::vector<v2i> result;
for(int x = -r; x <= r; ++x) {
int my = r*r - x*x;
for(int y = 0; y*y <= my; ++y) {
result.push_back(v2i(x,y));
if(y > 0)
result.push_back(v2i(x,-y));
}
}
std::sort(result.begin(), result.end(), r_comp);