Performance loss if function is not defined in class declaration

Performance loss if function is not defined in class declaration - c++

I have a high performance code which includes the class that is found at the bottom of this post. My problem is that as soon as I do not define the advecu function in the class declaration, but instead separate the declaration and the implementation (as I prefer), I lose a considerable amount of performance in both the Intel C++ and the Clang compilers. I do not, however, understand why. When I remove the templates, the performance is the same for both ways on all compilers.
template<bool dim3>
struct Advec4Kernel
{
static void advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4, const Grid &grid)
{
int ijk,kstart,kend;
int ii1,ii2,ii3,jj1,jj2,jj3,kk1,kk2,kk3;
double dxi,dyi;
ii1 = 1;
ii2 = 2;
ii3 = 3;
jj1 = 1*grid.icells;
jj2 = 2*grid.icells;
jj3 = 3*grid.icells;
kk1 = 1*grid.ijcells;
kk2 = 2*grid.ijcells;
kk3 = 3*grid.ijcells;
kstart = grid.kstart;
kend = grid.kend;
dxi = 1./grid.dx;
dyi = 1./grid.dy;
for(int k=grid.kstart; k<grid.kend; k++)
for(int j=grid.jstart; j<grid.jend; j++)
for(int i=grid.istart; i<grid.iend; i++)
{
ijk = i + j*jj1 + k*kk1;
ut[ijk] -= ( cg0*((ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]) * (ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]))
+ cg1*((ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]) * (ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]))
+ cg2*((ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]) * (ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]))
+ cg3*((ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3]) * (ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3])) ) * cgi*dxi;
if(dim3)
{
ut[ijk] -= ( cg0*((ci0*v[ijk-ii2-jj1] + ci1*v[ijk-ii1-jj1] + ci2*v[ijk-jj1] + ci3*v[ijk+ii1-jj1]) * (ci0*u[ijk-jj3] + ci1*u[ijk-jj2] + ci2*u[ijk-jj1] + ci3*u[ijk ]))
+ cg1*((ci0*v[ijk-ii2 ] + ci1*v[ijk-ii1 ] + ci2*v[ijk ] + ci3*v[ijk+ii1 ]) * (ci0*u[ijk-jj2] + ci1*u[ijk-jj1] + ci2*u[ijk ] + ci3*u[ijk+jj1]))
+ cg2*((ci0*v[ijk-ii2+jj1] + ci1*v[ijk-ii1+jj1] + ci2*v[ijk+jj1] + ci3*v[ijk+ii1+jj1]) * (ci0*u[ijk-jj1] + ci1*u[ijk ] + ci2*u[ijk+jj1] + ci3*u[ijk+jj2]))
+ cg3*((ci0*v[ijk-ii2+jj2] + ci1*v[ijk-ii1+jj2] + ci2*v[ijk+jj2] + ci3*v[ijk+ii1+jj2]) * (ci0*u[ijk ] + ci1*u[ijk+jj1] + ci2*u[ijk+jj2] + ci3*u[ijk+jj3])) ) * cgi*dyi;
}
ut[ijk] -= ( cg0*((ci0*w[ijk-ii2-kk1] + ci1*w[ijk-ii1-kk1] + ci2*w[ijk-kk1] + ci3*w[ijk+ii1-kk1]) * (ci0*u[ijk-kk3] + ci1*u[ijk-kk2] + ci2*u[ijk-kk1] + ci3*u[ijk ]))
+ cg1*((ci0*w[ijk-ii2 ] + ci1*w[ijk-ii1 ] + ci2*w[ijk ] + ci3*w[ijk+ii1 ]) * (ci0*u[ijk-kk2] + ci1*u[ijk-kk1] + ci2*u[ijk ] + ci3*u[ijk+kk1]))
+ cg2*((ci0*w[ijk-ii2+kk1] + ci1*w[ijk-ii1+kk1] + ci2*w[ijk+kk1] + ci3*w[ijk+ii1+kk1]) * (ci0*u[ijk-kk1] + ci1*u[ijk ] + ci2*u[ijk+kk1] + ci3*u[ijk+kk2]))
+ cg3*((ci0*w[ijk-ii2+kk2] + ci1*w[ijk-ii1+kk2] + ci2*w[ijk+kk2] + ci3*w[ijk+ii1+kk2]) * (ci0*u[ijk ] + ci1*u[ijk+kk1] + ci2*u[ijk+kk2] + ci3*u[ijk+kk3])) )
* dzi4[k];
}
}
};
The separated version looks as:
// header
template<bool dim3>
struct Advec4Kernel
{
static void advecu(double *, double *, double *, double *, double *, const Grid &);
}
// source
template<bool dim3>
void Advec4Kernel<dim3>::advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4, const Grid &grid)
{
//...
}

Apparently, the compiler performs some optimizations using the restrict keyword. To benefit from these optimizations, the function's declaration must contain the restrict keyword. This was determined empirically; I don't know whether it's a compiler deficiency or a law.
Code:
// header
template<bool dim3>
struct Advec4Kernel
{
static void advecu(double *restrict, double *restrict, double *restrict, double *restrict, double *restrict, const Grid &);
}

The functions that are defined inside the class declarations are inlined in most cases and thats why there is a performance degrade

Related

Converting 1-d array to 2d

I am trying to understand this code:
void stencil(const int nx, const int ny, const int width, const int height,
double* image, double* tmp_image)
{
for (int j = 1; j < ny + 1; ++j) {
for (int i = 1; i < nx + 1; ++i) {
tmp_image[j + i * height] = image[j + i * height] * 3.0 / 5.0;
tmp_image[j + i * height] += image[j + (i - 1) * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j + (i + 1) * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j - 1 + i * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j + 1 + i * height] * 0.5 / 5.0;
}
}
}
The 1-d array notation is very confusing. I am trying to convert it to a 2-d notation (which I find easier to read). Could someone point me in the right direction as to how I can accomplish this?

All this code is doing is creating a new image from an original image by taking 60% from the corresponding pixel and 10% from each neighboring pixel.
When you see tmp_image[j + i * height], read it as tmp_image[i][j].
Changing the code to literally use 2D syntax may require knowing at least one of the dimensions at compile time, whereas now it is a runtime argument. So that might be a non-starter, unless you're using C++ and want to write or use a matrix class instead of plain arrays.

Xorshift1024* jump not commutative?

I've been porting Sebastiano Vigna's xorshift1024* PRNG to be compatible with the standard C++11 uniform random number generator contract and noticed some strange behavior with the jump() function he provides.
According to Vigna, a call to jump() should be equivalent to 2^512 calls to next(). Therefore a series of calls to jump() and next() should be commutative. For example, assuming the generator starts in some known state,
jump();
next();
should leave the generator in the same state as
next();
jump();
since both should be equivalent to
for (bigint i = 0; i < (bigint(1) << 512) + 1; ++i)
next();
assuming bigint is some integer type with an extremely large maximum value (and assuming you are a very, very, very patient person).
Unfortunately, this doesn't work with the reference implementation Vigna provides (which I will include at the end for posterity; in case the implementation linked above changes or is taken down in the future). When testing the first two options using the following test code:
memset(s, 0xFF, sizeof(s));
p = 0;
// jump() and/or next() calls...
std::cout << p << ';';
for (int i = 0; i < 16; ++i)
std::cout << ' ' << s[i];
calling jump() before next() outputs:
1; 9726214034378009495 13187905351877324975 10033047168458208082 990371716258730972 965585206446988056 74622805968655940 11468976784638207029 3005795712504439672 6792676950637600526 9275830639065898170 6762742930827334073 16862800599087838815 13481924545051381634 16436948992084179560 6906520316916502096 12790717607058950780
while calling next() first results in:
1; 13187905351877324975 10033047168458208082 990371716258730972 965585206446988056 74622805968655940 11468976784638207029 3005795712504439672 6792676950637600526 9275830639065898170 6762742930827334073 16862800599087838815 13481924545051381634 16436948992084179560 6906520316916502096 12790717607058950780 9726214034378009495
Clearly either my understanding of what jump() is doing is wrong, or there's a bug in the jump() function, or the jump polynomial data is wrong. Vigna claims that such a jump function can be calculated for any stride of the period, but doesn't elaborate on how to calculate it (including in his paper on xorshift* generators). How can I calculate the correct jump data to verify that there's not a typo somewhere in it?
Xorshift1024* reference implementation; http://xorshift.di.unimi.it/xorshift1024star.c
/* Written in 2014-2015 by Sebastiano Vigna (vigna#acm.org)
To the extent possible under law, the author has dedicated all copyright
and related and neighboring rights to this software to the public domain
worldwide. This software is distributed without any warranty.
See <http://creativecommons.org/publicdomain/zero/1.0/>. */
#include <stdint.h>
#include <string.h>
/* This is a fast, top-quality generator. If 1024 bits of state are too
much, try a xorshift128+ generator.
The state must be seeded so that it is not everywhere zero. If you have
a 64-bit seed, we suggest to seed a splitmix64 generator and use its
output to fill s. */
uint64_t s[16];
int p;
uint64_t next(void) {
const uint64_t s0 = s[p];
uint64_t s1 = s[p = (p + 1) & 15];
s1 ^= s1 << 31; // a
s[p] = s1 ^ s0 ^ (s1 >> 11) ^ (s0 >> 30); // b,c
return s[p] * UINT64_C(1181783497276652981);
}
/* This is the jump function for the generator. It is equivalent
to 2^512 calls to next(); it can be used to generate 2^512
non-overlapping subsequences for parallel computations. */
void jump() {
static const uint64_t JUMP[] = { 0x84242f96eca9c41dULL,
0xa3c65b8776f96855ULL, 0x5b34a39f070b5837ULL, 0x4489affce4f31a1eULL,
0x2ffeeb0a48316f40ULL, 0xdc2d9891fe68c022ULL, 0x3659132bb12fea70ULL,
0xaac17d8efa43cab8ULL, 0xc4cb815590989b13ULL, 0x5ee975283d71c93bULL,
0x691548c86c1bd540ULL, 0x7910c41d10a1e6a5ULL, 0x0b5fc64563b3e2a8ULL,
0x047f7684e9fc949dULL, 0xb99181f2d8f685caULL, 0x284600e3f30e38c3ULL
};
uint64_t t[16] = { 0 };
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b)
for(int j = 0; j < 16; j++)
t[j] ^= s[(j + p) & 15];
next();
}
memcpy(s, t, sizeof t);
}

OK, I'm sorry but sometimes this happens (I'm the author).
Originally the function had two memcpy(). Then I realised then a circular copy was needed. But I replaced just the first memcpy(). Stupid, stupid, stupid. All files on the site have been fixed. The arXiv copy is undergoing update. See http://xorshift.di.unimi.it/xorshift1024star.c
Incidentally: I didn't "publish" anything wrong in the scientific sense, as the jump() function is not part of the ACM Trans. Math. Soft. paper—it just has been added few weeks ago on the site and on the arXiv/WWW version. The fast publication path of the web and arXiv means that, sometimes, one distributes unpolished papers. I can only thank the reporter for reporting this bug (OK, technically StackOverflow is not reporting bugs, but I got an email, too).
Unfortunately, the unit test I had did not consider the case p ≠ 0. My main concern was that the correctness of the computed polynomial. The function, as noted above, is correct when p = 0.
As for the computation: to each generator corresponds a characteristic polynomial P(x). The jump polynomial for k is just x^k mod P(x). I use fermat to compute such powers, and then I have some scripts generating the C code.
Of course I can't test 2^512, but since my generation code works perfectly from 2 to 2^30 (the range you can easily test), I'm confident it works at 2^512, too. It's just fermat computing x^{2^512} instead of x^{2^30}. But independent verifications are more than welcome.
I have code working only for powers of the form x^{2^t}. This is what I need to compute useful jump functions. Computing polynomials modulo P(x) is not difficult, so one could conceivably have a completely generic jump function for any value, but frankly I find this totally overkill.
If anybody is interested in getting other jump polynomials, I can provide the scripts. They will be part, as it happens for all other code, of the next xorshift distribution, but I need to complete the documentation before giving them out.
For the record, the characteristic polynomial of xorshift1024* is x^1024 + x^974 + x^973 + x^972 + x^971 + x^966 + x^965 + x^964 + x^963 + x^960 + x^958 + x^957 + x^956 + x^955 + x^950 + x^949 + x^948 + x^947 + x^942 + x^941 + x^940 + x^939 + x^934 + x^933 + x^932 + x^931 + x^926 + x^925 + x^923 + x^922 + x^920 + x^917 + x^916 + x^915 + x^908 + x^906 + x^904 + x^902 + x^890 + x^886 + x^873 + x^870 + x^857 + x^856 + x^846 + x^845 + x^844 + x^843 + x^841 + x^840 + x^837 + x^835 + x^830 + x^828 + x^825 + x^824 + x^820 + x^816 + x^814 + x^813 + x^811 + x^810 + x^803 + x^798 + x^797 + x^790 + x^788 + x^787 + x^786 + x^783 + x^774 + x^772 + x^771 + x^770 + x^769 + x^768 + x^767 + x^765 + x^760 + x^758 + x^753 + x^749 + x^747 + x^746 + x^743 + x^741 + x^740 + x^738 + x^737 + x^736 + x^735 + x^728 + x^726 + x^723 + x^722 + x^721 + x^720 + x^718 + x^716 + x^715 + x^714 + x^710 + x^709 + x^707 + x^694 + x^687 + x^686 + x^685 + x^684 + x^679 + x^678 + x^677 + x^674 + x^670 + x^669 + x^667 + x^666 + x^665 + x^663 + x^658 + x^655 + x^651 + x^639 + x^638 + x^635 + x^634 + x^632 + x^630 + x^623 + x^621 + x^618 + x^617 + x^616 + x^615 + x^614 + x^613 + x^609 + x^606 + x^604 + x^601 + x^600 + x^598 + x^597 + x^596 + x^594 + x^593 + x^592 + x^590 + x^589 + x^588 + x^584 + x^583 + x^582 + x^581 + x^579 + x^577 + x^575 + x^573 + x^572 + x^571 + x^569 + x^567 + x^565 + x^564 + x^563 + x^561 + x^559 + x^557 + x^556 + x^553 + x^552 + x^550 + x^544 + x^543 + x^542 + x^541 + x^537 + x^534 + x^532 + x^530 + x^528 + x^526 + x^523 + x^521 + x^520 + x^518 + x^516 + x^515 + x^512 + x^511 + x^510 + x^508 + x^507 + x^506 + x^505 + x^504 + x^502 + x^501 + x^499 + x^497 + x^494 + x^493 + x^492 + x^491 + x^490 + x^487 + x^485 + x^483 + x^482 + x^480 + x^479 + x^477 + x^476 + x^475 + x^473 + x^469 + x^468 + x^465 + x^463 + x^461 + x^460 + x^459 + x^458 + x^455 + x^453 + x^451 + x^448 + x^447 + x^446 + x^445 + x^443 + x^438 + x^437 + x^431 + x^430 + x^429 + x^428 + x^423 + x^417 + x^416 + x^415 + x^414 + x^412 + x^410 + x^409 + x^408 + x^400 + x^398 + x^396 + x^395 + x^391 + x^390 + x^386 + x^385 + x^381 + x^380 + x^378 + x^375 + x^373 + x^372 + x^369 + x^368 + x^365 + x^360 + x^358 + x^357 + x^354 + x^350 + x^348 + x^346 + x^345 + x^344 + x^343 + x^342 + x^340 + x^338 + x^337 + x^336 + x^335 + x^333 + x^332 + x^325 + x^323 + x^318 + x^315 + x^313 + x^309 + x^308 + x^305 + x^303 + x^302 + x^300 + x^294 + x^290 + x^281 + x^279 + x^276 + x^275 + x^273 + x^272 + x^267 + x^263 + x^262 + x^261 + x^260 + x^258 + x^257 + x^256 + x^249 + x^248 + x^243 + x^242 + x^240 + x^238 + x^236 + x^233 + x^232 + x^230 + x^228 + x^225 + x^216 + x^214 + x^212 + x^210 + x^208 + x^206 + x^205 + x^200 + x^197 + x^196 + x^184 + x^180 + x^176 + x^175 + x^174 + x^173 + x^168 + x^167 + x^166 + x^157 + x^155 + x^153 + x^152 + x^151 + x^150 + x^144 + x^143 + x^136 + x^135 + x^125 + x^121 + x^111 + x^109 + x^107 + x^105 + x^92 + x^90 + x^79 + x^78 + x^77 + x^76 + x^60 + 1

tldr: I'm pretty sure there's a bug in the original code:
The memcpy in jump() must consider the p rotation too.
The author didn't test nearly as much as appropriate before publishing a paper...
Long version:
One next() call changes only one of the 16 s array elements, the one with index p. p starts at 0, gets increased each next() call, and after 15 it becomes 0 again. Let's call s[p] the "current" array element. Another (slower) possibility for implementing next() would be that the current element is always the first one, there is no p, and instead of incrementing p the whole s array is rotated (ie. the first element moves to the last position and the previous second element becomes the first).
Independent of the current p value, 16 calls to next() should result in the same p value as before, ie. the whole cycle is done and the current element is the same position as before the 16 calls. jump() should do 2^512 next(), 2^512 is a multiple of 16, so with one jump, the p value before and after it should be the same.
You probably noticed already that your two different results are only rotated one time, ie. one solution is "9726214034378009495 somethingelse" and one is "somethingelse 9726214034378009495"
...because you did one next() before/after the jump() and jump() can't handle p other than 0.
If you'd test it with 16 next() (or 32 or 0 or ...) before/after jump() instead of one, the two results are equal. The reason is, within jump, while for the s array the current element / p is handled as it is in next(), the t array is semantically rotated so that the current element is always the first one (t[j] ^= s[(j + p) & 15];). Then, right before the function terminates, memcpy(s, t, sizeof t); copies the new values from t back to s without considering the rotation at all. Just replace the memcpy with a proper loop including the p offset, then it should be fine.
(Well, but that doesn't mean jump() is really the same as 2^512 next(). But at least it could be.)

As Vigna himself said, that was actually a bug.
While working on a Java implementation, I found, if not mistaken, a small improvement on the correct implementation:
If you update t array also circularly from p to p-1, then you can just memcpy it back to the state and it will work correctly.
Moreover, the loop updating t gets tighter, as you do not need to add p + j every time. For instance:
int j = p;
do {
t[j] ^= s[j];
++j;
j &= 15;
} while (j != p);
Ok, as bcrist correctly noted, the previous code is wrong, as p changes for each bit in JUMP array. The best alternative I come up with is the following:
void jump() {
static const uint64_t JUMP[] = { 0x84242f96eca9c41dULL,
0xa3c65b8776f96855ULL, 0x5b34a39f070b5837ULL, 0x4489affce4f31a1eULL,
0x2ffeeb0a48316f40ULL, 0xdc2d9891fe68c022ULL, 0x3659132bb12fea70ULL,
0xaac17d8efa43cab8ULL, 0xc4cb815590989b13ULL, 0x5ee975283d71c93bULL,
0x691548c86c1bd540ULL, 0x7910c41d10a1e6a5ULL, 0x0b5fc64563b3e2a8ULL,
0x047f7684e9fc949dULL, 0xb99181f2d8f685caULL, 0x284600e3f30e38c3ULL
};
uint64_t t[16] = { 0 };
const int base = p;
int j = base;
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b) {
int k = p;
do {
t[j++] ^= s[k++];
j &= 15;
k &= 15;
} while (j != base);
}
next();
}
memcpy(s, t, sizeof t);
}
As p will have its original value in the end, this should work.
Not very sure whether it is actually an improvement in performance, as I am trading one addition for an increment and a bitwise AND.
I think it will not be slower, even if increment is as expensive as addition, due to the lack of data dependency between j and k updates. Hopefully, it may be slightly faster.
Opinions / corrections are more than welcome.

Which is more costly for the processor?

I am working on a small Math lib for 3D graphics.
I'm not sure what costs more to the CPU/GPU in terms of time.
Right now I am using this to multiply matrix (4x4)
tmpM.p[0][0] = matA.p[0][0] * matB.p[0][0] + matA.p[0][1] * matB.p[1][0] + matA.p[0][2] * matB.p[2][0] + matA.p[0][3] * matB.p[3][0];
tmpM.p[0][1] = matA.p[0][0] * matB.p[0][1] + matA.p[0][1] * matB.p[1][1] + matA.p[0][2] * matB.p[2][1] + matA.p[0][3] * matB.p[3][1];
tmpM.p[0][2] = matA.p[0][0] * matB.p[0][2] + matA.p[0][1] * matB.p[1][2] + matA.p[0][2] * matB.p[2][2] + matA.p[0][3] * matB.p[3][2];
tmpM.p[0][3] = matA.p[0][0] * matB.p[0][3] + matA.p[0][1] * matB.p[1][3] + matA.p[0][2] * matB.p[2][3] + matA.p[0][3] * matB.p[3][3];
tmpM.p[1][0] = matA.p[1][0] * matB.p[0][0] + matA.p[1][1] * matB.p[1][0] + matA.p[1][2] * matB.p[2][0] + matA.p[1][3] * matB.p[3][0];
tmpM.p[1][1] = matA.p[1][0] * matB.p[0][1] + matA.p[1][1] * matB.p[1][1] + matA.p[1][2] * matB.p[2][1] + matA.p[1][3] * matB.p[3][1];
tmpM.p[1][2] = matA.p[1][0] * matB.p[0][2] + matA.p[1][1] * matB.p[1][2] + matA.p[1][2] * matB.p[2][2] + matA.p[1][3] * matB.p[3][2];
tmpM.p[1][3] = matA.p[1][0] * matB.p[0][3] + matA.p[1][1] * matB.p[1][3] + matA.p[1][2] * matB.p[2][3] + matA.p[1][3] * matB.p[3][3];
tmpM.p[2][0] = matA.p[2][0] * matB.p[0][0] + matA.p[2][1] * matB.p[1][0] + matA.p[2][2] * matB.p[2][0] + matA.p[2][3] * matB.p[3][0];
tmpM.p[2][1] = matA.p[2][0] * matB.p[0][1] + matA.p[2][1] * matB.p[1][1] + matA.p[2][2] * matB.p[2][1] + matA.p[2][3] * matB.p[3][1];
tmpM.p[2][2] = matA.p[2][0] * matB.p[0][2] + matA.p[2][1] * matB.p[1][2] + matA.p[2][2] * matB.p[2][2] + matA.p[2][3] * matB.p[3][2];
tmpM.p[2][3] = matA.p[2][0] * matB.p[0][3] + matA.p[2][1] * matB.p[1][3] + matA.p[2][2] * matB.p[2][3] + matA.p[2][3] * matB.p[3][3];
tmpM.p[3][0] = matA.p[3][0] * matB.p[0][0] + matA.p[3][1] * matB.p[1][0] + matA.p[3][2] * matB.p[2][0] + matA.p[3][3] * matB.p[3][0];
tmpM.p[3][1] = matA.p[3][0] * matB.p[0][1] + matA.p[3][1] * matB.p[1][1] + matA.p[3][2] * matB.p[2][1] + matA.p[3][3] * matB.p[3][1];
tmpM.p[3][2] = matA.p[3][0] * matB.p[0][2] + matA.p[3][1] * matB.p[1][2] + matA.p[3][2] * matB.p[2][2] + matA.p[3][3] * matB.p[3][2];
tmpM.p[3][3] = matA.p[3][0] * matB.p[0][3] + matA.p[3][1] * matB.p[1][3] + matA.p[3][2] * matB.p[2][3] + matA.p[3][3] * matB.p[3][3];
Is this a bad/slow idea?
Will it be more efficient to use a loop?

It will mostly depend on what the compiler manages to figure out from it.
If you can't time the operation in a context similar or identical to use (which remains the best way to figure it out), then I would guess a loop with a functor (functor object or lambda) is likely the best bet for the compiler to be able to figure out a cache friendly unroll and a cheap access to the operation.
A half decent modern compiler will also most likely vectorize it on a CPU.

Why can the Intel C++ compiler not benefit from this template, while GNU can?

In our three dimensional CFD code, performance is crucial. In order to avoid all the calculations in the third dimension in case of a two dimensional simulation, I am experimenting with templates, to avoid double code. With the GCC compiler, this works perfectly, but in case of the Intel compiler this results in a massive performance loss.
The call to my templated function looks like:
if(jtot == 1) // If jtot equals 1, then the grid is 2-dimensional
advecu<2>(ut,u,v,w,dzi4);
else
advecu<3>(ut,u,v,w,dzi4);
My templated function looks like this:
template<int dim>
void Advec4::advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4)
{
// Here comes the initialization of the constants...
// Loop
for(int k=grid->kstart; k<grid->kend; k++)
for(int j=grid->jstart; j<grid->jend; j++)
#pragma ivdep
for(int i=grid->istart; i<grid->iend; i++)
{
ijk = i + j*jj1 + k*kk1;
ut[ijk] -= ( cg0*((ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]) * (ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]))
+ cg1*((ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]) * (ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]))
+ cg2*((ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]) * (ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]))
+ cg3*((ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3]) * (ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3])) ) * cgi*dxi;
if(dim == 3)
{
ut[ijk] -= ( cg0*((ci0*v[ijk-ii2-jj1] + ci1*v[ijk-ii1-jj1] + ci2*v[ijk-jj1] + ci3*v[ijk+ii1-jj1]) * (ci0*u[ijk-jj3] + ci1*u[ijk-jj2] + ci2*u[ijk-jj1] + ci3*u[ijk ]))
+ cg1*((ci0*v[ijk-ii2 ] + ci1*v[ijk-ii1 ] + ci2*v[ijk ] + ci3*v[ijk+ii1 ]) * (ci0*u[ijk-jj2] + ci1*u[ijk-jj1] + ci2*u[ijk ] + ci3*u[ijk+jj1]))
+ cg2*((ci0*v[ijk-ii2+jj1] + ci1*v[ijk-ii1+jj1] + ci2*v[ijk+jj1] + ci3*v[ijk+ii1+jj1]) * (ci0*u[ijk-jj1] + ci1*u[ijk ] + ci2*u[ijk+jj1] + ci3*u[ijk+jj2]))
+ cg3*((ci0*v[ijk-ii2+jj2] + ci1*v[ijk-ii1+jj2] + ci2*v[ijk+jj2] + ci3*v[ijk+ii1+jj2]) * (ci0*u[ijk ] + ci1*u[ijk+jj1] + ci2*u[ijk+jj2] + ci3*u[ijk+jj3])) ) * cgi*dyi;
}
ut[ijk] -= ( cg0*((ci0*w[ijk-ii2-kk1] + ci1*w[ijk-ii1-kk1] + ci2*w[ijk-kk1] + ci3*w[ijk+ii1-kk1]) * (ci0*u[ijk-kk3] + ci1*u[ijk-kk2] + ci2*u[ijk-kk1] + ci3*u[ijk ]))
+ cg1*((ci0*w[ijk-ii2 ] + ci1*w[ijk-ii1 ] + ci2*w[ijk ] + ci3*w[ijk+ii1 ]) * (ci0*u[ijk-kk2] + ci1*u[ijk-kk1] + ci2*u[ijk ] + ci3*u[ijk+kk1]))
+ cg2*((ci0*w[ijk-ii2+kk1] + ci1*w[ijk-ii1+kk1] + ci2*w[ijk+kk1] + ci3*w[ijk+ii1+kk1]) * (ci0*u[ijk-kk1] + ci1*u[ijk ] + ci2*u[ijk+kk1] + ci3*u[ijk+kk2]))
+ cg3*((ci0*w[ijk-ii2+kk2] + ci1*w[ijk-ii1+kk2] + ci2*w[ijk+kk2] + ci3*w[ijk+ii1+kk2]) * (ci0*u[ijk ] + ci1*u[ijk+kk1] + ci2*u[ijk+kk2] + ci3*u[ijk+kk3])) )
* dzi4[k];
}
}
Why does the Intel compiler mess this up? It seems that it does not process the template before it starts optimizing, which results in a remaining if statement in an inner loop or a loss of performance for whatever other reason.

Discrete Wavelet Transform integer Daub 5/3 lifting issue

I'm trying to run an integer-to-integer lifting 5/3 on an image of lena. I've been following the paper "A low-power Low-memory system for wavelet-based image compression" by Walker, Nguyen, and Chen (Link active as of 7 Oct 2015).
I'm running into issues though. The image just doesn't seem to come out quite right. I appear to be overflowing slightly in the green and blue channels which means that subsequent passes of the wavelet function find high frequencies where there ought not to be any. I'm also pretty sure I'm getting something else wrong as I am seeing a line of the s0 image at the edges of the high frequency parts.
My function is as follows:
bool PerformHorizontal( Col24* pPixelsIn, Col24* pPixelsOut, int width, int pixelPitch, int height )
{
const int widthDiv2 = width / 2;
int y = 0;
while( y < height )
{
int x = 0;
while( x < width )
{
const int n = (x) + (y * pixelPitch);
const int n2 = (x / 2) + (y * pixelPitch);
const int s = n2;
const int d = n2 + widthDiv2;
// Non-lifting 5 / 3
/*pPixelsOut[n2 + widthDiv2].r = pPixelsIn[n + 2].r - ((pPixelsIn[n + 1].r + pPixelsIn[n + 3].r) / 2) + 128;
pPixelsOut[n2].r = ((4 * pPixelsIn[n + 2].r) + (2 * pPixelsIn[n + 2].r) + (2 * (pPixelsIn[n + 1].r + pPixelsIn[n + 3].r)) - (pPixelsIn[n + 0].r + pPixelsIn[n + 4].r)) / 8;
pPixelsOut[n2 + widthDiv2].g = pPixelsIn[n + 2].g - ((pPixelsIn[n + 1].g + pPixelsIn[n + 3].g) / 2) + 128;
pPixelsOut[n2].g = ((4 * pPixelsIn[n + 2].g) + (2 * pPixelsIn[n + 2].g) + (2 * (pPixelsIn[n + 1].g + pPixelsIn[n + 3].g)) - (pPixelsIn[n + 0].g + pPixelsIn[n + 4].g)) / 8;
pPixelsOut[n2 + widthDiv2].b = pPixelsIn[n + 2].b - ((pPixelsIn[n + 1].b + pPixelsIn[n + 3].b) / 2) + 128;
pPixelsOut[n2].b = ((4 * pPixelsIn[n + 2].b) + (2 * pPixelsIn[n + 2].b) + (2 * (pPixelsIn[n + 1].b + pPixelsIn[n + 3].b)) - (pPixelsIn[n + 0].b + pPixelsIn[n + 4].b)) / 8;*/
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
pPixelsOut[s].r = pPixelsIn[n].r + (((pPixelsOut[d - 1].r + pPixelsOut[d].r) >> 2) - 64);
pPixelsOut[d].g = pPixelsIn[n + 1].g - (((pPixelsIn[n].g + pPixelsIn[n + 2].g) >> 1) + 127);
pPixelsOut[s].g = pPixelsIn[n].g + (((pPixelsOut[d - 1].g + pPixelsOut[d].g) >> 2) - 64);
pPixelsOut[d].b = pPixelsIn[n + 1].b - (((pPixelsIn[n].b + pPixelsIn[n + 2].b) >> 1) + 127);
pPixelsOut[s].b = pPixelsIn[n].b + (((pPixelsOut[d - 1].b + pPixelsOut[d].b) >> 2) - 64);
x += 2;
}
y++;
}
return true;
}
There is definitely something wrong but I just can't figure it out. Can anyone with slightly more brain than me point out where I am going wrong? Its worth noting that you can see the un-lifted version of the Daub 5/3 above the working code and this, too, give me the same artifacts ... I'm very confused as I have had this working once before (It was over 2 years ago and I no longer have that code).
Any help would be much appreciated :)
Edit: I appear to have eliminated my overflow issues by clamping the low pass pixels to the 0 to 255 range. I'm slightly concerned this isn't the right solution though. Can anyone comment on this?

You can do some tests with extreme values to see the possibility of overflow. Example:
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
If:
pPixelsIn[n ].r == 255
pPixelsIn[n+1].r == 0
pPixelsIn[n+2].r == 255
Then:
pPixelsOut[d].r == -382
But if:
pPixelsIn[n ].r == 0
pPixelsIn[n+1].r == 255
pPixelsIn[n+2].r == 0
Then:
pPixelsOut[d].r == 128
You have a range of 511 possible values (-382 .. 128), so, in order to avoid overflow or clamping, you would need one extra bit, some quantization, or another encoding type!

I'm assuming the data have already been thresholded?
I also don't get why you're adding back in +127 and -64.

OK I can losslessly forward then inverse as long as I store my post forward transform data in a short. Obviously this takes up a little more space than I was hoping for but this does allow me a good starting point for going into the various compression algorithms. You can also, nicely, compress 2 4 component pixels at a time using SSE2 instructions. This is the standard C forward transform I came up with:
const int16_t dr = (int16_t)pPixelsIn[n + 1].r - ((((int16_t)pPixelsIn[n].r + (int16_t)pPixelsIn[n + 2].r) >> 1));
const int16_t sr = (int16_t)pPixelsIn[n].r + ((((int16_t)pPixelsOut[d - 1].r + dr) >> 2));
const int16_t dg = (int16_t)pPixelsIn[n + 1].g - ((((int16_t)pPixelsIn[n].g + (int16_t)pPixelsIn[n + 2].g) >> 1));
const int16_t sg = (int16_t)pPixelsIn[n].g + ((((int16_t)pPixelsOut[d - 1].g + dg) >> 2));
const int16_t db = (int16_t)pPixelsIn[n + 1].b - ((((int16_t)pPixelsIn[n].b + (int16_t)pPixelsIn[n + 2].b) >> 1));
const int16_t sb = (int16_t)pPixelsIn[n].b + ((((int16_t)pPixelsOut[d - 1].b + db) >> 2));
pPixelsOut[d].r = dr;
pPixelsOut[s].r = sr;
pPixelsOut[d].g = dg;
pPixelsOut[s].g = sg;
pPixelsOut[d].b = db;
pPixelsOut[s].b = sb;
It is trivial to create the inverse of this (A VERY simple bit of algebra). Its worth noting, btw, that you need to inverse the image from right to left bottom to top. I'll next see if I can shunt this data into uint8_ts and lost a bit or 2 of accuracy. For compression this really isn't a problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Performance loss if function is not defined in class declaration - c++

The functions that are defined inside the class declarations are inlined in most cases and thats why there is a performance degrade

Related

Converting 1-d array to 2d

Xorshift1024* jump not commutative?

Which is more costly for the processor?

Why can the Intel C++ compiler not benefit from this template, while GNU can?

Discrete Wavelet Transform integer Daub 5/3 lifting issue

Categories

Resources