Xorshift1024* jump not commutative? - c++

I've been porting Sebastiano Vigna's xorshift1024* PRNG to be compatible with the standard C++11 uniform random number generator contract and noticed some strange behavior with the jump() function he provides.
According to Vigna, a call to jump() should be equivalent to 2^512 calls to next(). Therefore a series of calls to jump() and next() should be commutative. For example, assuming the generator starts in some known state,
jump();
next();
should leave the generator in the same state as
next();
jump();
since both should be equivalent to
for (bigint i = 0; i < (bigint(1) << 512) + 1; ++i)
next();
assuming bigint is some integer type with an extremely large maximum value (and assuming you are a very, very, very patient person).
Unfortunately, this doesn't work with the reference implementation Vigna provides (which I will include at the end for posterity; in case the implementation linked above changes or is taken down in the future). When testing the first two options using the following test code:
memset(s, 0xFF, sizeof(s));
p = 0;
// jump() and/or next() calls...
std::cout << p << ';';
for (int i = 0; i < 16; ++i)
std::cout << ' ' << s[i];
calling jump() before next() outputs:
1; 9726214034378009495 13187905351877324975 10033047168458208082 990371716258730972 965585206446988056 74622805968655940 11468976784638207029 3005795712504439672 6792676950637600526 9275830639065898170 6762742930827334073 16862800599087838815 13481924545051381634 16436948992084179560 6906520316916502096 12790717607058950780
while calling next() first results in:
1; 13187905351877324975 10033047168458208082 990371716258730972 965585206446988056 74622805968655940 11468976784638207029 3005795712504439672 6792676950637600526 9275830639065898170 6762742930827334073 16862800599087838815 13481924545051381634 16436948992084179560 6906520316916502096 12790717607058950780 9726214034378009495
Clearly either my understanding of what jump() is doing is wrong, or there's a bug in the jump() function, or the jump polynomial data is wrong. Vigna claims that such a jump function can be calculated for any stride of the period, but doesn't elaborate on how to calculate it (including in his paper on xorshift* generators). How can I calculate the correct jump data to verify that there's not a typo somewhere in it?
Xorshift1024* reference implementation; http://xorshift.di.unimi.it/xorshift1024star.c
/* Written in 2014-2015 by Sebastiano Vigna (vigna#acm.org)
To the extent possible under law, the author has dedicated all copyright
and related and neighboring rights to this software to the public domain
worldwide. This software is distributed without any warranty.
See <http://creativecommons.org/publicdomain/zero/1.0/>. */
#include <stdint.h>
#include <string.h>
/* This is a fast, top-quality generator. If 1024 bits of state are too
much, try a xorshift128+ generator.
The state must be seeded so that it is not everywhere zero. If you have
a 64-bit seed, we suggest to seed a splitmix64 generator and use its
output to fill s. */
uint64_t s[16];
int p;
uint64_t next(void) {
const uint64_t s0 = s[p];
uint64_t s1 = s[p = (p + 1) & 15];
s1 ^= s1 << 31; // a
s[p] = s1 ^ s0 ^ (s1 >> 11) ^ (s0 >> 30); // b,c
return s[p] * UINT64_C(1181783497276652981);
}
/* This is the jump function for the generator. It is equivalent
to 2^512 calls to next(); it can be used to generate 2^512
non-overlapping subsequences for parallel computations. */
void jump() {
static const uint64_t JUMP[] = { 0x84242f96eca9c41dULL,
0xa3c65b8776f96855ULL, 0x5b34a39f070b5837ULL, 0x4489affce4f31a1eULL,
0x2ffeeb0a48316f40ULL, 0xdc2d9891fe68c022ULL, 0x3659132bb12fea70ULL,
0xaac17d8efa43cab8ULL, 0xc4cb815590989b13ULL, 0x5ee975283d71c93bULL,
0x691548c86c1bd540ULL, 0x7910c41d10a1e6a5ULL, 0x0b5fc64563b3e2a8ULL,
0x047f7684e9fc949dULL, 0xb99181f2d8f685caULL, 0x284600e3f30e38c3ULL
};
uint64_t t[16] = { 0 };
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b)
for(int j = 0; j < 16; j++)
t[j] ^= s[(j + p) & 15];
next();
}
memcpy(s, t, sizeof t);
}

OK, I'm sorry but sometimes this happens (I'm the author).
Originally the function had two memcpy(). Then I realised then a circular copy was needed. But I replaced just the first memcpy(). Stupid, stupid, stupid. All files on the site have been fixed. The arXiv copy is undergoing update. See http://xorshift.di.unimi.it/xorshift1024star.c
Incidentally: I didn't "publish" anything wrong in the scientific sense, as the jump() function is not part of the ACM Trans. Math. Soft. paper—it just has been added few weeks ago on the site and on the arXiv/WWW version. The fast publication path of the web and arXiv means that, sometimes, one distributes unpolished papers. I can only thank the reporter for reporting this bug (OK, technically StackOverflow is not reporting bugs, but I got an email, too).
Unfortunately, the unit test I had did not consider the case p ≠ 0. My main concern was that the correctness of the computed polynomial. The function, as noted above, is correct when p = 0.
As for the computation: to each generator corresponds a characteristic polynomial P(x). The jump polynomial for k is just x^k mod P(x). I use fermat to compute such powers, and then I have some scripts generating the C code.
Of course I can't test 2^512, but since my generation code works perfectly from 2 to 2^30 (the range you can easily test), I'm confident it works at 2^512, too. It's just fermat computing x^{2^512} instead of x^{2^30}. But independent verifications are more than welcome.
I have code working only for powers of the form x^{2^t}. This is what I need to compute useful jump functions. Computing polynomials modulo P(x) is not difficult, so one could conceivably have a completely generic jump function for any value, but frankly I find this totally overkill.
If anybody is interested in getting other jump polynomials, I can provide the scripts. They will be part, as it happens for all other code, of the next xorshift distribution, but I need to complete the documentation before giving them out.
For the record, the characteristic polynomial of xorshift1024* is x^1024 + x^974 + x^973 + x^972 + x^971 + x^966 + x^965 + x^964 + x^963 + x^960 + x^958 + x^957 + x^956 + x^955 + x^950 + x^949 + x^948 + x^947 + x^942 + x^941 + x^940 + x^939 + x^934 + x^933 + x^932 + x^931 + x^926 + x^925 + x^923 + x^922 + x^920 + x^917 + x^916 + x^915 + x^908 + x^906 + x^904 + x^902 + x^890 + x^886 + x^873 + x^870 + x^857 + x^856 + x^846 + x^845 + x^844 + x^843 + x^841 + x^840 + x^837 + x^835 + x^830 + x^828 + x^825 + x^824 + x^820 + x^816 + x^814 + x^813 + x^811 + x^810 + x^803 + x^798 + x^797 + x^790 + x^788 + x^787 + x^786 + x^783 + x^774 + x^772 + x^771 + x^770 + x^769 + x^768 + x^767 + x^765 + x^760 + x^758 + x^753 + x^749 + x^747 + x^746 + x^743 + x^741 + x^740 + x^738 + x^737 + x^736 + x^735 + x^728 + x^726 + x^723 + x^722 + x^721 + x^720 + x^718 + x^716 + x^715 + x^714 + x^710 + x^709 + x^707 + x^694 + x^687 + x^686 + x^685 + x^684 + x^679 + x^678 + x^677 + x^674 + x^670 + x^669 + x^667 + x^666 + x^665 + x^663 + x^658 + x^655 + x^651 + x^639 + x^638 + x^635 + x^634 + x^632 + x^630 + x^623 + x^621 + x^618 + x^617 + x^616 + x^615 + x^614 + x^613 + x^609 + x^606 + x^604 + x^601 + x^600 + x^598 + x^597 + x^596 + x^594 + x^593 + x^592 + x^590 + x^589 + x^588 + x^584 + x^583 + x^582 + x^581 + x^579 + x^577 + x^575 + x^573 + x^572 + x^571 + x^569 + x^567 + x^565 + x^564 + x^563 + x^561 + x^559 + x^557 + x^556 + x^553 + x^552 + x^550 + x^544 + x^543 + x^542 + x^541 + x^537 + x^534 + x^532 + x^530 + x^528 + x^526 + x^523 + x^521 + x^520 + x^518 + x^516 + x^515 + x^512 + x^511 + x^510 + x^508 + x^507 + x^506 + x^505 + x^504 + x^502 + x^501 + x^499 + x^497 + x^494 + x^493 + x^492 + x^491 + x^490 + x^487 + x^485 + x^483 + x^482 + x^480 + x^479 + x^477 + x^476 + x^475 + x^473 + x^469 + x^468 + x^465 + x^463 + x^461 + x^460 + x^459 + x^458 + x^455 + x^453 + x^451 + x^448 + x^447 + x^446 + x^445 + x^443 + x^438 + x^437 + x^431 + x^430 + x^429 + x^428 + x^423 + x^417 + x^416 + x^415 + x^414 + x^412 + x^410 + x^409 + x^408 + x^400 + x^398 + x^396 + x^395 + x^391 + x^390 + x^386 + x^385 + x^381 + x^380 + x^378 + x^375 + x^373 + x^372 + x^369 + x^368 + x^365 + x^360 + x^358 + x^357 + x^354 + x^350 + x^348 + x^346 + x^345 + x^344 + x^343 + x^342 + x^340 + x^338 + x^337 + x^336 + x^335 + x^333 + x^332 + x^325 + x^323 + x^318 + x^315 + x^313 + x^309 + x^308 + x^305 + x^303 + x^302 + x^300 + x^294 + x^290 + x^281 + x^279 + x^276 + x^275 + x^273 + x^272 + x^267 + x^263 + x^262 + x^261 + x^260 + x^258 + x^257 + x^256 + x^249 + x^248 + x^243 + x^242 + x^240 + x^238 + x^236 + x^233 + x^232 + x^230 + x^228 + x^225 + x^216 + x^214 + x^212 + x^210 + x^208 + x^206 + x^205 + x^200 + x^197 + x^196 + x^184 + x^180 + x^176 + x^175 + x^174 + x^173 + x^168 + x^167 + x^166 + x^157 + x^155 + x^153 + x^152 + x^151 + x^150 + x^144 + x^143 + x^136 + x^135 + x^125 + x^121 + x^111 + x^109 + x^107 + x^105 + x^92 + x^90 + x^79 + x^78 + x^77 + x^76 + x^60 + 1

tldr: I'm pretty sure there's a bug in the original code:
The memcpy in jump() must consider the p rotation too.
The author didn't test nearly as much as appropriate before publishing a paper...
Long version:
One next() call changes only one of the 16 s array elements, the one with index p. p starts at 0, gets increased each next() call, and after 15 it becomes 0 again. Let's call s[p] the "current" array element. Another (slower) possibility for implementing next() would be that the current element is always the first one, there is no p, and instead of incrementing p the whole s array is rotated (ie. the first element moves to the last position and the previous second element becomes the first).
Independent of the current p value, 16 calls to next() should result in the same p value as before, ie. the whole cycle is done and the current element is the same position as before the 16 calls. jump() should do 2^512 next(), 2^512 is a multiple of 16, so with one jump, the p value before and after it should be the same.
You probably noticed already that your two different results are only rotated one time, ie. one solution is "9726214034378009495 somethingelse" and one is "somethingelse 9726214034378009495"
...because you did one next() before/after the jump() and jump() can't handle p other than 0.
If you'd test it with 16 next() (or 32 or 0 or ...) before/after jump() instead of one, the two results are equal. The reason is, within jump, while for the s array the current element / p is handled as it is in next(), the t array is semantically rotated so that the current element is always the first one (t[j] ^= s[(j + p) & 15];). Then, right before the function terminates, memcpy(s, t, sizeof t); copies the new values from t back to s without considering the rotation at all. Just replace the memcpy with a proper loop including the p offset, then it should be fine.
(Well, but that doesn't mean jump() is really the same as 2^512 next(). But at least it could be.)

As Vigna himself said, that was actually a bug.
While working on a Java implementation, I found, if not mistaken, a small improvement on the correct implementation:
If you update t array also circularly from p to p-1, then you can just memcpy it back to the state and it will work correctly.
Moreover, the loop updating t gets tighter, as you do not need to add p + j every time. For instance:
int j = p;
do {
t[j] ^= s[j];
++j;
j &= 15;
} while (j != p);
Ok, as bcrist correctly noted, the previous code is wrong, as p changes for each bit in JUMP array. The best alternative I come up with is the following:
void jump() {
static const uint64_t JUMP[] = { 0x84242f96eca9c41dULL,
0xa3c65b8776f96855ULL, 0x5b34a39f070b5837ULL, 0x4489affce4f31a1eULL,
0x2ffeeb0a48316f40ULL, 0xdc2d9891fe68c022ULL, 0x3659132bb12fea70ULL,
0xaac17d8efa43cab8ULL, 0xc4cb815590989b13ULL, 0x5ee975283d71c93bULL,
0x691548c86c1bd540ULL, 0x7910c41d10a1e6a5ULL, 0x0b5fc64563b3e2a8ULL,
0x047f7684e9fc949dULL, 0xb99181f2d8f685caULL, 0x284600e3f30e38c3ULL
};
uint64_t t[16] = { 0 };
const int base = p;
int j = base;
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b) {
int k = p;
do {
t[j++] ^= s[k++];
j &= 15;
k &= 15;
} while (j != base);
}
next();
}
memcpy(s, t, sizeof t);
}
As p will have its original value in the end, this should work.
Not very sure whether it is actually an improvement in performance, as I am trading one addition for an increment and a bitwise AND.
I think it will not be slower, even if increment is as expensive as addition, due to the lack of data dependency between j and k updates. Hopefully, it may be slightly faster.
Opinions / corrections are more than welcome.

Related

Cuda global to shared memory and constant memory

I just started learning cuda and I'm having an issue converting some code to use shared memory and another to use constant memory, for comparison purposes.
__global__ void CUDA(int *device_array_Image1, int *device_array_Image2,int *device_array_Image3, int *device_array_kernel, int *device_array_Result1,int *device_array_Result2,int *device_array_Result3){
int i = blockIdx.x;
int j = threadIdx.x;
int ArraySum1 = 0 ; // set sum = 0 initially
int ArraySum2 = 0 ;
int ArraySum3 = 0 ;
for (int N = -1 ; N <= 1 ; N++)
{
for (int M = -1 ; M <= 1 ; M++)
{
ArraySum1 = ArraySum1 + (device_array_Image1[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
ArraySum2 = ArraySum2 + (device_array_Image2[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
ArraySum3 = ArraySum3 + (device_array_Image3[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
}
}
device_array_Result1[i * Image_Size + j] = ArraySum1;
device_array_Result2[i * Image_Size + j] = ArraySum2;
device_array_Result3[i * Image_Size + j] = ArraySum3;
}
This is what I have done so far but I'm having an issue understanding the shared and constant memory so if anyone could help with the code or point me in the right direction I'd be really grateful.
Thanks for any help.
a) Shared memory: This memory will be visible only to all threads in a block. This shared memory is useful if you are accessing data more than once from that block.So in squaring of a number it will not be useful but while matrix multiplication it is useful.
b) Constant memory: Data is stored in device global memory and data can be read through multiprocessor constant cache. 64KB constant memory and 8KB cache is given to each multiprocessor.Data is broadcast to all threads in a warp.So if all the threads in the warp request the same value, that value is delivered to in a single cycle.
Below links helped me in understanding constant and shared memory
1) http://cuda-programming.blogspot.in/2013/01/what-is-constant-memory-in-cuda.html
2) http://cuda-programming.blogspot.in/2013/01/shared-memory-and-synchronization-in.html
3) https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
Please refer this links.

SIMD/SSE : short dot product and short max value

I'm trying to optimize a dot product of two c-style arrays of contant and small size and of type short.
I've read several documentations about SIMD intrinsics and many blog posts/articles about dot product optimization using this intrisincs.
However, i don't understand how a dot product on short arrays using this intrinsics can give the right result. When making the dot product, the computed values can be (and are always, in my case) greater than SHORT_MAX, so is there sum. Hence, i store them in a variable of double type.
As i understand the dot product using simd intrinsic, we use __m128i variables types and operations are returning __m128i. So, what i don't understand is why it doesn't "overflow" and how the result can be transformed into a value type that can handle it?
thanks for your advices
Depending on the range of your data values you might use an intrinsic such as _mm_madd_epi16, which performs multiply/add on 16 bit data and generates 32 bit terms. You would then need to periodically accumulate your 32 bit terms to 64 bits. How often you need to do this depends on the range of your input data, e.g. if it's 12 bit greyscale image data then you can do 64 iterations at 8 elements per iteration (i.e. 512 input points) before there is the potential for overflow. In the worst case however, if your input data uses the full 16 bit range, then you would need to do the additional 64 bit accumulate on every iteration (i.e. every 8 points).
Just for the records, here is how i make an dot product for 2 int16 arrays of size 36:
double dotprod(const int16_t* source, const int16_t* target, const int size){
#ifdef USE_SSE
int res[4];
__m128i* src = (__m128i *) source;
__m128i* t = (__m128i *) target;
__m128i s = _mm_madd_epi16(_mm_loadu_si128(src), mm_loadu_si128(t));
++src;
++t;
s = _mm_add_epi32(s, _mm_madd_epi16(_mm_loadu_si128(src), _mm_loadu_si128(t)));
++src;
++t;
s = _mm_add_epi32(s, _mm_madd_epi16(_mm_loadu_si128(src), _mm_loadu_si128(t)));
++src;
++t;
s = _mm_add_epi32(s, _mm_madd_epi16(_mm_loadu_si128(src), _mm_loadu_si128(t)));
/* return the sum of the four 32-bit sub sums */
_mm_storeu_si128((__m128i*)&res, s);
return res[0] + res[1] + res[2] + res[3] + source[32] * target[32] + source[33] * target[33] + source[34] * target[34] + source[35] * target[35];
#elif USE_AVX
int res[8];
__m256i* src = (__m256i *) source;
__m256i* t = (__m256i *) target;
__m256i s = _mm256_madd_epi16(_mm256_loadu_si256(src), _mm256_loadu_si256(t));
++src;
++t;
s = _mm256_add_epi32(s, _mm256_madd_epi16(_mm256_loadu_si256(src), _mm256_loadu_si256(t)));
/* return the sum of the 8 32-bit sub sums */
_mm256_storeu_si256((__m256i*)&res, s);
return res[0] + res[1] + res[2] + res[3] + res[4] + res[5] + res[6] + res[7] + source[32] * target[32] + source[33] * target[33] + source[34] * target[34] + source[35] * target[35];
#else
return source[0] * target[0] + source[1] * target[1] + source[2] * target[2] + source[3] * target[3] + source[4] * target[4]+ source[5] * target[5] + source[6] * target[6] + source[7] * target[7] + source[8] * target[8] + source[9] * target[9] + source[10] * target[10] + source[11] * target[11] + source[12] * target[12] + source[13] * target[13] + source[14] * target[14] + source[15] * target[15] + source[16] * target[16] + source[17] * target[17] + source[18] * target[18] + source[19] * target[19] + source[20] * target[20] + source[21] * target[21] + source[22] * target[22] + source[23] * target[23] + source[24] * target[24] + source[25] * target[25] + source[26] * target[26] + source[27] * target[27] + source[28] * target[28] + source[29] * target[29] + source[30] * target[30] + source[31] * target[31] + source[32] * target[32] + source[33] * target[33] + source[34] * target[34] + source[35] * target[35];
#endif
}

Why can the Intel C++ compiler not benefit from this template, while GNU can?

In our three dimensional CFD code, performance is crucial. In order to avoid all the calculations in the third dimension in case of a two dimensional simulation, I am experimenting with templates, to avoid double code. With the GCC compiler, this works perfectly, but in case of the Intel compiler this results in a massive performance loss.
The call to my templated function looks like:
if(jtot == 1) // If jtot equals 1, then the grid is 2-dimensional
advecu<2>(ut,u,v,w,dzi4);
else
advecu<3>(ut,u,v,w,dzi4);
My templated function looks like this:
template<int dim>
void Advec4::advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4)
{
// Here comes the initialization of the constants...
// Loop
for(int k=grid->kstart; k<grid->kend; k++)
for(int j=grid->jstart; j<grid->jend; j++)
#pragma ivdep
for(int i=grid->istart; i<grid->iend; i++)
{
ijk = i + j*jj1 + k*kk1;
ut[ijk] -= ( cg0*((ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]) * (ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]))
+ cg1*((ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]) * (ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]))
+ cg2*((ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]) * (ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]))
+ cg3*((ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3]) * (ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3])) ) * cgi*dxi;
if(dim == 3)
{
ut[ijk] -= ( cg0*((ci0*v[ijk-ii2-jj1] + ci1*v[ijk-ii1-jj1] + ci2*v[ijk-jj1] + ci3*v[ijk+ii1-jj1]) * (ci0*u[ijk-jj3] + ci1*u[ijk-jj2] + ci2*u[ijk-jj1] + ci3*u[ijk ]))
+ cg1*((ci0*v[ijk-ii2 ] + ci1*v[ijk-ii1 ] + ci2*v[ijk ] + ci3*v[ijk+ii1 ]) * (ci0*u[ijk-jj2] + ci1*u[ijk-jj1] + ci2*u[ijk ] + ci3*u[ijk+jj1]))
+ cg2*((ci0*v[ijk-ii2+jj1] + ci1*v[ijk-ii1+jj1] + ci2*v[ijk+jj1] + ci3*v[ijk+ii1+jj1]) * (ci0*u[ijk-jj1] + ci1*u[ijk ] + ci2*u[ijk+jj1] + ci3*u[ijk+jj2]))
+ cg3*((ci0*v[ijk-ii2+jj2] + ci1*v[ijk-ii1+jj2] + ci2*v[ijk+jj2] + ci3*v[ijk+ii1+jj2]) * (ci0*u[ijk ] + ci1*u[ijk+jj1] + ci2*u[ijk+jj2] + ci3*u[ijk+jj3])) ) * cgi*dyi;
}
ut[ijk] -= ( cg0*((ci0*w[ijk-ii2-kk1] + ci1*w[ijk-ii1-kk1] + ci2*w[ijk-kk1] + ci3*w[ijk+ii1-kk1]) * (ci0*u[ijk-kk3] + ci1*u[ijk-kk2] + ci2*u[ijk-kk1] + ci3*u[ijk ]))
+ cg1*((ci0*w[ijk-ii2 ] + ci1*w[ijk-ii1 ] + ci2*w[ijk ] + ci3*w[ijk+ii1 ]) * (ci0*u[ijk-kk2] + ci1*u[ijk-kk1] + ci2*u[ijk ] + ci3*u[ijk+kk1]))
+ cg2*((ci0*w[ijk-ii2+kk1] + ci1*w[ijk-ii1+kk1] + ci2*w[ijk+kk1] + ci3*w[ijk+ii1+kk1]) * (ci0*u[ijk-kk1] + ci1*u[ijk ] + ci2*u[ijk+kk1] + ci3*u[ijk+kk2]))
+ cg3*((ci0*w[ijk-ii2+kk2] + ci1*w[ijk-ii1+kk2] + ci2*w[ijk+kk2] + ci3*w[ijk+ii1+kk2]) * (ci0*u[ijk ] + ci1*u[ijk+kk1] + ci2*u[ijk+kk2] + ci3*u[ijk+kk3])) )
* dzi4[k];
}
}
Why does the Intel compiler mess this up? It seems that it does not process the template before it starts optimizing, which results in a remaining if statement in an inner loop or a loss of performance for whatever other reason.

what does + sign after variable mean?

what will be the output of following code
int x,a=3;
x=+ +a+ + +a+ + +5;
printf("%d %d",x,a);
ouput is: 11 3. I want to know how? and what does + sign after a means?
I think DrYap has it right.
x = + + a + + + a + + + 5;
is the same as:
x = + (+ a) + (+ (+ a)) + (+ (+ 5));
The key points here are:
1) c, c++ don't have + as a postfix operator, so we know we have to interpret it as a prefix
2) monadic + binds more tightly (is higher precedence) than dyadic +
Funny isn't it ? If these were - signs it wouldn't look so strange. Monadic +/- is just a leading sign, or to put it another way, "+x" is the same as "0+x".
The + after a just gets seen as a + before the next value. If you use consistent spacing it is the same as:
x = + + a + + + a + + + 5;
But not all the +s are necessary so it will act the same as doing:
x = a + a + 5;
The value of a is unchanged because you have never used the incrementing operator which is ++ with no white space between the two + symbols. + and ++ are two separate operators.
Since the + operators are never two next to each other but always separated by a white space the statement
x=+ +a+ + +a+ + +5; is actually read as
x=+ (nothing)+a+(nothing) +(nothing) +a+(nothing) +(nothing) +5;
so basically the final equation becomes of the sort
x=a+a+5; and hence the result.
The code seems to be equivalent to:
x= (+(+(a)))+ (+ (+(a)))+ (+(+(5)));
I.e. x = a + a + 5. Which is 11. You know that you can put + or - sign before number, right? Now those + merely indicate sign of variable. Since sign is +, variable remains unchanged I.e. "+5" means "5", so "+a" means "a", and "+ +a" means "+(+a)" which means "a". In same fashion you could write x = + + + 3 + + + + 3 + + + + 5. Or x = - + + - 3 + - + - 3 - - + 5;.
x=+ +a+ + +a+ + +5 : This is equivalent to
x = x=+ +a+ + +a+ + +5 or
we can write it as x = + (+ a) + (+ (+ a)) + (+ (+ 5))
and the +'s are only indicating the signs which will be finally evaluated as
x = a + a + 5.

Discrete Wavelet Transform integer Daub 5/3 lifting issue

I'm trying to run an integer-to-integer lifting 5/3 on an image of lena. I've been following the paper "A low-power Low-memory system for wavelet-based image compression" by Walker, Nguyen, and Chen (Link active as of 7 Oct 2015).
I'm running into issues though. The image just doesn't seem to come out quite right. I appear to be overflowing slightly in the green and blue channels which means that subsequent passes of the wavelet function find high frequencies where there ought not to be any. I'm also pretty sure I'm getting something else wrong as I am seeing a line of the s0 image at the edges of the high frequency parts.
My function is as follows:
bool PerformHorizontal( Col24* pPixelsIn, Col24* pPixelsOut, int width, int pixelPitch, int height )
{
const int widthDiv2 = width / 2;
int y = 0;
while( y < height )
{
int x = 0;
while( x < width )
{
const int n = (x) + (y * pixelPitch);
const int n2 = (x / 2) + (y * pixelPitch);
const int s = n2;
const int d = n2 + widthDiv2;
// Non-lifting 5 / 3
/*pPixelsOut[n2 + widthDiv2].r = pPixelsIn[n + 2].r - ((pPixelsIn[n + 1].r + pPixelsIn[n + 3].r) / 2) + 128;
pPixelsOut[n2].r = ((4 * pPixelsIn[n + 2].r) + (2 * pPixelsIn[n + 2].r) + (2 * (pPixelsIn[n + 1].r + pPixelsIn[n + 3].r)) - (pPixelsIn[n + 0].r + pPixelsIn[n + 4].r)) / 8;
pPixelsOut[n2 + widthDiv2].g = pPixelsIn[n + 2].g - ((pPixelsIn[n + 1].g + pPixelsIn[n + 3].g) / 2) + 128;
pPixelsOut[n2].g = ((4 * pPixelsIn[n + 2].g) + (2 * pPixelsIn[n + 2].g) + (2 * (pPixelsIn[n + 1].g + pPixelsIn[n + 3].g)) - (pPixelsIn[n + 0].g + pPixelsIn[n + 4].g)) / 8;
pPixelsOut[n2 + widthDiv2].b = pPixelsIn[n + 2].b - ((pPixelsIn[n + 1].b + pPixelsIn[n + 3].b) / 2) + 128;
pPixelsOut[n2].b = ((4 * pPixelsIn[n + 2].b) + (2 * pPixelsIn[n + 2].b) + (2 * (pPixelsIn[n + 1].b + pPixelsIn[n + 3].b)) - (pPixelsIn[n + 0].b + pPixelsIn[n + 4].b)) / 8;*/
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
pPixelsOut[s].r = pPixelsIn[n].r + (((pPixelsOut[d - 1].r + pPixelsOut[d].r) >> 2) - 64);
pPixelsOut[d].g = pPixelsIn[n + 1].g - (((pPixelsIn[n].g + pPixelsIn[n + 2].g) >> 1) + 127);
pPixelsOut[s].g = pPixelsIn[n].g + (((pPixelsOut[d - 1].g + pPixelsOut[d].g) >> 2) - 64);
pPixelsOut[d].b = pPixelsIn[n + 1].b - (((pPixelsIn[n].b + pPixelsIn[n + 2].b) >> 1) + 127);
pPixelsOut[s].b = pPixelsIn[n].b + (((pPixelsOut[d - 1].b + pPixelsOut[d].b) >> 2) - 64);
x += 2;
}
y++;
}
return true;
}
There is definitely something wrong but I just can't figure it out. Can anyone with slightly more brain than me point out where I am going wrong? Its worth noting that you can see the un-lifted version of the Daub 5/3 above the working code and this, too, give me the same artifacts ... I'm very confused as I have had this working once before (It was over 2 years ago and I no longer have that code).
Any help would be much appreciated :)
Edit: I appear to have eliminated my overflow issues by clamping the low pass pixels to the 0 to 255 range. I'm slightly concerned this isn't the right solution though. Can anyone comment on this?
You can do some tests with extreme values to see the possibility of overflow. Example:
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
If:
pPixelsIn[n ].r == 255
pPixelsIn[n+1].r == 0
pPixelsIn[n+2].r == 255
Then:
pPixelsOut[d].r == -382
But if:
pPixelsIn[n ].r == 0
pPixelsIn[n+1].r == 255
pPixelsIn[n+2].r == 0
Then:
pPixelsOut[d].r == 128
You have a range of 511 possible values (-382 .. 128), so, in order to avoid overflow or clamping, you would need one extra bit, some quantization, or another encoding type!
I'm assuming the data have already been thresholded?
I also don't get why you're adding back in +127 and -64.
OK I can losslessly forward then inverse as long as I store my post forward transform data in a short. Obviously this takes up a little more space than I was hoping for but this does allow me a good starting point for going into the various compression algorithms. You can also, nicely, compress 2 4 component pixels at a time using SSE2 instructions. This is the standard C forward transform I came up with:
const int16_t dr = (int16_t)pPixelsIn[n + 1].r - ((((int16_t)pPixelsIn[n].r + (int16_t)pPixelsIn[n + 2].r) >> 1));
const int16_t sr = (int16_t)pPixelsIn[n].r + ((((int16_t)pPixelsOut[d - 1].r + dr) >> 2));
const int16_t dg = (int16_t)pPixelsIn[n + 1].g - ((((int16_t)pPixelsIn[n].g + (int16_t)pPixelsIn[n + 2].g) >> 1));
const int16_t sg = (int16_t)pPixelsIn[n].g + ((((int16_t)pPixelsOut[d - 1].g + dg) >> 2));
const int16_t db = (int16_t)pPixelsIn[n + 1].b - ((((int16_t)pPixelsIn[n].b + (int16_t)pPixelsIn[n + 2].b) >> 1));
const int16_t sb = (int16_t)pPixelsIn[n].b + ((((int16_t)pPixelsOut[d - 1].b + db) >> 2));
pPixelsOut[d].r = dr;
pPixelsOut[s].r = sr;
pPixelsOut[d].g = dg;
pPixelsOut[s].g = sg;
pPixelsOut[d].b = db;
pPixelsOut[s].b = sb;
It is trivial to create the inverse of this (A VERY simple bit of algebra). Its worth noting, btw, that you need to inverse the image from right to left bottom to top. I'll next see if I can shunt this data into uint8_ts and lost a bit or 2 of accuracy. For compression this really isn't a problem.