Discrete Wavelet Transform integer Daub 5/3 lifting issue - c++

I'm trying to run an integer-to-integer lifting 5/3 on an image of lena. I've been following the paper "A low-power Low-memory system for wavelet-based image compression" by Walker, Nguyen, and Chen (Link active as of 7 Oct 2015).
I'm running into issues though. The image just doesn't seem to come out quite right. I appear to be overflowing slightly in the green and blue channels which means that subsequent passes of the wavelet function find high frequencies where there ought not to be any. I'm also pretty sure I'm getting something else wrong as I am seeing a line of the s0 image at the edges of the high frequency parts.
My function is as follows:
bool PerformHorizontal( Col24* pPixelsIn, Col24* pPixelsOut, int width, int pixelPitch, int height )
const int widthDiv2 = width / 2;
int y = 0;
while( y < height )
int x = 0;
while( x < width )
const int n = (x) + (y * pixelPitch);
const int n2 = (x / 2) + (y * pixelPitch);
const int s = n2;
const int d = n2 + widthDiv2;
// Non-lifting 5 / 3
/*pPixelsOut[n2 + widthDiv2].r = pPixelsIn[n + 2].r - ((pPixelsIn[n + 1].r + pPixelsIn[n + 3].r) / 2) + 128;
pPixelsOut[n2].r = ((4 * pPixelsIn[n + 2].r) + (2 * pPixelsIn[n + 2].r) + (2 * (pPixelsIn[n + 1].r + pPixelsIn[n + 3].r)) - (pPixelsIn[n + 0].r + pPixelsIn[n + 4].r)) / 8;
pPixelsOut[n2 + widthDiv2].g = pPixelsIn[n + 2].g - ((pPixelsIn[n + 1].g + pPixelsIn[n + 3].g) / 2) + 128;
pPixelsOut[n2].g = ((4 * pPixelsIn[n + 2].g) + (2 * pPixelsIn[n + 2].g) + (2 * (pPixelsIn[n + 1].g + pPixelsIn[n + 3].g)) - (pPixelsIn[n + 0].g + pPixelsIn[n + 4].g)) / 8;
pPixelsOut[n2 + widthDiv2].b = pPixelsIn[n + 2].b - ((pPixelsIn[n + 1].b + pPixelsIn[n + 3].b) / 2) + 128;
pPixelsOut[n2].b = ((4 * pPixelsIn[n + 2].b) + (2 * pPixelsIn[n + 2].b) + (2 * (pPixelsIn[n + 1].b + pPixelsIn[n + 3].b)) - (pPixelsIn[n + 0].b + pPixelsIn[n + 4].b)) / 8;*/
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
pPixelsOut[s].r = pPixelsIn[n].r + (((pPixelsOut[d - 1].r + pPixelsOut[d].r) >> 2) - 64);
pPixelsOut[d].g = pPixelsIn[n + 1].g - (((pPixelsIn[n].g + pPixelsIn[n + 2].g) >> 1) + 127);
pPixelsOut[s].g = pPixelsIn[n].g + (((pPixelsOut[d - 1].g + pPixelsOut[d].g) >> 2) - 64);
pPixelsOut[d].b = pPixelsIn[n + 1].b - (((pPixelsIn[n].b + pPixelsIn[n + 2].b) >> 1) + 127);
pPixelsOut[s].b = pPixelsIn[n].b + (((pPixelsOut[d - 1].b + pPixelsOut[d].b) >> 2) - 64);
x += 2;
return true;
There is definitely something wrong but I just can't figure it out. Can anyone with slightly more brain than me point out where I am going wrong? Its worth noting that you can see the un-lifted version of the Daub 5/3 above the working code and this, too, give me the same artifacts ... I'm very confused as I have had this working once before (It was over 2 years ago and I no longer have that code).
Any help would be much appreciated :)
Edit: I appear to have eliminated my overflow issues by clamping the low pass pixels to the 0 to 255 range. I'm slightly concerned this isn't the right solution though. Can anyone comment on this?

You can do some tests with extreme values to see the possibility of overflow. Example:
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
pPixelsIn[n ].r == 255
pPixelsIn[n+1].r == 0
pPixelsIn[n+2].r == 255
pPixelsOut[d].r == -382
But if:
pPixelsIn[n ].r == 0
pPixelsIn[n+1].r == 255
pPixelsIn[n+2].r == 0
pPixelsOut[d].r == 128
You have a range of 511 possible values (-382 .. 128), so, in order to avoid overflow or clamping, you would need one extra bit, some quantization, or another encoding type!

I'm assuming the data have already been thresholded?
I also don't get why you're adding back in +127 and -64.

OK I can losslessly forward then inverse as long as I store my post forward transform data in a short. Obviously this takes up a little more space than I was hoping for but this does allow me a good starting point for going into the various compression algorithms. You can also, nicely, compress 2 4 component pixels at a time using SSE2 instructions. This is the standard C forward transform I came up with:
const int16_t dr = (int16_t)pPixelsIn[n + 1].r - ((((int16_t)pPixelsIn[n].r + (int16_t)pPixelsIn[n + 2].r) >> 1));
const int16_t sr = (int16_t)pPixelsIn[n].r + ((((int16_t)pPixelsOut[d - 1].r + dr) >> 2));
const int16_t dg = (int16_t)pPixelsIn[n + 1].g - ((((int16_t)pPixelsIn[n].g + (int16_t)pPixelsIn[n + 2].g) >> 1));
const int16_t sg = (int16_t)pPixelsIn[n].g + ((((int16_t)pPixelsOut[d - 1].g + dg) >> 2));
const int16_t db = (int16_t)pPixelsIn[n + 1].b - ((((int16_t)pPixelsIn[n].b + (int16_t)pPixelsIn[n + 2].b) >> 1));
const int16_t sb = (int16_t)pPixelsIn[n].b + ((((int16_t)pPixelsOut[d - 1].b + db) >> 2));
pPixelsOut[d].r = dr;
pPixelsOut[s].r = sr;
pPixelsOut[d].g = dg;
pPixelsOut[s].g = sg;
pPixelsOut[d].b = db;
pPixelsOut[s].b = sb;
It is trivial to create the inverse of this (A VERY simple bit of algebra). Its worth noting, btw, that you need to inverse the image from right to left bottom to top. I'll next see if I can shunt this data into uint8_ts and lost a bit or 2 of accuracy. For compression this really isn't a problem.


Collecting the coefficients of a term in Sympy

I have the following expression in Sympy
s = e0*a01*d1**2*u0 - e0*a01*d1**2*u1 - e0*a11*d1**2*u0 - e0*a11*d1**2*u1 + e0*d0*a00*d1*u1 + e0*d0*a01*d1*u0 + e0*d0*a10*d1*u0 - e0*d0*a11*d1*u1 + e0*d0*b0*u0 - e0*d0*b1*u1 + e0*d1*a00*d1*u0 - e0*d1*a01*d1*u1 - e0*d1*a10*d1*u1 - e0*d1*a11*d1*u0 - e0*d1*b0*u1 - e0*d1*b1*u0 - e1*a00*d1**2*u0 + e1*a00*d1**2*u1 + e1*a10*d1**2*u0 + e1*a10*d1**2*u1 - e1*d0*a00*d1*u0 + e1*d0*a01*d1*u1 + e1*d0*a10*d1*u1 + e1*d0*a11*d1*u0 + e1*d0*b0*u1 + e1*d0*b1*u0 + e1*d1*a00*d1*u1 + e1*d1*a01*d1*u0 + e1*d1*a10*d1*u0 - e1*d1*a11*d1*u1 + e1*d1*b0*u0 - e1*d1*b1*u1
So first I simpify it:
s = sympify(s,locals=T)
(T contains all these symbols in the string, that are non commutative). And I want to get the coefficient of
after "factoring" it. So I did the following:
collected_expr = collect(s,e,exact=True)
coeff = collected_expr.coeff(e)
The result of collected_expr is ok:
d1**2*u0*(e0*a01 - e0*a11 - e1*a00 + e1*a10) - e0*a01*d1**2*u1 - e0*a11*d1**2*u1 + e0*d0*a00*d1*u1 + e0*d0*a01*d1*u0 + e0*d0*a10*d1*u0 - e0*d0*a11*d1*u1 + e0*d0*b0*u0 - e0*d0*b1*u1 + e0*d1*a00*d1*u0 - e0*d1*a01*d1*u1 - e0*d1*a10*d1*u1 - e0*d1*a11*d1*u0 - e0*d1*b0*u1 - e0*d1*b1*u0 + e1*a00*d1**2*u1 + e1*a10*d1**2*u1 - e1*d0*a00*d1*u0 + e1*d0*a01*d1*u1 + e1*d0*a10*d1*u1 + e1*d0*a11*d1*u0 + e1*d0*b0*u1 + e1*d0*b1*u0 + e1*d1*a00*d1*u1 + e1*d1*a01*d1*u0 + e1*d1*a10*d1*u0 - e1*d1*a11*d1*u1 + e1*d1*b0*u0 - e1*d1*b1*u1
But coeff is not ok, as it returns 1, but I really want
e0*a01 - e0*a11 - e1*a00 + e1*a10
EDIT: I also tried
coeff = collected_expr.coeff(u0).coeff(d1).coeff(d1)
coeff = collected_expr.coeff(u0).coeff(d1**2)
But both things returned 0
The docstring of Expr.coeff says
When x is noncommutative, the coefficient to the left (default) or
right of x can be returned. The keyword 'right' is ignored when
x is commutative.
collect does not seem to be noncommutative-aware, however, so the factors that were on the right may collect to the left.
>>> var("A B", commutative=False)
(A, B)
>>> collect(A*B+B*A**2,B)
B*(A + A**2)

Is there a chance to make the bilinear interpolation faster?

First I want to provide you with some context.
I have two kind of images I need to merge. The first image is the background image with the format 8BppGrey and a resolution of 320x240. The second image is the forground image with the format 32BppRGBA and a resolution of 64x48.
The github repo with an MVP is at the bottom of the question.
To do it I resize the second image with bilinear interpolation to the same size as the first one and then use blending to merge both to one image. Blending only happens when the alpha value of the second image is greater then 0.
I need to do it as fast as possible so my idea was to combine the resize and merge / blend process.
To achieve this I used the resize function from the writeablebitmapex repository and added merging / blending.
Everything works as expected but I want to decrease the execution time.
This are the current debug timings:
// CPU: Intel(R) Core(TM) i7-4810MQ CPU # 2.80GHz
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
Do I have any chance to increase the performance and lower the execution time of the resize / merge / blend process?
Are there some parts I maybe can parallelize?
Do I maybe have a chance to use some processor features?
A huge performance hit is the nested loop but I have no idea how I could write it better.
I would like to reach 1 or 2 ms for the whole process. Is this even possible?
Here's the modified visual c++ function I use.
pd is the backbuffer of the writeable bitmap I use to display the
result in wpf. The format I use is the default 32BppRGBA.
pixels is the int[] array of the 64x48 32BppRGBA image
widthSource and heightSource is the size of the pixels image
width and height is the target size of the output image
baseImage is the int[] array of the 320x240 8BppGray image
VC++ code:
unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
unsigned int start = clock();
float xs = (float)widthSource / width;
float ys = (float)heightSource / height;
float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
int c, x0, x1, y0, y1;
byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
byte a, r, g, b;
// Bilinear
int srcIdx = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++)
sx = x * xs;
sy = y * ys;
x0 = (int)sx;
y0 = (int)sy;
// Calculate coordinates of the 4 interpolation points
fracx = sx - x0;
fracy = sy - y0;
ifracx = 1.0f - fracx;
ifracy = 1.0f - fracy;
x1 = x0 + 1;
if (x1 >= widthSource)
x1 = x0;
y1 = y0 + 1;
if (y1 >= heightSource)
y1 = y0;
// Read source color
c = pixels[y0 * widthSource + x0];
c1a = (byte)(c >> 24);
c1r = (byte)(c >> 16);
c1g = (byte)(c >> 8);
c1b = (byte)(c);
c = pixels[y0 * widthSource + x1];
c2a = (byte)(c >> 24);
c2r = (byte)(c >> 16);
c2g = (byte)(c >> 8);
c2b = (byte)(c);
c = pixels[y1 * widthSource + x0];
c3a = (byte)(c >> 24);
c3r = (byte)(c >> 16);
c3g = (byte)(c >> 8);
c3b = (byte)(c);
c = pixels[y1 * widthSource + x1];
c4a = (byte)(c >> 24);
c4r = (byte)(c >> 16);
c4g = (byte)(c >> 8);
c4b = (byte)(c);
// Calculate colors
// Alpha
l0 = ifracx * c1a + fracx * c2a;
l1 = ifracx * c3a + fracx * c4a;
a = (byte)(ifracy * l0 + fracy * l1);
// Write destination
if (a > 0)
// Red
l0 = ifracx * c1r + fracx * c2r;
l1 = ifracx * c3r + fracx * c4r;
rf = ifracy * l0 + fracy * l1;
// Green
l0 = ifracx * c1g + fracx * c2g;
l1 = ifracx * c3g + fracx * c4g;
gf = ifracy * l0 + fracy * l1;
// Blue
l0 = ifracx * c1b + fracx * c2b;
l1 = ifracx * c3b + fracx * c4b;
bf = ifracy * l0 + fracy * l1;
// Cast to byte
float alpha = a / 255.0f;
r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
// Alpha, Red, Green, Blue
pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
unsigned int end = clock() - start;
return end;
Github repo
One action that may speed up your code is to avoid type conversions from integer to float and vice versa. This can be achieved by having an int value in the suitable range instead of floats on range 0..1
Something like this:
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++)
int sx1 = x * widthSource ;
int x0 = sx1 / width;
int fracx = (sx1 % width) ; // range 0..width - 1
which turns into something like
l0 = (fracx * c2a + (width - fracx) * c1a) / width ;
And so on. A bit tricky but doable
Thank you for all the help but the problem was the managed c++ project. I transfered the function now to my native c++ library and used the managed c++ part only as a wrapper for the c# application.
After the compiler optimization the function is now finished in 1ms.
I will mark my own answer for now as the solution because the optimization from #marom leads to a broken image.
The common way to speedup a resize operation with bilinear interpolation is to:
Exploit the fact that x0 and fracx are independent from the row and that y0and fracy are independent from the column. Even though you haven't pulled out the computation of y0 and fracy out of the x-loop, compiler optimization should take care of that. However, for x0 and fracx, one needs to pre-compute the values for all columns and store them in an array. Complexity for computing x0 and fracx becomes O(width) compared to O(width*height) without pre-computation.
Do the whole processing with integers by replacing floating point arithmetics by integer arithmetics, thereby using shift operations instead of integer divisions.
For better readability, I did not implement the pre-computation of x0 and fracx in the following code. Pre-computation is straight-forward anyways.
Note that FACTOR = 2048 is the max you can do with 32-bit signed integers here (2048 * 2048 * 255 is just fine). For higher precision, you should switch to int64_t and then increase FACTOR and SHIFT, respectively.
I placed the border check into the inner loop for better readability. For an optimized implementation one should remove it by iterating in both loops just before this case happens and add special handling for the border pixels.
In case someone is wondering what the + (FACTOR * FACTOR / 2) is for, it is for rounding in conjunction with the subsequent division.
Finally note that (FACTOR * FACTOR / 2) and 2 * SHIFT are evaluated at compile time.
#define FACTOR 2048
#define SHIFT 11
const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);
for (int y = 0; y < height; y++)
const int sy = y * ys;
const int y0 = sy >> SHIFT;
const int fracy = sy - (y0 << SHIFT);
for (int x = 0; x < width; x++)
const int sx = x * xs;
const int x0 = sx >> SHIFT;
const int fracx = sx - (x0 << SHIFT);
if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
// insert special handling here
const int offset = y0 * widthSource + x0;
target[y * width + x] = (unsigned char)
((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
source[offset + 1] * fracx * (FACTOR - fracy) +
source[offset + widthSource] * (FACTOR - fracx) * fracy +
source[offset + widthSource + 1] * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));
For clarification, to match the variables used by the OP, for instance, in the case of the alpha channel it is:
a = (unsigned char)
((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
c2a * fracx * (FACTOR - fracy) +
c3a * (FACTOR - fracx) * fracy +
c4a * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));

Xorshift1024* jump not commutative?

I've been porting Sebastiano Vigna's xorshift1024* PRNG to be compatible with the standard C++11 uniform random number generator contract and noticed some strange behavior with the jump() function he provides.
According to Vigna, a call to jump() should be equivalent to 2^512 calls to next(). Therefore a series of calls to jump() and next() should be commutative. For example, assuming the generator starts in some known state,
should leave the generator in the same state as
since both should be equivalent to
for (bigint i = 0; i < (bigint(1) << 512) + 1; ++i)
assuming bigint is some integer type with an extremely large maximum value (and assuming you are a very, very, very patient person).
Unfortunately, this doesn't work with the reference implementation Vigna provides (which I will include at the end for posterity; in case the implementation linked above changes or is taken down in the future). When testing the first two options using the following test code:
memset(s, 0xFF, sizeof(s));
p = 0;
// jump() and/or next() calls...
std::cout << p << ';';
for (int i = 0; i < 16; ++i)
std::cout << ' ' << s[i];
calling jump() before next() outputs:
1; 9726214034378009495 13187905351877324975 10033047168458208082 990371716258730972 965585206446988056 74622805968655940 11468976784638207029 3005795712504439672 6792676950637600526 9275830639065898170 6762742930827334073 16862800599087838815 13481924545051381634 16436948992084179560 6906520316916502096 12790717607058950780
while calling next() first results in:
1; 13187905351877324975 10033047168458208082 990371716258730972 965585206446988056 74622805968655940 11468976784638207029 3005795712504439672 6792676950637600526 9275830639065898170 6762742930827334073 16862800599087838815 13481924545051381634 16436948992084179560 6906520316916502096 12790717607058950780 9726214034378009495
Clearly either my understanding of what jump() is doing is wrong, or there's a bug in the jump() function, or the jump polynomial data is wrong. Vigna claims that such a jump function can be calculated for any stride of the period, but doesn't elaborate on how to calculate it (including in his paper on xorshift* generators). How can I calculate the correct jump data to verify that there's not a typo somewhere in it?
Xorshift1024* reference implementation; http://xorshift.di.unimi.it/xorshift1024star.c
/* Written in 2014-2015 by Sebastiano Vigna (vigna#acm.org)
To the extent possible under law, the author has dedicated all copyright
and related and neighboring rights to this software to the public domain
worldwide. This software is distributed without any warranty.
See <http://creativecommons.org/publicdomain/zero/1.0/>. */
#include <stdint.h>
#include <string.h>
/* This is a fast, top-quality generator. If 1024 bits of state are too
much, try a xorshift128+ generator.
The state must be seeded so that it is not everywhere zero. If you have
a 64-bit seed, we suggest to seed a splitmix64 generator and use its
output to fill s. */
uint64_t s[16];
int p;
uint64_t next(void) {
const uint64_t s0 = s[p];
uint64_t s1 = s[p = (p + 1) & 15];
s1 ^= s1 << 31; // a
s[p] = s1 ^ s0 ^ (s1 >> 11) ^ (s0 >> 30); // b,c
return s[p] * UINT64_C(1181783497276652981);
/* This is the jump function for the generator. It is equivalent
to 2^512 calls to next(); it can be used to generate 2^512
non-overlapping subsequences for parallel computations. */
void jump() {
static const uint64_t JUMP[] = { 0x84242f96eca9c41dULL,
0xa3c65b8776f96855ULL, 0x5b34a39f070b5837ULL, 0x4489affce4f31a1eULL,
0x2ffeeb0a48316f40ULL, 0xdc2d9891fe68c022ULL, 0x3659132bb12fea70ULL,
0xaac17d8efa43cab8ULL, 0xc4cb815590989b13ULL, 0x5ee975283d71c93bULL,
0x691548c86c1bd540ULL, 0x7910c41d10a1e6a5ULL, 0x0b5fc64563b3e2a8ULL,
0x047f7684e9fc949dULL, 0xb99181f2d8f685caULL, 0x284600e3f30e38c3ULL
uint64_t t[16] = { 0 };
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b)
for(int j = 0; j < 16; j++)
t[j] ^= s[(j + p) & 15];
memcpy(s, t, sizeof t);
OK, I'm sorry but sometimes this happens (I'm the author).
Originally the function had two memcpy(). Then I realised then a circular copy was needed. But I replaced just the first memcpy(). Stupid, stupid, stupid. All files on the site have been fixed. The arXiv copy is undergoing update. See http://xorshift.di.unimi.it/xorshift1024star.c
Incidentally: I didn't "publish" anything wrong in the scientific sense, as the jump() function is not part of the ACM Trans. Math. Soft. paper—it just has been added few weeks ago on the site and on the arXiv/WWW version. The fast publication path of the web and arXiv means that, sometimes, one distributes unpolished papers. I can only thank the reporter for reporting this bug (OK, technically StackOverflow is not reporting bugs, but I got an email, too).
Unfortunately, the unit test I had did not consider the case p ≠ 0. My main concern was that the correctness of the computed polynomial. The function, as noted above, is correct when p = 0.
As for the computation: to each generator corresponds a characteristic polynomial P(x). The jump polynomial for k is just x^k mod P(x). I use fermat to compute such powers, and then I have some scripts generating the C code.
Of course I can't test 2^512, but since my generation code works perfectly from 2 to 2^30 (the range you can easily test), I'm confident it works at 2^512, too. It's just fermat computing x^{2^512} instead of x^{2^30}. But independent verifications are more than welcome.
I have code working only for powers of the form x^{2^t}. This is what I need to compute useful jump functions. Computing polynomials modulo P(x) is not difficult, so one could conceivably have a completely generic jump function for any value, but frankly I find this totally overkill.
If anybody is interested in getting other jump polynomials, I can provide the scripts. They will be part, as it happens for all other code, of the next xorshift distribution, but I need to complete the documentation before giving them out.
For the record, the characteristic polynomial of xorshift1024* is x^1024 + x^974 + x^973 + x^972 + x^971 + x^966 + x^965 + x^964 + x^963 + x^960 + x^958 + x^957 + x^956 + x^955 + x^950 + x^949 + x^948 + x^947 + x^942 + x^941 + x^940 + x^939 + x^934 + x^933 + x^932 + x^931 + x^926 + x^925 + x^923 + x^922 + x^920 + x^917 + x^916 + x^915 + x^908 + x^906 + x^904 + x^902 + x^890 + x^886 + x^873 + x^870 + x^857 + x^856 + x^846 + x^845 + x^844 + x^843 + x^841 + x^840 + x^837 + x^835 + x^830 + x^828 + x^825 + x^824 + x^820 + x^816 + x^814 + x^813 + x^811 + x^810 + x^803 + x^798 + x^797 + x^790 + x^788 + x^787 + x^786 + x^783 + x^774 + x^772 + x^771 + x^770 + x^769 + x^768 + x^767 + x^765 + x^760 + x^758 + x^753 + x^749 + x^747 + x^746 + x^743 + x^741 + x^740 + x^738 + x^737 + x^736 + x^735 + x^728 + x^726 + x^723 + x^722 + x^721 + x^720 + x^718 + x^716 + x^715 + x^714 + x^710 + x^709 + x^707 + x^694 + x^687 + x^686 + x^685 + x^684 + x^679 + x^678 + x^677 + x^674 + x^670 + x^669 + x^667 + x^666 + x^665 + x^663 + x^658 + x^655 + x^651 + x^639 + x^638 + x^635 + x^634 + x^632 + x^630 + x^623 + x^621 + x^618 + x^617 + x^616 + x^615 + x^614 + x^613 + x^609 + x^606 + x^604 + x^601 + x^600 + x^598 + x^597 + x^596 + x^594 + x^593 + x^592 + x^590 + x^589 + x^588 + x^584 + x^583 + x^582 + x^581 + x^579 + x^577 + x^575 + x^573 + x^572 + x^571 + x^569 + x^567 + x^565 + x^564 + x^563 + x^561 + x^559 + x^557 + x^556 + x^553 + x^552 + x^550 + x^544 + x^543 + x^542 + x^541 + x^537 + x^534 + x^532 + x^530 + x^528 + x^526 + x^523 + x^521 + x^520 + x^518 + x^516 + x^515 + x^512 + x^511 + x^510 + x^508 + x^507 + x^506 + x^505 + x^504 + x^502 + x^501 + x^499 + x^497 + x^494 + x^493 + x^492 + x^491 + x^490 + x^487 + x^485 + x^483 + x^482 + x^480 + x^479 + x^477 + x^476 + x^475 + x^473 + x^469 + x^468 + x^465 + x^463 + x^461 + x^460 + x^459 + x^458 + x^455 + x^453 + x^451 + x^448 + x^447 + x^446 + x^445 + x^443 + x^438 + x^437 + x^431 + x^430 + x^429 + x^428 + x^423 + x^417 + x^416 + x^415 + x^414 + x^412 + x^410 + x^409 + x^408 + x^400 + x^398 + x^396 + x^395 + x^391 + x^390 + x^386 + x^385 + x^381 + x^380 + x^378 + x^375 + x^373 + x^372 + x^369 + x^368 + x^365 + x^360 + x^358 + x^357 + x^354 + x^350 + x^348 + x^346 + x^345 + x^344 + x^343 + x^342 + x^340 + x^338 + x^337 + x^336 + x^335 + x^333 + x^332 + x^325 + x^323 + x^318 + x^315 + x^313 + x^309 + x^308 + x^305 + x^303 + x^302 + x^300 + x^294 + x^290 + x^281 + x^279 + x^276 + x^275 + x^273 + x^272 + x^267 + x^263 + x^262 + x^261 + x^260 + x^258 + x^257 + x^256 + x^249 + x^248 + x^243 + x^242 + x^240 + x^238 + x^236 + x^233 + x^232 + x^230 + x^228 + x^225 + x^216 + x^214 + x^212 + x^210 + x^208 + x^206 + x^205 + x^200 + x^197 + x^196 + x^184 + x^180 + x^176 + x^175 + x^174 + x^173 + x^168 + x^167 + x^166 + x^157 + x^155 + x^153 + x^152 + x^151 + x^150 + x^144 + x^143 + x^136 + x^135 + x^125 + x^121 + x^111 + x^109 + x^107 + x^105 + x^92 + x^90 + x^79 + x^78 + x^77 + x^76 + x^60 + 1
tldr: I'm pretty sure there's a bug in the original code:
The memcpy in jump() must consider the p rotation too.
The author didn't test nearly as much as appropriate before publishing a paper...
Long version:
One next() call changes only one of the 16 s array elements, the one with index p. p starts at 0, gets increased each next() call, and after 15 it becomes 0 again. Let's call s[p] the "current" array element. Another (slower) possibility for implementing next() would be that the current element is always the first one, there is no p, and instead of incrementing p the whole s array is rotated (ie. the first element moves to the last position and the previous second element becomes the first).
Independent of the current p value, 16 calls to next() should result in the same p value as before, ie. the whole cycle is done and the current element is the same position as before the 16 calls. jump() should do 2^512 next(), 2^512 is a multiple of 16, so with one jump, the p value before and after it should be the same.
You probably noticed already that your two different results are only rotated one time, ie. one solution is "9726214034378009495 somethingelse" and one is "somethingelse 9726214034378009495"
...because you did one next() before/after the jump() and jump() can't handle p other than 0.
If you'd test it with 16 next() (or 32 or 0 or ...) before/after jump() instead of one, the two results are equal. The reason is, within jump, while for the s array the current element / p is handled as it is in next(), the t array is semantically rotated so that the current element is always the first one (t[j] ^= s[(j + p) & 15];). Then, right before the function terminates, memcpy(s, t, sizeof t); copies the new values from t back to s without considering the rotation at all. Just replace the memcpy with a proper loop including the p offset, then it should be fine.
(Well, but that doesn't mean jump() is really the same as 2^512 next(). But at least it could be.)
As Vigna himself said, that was actually a bug.
While working on a Java implementation, I found, if not mistaken, a small improvement on the correct implementation:
If you update t array also circularly from p to p-1, then you can just memcpy it back to the state and it will work correctly.
Moreover, the loop updating t gets tighter, as you do not need to add p + j every time. For instance:
int j = p;
do {
t[j] ^= s[j];
j &= 15;
} while (j != p);
Ok, as bcrist correctly noted, the previous code is wrong, as p changes for each bit in JUMP array. The best alternative I come up with is the following:
void jump() {
static const uint64_t JUMP[] = { 0x84242f96eca9c41dULL,
0xa3c65b8776f96855ULL, 0x5b34a39f070b5837ULL, 0x4489affce4f31a1eULL,
0x2ffeeb0a48316f40ULL, 0xdc2d9891fe68c022ULL, 0x3659132bb12fea70ULL,
0xaac17d8efa43cab8ULL, 0xc4cb815590989b13ULL, 0x5ee975283d71c93bULL,
0x691548c86c1bd540ULL, 0x7910c41d10a1e6a5ULL, 0x0b5fc64563b3e2a8ULL,
0x047f7684e9fc949dULL, 0xb99181f2d8f685caULL, 0x284600e3f30e38c3ULL
uint64_t t[16] = { 0 };
const int base = p;
int j = base;
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b) {
int k = p;
do {
t[j++] ^= s[k++];
j &= 15;
k &= 15;
} while (j != base);
memcpy(s, t, sizeof t);
As p will have its original value in the end, this should work.
Not very sure whether it is actually an improvement in performance, as I am trading one addition for an increment and a bitwise AND.
I think it will not be slower, even if increment is as expensive as addition, due to the lack of data dependency between j and k updates. Hopefully, it may be slightly faster.
Opinions / corrections are more than welcome.

Why can the Intel C++ compiler not benefit from this template, while GNU can?

In our three dimensional CFD code, performance is crucial. In order to avoid all the calculations in the third dimension in case of a two dimensional simulation, I am experimenting with templates, to avoid double code. With the GCC compiler, this works perfectly, but in case of the Intel compiler this results in a massive performance loss.
The call to my templated function looks like:
if(jtot == 1) // If jtot equals 1, then the grid is 2-dimensional
My templated function looks like this:
template<int dim>
void Advec4::advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4)
// Here comes the initialization of the constants...
// Loop
for(int k=grid->kstart; k<grid->kend; k++)
for(int j=grid->jstart; j<grid->jend; j++)
#pragma ivdep
for(int i=grid->istart; i<grid->iend; i++)
ijk = i + j*jj1 + k*kk1;
ut[ijk] -= ( cg0*((ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]) * (ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]))
+ cg1*((ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]) * (ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]))
+ cg2*((ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]) * (ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]))
+ cg3*((ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3]) * (ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3])) ) * cgi*dxi;
if(dim == 3)
ut[ijk] -= ( cg0*((ci0*v[ijk-ii2-jj1] + ci1*v[ijk-ii1-jj1] + ci2*v[ijk-jj1] + ci3*v[ijk+ii1-jj1]) * (ci0*u[ijk-jj3] + ci1*u[ijk-jj2] + ci2*u[ijk-jj1] + ci3*u[ijk ]))
+ cg1*((ci0*v[ijk-ii2 ] + ci1*v[ijk-ii1 ] + ci2*v[ijk ] + ci3*v[ijk+ii1 ]) * (ci0*u[ijk-jj2] + ci1*u[ijk-jj1] + ci2*u[ijk ] + ci3*u[ijk+jj1]))
+ cg2*((ci0*v[ijk-ii2+jj1] + ci1*v[ijk-ii1+jj1] + ci2*v[ijk+jj1] + ci3*v[ijk+ii1+jj1]) * (ci0*u[ijk-jj1] + ci1*u[ijk ] + ci2*u[ijk+jj1] + ci3*u[ijk+jj2]))
+ cg3*((ci0*v[ijk-ii2+jj2] + ci1*v[ijk-ii1+jj2] + ci2*v[ijk+jj2] + ci3*v[ijk+ii1+jj2]) * (ci0*u[ijk ] + ci1*u[ijk+jj1] + ci2*u[ijk+jj2] + ci3*u[ijk+jj3])) ) * cgi*dyi;
ut[ijk] -= ( cg0*((ci0*w[ijk-ii2-kk1] + ci1*w[ijk-ii1-kk1] + ci2*w[ijk-kk1] + ci3*w[ijk+ii1-kk1]) * (ci0*u[ijk-kk3] + ci1*u[ijk-kk2] + ci2*u[ijk-kk1] + ci3*u[ijk ]))
+ cg1*((ci0*w[ijk-ii2 ] + ci1*w[ijk-ii1 ] + ci2*w[ijk ] + ci3*w[ijk+ii1 ]) * (ci0*u[ijk-kk2] + ci1*u[ijk-kk1] + ci2*u[ijk ] + ci3*u[ijk+kk1]))
+ cg2*((ci0*w[ijk-ii2+kk1] + ci1*w[ijk-ii1+kk1] + ci2*w[ijk+kk1] + ci3*w[ijk+ii1+kk1]) * (ci0*u[ijk-kk1] + ci1*u[ijk ] + ci2*u[ijk+kk1] + ci3*u[ijk+kk2]))
+ cg3*((ci0*w[ijk-ii2+kk2] + ci1*w[ijk-ii1+kk2] + ci2*w[ijk+kk2] + ci3*w[ijk+ii1+kk2]) * (ci0*u[ijk ] + ci1*u[ijk+kk1] + ci2*u[ijk+kk2] + ci3*u[ijk+kk3])) )
* dzi4[k];
Why does the Intel compiler mess this up? It seems that it does not process the template before it starts optimizing, which results in a remaining if statement in an inner loop or a loss of performance for whatever other reason.

What are some tips to make a function with a lot of calculation clean? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
How can functions like this:
void Map::Display()
int hScrollPos = GetScrollPos(M_HWnd, SB_HORZ);
int vScrollPos = GetScrollPos(M_HWnd, SB_VERT);
D2D1_RECT_F tFRegion = {0,0,TILE_WIDTH,21}; // tile front's region
Coor coor;
int tileHeight;
RECT rect;
GetWindowRect(M_HWnd, &rect);
int HWndWidth = rect.right - rect.left;
int HWndHeight = rect.bottom - rect.top;
pRT->Clear(D2D1::ColorF(0.45f, 0.76f, 0.98f, 1.0f));
for(int x=0; x<nTiles; x++)
coor = ppTile[x]->Getcoor();
tileHeight = ppTile[x]->Getheight();
if((coor.GetX() - 1) * (TILE_WIDTH * 0.5) - hScrollPos > 0 - TILE_WIDTH &&
(coor.GetX() - 1) * (TILE_WIDTH * 0.5) - hScrollPos < HWndWidth &&
((coor.GetY() - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos > 0 - (TILE_HEIGHT * 2.5) &&
((coor.GetY() - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos < HWndHeight)
/* Draws tiles */
(coor.GetX() - 1) * (TILE_WIDTH * 0.5) - hScrollPos,
((coor.GetY() - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos
pRT->FillRectangle( &region, pBmpTileBrush[ppTile[x]->GetType() + 1]);
/* Draws tiles' front */
if((coor.Y - 1) / 2 < mapSizeY - 1) // If we are not in the front row,
if(coor.X > 1)
for(int diffH = tileHeight - ppTile[x + mapSizeX - 1]->Getheight(); diffH == 0; diffH--)
(coor.GetX() - 1) * (TILE_WIDTH * 0.5) - hScrollPos,
((coor.GetY() - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos + (TILE_HEIGHT * 0.75) + (diffH * TILE_PIXEL_PER_LAYER)
pRT->FillRectangle( &tFRegion, pBmpTileFrontBrush[ppTile[x]->GetType()]);
if(((coor.X -1) / 2) + 1 < mapSizeX)
for(int diffH = tileHeight - ppTile[x + mapSizeX]->Getheight(); diffH == 0; diffH--)
(coor.GetX() - 1) * (TILE_WIDTH * 0.5) - hScrollPos,
((coor.GetY() - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos + (TILE_HEIGHT * 0.75) + (diffH * TILE_PIXEL_PER_LAYER)
pRT->FillRectangle( &tFRegion, pBmpTileFrontBrush[ppTile[x]->GetType()]);
if(coor.X == 1 || (coor.X - 1) / 2 == mapSizeY - 1) // If the tile if at any of left or right edge,
for(int n = ((TH * 1.5) / TPPL) - (ppTile[x + mapSizeY + mapSizeY - 1]->Getheight() - tileHeight); n>=0; n--)
(coor.X - 1) * (TILE_WIDTH * 0.5) - hScrollPos,
((coor.Y - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos + (TILE_HEIGHT * 0.75) + (n * TILE_PIXEL_PER_LAYER)
pRT->FillRectangle( &tFRegion, pBmpTileFrontBrush[ppTile[x]->GetType()]);
else // If we are in the front row
for(int h = tileHeight; h >= 0; h--)
(coor.GetX() - 1) * (TILE_WIDTH * 0.5) - hScrollPos,
((coor.GetY() - 1) * (TILE_HEIGHT * 0.5) * 1.5f) + ((MAX_MAP_HEIGHT - tileHeight) * (TILE_PIXEL_PER_LAYER)) + TILE_HEIGHT - vScrollPos + (TILE_HEIGHT * 0.75) + (h * TILE_PIXEL_PER_LAYER)
pRT->FillRectangle( &tFRegion, pBmpTileFrontBrush[ppTile[x]->GetType()]);
hr = pRT->EndDraw();
Tile* Map::GetClickedTile(short xPos, short yPos)
Tile* pNoClickedTile = NULL;
int hScrollPos = GetScrollPos(M_HWnd, SB_HORZ);
int vScrollPos = GetScrollPos(M_HWnd, SB_VERT);
if(xPos < (mapSizeX * TILE_WIDTH) - hScrollPos) // If the click is within width of the map then...
Coor coor;
int height;
int currentTile;
int tileDistanceFromTop;
/* Checks if click is in an odd row of tiles */
int column = (xPos + hScrollPos) / TILE_WIDTH;
for (int y=mapSizeY-1; y>=0; y--)
currentTile = column + (y * (mapSizeX+mapSizeX-1));
coor = ppTile[currentTile]->Getcoor();
height = ppTile[currentTile]->Getheight();
tileDistanceFromTop = ((coor.Y / 2) * TILE_HEIGHT * 1.5f) + // Distance between two tiles
vScrollPos +
/*if (tileDistanceFromTop < 0) // If the tile is partially hidden,
tileDistanceFromTop = tileDistanceFromTop % TILE_HEIGHT; // then % TILE_HEIGHT*/
if( yPos > tileDistanceFromTop &&
yPos < tileDistanceFromTop + TILE_HEIGHT)
/* Get relative coordinates */
int rpx = xPos % TILE_WIDTH;
int rpy = ( (yPos - SPACE_LEFT_FOR_BACKGROUND) -
(y * (TILE_HEIGHT /2) ) -
vScrollPos) %
/* Checks if click is withing area of current tile */
if (rpy + (rpx / (TILE_WIDTH /16)) > TILE_HEIGHT * 0.25f && // if click is Down Right the Upper Left slope and,
rpy + (rpx / (TILE_WIDTH /16)) < TILE_HEIGHT * 1.25f && // it is UL the LR slope and,
rpy - (rpx / (TILE_WIDTH /16)) < TILE_HEIGHT * 0.75f && // it is UR the LL slope and,
rpy - (rpx / (TILE_WIDTH /16)) > TILE_HEIGHT * -0.25f) // it is DL the UR slope,
return ppTile[currentTile]; // Then return currentTile
/* Checks if click is in an even row of tiles */
column = (xPos + hScrollPos - (TILE_WIDTH/2)) / TILE_WIDTH;
for (int y=mapSizeY-2; y>=0; y--)
currentTile = column + (y * (mapSizeX+mapSizeX-1)) + mapSizeX;
coor = ppTile[currentTile]->Getcoor();
height = ppTile[currentTile]->Getheight();
tileDistanceFromTop = (((coor.Y - 1) / 2) * TILE_HEIGHT * 1.5f) + // Distance between two tiles
(TILE_HEIGHT * 0.75) -
vScrollPos +
/*if (tileDistanceFromTop < 0)
tileDistanceFromTop = tileDistanceFromTop % TILE_HEIGHT;*/
if( yPos > tileDistanceFromTop &&
yPos < tileDistanceFromTop + TILE_HEIGHT)
/* Get relative coordinates */
int rpx = xPos % TILE_WIDTH;
int rpy = (int)((yPos - SPACE_LEFT_FOR_BACKGROUND) -
(y * (TILE_HEIGHT /2) ) -
(TILE_HEIGHT * 0.675) +
vScrollPos) %
/* Checks if click is withing area of current tile */
if (rpy + (rpx / (TILE_WIDTH /16)) > TILE_HEIGHT * 0.25f && // if click is Down Right the Upper Left slope and,
rpy + (rpx / (TILE_WIDTH /16)) < TILE_HEIGHT * 1.25f && // it is UL the LR slope and,
rpy - (rpx / (TILE_WIDTH /16)) < TILE_HEIGHT * 0.75f && // it is UR the LL slope and,
rpy - (rpx / (TILE_WIDTH /16)) > TILE_HEIGHT * -0.25f) // it is DL the UR slope,
return ppTile[currentTile]; // Then return currentTile // Then return currentTile
return pNoClickedTile;
Or even this:
int Map::GetTileNByCoor(Coor coor)
return ((coor.X / 2 + ((coor.Y - 1) * mapSizeY) - (coor.Y / 2));
be made easier to read? As my code grows bigger, I realize how important, if not at times necessary, it is to have a clean, easy to read code. What are some tips to make codes like the ones above cleaner?
My general refactoring practices is usually to do the following:
Pull out names for things that aren't apparent in the code. You can use local variables to give defining names to small pieces of code. So, in cases like your last example, what does (coor.X / 2 + ((coor.Y - 1) * mapSizeY) represent?
In most cases its better to have things names well, than worry about storing local variables (they will be deleted when the stack leaves the function, and usually you are not going to be too worried about memory space/speed of the code at such a fine grain).
Pull out groups of executing code into methods. A good rule of thumb is if your function is more than 6 lines of code, you can probably pull out a smaller function inside of it. Then your code will read better to what it's actually doing.
A very common place to look at this is loops. You can almost always pull the code inside a loop into it's own function, with a good descriptive name.
After you have pulled out methods, you can group common shared functionality into smaller objects. It's almost always better to have smaller objects working together to do the work, than to have giant objects that do a lot of work. You want your objects to each have a single responsibility.
Pretty solid code, well done. I would consider:
Comment the function itself at a high-level, and then add better comments for all the significant blocks in the code, and for anything unusually tricky.
Use descriptive consts or #defines for all the magic variables you're using. Why multiply by 0.675? What does 0.675 represent? Ditto 0.25, 1.25, -0.25 etc.
Turn things like the "Checks if click is withing area of current tile" test (and others) into a separate method that you call, for example isClickInsideTile(x,y,tile).
Add debug trace so that the next person responsible can enable debug to get diagnostics.
PS good job with your variable names and method names.