I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result.
I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam() or difference between copyFirstPixelOnRow() and copyFirstPixelOnRowUsingTSize() using struct TSize as parameter of image data size. Why is performance of boxFilterRow() and boxFilterRow_OpenMP() so different a why isn't it with different radius size in program?
I created github repository for this little testing project:
https://github.com/Tb45/OpenMP-Strange-Behavior
Here are all results summarized:
https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt
I didn't find any explanation why is this happening or what am I doing wrong.
Thanks for your help.
I'm working on faster box filter and others for image processing algorithms.
typedef intptr_t int_t;
struct TSize
{
int_t width;
int_t height;
};
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
void divideImageDataWithParam_OpenMP(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
#pragma omp parallel for if(parallel)
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
Results of divideImageDataWithParam():
generateRandomImageData :: 3840x2160
numberOfIterations = 100
With Visual C++ 2019:
32bit 64bit
336.906ms 344.251ms divideImageDataWithParam
1832.120ms 6395.861ms divideImageDataWithParam_OpenMP single-thread parallel=false
387.152ms 1204.302ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
With Intel C++ 19:
32bit 64bit
15.162ms 8.927ms divideImageDataWithParam
266.646ms 294.134ms divideImageDataWithParam_OpenMP single-threaded parallel=false
239.564ms 1195.556ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.
648trindade is right; it has to do with optimizations that cannot be done with openmp. But its not loop-unrolling or vectorization, its inlining which allows for a smart substitution.
Let me explain: Integer divisions are incredibly slow (64bit IDIV: ~40-100 Cycles). So whenever possible people (and compilers) try to avoid divisions. One trick you can use is to substitute a division with a multiplication and a shift. That only works if the divisor is known at compile time. This is the case because your function divideImageDataWithParam is inlined and PARAM is known. You can verify this by prepending it with __declspec(noinline). You will get the timings that you expected.
The openmp parallelization does not allow this trick because the function cannot be inlined and therefore param is not known at compile time and an expensive IDIV-instruction is generated.
Compiler output of divideImageDataWithParam (WIN10, MSVC2017, x64):
0x7ff67d151480 <+ 336> movzx ecx,byte ptr [r10+r8]
0x7ff67d151485 <+ 341> mov rax,r12
0x7ff67d151488 <+ 344> mul rax,rcx <------- multiply
0x7ff67d15148b <+ 347> shr rdx,3 <------- shift
0x7ff67d15148f <+ 351> mov byte ptr [r8],dl
0x7ff67d151492 <+ 354> lea r8,[r8+1]
0x7ff67d151496 <+ 358> sub r9,1
0x7ff67d15149a <+ 362> jne test!main+0x150 (00007ff6`7d151480)
And the openmp-version:
0x7ff67d151210 <+ 192> movzx eax,byte ptr [r10+rcx]
0x7ff67d151215 <+ 197> lea rcx,[rcx+1]
0x7ff67d151219 <+ 201> cqo
0x7ff67d15121b <+ 203> idiv rax,rbp <------- idiv
0x7ff67d15121e <+ 206> mov byte ptr [rcx-1],al
0x7ff67d151221 <+ 209> lea rax,[r8+rcx]
0x7ff67d151225 <+ 213> mov rdx,qword ptr [rbx]
0x7ff67d151228 <+ 216> cmp rax,rdx
0x7ff67d15122b <+ 219> jl test!divideImageDataWithParam$omp$1+0xc0 (00007ff6`7d151210)
Note 1) If you try out the compiler explorer (https://godbolt.org/) you will see that some compilers do the substitution for the openmp version too.
Note 2) As soon as the parameter is not known at compile time this optimization cannot be done anyway. So if you put your function into a library it will be slow. I'd do something like precomputing the division for all possible values and then do a lookup. This is even faster because the lookup table fits into 4-5 cache lines and L1 latency is only 3-4 cycles.
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
uint8_t tbl[256];
for(int i = 0; i < 256; i++) {
tbl[i] = i / param;
}
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = tbl[src[y*srcStep + x]];
}
}
}
Also thanks for the interesting question, I learned a thing or two along the way! ;-)
This behavior is explained by the use of compiler optimizations: when enabled, divideImageDataWithParam sequential code will be subjected to a series of optimizations (loop-unrolling, vectorization, etc.) that divideImageDataWithParam_OpenMP parallel code probably is not, as it is certainly uncharacterized after the process of outlining parallel regions by the compiler.
If you compile this same code without optimizations, you will find that the runtime version of the sequential version is very similar to that of the parallel version with only one thread.
The maximum speedup of parallel version in this case is limited by the division of the original workload without optimizations. Optimizations in this case need to be writed manually.
Related
I have a code like that:
const uint64_t tsc = __rdtsc();
const __m128 res = computeSomethingExpensive();
const uint64_t elapsed = __rdtsc() - tsc;
printf( "%" PRIu64 " cycles", elapsed );
In release builds, this prints garbage like “38 cycles” because VC++ compiler reordered my code:
const uint64_t tsc = __rdtsc();
00007FFF3D398D00 rdtsc
00007FFF3D398D02 shl rdx,20h
00007FFF3D398D06 or rax,rdx
00007FFF3D398D09 mov r9,rax
const uint64_t elapsed = __rdtsc() - tsc;
00007FFF3D398D0C rdtsc
00007FFF3D398D0E shl rdx,20h
00007FFF3D398D12 or rax,rdx
00007FFF3D398D15 mov rbx,rax
00007FFF3D398D18 sub rbx,r9
const __m128 res = …
00007FFF3D398D1B lea rdx,[rcx+98h]
00007FFF3D398D22 mov rcx,r10
00007FFF3D398D25 call computeSomethingExpensive (07FFF3D393E50h)
What’s the best way to fix?
P.S. I’m aware rdtsc doesn’t count cycles, it measures time based on CPU’s base frequency. I’m OK with that, I still want to measure that number.
Update: godbolt link
Adding a fake store
static bool save = false;
if (save)
{
static float res1[4];
_mm_store_ps(res1, res);
}
before the second __rdtsc seem to be enough to fool the compiler.
(Not adding a real store to avoid contention if this function is called in multiple threads, though could use TLS to avoid that)
I have posted some code below to test the performance (time-wise in milliseconds) of calling a method from native c++ and c# from c++/cli using Visual Studio 2010. I have a separate native c++ project which is compiled into dll. When I call into c++ from c++, I get the expected result which is much faster (about 4x) than the managed counterparts. However, when I call into c++ from c++/cli, the performance is 10x slower.
Is this an expected behavior when calling into native c++ from c++/cli? I was under the impression that there shouldn't be a significant difference, but this simple test is showing otherwise. Could this be an optimization difference between the c++ and c++/cli compiler?
Update
I made some update to the cpp, so that I'm not calling a method in a tight loop (as Reed Copsey pointed out), and it turns out that the difference in performance in insignificant or very small. Depending on how the inter-operation is being done, of course.
.h
#ifndef CPPOBJECT_H
#define CPPOBJECT_H
#ifdef CPLUSPLUSOBJECT_EXPORTING
#define CLASS_DECLSPEC __declspec(dllexport)
#else
#define CLASS_DECLSPEC __declspec(dllimport)
#endif
class CLASS_DECLSPEC CPlusPlusObject
{
public:
CPlusPlusObject(){}
~CPlusPlusObject(){}
void sayHello();
double getSqrt(double n);
// Update
double wasteSomeTimeWithSqrt(double n);
};
#endif
.cpp
#include "CPlusPlusObject.h"
#include <iostream>
void CPlusPlusObject::sayHello(){std::cout << "Hello";}
double CPlusPlusObject::getSqrt(double n) {return std::sqrt(n);}
double CPlusPlusObject::wasteSomeTimeWithSqrt(double n)
{
double result = 0;
for (int x = 0; x < 10000000; x++)
{
result += std::sqrt(n);
}
return result;
}
c++/cli
const unsigned set = 100;
const unsigned repetitions = 1000000;
double cppcliTocpp()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CPlusPlusObject cplusplusObject;
n += cplusplusObject.wasteSomeTimeWithSqrt(123.456);
/*for (int i = 0; i < repetitions; i++)
{
n += cplusplusObject.getSqrt(123.456);
}*/
stopWatch->Stop();
System::Console::WriteLine("c++/cli call to native c++ took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
double cppcliTocSharp()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CSharp::CSharpObject^ cSharpObject = gcnew CSharp::CSharpObject();
for (int i = 0; i < repetitions; i++)
{
n += cSharpObject->GetSqrt(123.456);
}
stopWatch->Stop();
System::Console::WriteLine("c++/cli call to c# took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
double cppcli()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CPlusPlusCliObject cPlusPlusCliObject;
for (int i = 0; i < repetitions; i++)
{
n += cPlusPlusCliObject.getSqrt(123.456);
}
stopWatch->Stop();
System::Console::WriteLine("c++/cli took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
int main()
{
double n = 0;
n += cppcliTocpp();
n += cppcliTocSharp();
n += cppcli();
System::Console::WriteLine(n);
System::Console::ReadKey();
}
However, when I call into c++ from c++/cli, the performance is 10x slower.
Bridging the CLR and native code requires marshaling. There is always going to be some overhead in each method call when going from C++/CLI into a native method call.
The only reason the overhead (in this case) seems so large is that you're calling a very fast method in a tight loop. If you were to batch the class, or call a method that was significantly longer in terms of runtime, you'd find that the overhead is quite small.
These micro-benchmarks are very dangerous. You made an effort to avoid the typical benchmark mistakes but still fell into a classic trap. Your intention was to measure method call overhead, but that's not what's actually happening. The jitter optimizer is capable of standard code optimization techniques, like code hoisting and method inlining. You can only really see that when you look at the generated machine code. Debug + Windows + Disassembly window.
I tested this with VS2012, 32-bit Release build with the jitter optimizer enabled. The C++/CLI code was the fastest, taking ~128 msec:
000000bf fld qword ptr ds:[01212078h]
000000c5 fsqrt
000000c7 fstp qword ptr [ebp-20h]
//
// stopWatch->Start() call elided...
//
n += cPlusPlusCliObject.getSqrt(123.456);
000000f5 fld qword ptr [ebp-20h]
000000f8 fadd qword ptr [ebp-14h]
000000fb fstp qword ptr [ebp-14h]
for (int i = 0; i < repetitions; i++)
000000fe dec eax
000000ff jne 000000F5
In other words, the std::sqrt() call got hoisted out of the loop and the inner loop simply performs adds from the generated value. No method call. Also note how it didn't actually measure the time needed for the sqrt() call :)
The loop with the C# method call was a bit slower, taking ~180 msec:
000000ea fld qword ptr ds:[01211EC0h]
000000f0 fsqrt
000000f2 fadd qword ptr [ebp-14h]
000000f5 fstp qword ptr [ebp-14h]
for (int i = 0; i < repetitions; i++)
000000f8 dec eax
000000f9 jne 000000EA
Just the inlined method call to Math::Sqrt(), it didn't get hoisted. Not actually sure why, the optimizations performed by the jitter optimizer do have a time factor included.
And I won't post the code for the interop call. But yes, taking ~380 msec due to the need to actually make a function call, unmanaged code cannot be inlined, plus the thunk that's required to prevent the garbage collector from blundering into the unmanaged stack frame. The thunk is pretty fast, takes a handful of nanoseconds, but that just can't compete with the jitter optimizer directly inlining the fadd or fsqrt.
Benchmarking this class:
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= (int)sqrt((double)n); ++i)
if (isPrime[i])
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
}
};
I'm getting over 3 times worse performance (CPU time) with 64-bit binary vs. 32-bit version (release build) when calling a constructor for a large number, e.g.
Sieve s(100000000);
I tested sizeof(bool) and it is 1 for both versions.
When I substitute vector<bool> with vector<char> performance becomes the same for 64-bit and 32-bit versions. Why is that?
Here are the run times for S(100000000) (release mode, 32-bit first, 64-bit second)):
vector<bool> 0.97s 3.12s
vector<char> 0.99s 0.99s
vector<int> 1.57s 1.59s
I also did a sanity test with VS2010 (prompted by Wouter Huysentruit's response), which produced 0.98s 0.88s. So there is something wrong with VS2012 implementation.
I submitted a bug report to Microsoft Connect
EDIT
Many answers below comment on deficiencies of using int for indexing. This may be true, but even the Great Wizard himself is using a standard for (int i = 0; i < v.size(); ++i) in his books, so such a pattern should not incur a significant performance penalty. Additionally, this issue was raised during Going Native 2013 conference and the presiding group of C++ gurus commented on their early recommendations of using size_t for indexing and as a return type of size() as a mistake. They said: "we are sorry, we were young..."
The title of this question could be rephrased to: Over 3 times performance drop on this code when upgrading from VS2010 to VS2012.
EDIT
I made a crude attempt at finding memory alignment of indexes i and j and discovered that this instrumented version:
struct Sieve {
vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= sqrt((double)n); ++i) {
if (i == 17) cout << ((int)&i)%16 << endl;
if (isPrime[i])
for (int j = i*i; j <= n; j += i) {
if (j == 4) cout << ((int)&j)%16 << endl;
isPrime[j] = false;
}
}
}
};
auto-magically runs fast now (only 10% slower than 32-bit version). This and VS2010 performance makes it hard to accept a theory of optimizer having inherent problems dealing with int indexes instead of size_t.
std::vector<bool> is not directly at fault here. The performance difference is ultimately caused by your use of the signed 32-bit int type in your loops and some rather poor register allocation by the compiler. Consider, for example, your innermost loop:
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
Here, j is a 32-bit signed integer. When it is used in isPrime[j], however, it must be promoted (and sign-extended) to a 64-bit integer, in order to perform the subscript computation. The compiler can't just treat j as a 64-bit value, because that would change the behavior of the loop (e.g. if n is negative). The compiler also can't perform the index computation using the 32-bit quantity j, because that would change the behavior of that expression (e.g. if j is negative).
So, the compiler needs to generate code for the loop using a 32-bit j then it must generate code to convert that j to a 64-bit integer for the subscript computation. It has to do the same thing for the outer loop with i. Unfortunately, it looks like the compiler allocates registers rather poorly in this case(*)--it starts spilling temporaries to the stack, causing the performance hit you see.
If you change your program to use size_t everywhere (which is 32-bit on x86 and 64-bit on x64), you will observe that the performance is on par with x86, because the generated code needs only work with values of one size:
Sieve (size_t n = 1)
{
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (size_t i = 2; i <= static_cast<size_t>(sqrt((double)n)); ++i)
if (isPrime[i])
for (size_t j = i*i; j <= n; j += i)
isPrime[j] = false;
}
You should do this anyway, because mixing signed and unsigned types, especially when those types are of different widths, is perilous and can lead to unexpected errors.
Note that using std::vector<char> also "solves" the problem, but for a different reason: the subscript computation required for accessing an element of a std::vector<char> is substantially simpler than that for accessing an element of std::vector<bool>. The optimizer is able to generate better code for the simpler computations.
(*) I don't work on code generation, and I'm hardly an expert in either assembly or low-level performance optimization, but from looking at the generated code, and given that it is reported here that Visual C++ 2010 generates better code, I'd guess that there are opportunities for improvement in the compiler. I'll make sure the Connect bug you opened gets forwarded on to the compiler team so they can take a look.
I've tested this with vector<bool> in VS2010: 32-bit needs 1452ms while 64-bit needs 1264ms to complete on a i3.
The same test in VS2012 (on i7 this time) needs 700ms (32-bit) and 2730ms (64-bit), so there is something wrong with the compiler in VS2012. Maybe you can report this test case as a bug to Microsoft.
UPDATE
The problem is that the VS2012 compiler uses a temporary stack variable for a part of the code in the inner for-loop when using int as iterator. The assembly parts listed below are part of the code inside <vector>, in the += operator of the std::vector<bool>::iterator.
size_t as iterator
When using size_t as iterator, a part of the code looks like this:
or rax, -1
sub rax, rdx
shr rax, 5
lea rax, QWORD PTR [rax*4+4]
sub r8, rax
Here, all instructions use CPU registers which are very fast.
int as iterator
When using int as iterator, that same part looks like this:
or rcx, -1
sub rcx, r8
shr rcx, 5
shl rcx, 2
mov rax, -4
sub rax, rcx
mov rdx, QWORD PTR _Tmp$6[rsp]
add rdx, rax
Here you see the _Tmp$6 stack variable being used, which causes the slowdown.
Point compiler into the right direction
The funny part is that you can point the compiler into the right direction by using the vector<bool>::iterator directly.
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign(n + 1, true);
std::vector<bool>::iterator it1 = isPrime.begin();
std::vector<bool>::iterator end = it1 + n;
*it1++ = false;
*it1++ = false;
for (int i = 2; i <= (int)sqrt((double)n); ++it1, ++i)
if (*it1)
for (std::vector<bool>::iterator it2 = isPrime.begin() + i * i; it2 <= end; it2 += i)
*it2 = false;
}
};
vector<bool> is a very special container that's specialized to use 1 bit per item rather than providing normal container semantics. I suspect that the bit manipulation logic is much more expensive when compiling 64 bits (either it still uses 32 bit chunks to hold the bits or some other reason). vector<char> behaves just like a normal vector so there's no special logic.
You could also use deque<bool> which doesn't have the specialization.
The code below (reduced from my larger code, after my astonishment at how its speed paled in comparison with that of std::vector) has two peculiar features:
It runs more than three times faster when I make a very tiny modification to the source code (always compiling it with /O2 with Visual C++ 2010).
Note: To make this a little more fun, I put a hint for the modification at the end, so you can spend some time figuring out the change yourself. The original code was ~500 lines, so it took me a heck of a lot longer to pin it down, since the fix looks pretty irrelevant to the performance.
It runs about 20% faster with /MTd than with /MT, even though the output loop looks the same!!!
The difference in the assembly code for the tiny-modification case is:
Loop without the modification (~300 ms):
00403383 mov esi,dword ptr [esp+10h]
00403387 mov edx,dword ptr [esp+0Ch]
0040338B mov dword ptr [edx+esi*4],eax
0040338E add dword ptr [esp+10h],ecx
00403392 add eax,ecx
00403394 cmp eax,4000000h
00403399 jl main+43h (403383h)
Loop with /MTd (looks identical! but ~270 ms):
00407D73 mov esi,dword ptr [esp+10h]
00407D77 mov edx,dword ptr [esp+0Ch]
00407D7B mov dword ptr [edx+esi*4],eax
00407D7E add dword ptr [esp+10h],ecx
00407D82 add eax,ecx
00407D84 cmp eax,4000000h
00407D89 jl main+43h (407D73h)
Loop with the modification (~100 ms!!):
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
Now my question is, why should the changes above have the effects they do? It's completely bizarre!
Especially the first one -- it shouldn't affect anything at all (once you see the difference in the code), and yet it lowers the speed dramatically.
Is there an explanation for this?
#include <cstdio>
#include <ctime>
#include <algorithm>
#include <memory>
template<class T, class Allocator = std::allocator<T> >
struct vector : Allocator
{
T *p;
size_t n;
struct scoped
{
T *p_;
size_t n_;
Allocator &a_;
~scoped() { if (p_) { a_.deallocate(p_, n_); } }
scoped(Allocator &a, size_t n) : a_(a), n_(n), p_(a.allocate(n, 0)) { }
void swap(T *&p, size_t &n)
{
std::swap(p_, p);
std::swap(n_, n);
}
};
vector(size_t n) : n(0), p(0) { scoped(*this, n).swap(p, n); }
void push_back(T const &value) { p[n++] = value; }
};
int main()
{
int const COUNT = 1 << 26;
vector<int> vect(COUNT);
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
}
Hint (hover your mouse below):
It has to do with the allocator.
Answer:
Change Allocator &a_ to Allocator a_.
For what it's worth, my speculation for the difference between /MT and /MTd is that the /MTd heap allocation will paint the heap memory for debugging purposes making it more likely to be paged in - that occurs before you start the clock.
If you 'pre-heat' the vector allocation, you get the same numbers for /MT and /MTd:
vector<int> vect(COUNT);
// make sure vect's memory is warmed up
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
vect.n = 0; // clear the vector
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
It's strange that Allocator& will break the alias chain while Allocator will not.
You can try
for(int i=vect.n; i<COUNT;++i){
...
}
to enforce i and n are synchronized.
This will make vc much easier to optimize.
emm... It seems that the "fastest" code
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
is somewhat over-optimized. In this loop, vect.n is ignored at all...
If there was an exception happened in the loop, vect.n will not be updated correctly.
So the answer might be: when you use Allocator, vc figures out that vect.n will be never used again,
so that it can be ignored. It's amazing, but in general it's not so useful and dangerous.
What's the reason behind applying two explicit type casts as below?
if (unlikely(val != (long)(char)val)) {
Code taken from lxml.etree.c source file from lxml's source package.
That's a cheap way to check to see if there's any junk in the high bits. The char cast chops of the upper 8, 24 or 56 bits (depending on sizeof(val)) and then promotes it back. If char is signed, it will sign extend as well.
A better test might be:
if (unlikely(val & ~0xff)) {
or
if (unlikely(val & ~0x7f)) {
depending on whether this test cares about bit 7.
Just for grins and completeness, I wrote the following test code:
void RegularTest(long val)
{
if (val != ((int)(char)val)) {
printf("Regular = not equal.");
}
else {
printf("Regular = equal.");
}
}
void MaskTest(long val)
{
if (val & ~0xff) {
printf("Mask = not equal.");
}
else {
printf("Mask = equal.");
}
}
And here's what the cast code turns into in debug in visual studio 2010:
movsx eax, BYTE PTR _val$[ebp]
cmp DWORD PTR _val$[ebp], eax
je SHORT $LN2#RegularTes
this is the mask code:
mov eax, DWORD PTR _val$[ebp]
and eax, -256 ; ffffff00H
je SHORT $LN2#MaskTest
In release, I get this for the cast code:
movsx ecx, al
cmp eax, ecx
je SHORT $LN2#RegularTes
In release, I get this for the mask code:
test DWORD PTR _val$[ebp], -256 ; ffffff00H
je SHORT $LN2#MaskTest
So what's going on? In the cast case it's doing a byte mov with sign extension (ha! bug - the code is not the same because chars are signed) and then a compare and to be totally sneaky, the compiler/linker has also made this function use register passing for the argument. In the mask code in release, it has folded everything up into a single test instruction.
Which is faster? Beats me - and frankly unless you're running this kind of test on a VERY slow CPU or are running it several billion times, it won't matter. Not in the least.
So the answer in this case, is to write code that is clear about its intent. I would expect a C/C++ jockey to look at the mask code and understand its intent, but if you don't like that, you should opt for something like this instead:
#define BitsAbove8AreSet(x) ((x) & ~0xff)
#define BitsAbove7AreSet(x) ((x) & ~0x7f)
or:
inline bool BitsAbove8AreSet(long t) { return (t & ~0xff) != 0; } // make it a bool to be nice
inline bool BitsAbove7AreSet(long t) { return (t & ~0x7f) != 0; }
And use the predicates instead of the actual code.
In general, I think "is it cheap?" is not a particularly good question to ask about this unless you're working in some very specific problem domains. For example, I work in image processing and when I have some kind of operation going from one image to another, I often have code that looks like this:
BYTE *srcPixel = PixelOffset(src, x, y, srcrowstride, srcdepth);
int srcAdvance = PixelAdvance(srcrowstride, right, srcdepth);
BYTE *dstPixel = PixelOffset(dst, x, y, dstrowstride, dstdepth);
int dstAdvance = PixelAdvance(dstrowstride, right, dstdepth);
for (y = top; y < bottom; y++) {
for (x=left; x < right; x++) {
ProcessOnePixel(srcPixel, srcdepth, dstPixel, dstdepth);
srcPixel += srcdepth;
dstPixel += dstdepth;
}
srcPixel += srcAdvance;
dstPixel += dstAdvance;
}
And in this case, assume that ProcessOnePixel() is actually a chunk of inline code that will be executed billions and billions of times. In this case, I care a whole lot about not doing function calls, not doing redundant work, not rechecking values, ensuring that the computational flow will translate into something that will use registers wisely, etc. But my actual primary concern is that the code can be read by the next poor schmuck (probably me) who has to look at it.
And in our current coding world, it is FAR FAR CHEAPER for nearly every problem domain to spend a little time up front ensuring that your code is easy to read and maintain than it is to worry about performance out of the gate.
Speculations:
cast to char: to mask the 8 low bits,
cast to long: to bring the value back to signed (if char is unsigned).
If val is a long then the (char) will strip off all but the bottom 8 bits. The (long) casts it back for the comparison.