First off, I know very little about multithreading and I am having troubles finding how the best way to optimize this code, but multithreading seems the path I should be on.
double
applyFilter(struct Filter *filter, cs1300bmp *input, cs1300bmp *output)
{
long long cycStart, cycStop;
cycStart = rdtscll();
output -> width = input -> width;
output -> height = input -> height;
int temp1 = output -> width;
int temp2 = output -> height;
int width=temp1-1;
int height=temp2 -1;
int getDivisorVar= filter -> getDivisor();
int t0, t1, t2, t3, t4, t5, t6, t7, t8, t9;
int keep0= filter -> get(0,0);
int keep1= filter -> get(1,0);
int keep2= filter -> get(2,0);
int keep3= filter -> get(0,1);
int keep4= filter -> get(1,1);
int keep5= filter -> get(2,1);
int keep6= filter -> get(0,2);
int keep7= filter -> get(1,2);
int keep8= filter -> get(2,2);
//Declare variables before the loop
int plane, row, col;
for (plane=0; plane < 3; plane++) {
for(row=1; row < height ; row++) {
for (col=1; col < width; col++) {
t0 = (input -> color[plane][row - 1][col - 1]) * keep0;
t1 = (input -> color[plane][row][col - 1]) * keep1;
t2 = (input -> color[plane][row + 1][col - 1]) * keep2;
t3 = (input -> color[plane][row - 1][col]) * keep3;
t4 = (input -> color[plane][row][col]) * keep4;
t5 = (input -> color[plane][row + 1][col]) * keep5;
t6 = (input -> color[plane][row - 1][col + 1]) * keep6;
t7 = (input -> color[plane][row][col + 1]) * keep7;
t8 = (input -> color[plane][row + 1][col + 1]) * keep8;
// NEW LINE HERE
t9 = t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
t9 = t9 / getDivisorVar;
if ( t9 < 0 ) {
t9 = 0;
}
if ( t9 > 255 ) {
t9 = 255;
}
output -> color[plane][row][col] = t9;
} ....
All of this code most likely isn't necessary, but it does provide some context. So because the first of the 3 "for" loop only goes from 0-2 I was hoping there was a way I could thread the bottom two "for" loops to all be running at the same time for a different "plane" value. Is this even possible? And if so, would it actually make my program faster?
I would also look into OpenMP. It is a great library that allows for threading in a VERY simple manner using pragmas. OpenMP is compileable on many platforms, you just have to make sure yours supports it!
I have a set of code that had 8 levels of for loops, and it threaded it very nicely.
Yes, it's perfectly possible. In this case, you should event get away without worrying about access synchronisation (ie race conditions), as both threads would be operating on different sets of data.
This would definitely speed up your code on a multicore machine.
You might want to have a look at std::thread (if you're ok with c++ 11) for cross platform threading implementation (since you haven't specified your target platform). Or better with threading support library
You may also think about detecting number of cores and launch appropriate number of threads, as in threadcount = min(planes, cores) and provide each worker function with access to single plane's set of data.
It certainly looks like you could break this into thread and you would probably see a good speed increase. However, your compiler will already be trying to unroll the loop for you and gain parallelism by vectorizing instructions. Your gains may not be as much as you suspect, especially if you are saturating the memory bus with reads from disparate locations.
What you might consider is, if this is a 2d graphics operation, try and use OpenGL or similar as it will leverage hardware on your system, and that has some parallelism built into it.
Threaded version of the code will be slower than simple implementation. Because in the threaded version there will be much time spend on synchronisation. Also in threaded version you will have cache performance drawback.
Also it is high probability, that outer for loop with 3 passes will be unrolled by complier and will be executed in parallel.
You can try to make threaded version and compare performance. Anyway it will be usefull experience.
For a situation like this you could do worse than use a compiler that automatically turns for loops into threads.
With code like that the compiler can determine whether or not there is any inter-iteration data dependency. If not then it knows that it can safely split the for loop(s) across multiple threads, putting bog standard thread syncing at the end. Normally such a compiler is able to insert code that determines at run time whether the overhead of having the threads is going to be outweighed by the benefits.
The only thing is, do you have a compiler that does it? If so then its by far the easiest way to get the benefits of threads for straightforward, almost overt parallelism like this.
I know that Sun's C compiler does it (I think they were one of the earliest to do this. It might be on only the Solaris version of their compiler). I think that Intel's compiler can too. I have doubts about GCC (though I'd be very happy to be corrected on that point), and I'm not too sure about Microsoft's compiler.
Related
I'm new to OpenMP, but am trying to use it to accelerate some operations on entries of a 2D array with a large number of rows and a small number of columns. At the same time, I am using a reduction to calculate the sum of all the array values in each column. The code looks something like this (I will explain the weird bits in a moment):
template <unsigned int NColumns>
void Function(int n_rows, double** X, double* Y)
{
#pragma omp parallel for reduction(+:Y[:NColumns])
for (int r = 0; r < n_rows; ++r)
{
for (int c = 0; c < NColumns; ++c)
{
X[r][c] = some_complicated_stuff(X[r], X[r][c]);
Y[c] += X[r][c];
}
}
}
To clarify, X is a n_rows x NColumns-sized 2D array allocated on the heap, and Y is a NColumns-sized 1D array. some_complicated_stuff isn't actually implemented as a separate function, but what I do to X[r][c] in that line only depends on X[r][c] and other values in the 1D array X[r].
The reason that NColumns is passed in as a template parameter rather than as a regular argument (like n_rows) is that when NColumns is known at compile-time, the compiler can more aggressively optimise the inner loop in the above function. I know that NColumns is going to be one of a small number of values when the program runs, so later on I have something like this code:
cin >> n_cols;
double** X;
double Y[n_cols];
// initialise X and Y, etc. . .
for (int i = 0; i < n_iterations; ++i)
{
switch (n_cols)
{
case 2: Function< 2>(X, Y); break;
case 10: Function<10>(X, Y); break;
case 60: Function<60>(X, Y); break;
default: throw "Unsupported n_cols."; break;
}
// . . .
Report(i, Y); // see GDB output below
}
Through testing, I have found that having this NColumns "argument" to Update as a template parameter rather than a normal function parameter actually makes for an appreciable performance increase. However, I have also found that, once in a blue moon (say, about every 10^7 calls to Function), the program hangsāand even worse, that its behaviour sometimes changes from one run of the program to the next. This happens rarely enough that I have been having a lot of trouble isolating the bug, but I'm now wondering whether it's because I am using this NColumns template parameter in my OpenMP reduction.
I note that a similar StackOverflow question asks about using template types in reductions, which apparently causes unspecified behaviour - the OpenMP 3.0 spec says
If a variable referenced in a data-sharing attribute clause has a type
derived from a template, and there are no other references to that
variable in the program, then any behavior related to that variable is
unspecified.
In this case, it's not a template type per se that is being used, but I'm sort of in the same ballpark. Have I messed up here, or is the bug more likely to be in some other part of the code?
I am using GCC 6.3.0.
If it is more helpful, here's the real code from inside Function. X is actually a flattened 2D array; ww and min_x are defined elsewhere:
#pragma omp parallel for reduction(+:Y[:NColumns])
for (int i = 0; i < NColumns * n_rows; i += NColumns)
{
double total = 0;
for (int c = 0; c < NColumns; ++c)
if (X[i + c] > 0)
total += X[i + c] *= ww[c];
if (total > 0)
for (int c = 0; c < NColumns; ++c)
if (X[i + c] > 0)
Y[c] += X[i + c] = (X[i + c] < min_x * total ? 0 : X[i + c] / total);
}
Just to thicken the plot a bit, I attached gdb to a running process of the program which hanged, and here's what the backtrace shows me:
#0 0x00007fff8f62a136 in __psynch_cvwait () from /usr/lib/system/libsystem_kernel.dylib
#1 0x00007fff8e65b560 in _pthread_cond_wait () from /usr/lib/system/libsystem_pthread.dylib
#2 0x000000010a4caafb in omp_get_num_procs () from /opt/local/lib/libgcc/libgomp.1.dylib
#3 0x000000010a4cad05 in omp_get_num_procs () from /opt/local/lib/libgcc/libgomp.1.dylib
#4 0x000000010a4ca2a7 in omp_in_final () from /opt/local/lib/libgcc/libgomp.1.dylib
#5 0x000000010a31b4e9 in Report(int, double*) ()
#6 0x3030303030323100 in ?? ()
[snipped traces 7-129, which are all ?? ()]
#130 0x0000000000000000 in ?? ()
Report() is a function that gets called inside the program's main loop but not within Function() (I've added it to the middle code snippet above), and Report() does not contain any OpenMP pragmas. Does this illuminate what's happening at all?
Note that the executable changed between when the process started running and when I attached GDB to it, which required referring to the new (changed) executable. So that could mean that the symbol table is messed up.
I have managed to partly work this out.
One of the problems was with the program behaving nondeterministically. This is just because (1) OpenMP performs reductions in thread-completion order, which is non-deterministic, and (2) floating-point addition is non-associative. I assumed that the reductions would be performed in thread-number order, but this is not the case. So any OpenMP for construct that reduces using floating-point operations will be potentially non-deterministic even if the number of threads is the same from one run to the next, so long as the number of threads is greater than 2. Some relevant StackOverflow questions on this matter are here and here.
The other problem was that the program occasionally hangs. I have not been able to resolve this issue. Running gdb on the hung process always yields __psynch_cvwait () at the top of the stack trace. It hangs around every 10^8 executions of the parallelised for loop.
Hope this helps a little.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a portion of code that is called an enormous amount of times. How can I speed it up?
#define SUM_(p0, p1, p2, p3, offset)
((p0)[offset] - (p1)[offset] - (p2)[offset] + (p3)[offset])
inline int Calc::compute( int offset ) const
{
int b = SUM( p[5], p[6], p[9], p[10], offset );
int a1 = SUM(...);
int a2 = SUM(...);
....
return (uchar)(((a1 >= b) << 7) |
((a2 >= b) << 6) |
((a3 >= b) << 5) |
((a4 >= b) << 4) |
((a5 >= b) << 3) |
((a6 >= b) << 2) |
((a7 >= b) << 1) |
(a8 >= b));
}
Thank you.
I see 3 possible opportunities here:
Pack your p0, p1, p2 and p3 data contiguously in memory (so basically contiguously in an array) in order to prevent cache misses:
#define SUM_(array, offset)
(array[offset] - array[offset + 1] - array[offset + 2] + array[offset + 3])
....
//make sure pArray contains all the p0, p1, ... values.
int a1 = SUM(pArray, offset);
replace the bitshift operator with an if structure that or's togethers static literals if (a1 >=b) and so on. The values are going to be static everytime anyway:
uint8_t bitmask = 0;
if(a1 >= b)
bitmask |= 0x80;
if(a2 >= b)
bitmask |= 0x40;
...
Try to make sure that SIMD instructions are being used. This involves dumping the assembly and see if these kind of instructions are being generated.
EDIT: Reaction on the comment below:
In order to prevent cache misses, everything revolves around accessing your data in a predictable manner. A problem with your original code is that you resolve the offset with your p-variables.
So you have something like:
int b = SUM( p[5], p[6], p[9], p[10], _offset );
int a1 = SUM( p[0], p[1], p[4], p[5], _offset );
int a2 = SUM( p[1], p[2], p[5], p[6], _offset );
You could create an array which contains these values in the order you use them.
So I'd try to create an array that looks like this at a certain offset:
p[5], p[6], p[9], p[10], p[0], p[1], p[4], p[5], p[1], p[2], p[5], p[6]
And now you can define your sum function like this:
#define SUM_(array, offset, calculationOffset)
(array[offset + calculationOffset * 4] - array[offset + calculationOffset * 4 + 1]
- array[offset + calculationOffset * 4 + 2] + array[offset + calculationOffset * 4 + 3])
Your calls can be transformed into this:
int b = SUM(pArray, offset, 0);
int a1 = SUM(pArray, offset, 1);
int a2 = SUM(pArray, offset, 2);
There's only one problem with this: if you have to create the array and copy all the data every function call, this might remove any benefit of what we did, but you may be able to construct this kind of array before using this function and pass it as an argument.
There's still the possibility to change/improve the algorithm itself. We cannot help you with this unless the algorithm you use is known (I mean the algorithm which uses the code you have shown).
If there are any known properties of the variables used and their values, this could give room for improvement.
I don't know whether the cast to uchar could impact performance. Why do you need this? I think this should be changed, since you are adding int and returning int. But you'd have to neasure performance to see if for example using & 0xff would give better performance.
Also, you should try the difference it makes when you replace the bit-oring-operators ("|") by a simple addition (should work the same in this case because each summand has a single unique bit set to 1).
Then try the effect of altering (a1 >= b) << 7) to (a1 >= b) ? 128 : 0) etc.
All of the above might or might not effect performance, and you have to measure the effect with the compiler you use.
But most important: if your problem with analyzing all these images is the total amount of time, you should look into processing different images at the same time (if you are on a multiprocessor machine with sufficient RAM). You have several options:
parallizing the code that processes a single image (using something like OpenMP)
Use one thread per image (IMHO much easier and I'd expect better overall throughput than 1.)
Move the concurrency to where your start your program (ie. a script).
I have the following piece of simple code that is potentially going to be executed many hundreds of millions of times;
for (int i = 0; i < 8; i++)
if (((p[i].X >= x) && (p[i].X <= x + d))
&&((p[i].Y >= y) && (p[i].Y <= y + d))
&&((p[i].Z >= z) && (p[i].Z <= z + d)))
return 1;
Will the optimizer in the Visual C++ 2010 compiler unroll this loop for me, or am I better off to do it manually? I've looked at other similar questions but don't see any specific results. I
The real question is, what do you gain from unrolling ?
Unrolling shaves off one branch (if i >= 8 stop) for every "unroll".
Your loop body contains 6 branches already (if * 1, || * 2, && * 3); so is there much to gain in unrolling it ?
It might be interesting to see how the code is optimized; but I am quite unsure whether unrolling should be your primary focus, I'd be more worried about how the complex condition is handled!
I have an expression
x += y;
and, based on a boolean, I want to be able to change it to
x -= y;
Of course I could do
if(i){x+=y;} else{x-=y;}
//or
x+=(y*sign); //where sign is either 1 or -1
But if I have to do this iteratively, I want to avoid the extra computation. Is there a more efficient way? Is it possible to modulate the operator?
if (i) {x += y;} else {x -= y;}
is probably going to be as efficient as anything else you can do. y * sign is likely to be fairly expensive (unless the compiler can figure out that y is guaranteed to be 1 or -1).
The most efficient way to do this iteratively is to precompute the data you need.
So, precomputation:
const YourNumberType increment = (i? y : -y);
Then in your loop:
x += increment;
EDIT: re question in commentary about how to generate code, like this:
#include <stdio.h>
void display( int x ) { printf( "%d\n", x ); }
template< bool isSomething >
inline void advance( int& x, int y );
template<> inline void advance<true>( int& x, int y ) { x += y; }
template<> inline void advance<false>( int& x, int y ) { x -= y; }
template< bool isSomething >
void myFunc()
{
int x = 314;
int y = 271;
for( ;; )
{
advance< isSomething >( x, y ); // The nano-optimization.
display( x );
if( !( -10000 < x && x < 10000 ) ) { return; }
}
}
int main( int n, char*[] )
{
n > 1? myFunc<true>() : myFunc<false>();
}
E.g. with Visual C++ 10.0 this generates two versions of myFunc, one with an add instruction and the other with a sub instruction.
Cheers & hth.,
On a modern pipelined machine you want to avoid branching if at all possible in those cases where performance really does count. When the front of the pipeline hits a branch, the CPU guesses which branch to take and lets the pipeline work ahead based on that guess. Everything is fine if the guess was right. Everything is not so fine if the guess was wrong, particularly so if you're still using one of Intel's processors such as a Pentium 4 that suffered from pipeline bloat. Intel discovered that too much pipelining is not a good thing.
More modern processors still do use pipelining (the Core line has a pipeline length of 14 or so), so avoiding branching still remains one of those good things to do -- when it counts, that is. Don't make your code an ugly, prematurely optimized mess when it doesn't count.
The best thing to do is to first find out where your performance demons lie. It is not at all uncommon for a tiny fraction of one percent of the code base to be the cause of almost all of the CPU usage. Optimizing the 99.9% of the code that doesn't contribute to the CPU usage won't solve your performance problems but it will have a deleterious effect on maintenance.
You optimize once you have found the culprit code, and even then, maybe not. When performance doesn't matter, don't optimize. Performance as a metric runs counter to almost every other code quality metric out there.
So, getting off the soap box, let's suppose that little snippet of code is the performance culprit. Try both approaches and test. Try a third approach you haven't thought of yet and test. Sometimes the code that is the best performance-wise is surprisingly non-intuitive. Think Duff's device.
If i stays constant during the execution of the loop, and y doesn't, move the if outside of the loop.
So instead of...
your_loop {
y = ...;
if (i)
x += y;
else
x -= y;
}
...do the following....
if (i) {
your_loop {
y = ...;
x += y;
}
}
else {
your_loop {
y = ...;
x -= y;
}
}
BTW, a decent compiler will do that optimization for you, so you may not see the difference when actually benchmarking.
Sounds like you want to avoid branching and multiplication. Let's say the switch i is set to all 1 bits, same size as y, when you want to add, and to 0 when you want to subtract. Then:
x += (y & i) - (y & ~i)
Haven't tested it, this is just to give you the general idea. Bear in mind that this makes the code a lot harder to read in exchange for what would probably be a very small increase in efficiency.
Edit: or, as bdonlan points out in the comments, possibly even a decrease.
I put my suggestion in the comments to the test, and in a simple test bit-fiddling is faster than branching options on an Intel(R) Xeon(R) CPU L5520 # 2.27GHz, but slower on my laptop Intel Core Duo.
If you are free to give i either the value 0 (for +) or ~0 (for -), these statements are equivalent:
// branching:
if ( i ) sum -= add; else sum += add;
sum += i?-add:add;
sum += (i?-1:1)*add;
// bit fiddling:
sum += (add^i)+(i&1);
sum += (add^i)+(!!i);
sum += (i&~add)-(i&add);
And as said, one method can beat the other by a factor of 2, depending on CPU and optimization level used.
Conclusion, as always, is that benchmarking is the only way to find out which is faster in your particular situation.
My team need the "Sobol quasi-random number generator" - a common RNG which is famous for good quality results and speed of operation. I found what looks like a simple C implementation on the web. At home I was able to compile it almost instantaneously using my Linux GCC compiler.
The following day I tried it at work: If I compile in Visual Studio in debug mode it takes about 1 minute. If I were to compile it in release mode it takes about 40 minutes.
Why?
I know that "release" mode triggers some compiler optimization... but how on earth could a file this small take so long to optimize? It's mostly comments and static-data. There's hardly anything worth optimizing.
None of these PCs are particularly slow, and in any case I know that the compile time is consistent across a range of Windows computers. I've also heard that newer versions of Visual Studio have a faster compile time, however for now we are stuck with Visual Studio.Net 2003. Compiling on GCC (the one bundled with Ubuntu 8.04) always takes microseconds.
To be honest, I'm not really sure the codes that good. It's got a nasty smell in it. Namely, this function:
unsigned int i4_xor ( unsigned int i, unsigned int j )
//****************************************************************************80
//
// Purpose:
//
// I4_XOR calculates the exclusive OR of two integers.
//
// Modified:
//
// 16 February 2005
//
// Author:
//
// John Burkardt
//
// Parameters:
//
// Input, unsigned int I, J, two values whose exclusive OR is needed.
//
// Output, unsigned int I4_XOR, the exclusive OR of I and J.
//
{
unsigned int i2;
unsigned int j2;
unsigned int k;
unsigned int l;
k = 0;
l = 1;
while ( i != 0 || j != 0 )
{
i2 = i / 2;
j2 = j / 2;
if (
( ( i == 2 * i2 ) && ( j != 2 * j2 ) ) ||
( ( i != 2 * i2 ) && ( j == 2 * j2 ) ) )
{
k = k + l;
}
i = i2;
j = j2;
l = 2 * l;
}
return k;
}
There's an i8_xor too. And a couple of abs functions.
I think a post to the DailyWTF is in order.
EDIT: For the non-c programmers, here's a quick guide to what the above does:
function xor i:unsigned, j:unsigned
answer = 0
bit_position = 1
while i <> 0 or j <> 0
if least significant bit of i <> least significant bit of j
answer = answer + bit_position
end if
bit_position = bit_position * 2
i = i / 2
j = j / 2
end while
return answer
end function
To determine if the least significant bit is set or cleared, the following is used:
bit set if i <> (i / 2) * 2
bit clear if i == (i / 2) * 2
What makes the code extra WTFy is that C defines an XOR operator, namely '^'. So, instead of:
result = i4_xor (a, b);
you can have:
result = a ^ b; // no function call at all!
The original programmer really should have know about the xor operator. But even if they didn't (and granted, it's another obfuscated C symbol), their implementation of an XOR function is unbelievably poor.
I'm using VC++ 2003 and it compiled instantly in both debug/release modes.
Edit:
Do you have the latest service pack installed on your systems?
I would recommend you download a trial edition of Visual Studio 2008 and try the compile there, just to see if the problem is inherent. Also, if it does happen on a current version, you would be able to report the problem, and Microsoft might fix it.
On the other hand, there is no chance that Microsoft will fix whatever bug is in VS2003.