Optimal place to call __syncthreads()

Optimal place to call __syncthreads() - c++

Given that the code is correct, is there some potential performance benefit in calling __syncthreads as late as possible, as early as possible, or does it not matter? Here's an example with comments that demonstrate the question:
__global__ void kernel(const float* data) {
__shared__ float shared_data[64];
if (threadIdx.x < 64) {
shared_data[threadIdx.x] = data[threadIdx.x];
}
// Option #1: Place the call to `__syncthreads()` here?
// Here is a lot of code that doesn't use `shared_data`.
// Option #2: Place the call to `__syncthreads()` here?
// Here is some code that uses `shared_data`.
}

What you are facing is a split between where the writes are made and where they should be visible to the entire block.
NVIDIA has recently introduced a mechanism for just that: arrive + wait.
You start with initializing a barrier:
void __mbarrier_init(__mbarrier_t* bar, uint32_t expected_count);
Then you arrive at your "option 1" position, with the bar token you initialized:
__mbarrier_token_t __mbarrier_arrive(__mbarrier_t* bar);
then you have your unrelated code, and then finally, wait for everyone to arrive at your "option 2" position:
bool __mbarrier_test_wait(__mbarrier_t* bar, __mbarrier_token_t token);
... but note that this call doesn't block, i.e you'll have to actively "wait".
Alternatively, you can use NVIDIA's C++ wrappers for this mechanism, presented here.
Note that this functionality is relatively new, with Compute Capability at least 7.0 required, and 8.0 or later recommended.

Related

Pointer passed to function changes unexpectedly

I'm designing a preloader-based lock tracing utility that attaches to Pthreads, and I've run into a weird issue. The program works by providing wrappers that replace relevant Pthreads functions at runtime; these do some logging, and then pass the args to the real Pthreads function to do the work. They do not modify the arguments passed to them, obviously. However, when testing, I discovered that the condition variable pointer passed to my pthread_cond_wait() wrapper does not match the one that gets passed to the underlying Pthreads function, which promptly crashes with "futex facility returned an unexpected error code," which, from what I've gathered, usually indicates an invalid sync object passed in. Relevant stack trace from GDB:
#8 __pthread_cond_wait (cond=0x7f1b14000d12, mutex=0x55a2b961eec0) at pthread_cond_wait.c:638
#9 0x00007f1b1a47b6ae in pthread_cond_wait (cond=0x55a2b961f290, lk=0x55a2b961eec0)
at pthread_trace.cpp:56
I'm pretty mystified. Here's the code for my pthread_cond_wait() wrapper:
int pthread_cond_wait(pthread_cond_t* cond, pthread_mutex_t* lk) {
// log arrival at wait
the_tracer.add_event(lktrace::event::COND_WAIT, (size_t) cond);
// run pthreads function
GET_REAL_FN(pthread_cond_wait, int, pthread_cond_t*, pthread_mutex_t*);
int e = REAL_FN(cond, lk);
if (e == 0) the_tracer.add_event(lktrace::event::COND_LEAVE, (size_t) cond);
else {
the_tracer.add_event(lktrace::event::COND_ERR, (size_t) cond);
}
return e;
}
// GET_REAL_FN is defined as:
#define GET_REAL_FN(name, rtn, params...) \
typedef rtn (*real_fn_t)(params); \
static const real_fn_t REAL_FN = (real_fn_t) dlsym(RTLD_NEXT, #name); \
assert(REAL_FN != NULL) // semicolon absence intentional
And here's the code for __pthread_cond_wait in glibc 2.31 (this is the function that gets called if you call pthread_cond_wait normally, it has a different name because of versioning stuff. The stack trace above confirms that this is the function that REAL_FN points to):
int
__pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mutex)
{
/* clockid is unused when abstime is NULL. */
return __pthread_cond_wait_common (cond, mutex, 0, NULL);
}
As you can see, neither of these functions modifies cond, yet it is not the same in the two frames. Examining the two different pointers in a core dump shows that they point to different contents, as well. I can also see in the core dump that cond does not appear to change in my wrapper function (i.e. it's still equal to 0x5... in frame 9 at the crash point, which is the call to REAL_FN). I can't really tell which pointer is correct by looking at their contents, but I'd assume it's the one passed in to my wrapper from the target application. Both pointers point to valid segments for program data (marked ALLOC, LOAD, HAS_CONTENTS).
My tool is definitely causing the error somehow, the target application runs fine if it is not attached. What am I missing?
UPDATE: Actually, this doesn't appear to be what's causing the error, because calls to my pthread_cond_wait() wrapper succeed many times before the error occurs, and exhibit similar behavior (pointer value changing between frames without explanation) each time. I'm leaving the question open, though, because I still don't understand what's going on here and I'd like to learn.
UPDATE 2: As requested, here's the code for tracer.add_event():
// add an event to the calling thread's history
// hist_entry ctor gets timestamp & stack trace
void tracer::add_event(event e, size_t obj_addr) {
size_t tid = get_tid();
hist_map::iterator hist = histories.contains(tid);
assert(hist != histories.end());
hist_entry ev (e, obj_addr);
hist->second.push_back(ev);
}
// hist_entry ctor:
hist_entry::hist_entry(event e, size_t obj_addr) :
ts(chrono::steady_clock::now()), ev(e), addr(obj_addr) {
// these are set in the tracer ctor
assert(start_addr && end_addr);
void* buf[TRACE_DEPTH];
int v = backtrace(buf, TRACE_DEPTH);
int a = 0;
// find first frame outside of our own code
while (a < v && start_addr < (size_t) buf[a] &&
end_addr > (size_t) buf[a]) ++a;
// skip requested amount of frames
a += TRACE_SKIP;
if (a >= v) a = v-1;
caller = buf[a];
}
histories is a lock-free concurrent hashmap from libcds (mapping tid->per-thread vectors of hist_entry), and its iterators are guaranteed to be thread-safe as well. GNU docs say backtrace() is thread-safe, and there's no data races mentioned in the CPP docs for steady_clock::now(). get_tid() just calls pthread_self() using the same method as the wrapper functions, and casts its result to size_t.

Hah, figured it out! The issue is that Glibc exposes multiple versions of pthread_cond_wait(), for backwards compatibility. The version I reproduce in my question is the current version, the one we want to call. The version that dlsym() was finding is the backwards-compatible version:
int
__pthread_cond_wait_2_0 (pthread_cond_2_0_t *cond, pthread_mutex_t *mutex)
{
if (cond->cond == NULL)
{
pthread_cond_t *newcond;
newcond = (pthread_cond_t *) calloc (sizeof (pthread_cond_t), 1);
if (newcond == NULL)
return ENOMEM;
if (atomic_compare_and_exchange_bool_acq (&cond->cond, newcond, NULL))
/* Somebody else just initialized the condvar. */
free (newcond);
}
return __pthread_cond_wait (cond->cond, mutex);
}
As you can see, this version tail-calls the current one, which is probably why this took so long to detect: GDB is normally pretty good at detecting frames elided by tail calls, but I'm guessing it didn't detect this one because the functions have the "same" name (and the error doesn't affect the mutex functions because they don't expose multiple versions). This blog post goes into much more detail, coincidentally specifically about pthread_cond_wait(). I stepped through this function many times while debugging and sort of tuned it out, because every call into glibc is wrapped in multiple layers of indirection; I only realized what was going on when I set a breakpoint on the pthread_cond_wait symbol, instead of a line number, and it stopped at this function.
Anyway, this explains the changing pointer phenomenon: what happens is that the old, incorrect function gets called, reinterprets the pthread_cond_t object as a struct containing a pointer to a pthread_cond_t object, allocates a new pthread_cond_t for that pointer, and then passes the newly allocated one to the new, correct function. The frame of the old function gets elided by the tail-call, and to a GDB backtrace after leaving the old function it looks like the correct function gets called directly from my wrapper, with a mysteriously changed argument.
The fix for this was simple: GNU provides the libdl extension dlvsym(), which is like dlsym() but also takes a version string. Looking for pthread_cond_wait with version string "GLIBC_2.3.2" solves the problem. Note that these versions do not usually correspond to the current version (i.e. pthread_create()/exit() have version string "GLIBC_2.2.5"), so they need to be looked up on a per-function basis. The correct string can be determined either by looking at the compat_symbol() or versioned_symbol() macros that are somewhere near the function definition in the glibc source, or by using readelf to see the names of the symbols in the compiled library (mine has "pthread_cond_wait##GLIBC_2.3.2" and "pthread_cond_wait##GLIBC_2.2.5").

strange proplem using two Threads and Boolean

(I hate having to put a title like this. but I just couldn't find anything better)
I have two classes with two threads. first one detects motion between two frames:
void Detector::run(){
isActive = true;
// will run forever
while (isActive){
//code to detect motion for every frame
//.........................................
if(isThereMotion)
{
if(number_of_sequence>0){
theRecorder.setRecording(true);
theRecorder.setup();
// cout << " motion was detected" << endl;
}
number_of_sequence++;
}
else
{
number_of_sequence = 0;
theRecorder.setRecording(false);
// cout << " there was no motion" << endl;
cvWaitKey (DELAY);
}
}
}
second one will record a video when started:
void Recorder::setup(){
if (!hasStarted){
this->start();
}
}
void Recorder::run(){
theVideoWriter.open(filename, CV_FOURCC('X','V','I','D'), 20, Size(1980,1080), true);
if (recording){
while(recording){
//++++++++++++++++++++++++++++++++++++++++++++++++
cout << recording << endl;
hasStarted=true;
webcamRecorder.read(matRecorder); // read a new frame from video
theVideoWriter.write(matRecorder); //writer the frame into the file
}
}
else{
hasStarted=false;
cout << "no recording??" << endl;
changeFilemamePlusOne();
}
hasStarted=false;
cout << "finished recording" << endl;
theVideoWriter.release();
}
The boolean recording gets changed by the function:
void Recorder::setRecording(bool x){
recording = x;
}
The goal is to start the recording once motion was detected while preventing the program from starting the recording twice.
The really strange problem, which honestly doesn't make any sense in my head, is that the code will only work if I cout the boolean recording ( marked with the "++++++"). Else recording never changes to false and the code in the else statment never gets called.
Does anyone have an idea on why this is happening. I'm still just begining with c++ but this problem seems really strange to me..

I suppose your variables isThereMotion and recording are simple class members of type bool.
Concurrent access to these members isn't thread safe by default, and you'll face race conditions, and all kinds of weird behaviors.
I'd recommend to declare these member variables like this (as long you can make use of the latest standard):
class Detector {
// ...
std::atomic<bool> isThereMotion;
};
class Recorder {
// ...
std::atomic<bool> hasStarted;
};
etc.
The reason behind the scenes is, that even reading/writing a simple boolean value splits up into several assembler instructions applied to the CPU, and those may be scheduled off in the middle for a thread execution path change of the process. Using std::atomic<> provides something like a critical section for read/write operations on this variable automatically.
In short: Make everything, that is purposed to be accessed concurrently from different threads, an atomic value, or use an appropriate synchronization mechanism like a std::mutex.
If you can't use the latest c++ standard, you can perhaps workaround using boost::thread to keep your code portable.
NOTE:
As from your comments, your question seems to be specific for the Qt framework, there's a number of mechanisms you can use for synchronization as e.g. the mentioned QMutex.
Why volatile doesn't help in multithreaded environments?
volatile prevents the compiler to optimize away actual read access just by assumptions of values set formerly in a sequential manner. It doesn't prevent threads to be interrupted in actually retrieving or writing values there.
volatile should be used for reading from addresses that can be changed independently of the sequential or threading execution model (e.g. bus addressed peripheral HW registers, where the HW changes values actively, e.g. a FPGA reporting current data throughput at a register inteface).
See more details about this misconception here:
Why is volatile not considered useful in multithreaded C or C++ programming?

You could use a pool of nodes with pointers to frame buffers as part of a linked list fifo messaging system using mutex and semaphore to coordinate the threads. A message for each frame to be recorded would be sent to the recording thread (appended to it's list and a semaphore released), otherwise the node would be returned (appended) back to the main thread's list.
Example code using Windows based synchronization to copy a file. The main thread reads into buffers, the spawned thread writes from buffers it receives. The setup code is lengthy, but the actual messaging functions and the two thread functions are simple and small.
mtcopy.zip

Could be a liveness issue. The compiler could be re-ordering instructions or hoisting isActive out of the loop. Try marking it as volatile.
From MSDN docs:
Objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.
Simple example:
#include <iostream>
using namespace std;
int main() {
bool active = true;
while(active) {
cout << "Still active" << endl;
}
}
Assemble it:
g++ -S test.cpp -o test1.a
Add volatile to active as in volatile bool active = true
Assemble it again g++ -S test.cpp -o test2.a and look at the difference diff test1.a test2.a
< testb $1, -5(%rbp)
---
> movb -5(%rbp), %al
> testb $1, %al
Notice the first one doesn't even bother to read the value of active before testing it since the loop body never modifies it. The second version does.

longjmp and RAII

So I have a library (not written by me) which unfortunately uses abort() to deal with certain errors. At the application level, these errors are recoverable so I would like to handle them instead of the user seeing a crash. So I end up writing code like this:
static jmp_buf abort_buffer;
static void abort_handler(int) {
longjmp(abort_buffer, 1); // perhaps siglongjmp if available..
}
int function(int x, int y) {
struct sigaction new_sa;
struct sigaction old_sa;
sigemptyset(&new_sa.sa_mask);
new_sa.sa_handler = abort_handler;
sigaction(SIGABRT, &new_sa, &old_sa);
if(setjmp(abort_buffer)) {
sigaction(SIGABRT, &old_sa, 0);
return -1
}
// attempt to do some work here
int result = f(x, y); // may call abort!
sigaction(SIGABRT, &old_sa, 0);
return result;
}
Not very elegant code. Since this pattern ends up having to be repeated in a few spots of the code, I would like to simplify it a little and possibly wrap it in a reusable object. My first attempt involves using RAII to handle the setup/teardown of the signal handler (needs to be done because each function needs different error handling). So I came up with this:
template <int N>
struct signal_guard {
signal_guard(void (*f)(int)) {
sigemptyset(&new_sa.sa_mask);
new_sa.sa_handler = f;
sigaction(N, &new_sa, &old_sa);
}
~signal_guard() {
sigaction(N, &old_sa, 0);
}
private:
struct sigaction new_sa;
struct sigaction old_sa;
};
static jmp_buf abort_buffer;
static void abort_handler(int) {
longjmp(abort_buffer, 1);
}
int function(int x, int y) {
signal_guard<SIGABRT> sig_guard(abort_handler);
if(setjmp(abort_buffer)) {
return -1;
}
return f(x, y);
}
Certainly the body of function is much simpler and more clear this way, but this morning a thought occurred to me. Is this guaranteed to work? Here's my thoughts:
No variables are volatile or change between calls to setjmp/longjmp.
I am longjmping to a location in the same stack frame as the setjmp and returning normally, so I am allowing the code to execute the cleanup code that the compiler emitted at the exit points of the function.
It appears to work as expected.
But I still get the feeling that this is likely undefined behavior. What do you guys think?

I assume that f is in a third party library/app, because otherwise you could just fix it to not call abort. Given that, and that RAII may or may not reliably produce the right results on all platforms/compilers, you have a few options.
Create a tiny shared object that defines abort and LD_PRELOAD it. Then you control what happens on abort, and NOT in a signal handler.
Run f within a subprocess. Then you just check the return code and if it failed try again with updated inputs.
Instead of using the RAII, just call your original function from multiple call points and let it manually do the setup/teardown explicitly. It still eliminates the copy-paste in that case.

I actually like your solution, and have coded something similar in test harnesses to check that a target function assert()s as expected.
I can't see any reason for this code to invoke undefined behaviour. The C Standard seems to bless it: handlers resulting from an abort() are exempted from the restriction on calling library functions from a handler. (Caveat: this is 7.14.1.1(5) of C99 - sadly, I don't have a copy of C90, the version referenced by the C++ Standard).
C++03 adds a further restriction: If any automatic objects would be destroyed by a thrown exception transferring control to another (destination) point in the program, then a call to longjmp(jbuf, val) at the throw point that transfers control to the same (destination) point has undefined behavior. I'm supposing that your statement that 'No variables are volatile or change between calls to setjmp/longjmp' includes instantiating any automatic C++ objects. (I guess this is some legacy C library?).
Nor is POSIX async signal safety (or lack thereof) an issue - abort() generates its SIGABRT synchronously with program execution.
The biggest concern would be corrupting the global state of the 3rd party code: it's unlikely that the author will take pains to get the state consistent before an abort(). But, if you're correct that no variables change, then this isn't a problem.
If someone with a better understanding of the standardese can prove me wrong, I'd appreciate the enlightenment.

Debugging instance of another thread altering my data

I have a huge global array of structures. Some regions of the array are tied to individual threads and those threads can modify their regions of the array without having to use critical sections. But there is one special region of the array which all threads may have access to. The code that accesses these parts of the array needs to carefully use critical sections (each array element has its own critical section) to prevent any possibility of two threads writing to the structure simultaneously.
Now I have a mysterious bug I am trying to chase, it is occurring unpredictably and very infrequently. It seems that one of the structures is being filled with some incorrect number. One obvious explanation is that another thread has accidentally been allowed to set this number when it should be excluded from doing so.
Unfortunately it seems close to impossible to track this bug. The array element in which the bad data appears is different each time. What I would love to be able to do is set some kind of trap for the bug as follows: I would enter a critical section for array element N, then I know that no other thread should be able to touch the data, then (until I exit the critical section) set some kind of flag to a debugging tool saying "if any other thread attempts to change the data here please break and show me the offending patch of source code"... but I suspect no such tool exists... or does it? Or is there some completely different debugging methodology that I should be employing.

How about wrapping your data with a transparent mutexed class? Then you could apply additional lock state checking.
class critical_section;
template < class T >
class element_wrapper
{
public:
element_wrapper(const T& v) : val(v) {}
element_wrapper() {}
const element_wrapper& operator = (const T& v) {
#ifdef _DEBUG_CONCURRENCY
if(!cs->is_locked())
_CrtDebugBreak();
#endif
val = v;
return *this;
}
operator T() { return val; }
critical_section* cs;
private:
T val;
};
As for critical section implementation:
class critical_section
{
public:
critical_section() : locked(FALSE) {
::InitializeCriticalSection(&cs);
}
~critical_section() {
_ASSERT(!locked);
::DeleteCriticalSection(&cs);
}
void lock() {
::EnterCriticalSection(&cs);
locked = TRUE;
}
void unlock() {
locked = FALSE;
::LeaveCriticalSection(&cs);
}
BOOL is_locked() {
return locked;
}
private:
CRITICAL_SECTION cs;
BOOL locked;
};
Actually, instead of custom critical_section::locked flag, one could use ::TryEnterCriticalSection (followed by ::LeaveCriticalSection if it succeeds) to determine if a critical section is owned. Though, the implementation above is almost as good.
So the appropriate usage would be:
typedef std::vector< element_wrapper<int> > cont_t;
void change(cont_t::reference x) { x.lock(); x = 1; x.unlock(); }
int main()
{
cont_t container(10, 0);
std::for_each(container.begin(), container.end(), &change);
}

I know two ways to handle such errors:
1) Read the code again and again, looking for possible errors. I can think about two errors that can cause this: unsynchronized access or writing by incorrect memory address. Maybe you have more ideas.
2) Logging, logging an logging. Add lot of optional traces (OutputDebugString or log file), in every critical place, which contain enough information - indexes, variable values etc. It is a good idea to add this tracing with some #ifdef. Reproduce the bug and try to understand from the log, what happens.

Your best (fastest) bet is still to revise the mutex code. As you said, it is the obvious explanation - why not trying to really find the explanation (by logic) instead of additional hints (by coding) that may come out inconclusive? If the code review doesn't turn out something useful you may still take the mutex code and use it for a test run. The first try should not be to reproduce the bug in your system but to ensure correct implementation of the mutex - implement threads (start from 2 upwards) that all try to access the same data structure again and again with a random small delay in each of them to have them jitter around on the time line. If this test results in a buggy mutex which you simply can't identify in the code then you have fallen victim to some architecture dependant effect (maybe intstruction reordering, multi-core cache incoherency, etc.) and need to find another mutex implementation. If OTOH you find an obvious bug in the mutex, try to exploit it in your real system (instrument your code so that the error should appear much more often) so that you can ensure that it really is the cause of your original problem.

I was thinking about this while pedaling to work. One possible way of handling this is to make portions of the memory in question be read-only when it is not actively being accessed and protected via critical section ownership. This is assuming that the problem is caused by a thread writing to the memory when it does not own the appropriate critical section.
There are quite a few limitations to this that prevent it from working. Most importantly is the fact that I think you can only set privileges on a page by page basis (4K I believe). So that would likely require some very specific changes to your allocation scheme so that you could narrow down the appropriate section to protect. The second problem is that it would not catch the rogue thread writing to the memory if another thread actively owned the critical section. But it would catch it and cause an immediate access violation if the critical section was not owned.
The idea would be to do to change your EnterCriticalSection calls to:
EnterCriticalSection()
VirtualProtect( … PAGE_READWRITE … );
And change the LeaveCriticalSection calls to:
VirtualProtect( … PAGE_READONLY … );
LeaveCriticalSection()
The following chunk of code shows a call to VirtualProtect
int main( int argc, char* argv[] 1
{
unsigned char *mem;
int i;
DWORD dwOld;
// this assume 4K page size
mem = malloc( 4096 * 10 );
for ( i = 0; i < 10; i++ )
mem[i * 4096] = i;
// set the second page to be readonly. The allocation from malloc is
// not necessarily on a page boundary, but this will definitely be in
// the second page.
printf( "VirtualProtect res = %d\n",
VirtualProtect( mem + 4096,
1, // ends up setting entire page
PAGE_READONLY, &dwOld ));
// can still read it
for ( i = 1; i < 10; i++ )
printf( "%d ", mem[i*4096] );
printf( "\n" );
// Can write to all but the second page
for ( i = 0; i < 10; i++ )
if ( i != 1 ) // avoid second page which we made readonly
mem[i] = 1;
// this causes an access violation
mem[4096] = 1;
}

optimizing branching by re-ordering

I have this sort of C function -- that is being called a zillion times:
void foo ()
{
if (/*condition*/)
{
}
else if(/*another_condition*/)
{
}
else if (/*another_condition_2*/)
{
}
/*And so on, I have 4 of them, but we can generalize it*/
else
{
}
}
I have a good test-case that calls this function, causing certain if-branches to be called more than the others.
My goal is to figure the best way to arrange the if statements to minimize the branching.
The only way I can think of is to do write to a file for every if condition branched to, thereby creating a histogram. This seems to be a tedious way. Is there a better way, better tools?
I am building it on AS3 Linux, using gcc 3.4; using oprofile (opcontrol) for profiling.

It's not portable, but many versions of GCC support a function called __builtin_expect() that can be used to tell the compiler what we expect a value to be:
if(__builtin_expect(condition, 0)) {
// We expect condition to be false (0), so we're less likely to get here
} else {
// We expect to get here more often, so GCC produces better code
}
The Linux kernel uses these as macros to make them more intuitive, cleaner, and more portable (i.e. redefine the macros on non-GCC systems):
#ifdef __GNUC__
# define likely(x) __builtin_expect((x), 1)
# define unlikely(x) __builtin_expect((x), 0)
#else
# define likely(x) (x)
# define unlikely(x) (x)
#endif
With this, we can rewrite the above:
if(unlikely(condition)) {
// we're less likely to get here
} else {
// we expect to get here more often
}
Of course, this is probably unnecessary unless you're aiming for raw speed and/or you've profiled and found that this is a problem.

Try a profiler (gprof?) - it will tell you how much time is spent. I don't recall if gprof counts branches, but if not, just call a separate empty method in each branch.

Running your program under Callgrind will give you branch information. Also I hope you profiled and actually determined this piece of code is problematic, as this seems like a microoptimization at best. The compiler is going to generate a branch table from the if/else if/else if it's able to which would require no branching (this is dependent on what the conditionals are, obviously)0, and even failing that the branch predictor on your processor (assuming this is not for embedded work, if it is feel free to ignore me) is pretty good at determining the target of branches.

It doesn't actually matter what order you change them round to, IMO. The branch predictor will store the most common branch and auto take it anyway.
That said, there are something you could try ... You could maintain a set of job queues and then, based on the if statements, assign them to the correct job queue before executing them one after another at the end.
This could further be optimised by using conditional moves and so forth (This does require assembler though, AFAIK). This could be done by conditionally moving a 1 into a register, that is initialised as 0, on condition a. Place the pointer valueat the end of the queue and then decide to increment the queue counter or not by adding that conditional 1 or 0 to the counter.
Suddenly you have eliminated all branches and it becomes immaterial how many branch mispredictions there are. Of course, as with any of these things, you are best off profiling because, though it seems like it would provide a win ... it may not.

We use a mechanism like this:
// pseudocode
class ProfileNode
{
public:
inline ProfileNode( const char * name ) : m_name(name)
{ }
inline ~ProfileNode()
{
s_ProfileDict.Find(name).Value() += 1; // as if Value returns a nonconst ref
}
static DictionaryOfNodesByName_t s_ProfileDict;
const char * m_name;
}
And then in your code
void foo ()
{
if (/*condition*/)
{
ProfileNode("Condition A");
// ...
}
else if(/*another_condition*/)
{
ProfileNode("Condition B");
// ...
} // etc..
else
{
ProfileNode("Condition C");
// ...
}
}
void dumpinfo()
{
ProfileNode::s_ProfileDict.PrintEverything();
}
And you can see how it's easy to put a stopwatch timer in those nodes too and see which branches are consuming the most time.

Some counter may help. After You see the counters, and there are large differences, You can sort the conditions in a decreasing order.
static int cond_1, cond_2, cond_3, ...
void foo (){
if (condition){
cond_1 ++;
...
}
else if(/*another_condition*/){
cond_2 ++;
...
}
else if (/*another_condtion*/){
cond_3 ++;
...
}
else{
cond_N ++;
...
}
}
EDIT: a "destructor" can print the counters at the end of a test run:
void cond_print(void) __attribute__((destructor));
void cond_print(void){
printf( "cond_1: %6i\n", cond_1 );
printf( "cond_2: %6i\n", cond_2 );
printf( "cond_3: %6i\n", cond_3 );
printf( "cond_4: %6i\n", cond_4 );
}
I think it is enough to modify only the file that contains the foo() function.

Wrap the code in each branch into a function and use a profiler to see how many times each function is called.

Line-by-line profiling gives you an idea which branches are called more often.
Using something like LLVM could make this optimization automatically.

As a profiling technique, this is what I rely on.
What you want to know is: Is the time spent in evaluating those conditions a significant fraction of execution time?
The samples will tell you that, and if not, it just doesn't matter.
If it does matter, for example if the conditions include function calls that are on the stack a significant part of the time, what you want to avoid is spending much time in comparisons that are false. The way you tell this is, if you often see it calling a comparison function from, say, the first or second if statement, then catch it in such a sample and step out of it to see if it returns false or true. If it typically returns false, it should probably go farther down the list.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js