I have a C++ program that processes images and tracks objects in them, using OpenCV. For the most part, it works well; however the results that I get are inconsistent. That is, approximately 10% of the time, I am getting slightly different output values and I cannot figure out why. I do not have any calls to random; I have run valgrind to look for uninitialized memory; I have run clang-tools static analysis on it. No luck. The inconsistent runs have one of several different outputs, so they are not completely random.
Is there a tool that will show me where two runs diverge? If I run gprof or maybe cflow, can I compare them and see what was different? Is there some other tool or process I can use?
Edit: Thank you for the feedback. I believe that it is due to threading and a race condition; the suggestion was very helpful. I am currently using advice from: Ways to Find a Race Condition
Answering my own question, so that someone can see it. It was not a race condition:
The underlying problem was that we were using HOG descriptors with the incorrect parameters.
As it says in the documentation (https://docs.opencv.org/3.4.7/d5/d33/structcv_1_1HOGDescriptor.html), when you call cv::HOGDescriptor::compute, and you pass in a win_stride, it has to be a multiple of block stride. Similarly, block stride must be a multiple of cell stride. We did not set these properly. The end result is that sometimes (about 10% of the time), memory was being overwritten or otherwise corrupted. It didn't throw an error, and the results were almost correct all the time, but they were subtly different.
Related
Through my experience I bumped into some odd behaviors when using printf (or any other std out logging ) for debugging .
Behavior 1 :
One common scenario Is when using printf in multithreaded applications in order to find why certain bug(s) are occurring, and using printf suddenly "fixed" the bug(s) (ofc printfs where agressivelly called, resulting in a huge output).
In this scenario I consider that printf adds some delays so there might be some low priority threads that don't get CPU, so I start looking in that direction.
Another direction I look after the miracle printf fix is on synchronization, because I speculate that calls to printf, although multithreaded , are synchronized behind by the system, so the different threads with printf get synchronized between themselves by waiting for each other to finish writing to the I/O buffer.
Q1 : Are my two suppositions regarding the first scenario correct ?
Q2 : Are there any other directions I should take into consideration when such a scenarios occurs ?
Behavior 2 :
This scenario very rarely happens , but when bumped into it will make even senior developers question themselves, and I would really appreciate an explanation regarding this.
It goes something like this :
code doesn't work ... (clean, compile , run)
code still doesn't work , so you add a printf to see why (clean, compile, run)
code starts working fine .... you remove the previously added printf (clean, compile, run)
the code works fine now !!!!!!!!!! (scratch head, stare in disbelief ).
In practice I used this approach more then once this approach to fix the CPU may pe pegged bug more then once when it occurred : Android "cpu may be pegged" bug .
It actually worked so well, that it became a known "fix" (and if it didn't work from the first try, you just repeat the process until it was gone).
Please note that the code was properly cleaned it was never the problem of linking with older compiled objects.
One of the most popular speculations is the fact that the compiled code is different , for unknown reasons (do compilers have some random according to lines of a certain file including whitespaces ? ).
Q3 : What can be the cause this behavior (I'm open to speculations as well) ? Can a compiler generate different assembly although the code is the same ?
Please note that the projects I'm talking about are quite large, with multiple static libraries, so these behaviors aren't replicable on small code snippets (although I've heard of scenario 2 happening on a single-file program as well).
Q1: You can tell if printf is synchronizing behind the scenes simply by looking for interleaved characters from two different printfs. If there's no interleaving, then printf is synchronizing. I expect this to be more likely to fix things than CPU hogging.
Q2: I would look for shared resources that aren't properly mutex protected.
Q3: Compilers may use random numbers during optimization. For example, if the compiler has 32 variables and 8 registers to put them in, it may "roll the dice" to determine which variables to put into registers. You can test this theory by disabling optimization. Without optimization, the output should be consistent. And you can test that theory by comparing binaries.
printf()'s thread safety is discussed in other questions e.g. here for linux and it's also noteworthy that any out-of-line function call can be expected to cause writes back to memory - to quote David Butenhoff
"In practice, most compilers will not try to keep register copies of
global data across a call to an external function, because it's too
hard to know whether the routine might somehow have access to the
address of the data."
Either aspects can mean undefined behaviour caused by failing to use synchronisation instructions properly may be "masked" (with varying degrees of reliably depending on your architecture, the exact nature of the problem etc.) by calling printf().
As you say, the time spent calling printf can also affect the frequency with which race conditions turn out adversely.
Regarding re-compiling programs fixing bugs: firstly if the bug is intermittent in the first place, observing it before some change and not afterwards doesn't necessarily prove any causal relationship. There could be other factors such as less system load afterwards due to other factors such as anti-virus sweeps, backup schedules, other users etc.. Secondly, it's possible the executable might not be the same: the compiler might inject something like an incrementing version numbers or build timestamps or some other system data - a change in e.g. data length could have knock-on affects to the alignment of other data with lots of subtle consequences. It's also possible that your compiler's using address randomisation or some other techniques, which again could affect data alignment - in a way that might mask errors or change performance.
I have a genetic algorithm program, everything is allocated dynamically using vectors. Nowhere is the number of generations or individuals per generation set at compile time.
I tried it using 500, 1000, 2000 generations, it runs perfect. Then I tried 10,000 generations. It gave me debug assertion failed, vector subscript out of range at generation 4966.
I tried again twice with the same parameters, 10,000 generations, it ran fine.
I tried it once more, I got the error at generation 7565.
It's strange that sometimes it works perfectly, sometimes I get the error. Especially considering that everything is done using vectors.
Any ideas on where could the problem come from? Maybe the debug mode is buggy for some reason?
The problem comes from stack corruption or most probably from index out of bounds access. The fact that there are cases when your code crashes indicates there is something wrong. If your code is multi-threaded the problem may be because if actions are executed in a given order your code will try to access something out of bounds for a vector.
My advice is to run your code using valgrind and see what it will say. Usually it helps in resolving similar issues.
Also note the fact that there are cases when your code does not crash this does not mean that it works perfectly. You may still have stack corruption or similar.
I'm debugging my CUDA 4.0/Thrust-based image reconstruction code on my Ubuntu 10.10 64-bit system and I've been trying to figure out how to debug this run-time error I have in which my output images appear to some random "noise." There is no random number generator output in my code, so I expect the output to be consistent between runs, even if it's wrong. However, it's not...
I was just wondering if any one has a general procedure for debugging CUDA runtime errors such as these. I'm not using any shared memory in my cuda kernels. I've taken pains to avoid any race conditions involving global memory, but I could have missed something.
I've tried using gpu ocelot, but it has problems recognizing some of my CUDA and CUSPARSE function calls.
Also, my code generally works. It's just when I change this one setting that I get these non-deterministic results. I've checked all code associated with that setting, but I can't figure out what I'm doing wrong. If I can distill it to something that I can post here, I might do that, but at this point it's too complicated to post here.
Are you sure all of your kernels have proper blocksize/remainder handling? The one place we have seen non-deterministic results occurred when we had data elements at the end of the array not being processed.
Our kernels were originally were intended for data that was known to be an integer multiple of 256 elements. So we used a blocksize of 256, and did a simple division to get the number of blocks. When the data was then changed to be any length, the leftover 255 or less elements never got processed. Those spots in the output then had random data.
I am rewriting some rendering C code in C++. The old C code basically computes everything it needs and renders it at each frame. The new C++ code instead pre-computes what it needs and stores that as a linked list.
Now, actual rendering operations are translations, colour changes and calls to GL lists.
While executing the operations in the linked list should be pretty straightforward, it would appear that the resulting method call takes longer than the old version (which computes everything each time - I have of course made sure that the new version isn't recomputing).
The weird thing? It executes less OpenGL operations than the old version. But it gets weirder. When I added counters for each type of operation, and a good old printf at the end of the method, it got faster - both gprof and manual measurements confirm this.
I also bothered to take a look at the assembly code generated by G++ in both cases (with and without trace), and there is no major change (which was my initial suspicion) - the only differences are a few more stack words allocated for counters, increasing said counters, and preparing for printf followed by a jump to it.
Also, this holds true with both -O2 and -O3. I am using gcc 4.4.5 and gprof 2.20.51 on Ubuntu Maverick.
I guess my question is: what's happening? What am I doing wrong? Is something throwing off both my measurements and gprof?
By spending time in printf, you may be avoiding stalls in your next OpenGL call.
Without more information, it is difficult to know what is happening here, but here are a few hints:
Are you sure the OpenGL calls are the same? You can use some tool to compare the calls issued. Make sure there was no state change introduced by the possibly different order things are done.
Have you tried to use a profiler at runtime? If you have many objects, the simple fact of chasing pointers while looping over the list could introduce cache misses.
Have you identified a particular bottleneck, either on the CPU side or GPU side?
Here is my own guess on what could be going wrong. The calls you send to your GPU take some time to complete: the previous code, by mixing CPU operations and GPU calls, made CPU and GPU work in parallel; on the contrary the new code first makes the CPU computes many things while the GPU is idling, then feeds the GPU with all the work to get done while the CPU has nothing left to do.
I have a performance issue where I suspect one standard C library function is taking too long and causing my entire system (suite of processes) to basically "hiccup". Sure enough if I comment out the library function call, the hiccup goes away. This prompted me to investigate what standard methods there are to prove this type of thing? What would be the best practice for testing a function to see if it causes an entire system to hang for a sec (causing other processes to be momentarily starved)?
I would at least like to definitively correlate the function being called and the visible freeze.
Thanks
The best way to determine this stuff is to use a profiling tool to get the information on how long is spent in each function call.
Failing that set up a function that reserves a block of memory. Then in your code at various points, write a string to memory including the current time. (This avoids the delays associated with writing to the display).
After you have run your code, pull out the memory and parse it to deterimine how long parts of your code are taking.
I'm trying to figure out what you mean by "hiccup". I'm imagining your program does something like this:
while (...){
// 1. do some computing and/or file I/O
// 2. print something to the console or move something on the screen
}
and normally the printed or graphical output hums along in a subjectively continuous way, but sometimes it appears to freeze, while the computing part takes longer.
Is that what you meant?
If so, I suspect in the running state it is most always in step 2, but in the hiccup state it spending time in step 1.
I would comment out step 2, so it would spend nearly all it's time in the hiccup state, and then just pause it under the debugger to see what it's doing.
That technique tells you exactly what the problem is with very little effort.