I am compiling intel tbb community version tbb2017_20161128oss. While compiling it runs few test cases. in one of the test case it gives me the warning
./test_global_control.exe
TBB Warning: The number of workers is currently limited to 0. The request for 1 workers is ignored. Further requests for more workers will be silently ignored until the limit changes.
What does this warning mean for my platform? Should I refrain from using certain components of ITBB?
Usually for TBB tests you can ignore run-time warnings starting from "TBB warning". Generally, these warnings are to tell programmers that they possibly use TBB sub-optimally or incorrectly. In the tests, however, the library is used in very complicated ways, and so sometimes warnings are issued.
This particular warning tells that a program first has limited the number of worker threads allowed to use, and then tries to request more workers than the limit allows. For the test, it's important to check that the behavior is correct in such corner cases; but the warning is outside of its control.
In real applications, these warnings can help diagnosing unexpected situations, and so should not be ignored.
Related
Can oversubscribing the number of OpenMP threads in a hybrid MPI / OpenMP program lead to an incorrect execution of parallel code in C++? By incorrect I mean it does not produce output in a parallel test case as expected.
I am trying to come up with an example of a case where oversubscription, on its own, causes execution of the code to fail. The only cause I can think of and find via research is when there are so many threads used in OpenMP that they cause a stack overflow.
My motivation for the question is I am working on a large project with hybrid OpenMP / MPI where the number of failed tests seems to depend on the number of cores used. I imagine this could be due to a number of issues outside the scope of the question, but I am interested to know whether solely oversubscription could cause correctness tests to fail.
No. A correct well-formed parallel program on functioning hardware does not become incorrect from being oversubscribed.
There is simply no correctness assumption being violated by oversubscription. Imagine a non-pinned program - one of it's threads could be migrated by the processor to a core that is already executing another threads. Locally, this is similar to oversubscription, and it must not be incorrect.
You may experience severe performance degradation or program termination due to lack of resources. Of course an incorrect program that appeared to have worked before, can reveal its flaws when run under oversubscription. Oversubscription could possibly exhibit a pattern that reveals existing hardware issues.
I've had some rather unusual behavior from my OpenMP program, which I think must be caused by some inner-workings of Linux processes that I am unaware of.
When I run a C benchmark binary which has been compiled with OpenMP support, the program executes successfully, no problems at all:
>:$ OMP_NUM_THREADS=4 GOMP_CPU_AFFINITY="0-3" /home/me/bin/benchmark
>:$ ...benchmark complete...
When I run the benchmark from a separate C++ program that I start, where the code to execute it (for example) looks like:
int main(int argc, char* argv[]){
system("/home/me/bin/benchmark");
return 0;
}
The output gives me warnings:
>:$ home/me/bin/my_cpp_program
OMP: Warning #123: Ignoring invalid OS proc ID 1.
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
These warnings are the same warnings I get when I try to set CPU affinity to CPUs that don't exist, and run the OpenMP benchmark directly.
I therefore assume that the only CPU my_cpp_program knows to exist is processor ID 0. I also get the same error when using root so I don't think it is a permissions problem? I've also checked that the code executed by system() has the correct environment variables, and the same linked libraries.
Does anyone know what is causing this to occur?
Following a suggestion by Zulan, I'm pretty sure the reason this occurred was because of the nature and compilation of bin/my_cpp_program. It was importantly also compiled with -fopenmp. It turns out that bin/my_cpp_program was compiled against the GNU OpenMP implementation library libgomp, whereas the bin/benchmark program was compiled against the LLVM OpenMP implementation library libomp.
I am not certain, but what I think happened was the following. GOMP_CPU_AFFINITIY is the GNU environment variable for setting CPU affinity in libgomp. Therefore because GOMP_CPU_AFFINITY was set, the single-thread running in bin/my_cpp_program was bound to CPU 0. I guess any child processes spawned by that thread must also only see CPU 0 as its potential-CPUs when using further GOMP_CPU_AFFINITY affinity assignments. This then gave me the warnings when the spawned process (from the environment) tries to find CPUs 1-3.
To workaround this I used KMP_AFFINITY which is Intel's CPU affinity environment variable. The libomp OpenMP runtime used by bin/benchmark gives a higher precedence to KMP_AFFINITY when both it and GOMP_CPU_AFFINITY are set, and for whatever reason this allows the spawned child process to assign to other CPUs. So to do this, I used:
KMP_AFFINITY=logical,granularity=fine ./bin/benchmark
This means the program works as expected (each logical core is bound in ascending order from 0 to 3) in both situations, as the CPU assignments of the bin/my_cpp_program no longer screw with the assignments of bin/benchmark, when one spawns the other. This can be checked as truly occurring by adding verbose to the comma-separated list.
As my GPU device Quadro FX 3700 doesn't support arch>sm_11. I was not able to use relocatable device code (rdc). Hence i combined all the utilities needed into 1 large file (say x.cu).
To give a overview of x.cu it contains 2 classes with 5 member functions each, 20 device functions, 1 global kernel, 1 kernel caller function.
Now, when i try to compile via Nsight it just hangs showing Build% as 3.
When i try compiling using
nvcc x.cu -o output -I"."
It shows the following Messages and compiles after a long time,
/tmp/tmpxft_0000236a_00000000-9_Kernel.cpp3.i(0): Warning: Olimit was exceeded on function _Z18optimalOrderKernelPdP18PrepositioningCUDAdi; will not perform function-scope optimization.
To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=45022
/tmp/tmpxft_0000236a_00000000-9_Kernel.cpp3.i(0): Warning: To override Olimit for all functions in file, use -OPT:Olimit=45022
(Compiler may run out of memory or run very slowly for large Olimit values)
Where optimalOrderKernel is the global kernel. As compiling shouldn't be taking much time. I want to understand the reason behind this messages, particularly Olimit.
Olimit is pretty clear, I think. It is a limit the compiler places on the amount of effort it will expend on optimizing code.
Most codes compile just fine using nvcc. However, no compiler is perfect, and some seemingly innocuous codes can cause the compiler to spend a long time at an optimization process that would normally be quick.
Since you haven't provided any code, I'm speaking in generalities.
Since there is the occasional case where the compiler spends a disproportionately long time in certain optimization phases, the Olimit provides a convenient watchdog, so you have some idea of why it is taking so long. Furthermore, the Olimit acts like a watchdog on an optimization process that is taking too long. When it is exceeded, certain optimization steps are aborted, and a "less optimized" version of your code is generated, instead.
I think the compiler messages you received are quite clear on how to modify the Olimit depending on your intentions. You can override it to increase the watchdog period, or disable it entirely (by setting it to zero). In that case, the compile process could take an arbitrarily long period of time, and/or run out of memory, as the messages indicate.
I enabled in /RTCs in my application to detect stack corruption issues. The application has many components(dlls) and the total LOC is about 40K. It has many threads.
Initially I was getting the crash after executing 18000 cycles. But after enabling the /RTCs option, I am getting the carsh within 100 cyles. The crash always occurs in a thread called Reciever Thread. But it crashes consistently at 3 or 4 locations. When the crash occurs almost all local variables looks like corrupted in some cases. But I am not able to identify the root cause as I cannot see any issues around the points at which the crash occurs.
What things can I do to narrow down the point where the stack is corrupting?
The code has try catch statements, will it prevent identifying the cause?
Please help me
Thanks!
Edit: Are you using optimisatons:
If you compile your program at the
command line using any of the /RTC
compiler options, any pragma optimize
instructions in your code will
silently fail. This is because
run-time error checks are not valid in
a release (optimized) build.
You should use /RTC for development
builds; /RTC should not be used for a
retail build. /RTC cannot be used with
compiler optimizations (/O Options
(Optimize Code)). A program image
built with /RTC will be slightly
larger and slightly slower than an
image built with /Od (up to 5 percent
slower than an /Od build).
Without you posting any code I can only suggest general tools.
I use valgrind --tool=helgrind on Linux for this kind of thing, but I am guessing from your question that you are on Windows.
You might find the answers to this question useful: Is there a good Valgrind substitute for Windows?
(It might help if you post the code where you are getting problems or indicate what methods you have used to protect the variables that seem to be corrupted (mutexes and the like...) )
It never happened to me. In Visual Studio, I have a part of code that is executed 300 times, I time it every iteration with the performance counter, and then average it.
If I'm running the code in the debugger I get an average of 1.01 ms if I run it without the debugger I get 1.8 ms.
I closed all other apps, I rebooted, I tried it many times: Always the same timing.
I'm trying to optimize my code, but before throwing me into changing the code, I want to be sure of my timings. To have something to compare with.
What can cause that strange behaviour?
Edit:
Some clarification:
I'm running the same compiled piece of code: the release build. The only difference is (F5 vs CTRL-F5)
So, the compiler optimization should not be invoved.
Since each calcuated times were verry small, I changed the way I benchmark: I'm now timing the 300 iterations and then divide by 300. I have the same result.
About caching: The code is doing some image cross correlation, with different images at each iterations. The steps of the processing are not modified by the data in the images. So, I think caching is not the problem.
I think I figured it out.
If I add a Sleep(3000) before running the tests, they give the same result.
I think it has something to do with the loading of misc. dlls. In the debugger, the dlls were loaded before any code was executed. Outside the debugger, the dlls were loaded on demand, and one or more were loaded after the timer was started.
Thanks all.
I don't think anyone has mentioned this yet, but the debug build may not only affect the way your code executes, but also the way the timer itself executes. This can lead to the timer being inaccurate / slower / definitely not reliable. I would recommend using a profiler as others have mentioned, and compare only similar configurations.
You are likely to get very erroneous results by doing it this way ... you should be using a profiler. You should read this article entitled The Perils of MicroBenchmarking:
http://blogs.msdn.com/shawnhar/archive/2009/07/14/the-perils-of-microbenchmarking.aspx
It's probably a compiler optimization that's actually making your code worse. This is extremely rare these days but if you're doing odd, odd stuff, this can happen.
Some debugger / IDEs like Visual Studio will automatically zero out memory for you in Debug mode; this may be a contributing factor.
Are you running the exact same code in the debugger and outside the debugger or running debug in the debugger and release outside? If so the code isn't the same. If you're running debug and release and seeing the difference you could turn off optimization in release and see what that does or run your code in a profiler in debug and release and see what changes.
The debug version initializes variables to 0 (usually).
While a release binary does not initialize variables (unless the code explicitly does). This may affect what the code is doing the ziae of a loop or a whole host of other possibilities.
Set the warning level to the highest level (level 4, default 3).
Set the flag that says treat warnings as errors.
Recompile and re-test.
Before you dive into an optimization session get some facts:
dose it makes a difference? dose this application runs twice as slow measured over a reasonable length of time?
how are the debug and release builds configured
what is the state of this project? Is it a complete software or are you profiling a single function ?
how are you running the debug and build releases , are you sure you are testing under the same conditions (e.g. process priority settings )
suppose you do optimize the code what do you have in mind ?
Having read your additional data a distant bell started to ring ...
When running a program in the debugger it will catch both C++ exceptions and structured exceptions (windows execution)
One event that will trigger a structured exception is a divide by zero, it is possible that the debugger quickly catches and dismiss this event (as a first chance exception handling) while the release code goes a bit longer before doing something about it.
so if your code might be generating such or similar exceptions it worth a while to look into it.