CUDA architecture -sm_11 compile issue in NSight - c++

As my GPU device Quadro FX 3700 doesn't support arch>sm_11. I was not able to use relocatable device code (rdc). Hence i combined all the utilities needed into 1 large file (say x.cu).
To give a overview of x.cu it contains 2 classes with 5 member functions each, 20 device functions, 1 global kernel, 1 kernel caller function.
Now, when i try to compile via Nsight it just hangs showing Build% as 3.
When i try compiling using
nvcc x.cu -o output -I"."
It shows the following Messages and compiles after a long time,
/tmp/tmpxft_0000236a_00000000-9_Kernel.cpp3.i(0): Warning: Olimit was exceeded on function _Z18optimalOrderKernelPdP18PrepositioningCUDAdi; will not perform function-scope optimization.
To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=45022
/tmp/tmpxft_0000236a_00000000-9_Kernel.cpp3.i(0): Warning: To override Olimit for all functions in file, use -OPT:Olimit=45022
(Compiler may run out of memory or run very slowly for large Olimit values)
Where optimalOrderKernel is the global kernel. As compiling shouldn't be taking much time. I want to understand the reason behind this messages, particularly Olimit.

Olimit is pretty clear, I think. It is a limit the compiler places on the amount of effort it will expend on optimizing code.
Most codes compile just fine using nvcc. However, no compiler is perfect, and some seemingly innocuous codes can cause the compiler to spend a long time at an optimization process that would normally be quick.
Since you haven't provided any code, I'm speaking in generalities.
Since there is the occasional case where the compiler spends a disproportionately long time in certain optimization phases, the Olimit provides a convenient watchdog, so you have some idea of why it is taking so long. Furthermore, the Olimit acts like a watchdog on an optimization process that is taking too long. When it is exceeded, certain optimization steps are aborted, and a "less optimized" version of your code is generated, instead.
I think the compiler messages you received are quite clear on how to modify the Olimit depending on your intentions. You can override it to increase the watchdog period, or disable it entirely (by setting it to zero). In that case, the compile process could take an arbitrarily long period of time, and/or run out of memory, as the messages indicate.

Related

Cpp binary code JIT optimizations via recompiling parts based on of run-time information

Performance is very critical for my cpp application(an in-memory database). I have tried using JIT at runtime, meaning compiling and optimizing, the parts of vanilla cpp code, at runtime. The fresh compilation times from scratch are very non-deterministic and the overall gains are capped. It gives good performance though, but I am hoping I can extract out more,
My idea is if I start with some release/opt compiled binary equivalent at compile time and then optimize it at run-time, based on if data based optimizations are possible. Basically the performance of such a binary should be same as native binary when no JIT optimizations are done. If data is too large or if dynamic optimizations are possible, then I am willing to spend some time running the optimization passes on the build. Conceptually, here I will be saving the compilation cost of first pass on the code to convert it to runnable code module.
May be, I am talking at a high level, but it would make more sense when the first compilation pass is time consuming because of conversion of cpp text code to un-optimized machine code(as it would need lots of mem-allocations), and time taken by respective optimization passes is small as compared. Do you have any pointers or resources?

Get the execution time of each line of code in c++

In my current project some part of code taking more than 30 minutes to complete the process. I found clock function would be best choice for getting the method execution time, but is there any other way to get the maximum time taking line of code? or else I have to log every method with clock function that would be a complex process for me because it is really gigantic project that would take forever to do it.
Proper way to do it - profiling. This will give you pretty useful information based on functions - where code spent most, which function was called most of the time etc. There are profilers that available on compiler level (like gcc has option to enable it) or you can use 3rd party ones. Unfortunately profiling itself will affect performance of the program and you may see different timing than real program with profiler enabled, but usually it is a good starting point.
As to measure execution time of every line of code, that is not practical. First of all not every line produces executable code especially after optimizer. On another side it is pretty useless to opimize code that not compiled with optimization enabled.

LD_BIND_NOW Can Make the Executable run Slower?

I am curious if an executable is poorly written that it has much dead code, referring to 1000s of functions externally (i.e. .so files) but only 100s of those functions are actually called during runtime, will LD_BIND_NOW=1 be worse than LD_BIND_NOW not set? Because the Procedure Linkage Table will contain 900 useless function addresses? Worse in a sense of memory footprint and performance (as I don't know whether the lookup is O(n)).
I am trying to see whether setting LD_BIND_NOW to 1 will help (by comparing to LD_BIND_NOW not set):
1. a program that runs 24 x 5 in terms of latency
2. saving 1 microsecond is considered big in my case as the code paths being executed during the life time of the program are mainly processing incoming messages from TCP/UDP/shared memory and then doing some computations on them;
all these code paths take very short time (e.g. < 10 micro) and these code paths will be run like millions of times
Whether LD_BIND_NOW=1 helps the startup time doesn't matter to me.
saving 1 microsecond is considered big in my case as the executions by the program are all short (e.g. <10 micro)
This is unlikely (or you mean something else). A typical call to execve(2) -the system call used to start programs- is usually lasting several milliseconds. So it is rare (and practically impossible) that a program executes (from execve to _exit(2)) in microseconds.
I suggest that your program should not be started more than a few times per second. If indeed the entire program is very short lived (so its process lasts only a fraction of a second), you could consider some other approach (perhaps making a server running those functions).
LD_BIND_NOW will affect (and slow down) the start-up time (e.g. in the dynamic linker ld-linux(8)). It should not matter (except for cache effects) the steady state execution time of some event loop.
See also references in this related answer (to a different question), they contain detailed explanations relevant to your question.
In short, the setting of LD_BIND_NOW will not affect significantly the time needed to handle each incoming message in a tight event loop.
Calling functions in shared libraries (containing position-independent code) might be slightly slower (by a few percents at most, and probably less on x86-64) in some cases. You could try static linking, and you might consider even link time optimization (i.e. compiling and linking all your code -main programs and static libraries- with -flto -O2 if using GCC).
You might have accumulated technical debt, and you could need major code refactoring (which takes a lot of time and effort, that you should budget).

gcc/C++: If CPU load is low, then code optimization is of little use, true?

My colleague likes to use gcc with '-g -O0' for building production binaries because of debugging is easy if core dump happens. He says there is no need to use compiler optimization or tweak the code because he finds the process in production does not have high CPU load, e.g. 30% around.
I asked him the reason behind that and he told me: If CPU load is not high, the bottleneck must not be our code performance, and should be some IO (disk/network). So by using gcc -O2 is of no use to improve the latency and throughput. Also that also indicates we don't have much to improve in the code because CPU is not a bottleneck. Is that correct?
About CPU usage ~ optimisation
I would expect most optimisation problems in a program to correlate to higher-than-usual CPU load, because we say that a sub-optimal program does more than it theoretically needs to. But "usual" here is a complicated word. I don't think you can pick a hard value of system-wide CPU load percentage at which optimisation becomes useful.
If my program reallocates a char buffer in a loop, when it doesn't need to, my program might run ten times slower than it needs to, and my CPU usage may be ten times higher than it needs to be, and optimising the function may yield ten-fold increases in application performance … but the CPU usage may still only be 0.5% of the whole system capacity.
Even if we were to choose a CPU load threshold at which to begin profiling and optimising, on a general-purpose server I'd say that 30% is far too high. But it depends on the system, because if you're programming for an embedded device that only runs your program, and has been chosen and purchased because it has just enough power to run your program, then 30% could be relatively low in the grand scheme of things.
Further still, not all optimisation problems will indeed have anything to do with higher-than-usual CPU load. Perhaps you're just waiting in a sleep longer than you actually need to, causing message latency to increase but substantially reducing CPU usage.
tl;dr: Your colleague's view is simplistic, and probably doesn't match reality in any useful way.
About build optimisation levels
Relating to the real crux of your question, though, it's fairly unusual to deploy a release build with all compiler optimisations turned off. Compilers are designed to emit pretty naive code at -O0, and to do the sort of optimisations that are pretty much "standard" in 2016 at -O1 and -O2. You're generally expected to turn these on for production use, otherwise you're wasting a huge portion of a modern compiler's capability.
Many folks also tend not to use -g in a release build, so that the deployed binary is smaller and easier for your customers to handle. You can drop a 45MB executable to 1MB by doing this, which is no pocket change.
Does this make debugging more difficult? Yes, it can. Generally, if a bug is located, you want to receive reproduction steps that you can then repeat in a debug-friendly version of your application and analyse the stack trace that comes out of that.
But if the bug cannot be reproduced on demand, or it can only be reproduced in a release build, then you may have a problem. It may therefore seem reasonable to keep basic optimisations on (-O1) but also keep debug symbols in (-g); the optimisations themselves shouldn't vastly hinder your ability to analyse the core dump provided by your customer, and the debug symbols will allow you to correlate the information to source code.
That being said, you can have your cake and eat it too:
Build your application with -O2 -g
Copy the resulting binary
Perform strip on one of those copies, to remove the debug symbols; the binaries will otherwise be identical
Store them both forever
Deploy the stripped version
When you have a core dump to analyse, debug it against your original, non-stripped version
You should also have sufficient logging in your application to be able to track down most bugs without needing any of this.
Under certain circumstances he could be correct, and mostly incorrect under other (while under some he's totally correct).
If you assume that you run for 1s the CPU would be busy for 0.3s and waiting for something else 0.7s. If you optimized the code and say got 100% improvement then the CPU would complete what took 0.3s in 0.15s and make the task complete in 0.85s instead of 1s (given that the wait for something else will take the same time).
However if you've got a multicore situation the CPU load is sometimes defined as the amount of processing power that's being used. So if one core runs at 100% and two are idling the CPU load would become 33% so in such a scenario 30% CPU load may be due to the program is only able to make use of one core. In that case it could improve performance drastically if the code were optimized.
Note that sometimes what is thought to be an optimization is actually an pessimization - that's why it's important to measure. I've seen a few "optimizations" that reduce performance. Also some times optimizations would alter the behavior (in particular when you "improve" the source code) so you should probably make sure it doesn't break anything by having proper tests. After doing performance measurement you should decide if it's worth trading debuggability for speed.
A possible improvement might be to compile with gcc -Og -g using a recent GCC. The -Og optimization is debugger-friendly.
Also, you can compile with gcc -O1 -g; you get many (simple) optimizations, so performance is usually 90% of -O2 (with of course some exceptions, where even -O3 matters). And the core dump is usually debuggable.
And it really depends upon the kind of software and the required reliability and ease of debugging. Numerical code (HPC) is quite different from small database post-processing.
At last, using -g3 instead of -g might help (e.g. gcc -Wall -O1 -g3)
BTW synchronization issues and deadlocks might be more likely to appear on optimized code than on non-optimized ones.
It's really simple: CPU time is not free. We like to think that it is, but it's patently false. There are all sorts of magnification effects that make every cycle count in some scenarios.
Suppose that you develop an app that runs on a million of mobile devices. Every second your code wastes is 1-2 years of continuous device use worth on a 4-core device. Even with 0% CPU utilization, wall time latency costs you backlight time, and that's not to be ignored with either: backlight uses about 30% of device's power.
Suppose that you develop an app that runs in a data center. Every 10% of the core that you're using is what someone else won't be using. At the end of the day, you've only got so many cores on a server, and that server has power, cooling, maintenance and amortization costs. Every 1% of CPU usage has costs that are simple to determine, and they aren't zero!
On the other hand: developer time isn't free, and every second of developer's attention requires commensurate energy and resource inputs just to keep her or him alive, fed, well and happy. Yet, in this case all the developer needs to do is flip a compiler switch. I personally don't buy the "easier debugging" myths. Modern debugging information is expressive enough to capture register use, value liveliness, code replication and such. Optimizations don't really get in the way as they did 15 years ago.
If your business has a single, underutilized server, then what the developer is doing might be OK, practically speaking. But all I see here really is an unwillingness to learn how to use the debugging tools or proper tools to begin with.

Gprof: specific function time [duplicate]

This question already has answers here:
Function execution time
(2 answers)
Closed 9 years ago.
I want to find out the time spent by a particular function in my program. FOr that purpose, I am making use of gprof. I used the following command to get the time for the specific function but still the log file displays the results for all the functions present in the program. Please help me in this regard.
gprof -F FunctionName Executable gmon.out>log
You are nearly repeating another question about function execution time.
As I answered there, there is a difficulty (due to hardware!) to get reliably the execution time of some particular function, specially if that function takes little time (e.g. less than a millisecond). Your original question pointed to these methods.
I would suggest using clock_gettime(2) with CLOCK_REALTIME or perhaps CLOCK_THREAD_CPUTIME_ID
gprof(1) (after compilation with -pg) works with profil(3) and is using a sampling technique, based upon sending a SIGPROF signal (see signal(7)) at periodic intervals (e.g. each 10 milliseconds) from a timer set with setitimer(2) and TIMER_PROF; so the program counter is sampled periodically. Read the wikipage on gprof and notice that profiling may significantly degrade the running time.
If your function gets executed in a short time (less than a millisecond) the profiling gives an imprecise measurement (read about heisenbugs).
In other words, profiling and measuring the time of a short running function is altering the behavior of the program (and this would happen with some other OS too!). You might have to give up the goal of measuring precisely and reliably and accurately the timing of your function without disturbing it. It might even not make any precise sense, e.g. because of the CPU cache.
You could use gprof without any -F argument and, if needed, post-process the textual profile output (e.g. with GNU awk) to extract the information you want.
BTW, the precise timing of a particular function might not be important. What is important is the benchmarking of the entire application.
You could also ask the compiler to optimize even more your program; if you are using link time optimization, i.e. compiling and linking with g++ -flto -O2, the notion of timing of a small function may even cease to exist (because the compiler and the linker could have inlined it without you knowing that).
Consider also that current superscalar processors have a so complex micro-architecture with instruction pipeline, caches, branch predictor, register renaming, speculative execution, out-of-order execution etc etc that the very notion of timing a short function is undefined. You cannot predict or measure it.