Measuring CUDA Allocation time - c++

I need to measure the time difference between allocating normal CPU memory with new and a call to cudaMallocManaged. We are working with unified memory and are trying to figure out the trade-offs of switching things to cudaMallocManaged. (The kernels seem to run a lot slower, likely due to a lack of caching or something.)
Anyway, I am not sure the best way to time these allocations. Would one of boost's process_real_cpu_clock, process_user_cpu_clock, or process_system_cpu_clock give me the best results? Or should I just use the regular system time call in C++11? Or should I use the cudaEvent stuff for timing?
I figure that I shouldn't use the cuda events, because they are for timing GPU processes and would not be acurate for timing cpu calls (correct me if I am wrong there.) If I could use the cudaEvents on just the mallocManaged one, what would be most accurate to compare against when timing the new call? I just don't know enough about memory allocation and timing. Everything I read seems to just make me more confused due to boost's and nvidia's shoddy documentation.

You can use CUDA events to measure the time of functions executed in the host.
cudaEventElapsedTime computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).
Read more at: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html
In addition, if you are also interested in timing your kernel execution time, you will find that the CUDA event API automatically blocks the execution of your code and waits until any asynchronous call ends (like a kernel call).
In any case, you should use the same metrics (always CUDA events, or boost, or your own timing) to ensure the same resolution and overhead.
The profiler `nvprof' shipped with the CUDA toolkit may help to understand and optimize the performance of your CUDA application.
Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html

I recommend:
auto t0 = std::chrono::high_resolution_clock::now();
// what you want to measure
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";
This will output the difference in seconds represented as a double.
Allocation algorithms usually optimize themselves as they go along. That is, the first allocation is often more expensive than the second because caches of memory are created during the first in anticipation of the second. So you may want to put the thing you're timing in a loop, and average the results.
Some implementations of std::chrono::high_resolution_clock have been less than spectacular, but are improving with time. You can assess your implementation with:
auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";
That is, how fast can your implementation get the current time? If it is slow, two successive calls will demonstrate a large time in-between. On my system (at -O3) this outputs on the order of:
1.2e-07s
which means I can time something that takes on the order of 1 microsecond. To get a finer measurement than that I have to loop over many operations, and divide by the number of operations, subtracting out the loop overhead if that would be significant.
If your implementation of std::chrono::high_resolution_clock appears to be unsatisfactory, you may be able to build your own chrono clock along the lines of this. The disadvantage is obviously a bit of non-portable work. However you get the std::chrono duration and time_point infrastructure for free (time arithmetic and units conversion).

Related

Incorrect measurement of the code execution time inside OpenMP thread

So I need to measure execution time of some code inside for loop. Originally, I needed to measure several different activities, so I wrote a timer class to help me with that. After that I tried to speed things up by paralleling the for loop using OpenMP. The problem is that when running my code in parallel my time measurements become really different - the values increase approximately up to a factor of 10. So to avoid the possibility of flaw inside the timer class I started to measure execution time of the whole loop iteration, so structurally my code looks something like this:
#pragma omp parallel for num_threads(20)
for(size_t j = 0; j < entries.size(); ++j)
{
auto t1 = std::chrono::steady_clock::now();
// do stuff
auto t2 = std::chrono::steady_clock::now();
std::cout << "Execution time is "
<< std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count()
<< std::endl;
}
Here are some examples of difference between measurements in parallel and measurements in single thread:
Single-threaded
Multi-threaded
11.363545868021
94.154685442
4.8963048650184
16.618173163
4.939025568
25.4751074
18.447368772
110.709813843
Even though it is only a couple of examples, this behaviour seems to prevail in all loop iterations. I also tried to use boost's chrono library and thread_clock but got the same result. Do I misunderstand something? What may be the cause of this? Maybe I get cumulative time of all threads?
Inside the for loop, during each iteration I read a different file. Based on this file I create and solve multitude of mixed-integer optimisation models. I solve them with MIP solver, which I set to run in one thread. The instance of the solver is created on each iteration. The only variables that is shared between iteration are constant strings which represents paths to some directories.
My machine has 32 threads (16 cores, 2 threads per core).
Also here are the timings of the whole application in single-threaded mode:
real 23m17.763s
user 21m46.284s
sys 1m28.187s
and in multi-threaded mode:
real 12m47.657s
user 156m20.479s
sys 2m34.311s
A few points here.
What you're measuring corresponds (roughly) to what time returns as the user time--that is total CPU time consumed by all threads. But when we look at the real time reported by time, we see that your multithreaded code is running close to twice as fast as the single threaded code. So it is scaling to some degree--but not very well.
Reading a file in the parallel region may well be part of this. Even at best, the fastest NVMe SSDs can only support reading from a few (e.g., around three or four) threads concurrently before you're using the drive's entire available bandwidth (and if you're doing I/O efficiently that may well be closer to 2. If you're using an actual spinning hard drive, it's usually pretty easy for a single thread to saturate the drive's bandwidth. A PCIe 5 SSD should keep up with more threads, but I kind of doubt even it has the bandwidth to feed 20 threads.
Depending on what parts of the standard library you're using, it's pretty easy to have some "invisible" shared variables. For one common example, quite code that uses Monte Carlo methods will frequently have calls to rand(). Even though it looks like a normal function call, rand() will typically end up using a seed variable that's shared between threads, and every call to rand() not only reads but also writes to that shared variable--so the calls to rand() all end up serialized.
You mention your MIP solver running in a single thread, but say there's a separate instance per thread, leaving it unclear whether the MIP solving code is really one thread shared between the 20 other threads, or that you have one MIP solver instance running in each of the 20 threads. I'd guess the latter, but if it's really the former, then it's being a bottleneck wouldn't seem surprising at all.
Without code to look at, it's impossible to get really specific though.

A deterministic execution time measure

Some algorithms depend on a time measure. E.g., 10% of the time, follow approach A. If that does not work, follow B for 20% of the time. If that does not work, do C.
Measuring execution time in seconds is non-deterministic. Cache state, interleaving non-user tasks on a core, or even simply the dynamic boost of a modern processor's clock speed are external influences that alter the execution time of otherwise deterministic code. Hence, the algorithm might behave non-deterministically if classic execution time measures are used.
To keep the algorithm behaving deterministically, I'm looking for a deterministic way to measure execution time. This is possible, e.g., the CPLEX solver has a deterministic time measure called ticks.
I know this simple question does not have a simple answer. So let me narrow it down a little:
The determinism property is a hard constraint. I'd rather have a measure that only very weakly correlates with measured execution time, as long as it is deterministic.
Ideally, the deterministic time measure measures the whole program execution, including statically compiled libraries. But if this is not possible, then measuring the execution time of the source code I can modify is fine.
I'm willing to take a 100% performance hit, but not more. Less of a performance hit would be better though :)
It's ok if the compiled binary is no longer portable among different CPU models.
Some approaches I have considered, but don't know how hard they are to implement or how well they will work:
modifying a compiler to add a command incrementing a global counter inbetween each other command in the compiled code. This seems like the most principled approach, and may in theory even work for statically compiled libraries.
counting the number of memory accesses. No idea how to do this in practice. Probably also by modifying a compiler?
counting the number of if-statements and loop condition checks using a global counter in the source code. This can be done easily by, e.g., macros, but it will overlook many library calls (e.g., a simple call to sort a vector will not increase the counter), and hence may not correlate much with the actual execution time.
accessing hardware performance counters to, e.g., count the number of instructions of a process, perhaps through a library such as PAPI. The problem here is that I think these counters are non-deterministic as well?
So, how to deterministically measure the execution time of a program?
Edit: measuring cpu time (e.g. by the clock() function) is definitely better than my naive wall clock time examples. However, measuring cpu time is by no means deterministic: runs of the same deterministic program will yield different cpu times. I'm really looking for a deterministic measure (or a measure of "work done" as #mevets calls it).
You can access process time (number of clock cycles used by the process) instead of wall clock time (time elapsed including any other processes that context-switched in between) by calling the C standard library function clock(). There are CLOCKS_PER_SEC clock ticks in one second. Note that this may run faster than wall clock time if your program is multithreaded -- i.e., it measures clock cycles consumed by the program over all processor cores. Therefore, CLOCKS_PER_SEC clock ticks refers to one second of compute time on one processor core. To implement the switching between methods, you could use asynchronous I/O (such as with newfangled C++20 coroutines, or Boost coroutines), checking process time occassionally, or you could do timed software interrupts that set a flag which is picked up by the main thread of execution, which then switches to a new method.
You probably don't want to increment a counter after each instruction. That creates enormous compute overhead and gums up your processor pipeline because every other instruction depends on the instruction 2 before it, and also your instruction cache.
Code example (POSIX):
static /* possibly thread_local */ std::atomic<int> method;
void interrupt_handler(int signal_code) {
method.fetch_add(1);
}
void calculation(/* input */) {
auto prev_signal_handler = signal(SIGINT, &interrupt_handler);
try {
method.store(0);
int prev_method = 0;
// schedule timer interrupts
for (size_t num_ns : /* list of times, in ns */) {
timer_t t_id;
sigevent ev;
ev.sigev_notify = SIGNAL;
ev.sigev_signo = SIGINT;
ev.sigev_value.sival_ptr = &t_id;
timer_create(CLOCK_THREAD_CPUTIME_ID, &ev, &t_id);
itimerspec t_spec;
t_spec.it_interval.tv_sec = t_spec.it_value.tv_sec = num_ns / 1000000000;
t_spec.it_interval.tv_nsec = t_spec.it_value.tv_nsec = num_ns % 1000000000;
timer_settime(t_id, 0, &t_spec, nullptr);
}
bool done = false;
while (!done) {
int current_method = method.load();
if (current_method != prev_method) {
// switch method
}
else {
// continue using current method
}
}
}
catch (...) {
signal(SIGINT, prev_signal_handler);
throw;
}
signal(SIGINT, prev_signal_handler);
}
You're mired with some detailed solutions that potentially extensively change the code, probably because those are the only approaches you're familiar with, but this is IMHO short sighted. You cannot at this point know for sure that instrumenting the generated code in such an invasive way has merit. Let's step back for a minute.
Some algorithms depend on a time measure. E.g., 10% of the time, follow approach A. If that does not work, follow B for 20% of the time. If that does not work, do C.
I don't think it's true. It's an arbitrary constraint, that's not general at all. The algorithms depend on the "effort", and often real time is a very poor substitute for effort. As you have well stated, any sort of "time" is mired in architectural specifics.
Another problem is the assumption that the algorithms are the units of change. They are generally not, i.e. you don't have as much control here as you think you do, unless you code all the numerical parts in assembly, or thoroughly audit the generated code. Each algorithm, if it succeeds, may produce slightly different results depending on numerical error stackups due to the architecture-dependent selections done by the generated code at runtime. It's a thing, compilers and/or their runtime libraries do plenty of that! So the idea that running the same compiled floating point code on various PCs and will produce bit-identical results is correct as long as your goal is to show it incorrect, but in reality it'll prove incorrect at some later time when you'll be too deep into it to realistically implement the huge changes needed for a fix.
But inside the algorithm you should have plenty of arbitrary points where you can increment a counter - not too often, and use the value of the counter as a measure of effort your algorithm has expended. It doesn't matter much that such a measure has a different scaling factor to "real time", for each algorithm, because real time is not the true requirement here. All you really want is some deterministic way to carry out a decision to switch algorithms, and you can roughly calibrate these arbitrary switchover points to real time once, and keep this calibration frozen: it doesn't really matter exactly, only that you can clearly decide when to switch.
Furthermore, there's some caution to be had when an algorithm produces a result ("converges") very close to the effort threshold. Due to architectural differences, the exact effort required to achieve "convergence" in terms of a fixed floating point threshold may slightly vary between CPU generations. So instead of being a hard cutoff, you need some way of expressing hysteresis, so that if the convergence happens close to the effort cut-off, some more alternative criterion is used for either threshold or convergence, but you'd need to do proper statistical modeling to show that the alternates are sufficiently reliable.
A counter can handle units of work, but is each unit of equal value (ie. time)? The service clock(3) provides an approximate virtual time of execution -- that is time elapsed while your process is actually running, as opposed to real world (wall) time.
Similarly, timer_create may accept clock ids similar to CLOCK_PROCESS_CPUTIME_ID, which permits you to raise a signal after a certain cpu time has passed. Providing your app can be arbitrarily interrupted without entering a undefined state, you could use this to switch from method 1 -> 2 -> 3.
Although better than counting blocks of work, you will need to accept a certain inaccuracy around the exact time to account for system overhead, cache contention, etc..

C++11 Most accurate way to pause execution for a certain amount of time? [duplicate]

This question already has answers here:
How to guarantee exact thread sleep interval?
(4 answers)
accurate sampling in c++
(2 answers)
Closed 4 years ago.
I'm currently working on some C++ code that reads from a video file, parses the video/audio streams into its constituent units (such as an FLV tag) and sends it back out in order to "restream" it.
Because my input comes from file but I want to simulate a proper framerate when restreaming this data, I am considering the ways that I might sleep the thread that is performing the read on the file in order to attempt to extract a frame at the given rate that one might expect out of typical 30 or 60 FPS.
One solution is to use an obvious std::this_thread::sleep_for call and pass in the amount of milliseconds depending on what my FPS is. Another solution I'm considering is using a condition variable, and using std::condition_variable::wait_for with the same idea.
I'm a little stuck, because I know that the first solution doesn't guarantee exact precision -- the sleep will last around as long as the argument I pass in but may in theory be longer. And I know that the std::condition_variable::wait_for call will require lock reacquisition which will take some time too. Is there a better solution than what I'm considering? Otherwise, what's the best methodology to attempt to pause execution for as precise a granularity as possible?
C++11 Most accurate way to pause execution for a certain amount of time?
This:
auto start = now();
while(now() < start + wait_for);
now() is a placeholder for the most accurate time measuring method available for the system.
This is the analogue to sleep as what spinlock is to a mutex. Like a spinlock, it will consume all the CPU cycles while pausing, but it is what you asked for: The most accurate way to pause execution. There is trade-off between accuracy and CPU-usage-efficiency: You must choose which is more important to have for your program.
why is it more accurate than std::this_thread::sleep_for?
Because sleep_for yields execution of the thread. As a consequence, it can never have better granularity than the process scheduler of the operating system has (assuming there are other processes competing for time).
The live loop shown above which doesn't voluntarily give up its time slice will achieve the highest granularity provided by the clock that is used for measurement.
Of course, the time slice granted by the scheduler will eventually run out, and that might happen near the time we should be resuming. Only way to reduce that effect is to increase the priority of our thread. There is no standard way of affecting the priority of a thread in C++. The only way to get completely rid of that effect is to run on a non-multi-tasking system.
On multi-CPU systems, one trick that you might want to do is to set the thread affinity so that the OS thread won't be migrated to other hard ware threads which would introduce latency. Likewise, you might want to set thread affinity of your other threads to stay off the time measuring thread. There is no standard tool to set thread affinity.
Let T be the time you wish to sleep for and let G be the maximum time that sleep_for could possibly overshoot.
If T is greater than G, then it will be more efficient to use sleep_for for T - G time units, and only use the live loop for the final G - O time units (where O is the time that sleep_for was observed to overshoot).
Figuring out what G is for your target system can be quite tricky however. There is no standard tool for that. If you over-estimate, you'll waste more cycles than necessary. If you under-estimate, your sleep may overshoot the target.
In case you're wondering what is a good choice for now(), the most appropriate tool provided by the standard library is std::chrono::steady_clock. However, that is not necessarily the most accurate tool available on your system. What tool is the most accurate depends on what system you're targeting.

Should I make a large function atomic in order to benchmark it accurately?

I would like to know how long it takes to execute some code. The code I am executing deals with openCV matrices and operations. The code will be run in a ROS environment on Linux. I don't want the code to be interrupted by system functions during my benchmarking.
Looking at this post about benchmarking, the answerer said the granularity of the result is 15ms. I would like to do much better than that and so I was considering to make the function atomic (just for benchmarking purposes). I'm not sure if it is a good idea for a few reasons, primarily because I don't have a deep understanding of processor architecture.
void atomic_wrapper_function(const object& A, const object& B) {
static unsigned long running_sum = 0;
unsigned long before, after;
before = GetTimeMs64();
function_to_benchmark(A, B);
after = GetTimeMs64();
running_sum += (after - before);
}
The function I am trying to bench mark is not a short function.
Will the result be accurate? For marking the time I'm considering to use this function by Andreas Bonini.
Will it do something horrible to my computer? Call me superstitious but I think it's good to ask this question.
I'm using C++11 on the Linux Kernel.
C++11 atomics are not atomic in the RTOS way, they just provide guarantees when writing multithreaded code. Linux is not an RTOS. Your code can and will always be interrupted. There are some ways to lessen the effects though, but not without diving very deeply into linux.
You can for example configure the niceness to get interrupted less by other userspace programs. You can tell the kernel on which CPU core to process interrupts, then pin your program to a different cpu. You can increase the timer precision etc, but:
There are many other things that might change the runtime of your algorithm like several layers of CPU caches, power saving features of your CPU, etc... If you are really only interested in benchmarking the execution time of your function for non-hard realtime problems, it is easier to just run the algorithm many many times and get a statistical estimate for the execution time.
Call the function a billion times during the benchmark and average. OR
Benchmark the function from 1 time to a billion times. The measure for execution time you are interested in should scale linearly. Then do some kind of linear regression to get an estimate of that.
OR: You say that you want to know what influence the algorithm has on your total program runtime? Use profiling tools like callgrind (integratable into QtCreator).

Correct way to logging elapsed time in C++

I'm doing a article about GPU speed up in cluster environment
To do that, I'm programming in CUDA, that is basically a c++ extension.
But, as I'm a c# developer I don't know the particularities of c++.
There is some concern about logging elapsed time? Some suggestion or blog to read.
My initial idea is make a big loop and run the program several times. 50 ~ 100, and log every elapsed time to after make some graphics of velocity.
Depending on your needs, it can be as easy as:
time_t start = time(NULL);
// long running process
printf("time elapsed: %d\n", (time(NULL) - start));
I guess you need to tell how you plan this to be logged (file or console) and what is the precision you need (seconds, ms, us, etc). "time" gives it in seconds.
I would recommend using the boost timer library . It is platform agnostic, and is as simple as:
#include <boost/timer/timer.hpp>
boost::timer t;
// do some stuff, up until when you want to start timing
t.restart();
// do the stuff you want to time.
std::cout << t.elapsed() << std::endl;
Of course t.elapsed() returns a double that you can save to a variable.
Standard functions such as time often have a very low resolution. And yes, a good way to get around this is to run your test many times and take an average. Note that the first few times may be extra-slow because of hidden start-up costs - especially when using complex resources like GPUs.
For platform-specific calls, take a look at QueryPerformanceCounter on Windows and CFAbsoluteTimeGetCurrent on OS X. (I've not used POSIX call clock_gettime but that might be worth checking out.)
Measuring GPU performance is tricky because GPUs are remote processing units running separate instructions - often on many parallel units. You might want to visit Nvidia's CUDA Zone for a variety of resources and tools to help measure and optimize CUDA code. (Resources related to OpenCL are also highly relevant.)
Ultimately, you want to see how fast your results make it to the screen, right? For that reason, a call to time might well suffice for your needs.