OpenCL time measurment issues with AMD GPU

OpenCL time measurment issues with AMD GPU - c++

I recently compared 2 kinds of doing kernel runtime measuring and I see some confusing results.
I use an AMD Bobcat CPU (E-350) with integrated GPU and Ubuntu Linux (CL_PLATFORM_VERSION is OpenCL 1.2 AMD-APP (923.1)).
The basic gettimeofday idea looks like this:
clFinish(...) // that all tasks are finished on the command queue
gettimeofday(&starttime,0x0)
clEnqueueNDRangeKernel(...)
clFlush(...)
clWaitForEvents(...)
gettimeofday(&endtime,0x0)
This says the kernel needs around 5466 ms.
Second time measurement I did with clGetEventProfilingInfo for QUEUED / SUBMIT / START / END.
With the 4 time values I can calculate the time spend in the different states:
time spend queued: 0.06 ms,
time spend submitted: 2733 ms,
time spend in execution: 2731 ms (actual execution time).
I see that it adds up to the 5466 ms, but why does it stay in submitted state for half the time?
And the funny things are:
the submitted state is always half of the actual execution time, even for different kernels or different workload (so it can't be a constant setup time),
for the CPU the time spend in submitted state is 0 and the execution time is equal to the gettimeofday result,
I tested my kernels on an Intel Ivy Bridge with windows using CPU and GPU and I didn't see the effects there.
Does anyone have a clue?
I suspect that either the GPU runs the kernel twice (resulting in gettimeofday being double of the actual execution time) or that the function clGetEventProfilingInfo is not working correctly for the AMD GPU.

I posted the problem in an AMD forum. They say it's a bug in the AMD profiler.
http://devgurus.amd.com/thread/159809

Related

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.

There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.

You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.

As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

CPU utlization on poll mode

For our project written in c++, we run the processor cores on poll mode to poll the driver (dpdk), but in poll mode the cpu utilization shows up as 100% in top/htop. As we started seeing glitch of packet drops, calculated the number of loops or polls executed per sec on a core (varies based on the processor speed and type).
Sample code used to calculate the polls/second with and with out the overhead of driver poll function is as below.
#include <iostream>
#include <sys/time.h>
int main() {
unsigned long long counter;
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
gettimeofday(&tv2, NULL);
while(1) {
gettimeofday(&tv2, NULL);
//Some function here to measure the overhead
//Poll the driver
if ((double) (tv2.tv_usec - tv1.tv_usec) / 1000000 + (double) (tv2.tv_sec - tv1.tv_sec) > 1.0) {
std::cout << std::dec << "Executions per second = " << counter << " per second" << std::endl;
counter = 0;
gettimeofday(&tv1, NULL);
}
counter++;
}
}
The poll count results are varying, sometimes we see a glitch and the number go down 50% or lower than regular counts, thought this could be problem with the linux scheduling the task so
Isolated the cores using linux command line (isolcpus=...), Set affinity, Increase priority for the process/thread to highest nice value and type to realtime (RT)
But no difference.
So questions are,
Can we rely on the number of loops/polls per sec executed on a processor core in poll mode?
Is there a way to calculate the CPU occupancy on poll mode since the cores CPU utilization shows up as 100% on top?
Is this the right approach for this problem?
Environment:
Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
8G ram
Ubuntu virtual machine on Vmware hypervisor.
Not sure if this was previously answered, any references will be helpful.

No, you cannot rely "the number of loops/polls per sec executed on a processor core in poll mode".
This is a fundamental aspect of the execution environment in a traditional operating system, such as the one you are using: mainstream Linux.
At any time, a heavy-weight cron job can get kicked off that makes immediate demands on some resources, and the kernel's scheduler decides to preempt your application and do something else. That would be just one of hundreds of possible reasons why your process gets preempted.
Even if you're running as root, you won't be in full control of your process's resources.
The fact that you're seeing such a wide, occasional, disparity in your polling metrics should be a big, honking clue: multi-tasking operating systems don't work like this.
There are other "realtime" operating systems where userspace apps can have specific "service level" guarantees, i.e. minimum CPU or I/O resources available, which you can rely on for guaranteeing a floor on the number of times a particular code sequence can be executed, per second or some other metric.
On Linux, there are a few things that can be fiddled with, such as the process's nice level, and a few other things. But that still will not give you any absolute guarantees, whatsoever.
Especially since you're not even running on bare metal, but you're running inside a virtual hypervisor. So, your actual execution profile is affected not just by your host operating system, but your guest operating system as well!
The only way to guarantee the kind of a metric that you're looking for, is to use a realtime operating system, instead of Linux. Some years ago I have heard about realtime extensions to the Linux kernel (Google food: "linux rtos"), but haven't heard much of that recently. I don't believe that mainstream Linux distributions include that kernel extension, so, if you want to go that way, you'll be on your own.

modern linux on intel cpu does provide ways to make poll loop fully occupy the cpu core near 100%. things you haven't considered are, remove system call that will cause context switching, turn off hyperthreading or don't use the other thread that is on the same cache line, turn off dynamic cpu frequency boost in bios, move interrupt handling out.

How to use .. QNX Momentics Application Profiler?

I'd like to profile my (multi-threaded) application in terms of timing. Certain threads are supposed to be re-activated frequently, i.e. a thread executes its main job once every fixed time interval. In other words, there's a fixed time slice in which all the threads a getting re-activated.
More precisely, I expect certain threads to get activated every 2ms (since this is the cycle period). I made some simplified measurements which confirmed the 2ms to be indeed effective.
For the purpose of profiling my app more accurately it seemed suitable to use Momentics' tool "Application Profiler".
However when I do so, I fail to interpret the timing figures that I selected. I would be interested in the average as well in the min and max time it takes before a certain thread is re-activated. So far it seems, the idea is to be only able to monitor the times certain functions occupy. However, even that does not really seem to be the case. E.g. I've got 2 lines of code that are put literally next to each other:
if (var1 && var2 && var3) var5=1; takes 1ms (avg)
if (var4) var5=0; takes 5ms (avg)
What is that supposed to tell me?
Another thing confuses me - the parent thread "takes" up 33ms on avg, 2ms on max and 1ms on min. Aside the fact that the avg shouldn't be bigger than max (i.e. even more I expect avg to be not bigger than 2ms - since this is the cycle time), it's actually increasing the longer I run the the profiling tool. So, if I would run the tool for half an hour the 33ms would actually be something like 120s. So, it seems that avg is actually the total amount of time the thread occupies the CPU.
If that is the case, I would assume to be able to offset against the total time using the count figure which doesn't work either. Mostly due to the figure being almost never available - i.e. there is only as a separate list entry (for every parent thread) called which does not represent a specific process scope.
So, I read QNX community wiki about the "Application Profiler", incl. the manual about "New IDE Application Profiler Enhancements", as well as the official manual articles about how to use the profiler tool.. but I couldn't figure out how I would use the tool to serve my interest.
Bottom line: I'm pretty sure I'm misinterpreting and misusing the tool for what it was intended to be used. Thus my question - how would I interpret the numbers or use the tool's feedback properly to get my 2ms cycle time confirmed?
Additional information
CPU: single core
QNX SDP 6.5 / Momentics 4.7.0
Profiling Method: Sampling and Call Count Instrumentation
Profiling Scope: Single Application
I enabled "Build for Profiling (Sampling and Call Count Instrumentation)" in the Build Options1

The System Profiler should give you what you are looking for. It hooks into the micro kernel and lets you see the state of all threads on the system. I used it in a similar setup to find out what our system was getting unexpected time-outs. (The cause turned out to be Page Waits on critical threads.)

Why my programs is taking longer to execute with more threads allocated?

I wrote two programs that computes area of a function with a certain number of rectangles using Riemann sums, one is written in Go and the other with C++.
The objective was to measure execution time and see which language is the faster on multithreading.
I ran the program on a 32-cores server (Dual Intel Xeon) using a bash script4 to run it with 1, 2, 4, 8, 16 and 32 threads. The script use time --format %U to get the execution time .
But as you can see in the results running the Go version using 1 core is 1.19 seconds and with 32 cores it's 1.69 seconds ! I thought using more cores would have made computations faster...
Did I made an error writing my programs ? Is the measure of time accurate ? Or maybe the results are good but how ?
Thank you in advance for your answers !
Sources :
Go code : https://github.com/Mistermatt007/Benchmark-go-vs-cpp/blob/master/CalculGo/Calcul.go
C++ code : https://github.com/Mistermatt007/Benchmark-go-vs-cpp/blob/master/CalculCpp/Calcul.cpp
launch script : https://github.com/Mistermatt007/Benchmark-go-vs-cpp/blob/master/script/Launch.sh
Riemann sums : http://mathworld.wolfram.com/RiemannSum.html
first results : https://github.com/Mistermatt007/Benchmark-go-vs-cpp/blob/master/results/GoVSCpp.txt

According to man time:
U Total number of CPU-seconds that the process used directly (in user mode), in seconds.
You are measuring CPU-seconds, i.e. time spent by each CPU cumulatively, not "wallclock" seconds. This measure won't go down with additional threads, because it is proportional to amount of work which is constant. On the other hand, this may go up with number of threads, as every new thread incurs some additional bookkeeping.
If you want to list "real" time, use %e specifier.

How to interpret the output of boost::timer::cpu_timer on multicore machine?

I use boost::timer::cpu_timer to measure the performance of some algorithm in my application. The example output looks like this:
Algo1 duration: 6.755457s wall, 12.963683s user + 1.294808s system = 14.258491s CPU (211.1%)
From boost cpu_timer documentation:
The output of this program will look something like this:
5.713010s wall, 5.709637s user + 0.000000s system = 5.709637s CPU (99.9%)
In other words, this program ran in 5.713010 seconds as would be
measured by a clock on the wall, the operating system charged it for
5.709637 seconds of user CPU time and 0 seconds of system CPU time, the total of these two was 5.709637, and that represented 99.9 percent
of the wall clock time.
What does the value I obtained mean (211.1%), does it mean that more than two cores were involved in execution of my algorithm ?
What is the meaning of user CPU time and system CPU time ?

What does the value I obtained mean (211.1%), does it mean that more than two cores were involved in execution of my algorithm ?
It means that program used a little bit more than twice as much CPU time as wall time. For that to happen, it must have been running on at least three cores for some of that time.
What is the meaning of user CPU time and system CPU time ?
User CPU time is time when the CPU is running user code. System CPU time is time when the CPU is running system code. When you call a system function, like the function to read from a file, you switch from running user code to running system code until that function returns.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js