OpenCL Profiling timestamps are not consistent in duration compared to CPU clock

OpenCL Profiling timestamps are not consistent in duration compared to CPU clock - c++

I am creating a custom tool interface with my application to profile the performance of OpenCL kernels while also integrating CPU profiling points. I'm currently working with this code on Linux using Ubuntu, and am testing using the 3 OpenCL devices in my machine: Intel CPU, Intel IGP, and Nvidia Quadro.
I am using this code std::chrono::high_resolution_clock::now().time_since_epoch().count() to produce a timestamp on the CPU, and of course for the OpenCL profiling time points, they are 64-bit nanoseconds provided from the OpenCL profiling events API. The purpose of the tool I made is to consume log output (specially formatted and generated so as not to impact performance much) from the program and generate a timeline chart to aid performance analysis.
So far in my visualization interface I had made the assumption that nanoseconds are uniform. I've realized now after getting my visual interface working and checking a few assumptions that this condition more or less does hold to a standard deviation of 0.4 microsecond for the CPU OpenCL device (which indicates that the CPU device could be implemented using the same time counter, as it has no drift), but does not hold for the two GPU devices! This is perhaps not the most surprising thing in the world, but it affects the core design of my tool, so this was an unforeseen risk.
I'll provide some eye candy since it is very interesting and it does prove to me that this is indeed happening.
This is zoomed into the beginning of the profile where the GPU has the corresponding mapBuffer for poses happening around a millisecond before the CPU calls it (impossible!)
Toward the end of the profile we see the same shapes but reversed relationship, clearly showing that GPU seconds seem to count for a little bit less compared to CPU seconds.
The way that this visualization currently works as i had assumed a GPU nanosecond is indeed a CPU nanosecond, is that I actually have been computing the average of the delta between the values given to me by the CPU and GPU... Since I did implement this initially, perhaps it indicates that i was at least subconsciously expecting there to be an issue like this one. Anyway, what I did was establish a sync point at the kernel dispatch by recording a CPU timestamp immediately before calling clEnqueueNDRangeKernel and then comparing this against the CL_PROFILING_COMMAND_QUEUED OpenCL Profile event time. This delta upon further inspection showed the time drift:
This screenshot from the chrome console shows me logging the array of delta values I collected from these two GPU devices; they are showing BigInts to avoid losing integer precision: in both cases the GPU reported timestamp deltas are trending down.
Compare with the numbers from the CPU:
My questions:
What might be a practical way to deal with this issue? I am currently leaning toward the use of sync points when dispatching OpenCL kernels, and these sync points could be used to either locally piecewise stretch the OpenCL Profiling timestamps, or to locally sync at the beginning of, say, a kernel dispatch and just ignore the discrepancy we have, assuming it will be insignificant during the period. In particular it is unclear whether it'd be a good idea to maximize granularity by implementing a sync point for every single profiling event I want to use.
What might be some other time measuring systems I can or should use on the CPU-side to see if maybe they will end up aligning better? I don't really have much hope in this at this point because I can imagine that the profiling times being provided to me are actually generated and timed on the GPU device itself. The fluctuations would then be affected by such things as dynamic GPU clock scaling, and there would be no hope of stumbling upon a different better timekeeping scheme on the CPU.

Related

What strategies and practices are used, when running very intense and long calculations, to ensure that hardware isn't damaged?

I have many large Fortran programs to run at work. I have access to several desktop computers and the Fortran code runs over takes several consecutive days. It's essentially running the same master module many times (lets say N times) with different parameters, something akin to Monte Carlo on steroids. In that sense the code is parallelizable, however I don't have access to a cluster.
With the scientific computing community, what practices and strategies are used to minimise hardware damaged from heat? The machines of course have their own cooling system (fans and heat sinks), but even so running intense calculations non stop for half a week cannot be healthy for the life of the machines? Though maybe I'm over-thinking this?
I'm not aware of any intrinsic functions in Fortran that can pause the code to give components a break? Current I've written a small module that keeps an eye on system clock, with a do while loop that "wastes time" in between consecutive runs of the master module in order to discharge heat. Is this an acceptable way of doing this? The processor is, after all, still running a while loop.
Another way would be to use a shell scripts or a python code to import Fortran? Alternatively are there any intrinsic routines in the compile (gfortran) that could achieve this? What are the standard, effective and accepted practices for dealing with this?
Edit: I should mention that all machines run on Linux, specifically Ubuntu 12.04.

For MS-DOS application I would consider the following:
Reduce as much as possible I/O operations withHDD, that is, keep data in memory as much as you can,
or keep data on a RamDisk.A RamDisk driver is available on Microsoft's website.
Let me know if you won't be able to find and I look at my CD archives
-Try to use Extended Memory by using aDPMI driver
DPMI - DOS Protected Mode Interface
-Set CPU affinity for a second CPU
Boost a priority to High, butI wouldn't recommend toboost toReal-Time

I think you need a hardware solution here, not a software solution. You need to increase the rate of heat exchange in the computers (new fans, water cooling, etc) and in the room (turn the thermostat way down, get some fans running, etc).
To answer the post more directly, you can use the fortran SLEEP command to pause a computation for a given number of seconds. You could use some system calls in Fortran to set the argument on the fly. But I wouldn't recommend it - you might as well just run your simulations on fewer computers.
To keep the advantages of the multiple computers, you need better heat exchange.

As long as the hardware is adequately dissipating heat and components are not operating at or beyond their "safe" temperature limits, they * should be fine.
*Some video cards were known to run very hot; i.e. 65-105°C. Typically, electronic components have a maximum temperature rating of exactly this. Beyond it, reliability degrades very quickly. Even though the manufacturer made these cards this way, they ended up with a reputation for failing (i.e. older nVidia FX, Quadro series.)
*Ubuntu likely has a "Critical temperature reached" feature where the entire system will power off if it overheats, as explained here. Windows is "blissfully ignorant." :)
*Thermal stress (large, repeated temperature variations) may contribute to component failure of IC's, capacitors, and hard disks. Over three decades of computing has taught that adequate cooling and leaving the PC on 24/7 actually may save wear-and-tear in my experience. (A typical PC will cost around $200 USD/year in electricity, so it's more like a trade-off in terms of cost.)
*PC's must be cleaned twice a year (depending on airborne particulate constituency and concentration.) Compressed air is nice for removing dust. Dust traps heat and causes failures. Operate a shop-vac while "dusting" to prevent the dust from going everywhere. Wanna see a really dusty computer?
*The CPU should be "ok" with it's stock cooler. Check it's temperature at cold system boot-up, then again after running code for an hour or so. The fan is speed-controlled to limit temperature rise. CPU temperature rise shouldn't be much warmer than about 40°C and less would be better. But an aftermarket, better-performing CPU cooler never hurts, such as these. CPU's rarely fail unless there is a manufacturing flaw or they operate near or beyond their rated temperatures for too long, so as long as they stay cool, long calculations are fine. Typically, they stop functioning and/or reset the PC if too hot.
*Capacitors tend to fail very rapidly when overheated. It is a known issue that some cap vendors are "junk" and will fail prematurely, regardless of other factors. "Re-capping" is the art of fixing these components. For a full run-down on this topic, see badcaps.net. It used to be possible to re-cap a motherboard, but today's 12+ layer and ROHS (no lead) motherboards make it very difficult without specialty hot-air tools.

how to compute in game loop until the last possible moment

As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.

This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.

I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.

Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

Reducing bandwidth between GPU and CPU( sending raw data or pre calculate first)

OK so I am just trying to work out the best way reduce band width between the GPU and CPU.
Particle Systems.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
My problem is that the sort of app that I have written has to have a good few variables sent to the shaders for example, A user at run time will select emitter positions and velocity plus a lot more. The sorts of things that I am not sure how to tackle are things like "if a user wants a random velocity and gives a min and max value to have the random value select from, should this random value be worked out on the CPU and sent as a single value to the GPU or should both the min and max values be sent to the GPU and have a random function generator in the GPU do it? Any comments on reducing bandwidth and optimization are much appreciated.

Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
Impossible to answer. Spend too much CPU time and performance will drop. Spend too much GPU time, performance will drop too. Transfer too much data, performance will drop. So, instead of trying to guess (I don't know what app you're writing, what's your target hardware, etc. Hell, you didn't even specify your target api and platform) measure/profile and select optimal method. PROFILE instead of trying to guess the performance. There are AQTime 7 Standard, gprof, and NVPerfKit for that (plus many other tools).
Do you actually have performance problem in your application? If you don't have any performance problems, then don't do anything. Do you have, say ten million particles per frame in real time? If not, there's little reason to worry, since a 600mhz cpu was capable of handling thousand of them easily 7 years ago. On other hand, if you have, say, dynamic 3d environmnet and particles must interact with it (bounce), then doing it all on GPU will be MUCH harder.
Anyway, to me it sounds like you don't have to optimize anything and there's no actual NEED to optimize. So the best idea would be to concentrate on some other things.
However, in any case, ensure that you're using correct way to transfer "dynamic" data that is frequently updated. In directX that meant using dynamic write-only vertex buffers with D3DLOCK_DISCARD|D3DLOCK_NOOVERWRITE. With OpenGL that'll probably mean using STREAM or DYNAMIC bufferdata with DRAW access. That should be sufficient to avoid major performance hits.

There's no single right answer to this. Here are some things that might help you make up your mind:
Are you sure the volume of data going over the bus is high enough to be a problem? You might want to do the math and see how much data there is per second vs. what's available on the target hardware.
Is the application likely to be CPU bound or GPU bound? If it's already GPU bound there's no point loading it up further.
Particle systems are pretty easy to implement on the CPU and will run on any hardware. A GPU implementation that supports nontrivial particle systems will be more complex and limited to hardware that supports the required functionality (e.g. stream out and an API that gives access to it.)
Consider a mixed approach. Can you split the particle systems into low complexity, high bandwidth particle systems implemented on the GPU and high complexity, low bandwidth systems implemented on the CPU?
All that said, I think I would start with a CPU implementation and move some of the work to the GPU if it proves necessary and feasible.

Measuring running time of computational geometry algorithms

I am taking a course on computational geometry in the fall, where we will be implementing some algorithms in C or C++ and benchmarking them. Most of the students generate a few datasets and measure their programs with the time command, but I would like to be a bit more thorough.
I am thinking about writing a program to automatically generate different datasets, run my program with them and use R to test hypotheses and estimate parameters.
So... How do you measure program running time more accurately?
What might be relevant to measure?
What hypotheses might be interesting to test (variance, effects caused by caching, etc.)?
Should I test my code on more than one machine? How should these machines differ?
My overall goals are to learn how these algorithms perform in practice, which implementation techniques are better and how the hardware actually performs.

Profilers are great. Valgrind is pretty popular. Also, I'd suggest trying your code out on risc machines if you can get access to some. Their performance characteristics are different from those of cisc machines in interesting ways.

You could use the Windows API timing function (are not that exactly) and you can use the RDTSC inline assembler command which is sub-nanosecond exact(don't forget that the command and the instructions around it create a small overhead of some hundreds cycles but this is not an big issue).

In order to get better accuracy with program metrics, you will have to run your program many times, such as 100 or 1000.
For more details, on metrics, search the web for metrics and profiling.
Beware that programs may differ in performance (time) measurements due to things running in the background such as virus scanners, music players, and other programs with timers in them.
You could test your program on different machines. Processor clock rates, L1 and L2 cache sizes, RAM sizes, and Disk speeds are all factors (as well as the number of other programs / tasks running concurrently). Floating point may also be a factor.
If you want, you can challenge your compiler by printing the assembly language of the listings for various optimization settings. See which setting produces the fewest or most efficient assembly code.
Since your processing data, look at data driven design: http://www.gamearchitect.net/Articles/DataDrivenDesign.html

You can use the Windows High Performance Counter to get nanosecond accuracy. Technically, afaik, the HPC can be any speed, but you can query it's counts per second, and as far as I know, most CPUs do very very high performance counting.
What you should do is just get a professional profiler. That's what they're for. More realistically, however.
If you're only comparing between algorithms, as long as your machine doesn't happen to excel in one area (Pentium D, SSD sort of thing) it shouldn't matter too much to do it on just one machine. If you want to look at cache effects, try running the algorithm right after the machine starts up (make sure that you get a copy of Windows 7, should be free for CS students), then leave it doing something that can be plenty cache heavy, like image processing, for 24h or something to convince the OS to cache it. Then run algorithm again. Compare.

You didn't specify your platform. If you are on a POSIX system (eg linux) have a look into clock_gettime. This lets you access different kinds of clocks e.g wall clock time or cpu time. You also may get to know about the precision of the clocks.
Since you are willing to do good statistics on your numbers, you should repeat your experiments often enough such that the statistical test give you enough confidence.
If your measurements are not too fine grained and your variance is low this often is quite good for 10 probes or so. But if you go down to small scale, a short function or so, you might need to go much higher.
Also you would have to ensure reproducible experimental conditions, no other load on the machine, enough memory available etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js