How do I measure GPU time on Metal? - profiling

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.

There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.

You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.

As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):


Slow cloud processing. How to use all CPU cores in grabber callback?

I am working with Kinect2Grabber and want to do some real-time processing, but the performance I get is around 1 fps. Ultimately I want to estimate a trajectory of a ball thrown in the air. I cannot do it with such slow processing :(
I am using Windows 10 64-bit, Visual Studio 2017, PCL 1.9.1 All-in-one Installer MSVC2017 x64 on a AMD Ryzen Threadripper 1900X 8-Core Processor. OpenMP is enabled in my project and so are optimizations. However, when I run my program, the CPU usage for it is around 12-13%. What am I doing wrong?
int main(int argc, char* argv[])
boost::shared_ptr<visualization::PCLVisualizer> viewer(new visualization::PCLVisualizer("Point Cloud Viewer"));
viewer->setCameraPosition(0.0, 0.0, -1.0, 0.0, 0.0, 0.0);
PointCloud<PointType>::Ptr cloud(new PointCloud<PointType>);
// Retrieved Point Cloud Callback Function
boost::mutex mutex;
boost::function<void(const PointCloud<PointType>::ConstPtr&)> function = [&cloud, &mutex](const PointCloud<PointType>::ConstPtr& ptr) {
boost::mutex::scoped_lock lock(mutex);
//Point Cloud Processing
cloud = ptr->makeShared();
std::vector<int> indices;
removeNaNFromPointCloud<PointType>(*cloud, *cloud, indices);
pass_filter(0.5, 0.90, cloud);
outlier_removal(50, 1.0, cloud);
downsampling_vox_grid(0.005f, cloud);
normals(0.04, cloud, cloud_normals);
segmentation(cloud, cloud_normals);
boost::shared_ptr<Grabber> grabber = boost::make_shared<Kinect2Grabber>();
boost::signals2::connection connection = grabber->registerCallback(function);
while (!viewer->wasStopped()) {
// Update Viewer
boost::mutex::scoped_try_lock lock(mutex);
if (lock.owns_lock() && cloud) {
// Update Point Cloud
if (!viewer->updatePointCloud(cloud, "chmura")) {
viewer->addPointCloud(cloud, "chmura");
// Disconnect Callback Function
if (connection.connected()) {
return 0;
The omitted code for pass_filter, outlier_removal etc is taken directly from the tutorials and it is working, but very slow starting from outlier_removal (inclusive).
Your help will be greatly apprieciated.
I do not have to use Kinect2Grabber. Anything will be good to grab and process frames from Kinec2 on Windows.
I see a few mitigations for your issues. 12-13% usage for Threadripper sounds good (100/16 is ~6.25%) as this implies full use of 1 physical core (1 thread for IO and 1 for computation? Just my guess).
To get better performance, you need to profile in order to understand what's causing the bottleneck. Perf is a great tool for that. There's an awesome video about how to profile code by Chandler. This is not the best place for a perf tutorial, but TL;DW:
compile using -fno-omit-frame-pointer flag or equivalent
perf record -g <executable>
perf report -g to know which functions take most CPU cycles
perf report -g 'graph,0.5,caller' to know which call-paths take most CPU cycles
Most likely, the issue identified would be
Repeated creation of single-use objects such as VoxelGrid: Instantiating objects just once is a better use of your CPU cycles
Lock and IO for the grabber
This will get you slightly more frame rate but still limit your CPU utilization to 12-13% aka the single-thread limit.
Since you are using a ThreadRipper, you might use threads to use the other CPU and decouple IO, computation and visualization
One thread for the grabber which grabs frames and pushes them into a Queue
Another thread to consume frames from the Queue based on CPU availability. This saves the computed data into yet another Queue
Visualization thread which takes data from the output Queue
This allows you to tune the Queue sizes to drop frames to improve latency. This might not be required based on the design of your custom Kinect2Grabber. This will be evident after you profile your code.
This has potential to dramatically reduce latency as well as improve frame-rates by perhaps increasing CPU utilization to 20% (because the grabber and visualization threads will be working at full throttle)
In order to fully utilize all the threads, the consumer thread can offload frames to other threads to allow CPU to process multiple frames at once. For this, you can adopt the following options (and they aren't exclusive. You can use option 2 as the workhorse for option 1 for complete benefit)
For parallel executing multiple independent functions at once (eg: your workhorse lambda), look at boost::thread_pool or boost::thread_group
For a pipeline based model (where every stage is run in a different thread), you can use frameworks like TaskFlow
Option 1 is suitable when you're not going for a specific metric like latency. It also requires least change to your code, but has trade-off between high latency (current design issue pointed above) or creating one copy of object per-thread for preventing setup every time.
Option 2 is suitable when the latency required is low and the frames need to be ordered. However, in order to maintain low latency in TaskFlow, you'd need to feed the frames at the rate your CPU cores can process them. Else, you can overload the CPU, leave no free RAM causing page thrashing, and actually reduce performance
Both of these options require you to ensure that the output arrives in the correct order. This can be done using promises or a Queue which drops out of order frames. By implementing part of these solutions or all, you can ensure that your CPU utilization remains high, maybe even 100%.
In order to know what fits your situation the best, know what your target performance metrics are, profile the code intelligently and test without bias.
Without metrics, you can enter a deep hole of improving performance even if it's not required (eg: 5 fps might be sufficient instead of 60fps)
Without measuring, you can't know if you're getting the correct result or if you're indeed attacking the bottleneck.
Without unbiased testing, you can arrive at wrong conclusions.

How to use .. QNX Momentics Application Profiler?

I'd like to profile my (multi-threaded) application in terms of timing. Certain threads are supposed to be re-activated frequently, i.e. a thread executes its main job once every fixed time interval. In other words, there's a fixed time slice in which all the threads a getting re-activated.
More precisely, I expect certain threads to get activated every 2ms (since this is the cycle period). I made some simplified measurements which confirmed the 2ms to be indeed effective.
For the purpose of profiling my app more accurately it seemed suitable to use Momentics' tool "Application Profiler".
However when I do so, I fail to interpret the timing figures that I selected. I would be interested in the average as well in the min and max time it takes before a certain thread is re-activated. So far it seems, the idea is to be only able to monitor the times certain functions occupy. However, even that does not really seem to be the case. E.g. I've got 2 lines of code that are put literally next to each other:
if (var1 && var2 && var3) var5=1; takes 1ms (avg)
if (var4) var5=0; takes 5ms (avg)
What is that supposed to tell me?
Another thing confuses me - the parent thread "takes" up 33ms on avg, 2ms on max and 1ms on min. Aside the fact that the avg shouldn't be bigger than max (i.e. even more I expect avg to be not bigger than 2ms - since this is the cycle time), it's actually increasing the longer I run the the profiling tool. So, if I would run the tool for half an hour the 33ms would actually be something like 120s. So, it seems that avg is actually the total amount of time the thread occupies the CPU.
If that is the case, I would assume to be able to offset against the total time using the count figure which doesn't work either. Mostly due to the figure being almost never available - i.e. there is only as a separate list entry (for every parent thread) called which does not represent a specific process scope.
So, I read QNX community wiki about the "Application Profiler", incl. the manual about "New IDE Application Profiler Enhancements", as well as the official manual articles about how to use the profiler tool.. but I couldn't figure out how I would use the tool to serve my interest.
Bottom line: I'm pretty sure I'm misinterpreting and misusing the tool for what it was intended to be used. Thus my question - how would I interpret the numbers or use the tool's feedback properly to get my 2ms cycle time confirmed?
Additional information
CPU: single core
QNX SDP 6.5 / Momentics 4.7.0
Profiling Method: Sampling and Call Count Instrumentation
Profiling Scope: Single Application
I enabled "Build for Profiling (Sampling and Call Count Instrumentation)" in the Build Options1
The System Profiler should give you what you are looking for. It hooks into the micro kernel and lets you see the state of all threads on the system. I used it in a similar setup to find out what our system was getting unexpected time-outs. (The cause turned out to be Page Waits on critical threads.)

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.
Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.
Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck
After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

What could delay pre-emption of a VxWorks task?

In my current project, I have two levels of tasking, in a VxWorks system, a higher priority (100) task for number crunching and other work and then a lower priority (200) task for background data logging to on-board flash memory. Logging is done using the fwrite() call, to a file stored on a TFFS file system. The high priority task runs at a periodic rate and then sleeps to allow background logging to be done.
My expectation was that the background logging task would run when the high priority task sleeps and be preempted as soon as the high priority task wakes.
What appears to be happening is a significant delay in suspending the background logging task once the high priority task is ready to run again, when there is sufficient data to keep the logging task continuously occupied.
What could delay the pre-emption of a lower priority task under VxWorks 6.8 on a Power PC architecture?
You didn't quantify significant, so the following is just speculation...
You mention writing to flash. One of the issue is that writing to flash typically requires the driver to poll the status of the hardware to make sure the operation completes successfully.
It is possible that during certain operations, the file system temporarily disables preemption to insure that no corruption occurs - coupled with having to wait for hardware to complete, this might account for the delay.
If you have access to the System Viewer tool, that would go a long way towards identifying the cause of the delay.
I second the suggestion of using the System Viewer, It'll show all the tasks involved in TFFS stack and you may be surprised how many layers there are. If you're making an fwrite with a large block of data, the flash access may be large (and slow as Benoit said). You may try a bunch of smaller fwrites. I suggest doing a test to see how long fwrites() take for various sizes, and you may see differences from test to test with the same sizea as you cross flash block boundaries.

Platform independent parallelization without changing the framework?

I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?