Getting current cpu usage in c++/windows for particular process - c++

I want to calculate current cpu usage for particular application in my code. I looked up on internet and found pdh library for windows. When I tried it I am getting overall cpu usage not cpu usage for one process.
PdhAddCounter(hquery, TEXT("\\Processor(_Total)\\% Processor Time"),0,&counter);
So what I do with this line to get cpu usage for particular process? I tried replacing _Total with process name(explorer). At that time I am getting 0 cpu usage. But I checked in resource monitor that opening many windows at a time increased cpu usage upto 20%. Still in log file cpu usage is showing 0.
Can anyone help me with this?
thanks in advance.

You need to use GetProcessTimes
And unfortunately, it won't give you the "CPU usage", it will give you the amount of CPU-time since the process started. So to get CPU usage, you will need to take one sample, store that, and then take another sample a known amount of time later, and then calculate the time (and if you want to know the total usage, you'll need to add the usertime and kerneltime together, of course).

You can check this for example. Explained everything in that project. It will give memory based on process id(same way shown in task manager)
Thanks,
Darshan

Related

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

High cpu utilization in c++ module

I have C++ module , for a particular flow the CPU utilization goes to 70% and
comes back.
Wanted to know is there a way to find out what is making the cpu utilization
high.
I checked through the program and there is a while loop , but it is executed
only once.
Any command or way to find out the culprit.
Thanks in advance
SKP

How to make TensorFlow use more available CPU

How can I fully utilize each of my EC2 cores?
I'm using a c4.4xlarge AWS Ubuntu EC2 instance and TensorFlow to build a large convoluted neural network. nproc says that my EC2 instance has 16 cores. When I run my convnet training code, the top utility says that I'm only using 400% CPU. I was expecting it to use 1600% CPU because of the 16 cores. The AWS EC2 monitoring tab confirms that I'm only using 25% of my CPU capacity. This is a huge network, and on my new Mac Pro it consumes about 600% CPU and takes a few hours to build, so I don't think the reason is because my network is too small.
I believe the line below ultimately determines CPU usage:
sess = tf.InteractiveSession(config=tf.ConfigProto())
I admit I don't fully understand the relationship between threads and cores, but I tried increasing the number of cores. It had the same effect as the line above: still 400% CPU.
NUM_THREADS = 16
sess = tf.InteractiveSession(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
EDIT:
htop shows that shows that I am actually using all 16 of my EC2 cores, but each core is only at about 25%
top shows that my total CPU % is around 400%, but occasionally it will shoot up to 1300% and then almost immediately go back down to ~400%. This makes me think there could be a deadlock problem
Several things you can try:
Increase the number of threads
You already tried changing the intra_op_parallelism_threads. Depending on your network it can also make sense to increase the inter_op_parallelism_threads. From the doc:
inter_op_parallelism_threads:
Nodes that perform blocking operations are enqueued on a pool of
inter_op_parallelism_threads available in each process. 0 means the
system picks an appropriate number.
intra_op_parallelism_threads:
The execution of an individual op (for
some op types) can be parallelized on a pool of
intra_op_parallelism_threads. 0 means the system picks an appropriate
number.
(Side note: The values from the configuration file referenced above are not the actual default values tensorflow uses but just example values. You can see the actual default configuration by manually inspecting the object returned by tf.ConfigProto().)
Tensorflow uses 0 for the above options meaning it tries to choose appropriate values itself. I don't think tensorflow picked poor values that caused your problem but you can try out different values for the above option to be on the safe side.
Extract traces to see how well your code parallelizes
Have a look at
tensorflow code optimization strategy
It gives you something like this. In this picture you can see that the actual computation happens on far fewer threads than available. This could also be the case for your network. I marked potential synchronization points. There you can see that all threads are active for a short moment which potentially is the reason for the sporadic peaks in CPU utilization that you experience.
Miscellaneous
Make sure you are not running out of memory (htop)
Make sure you are not doing a lot of I/O or something similar

How to use .. QNX Momentics Application Profiler?

I'd like to profile my (multi-threaded) application in terms of timing. Certain threads are supposed to be re-activated frequently, i.e. a thread executes its main job once every fixed time interval. In other words, there's a fixed time slice in which all the threads a getting re-activated.
More precisely, I expect certain threads to get activated every 2ms (since this is the cycle period). I made some simplified measurements which confirmed the 2ms to be indeed effective.
For the purpose of profiling my app more accurately it seemed suitable to use Momentics' tool "Application Profiler".
However when I do so, I fail to interpret the timing figures that I selected. I would be interested in the average as well in the min and max time it takes before a certain thread is re-activated. So far it seems, the idea is to be only able to monitor the times certain functions occupy. However, even that does not really seem to be the case. E.g. I've got 2 lines of code that are put literally next to each other:
if (var1 && var2 && var3) var5=1; takes 1ms (avg)
if (var4) var5=0; takes 5ms (avg)
What is that supposed to tell me?
Another thing confuses me - the parent thread "takes" up 33ms on avg, 2ms on max and 1ms on min. Aside the fact that the avg shouldn't be bigger than max (i.e. even more I expect avg to be not bigger than 2ms - since this is the cycle time), it's actually increasing the longer I run the the profiling tool. So, if I would run the tool for half an hour the 33ms would actually be something like 120s. So, it seems that avg is actually the total amount of time the thread occupies the CPU.
If that is the case, I would assume to be able to offset against the total time using the count figure which doesn't work either. Mostly due to the figure being almost never available - i.e. there is only as a separate list entry (for every parent thread) called which does not represent a specific process scope.
So, I read QNX community wiki about the "Application Profiler", incl. the manual about "New IDE Application Profiler Enhancements", as well as the official manual articles about how to use the profiler tool.. but I couldn't figure out how I would use the tool to serve my interest.
Bottom line: I'm pretty sure I'm misinterpreting and misusing the tool for what it was intended to be used. Thus my question - how would I interpret the numbers or use the tool's feedback properly to get my 2ms cycle time confirmed?
Additional information
CPU: single core
QNX SDP 6.5 / Momentics 4.7.0
Profiling Method: Sampling and Call Count Instrumentation
Profiling Scope: Single Application
I enabled "Build for Profiling (Sampling and Call Count Instrumentation)" in the Build Options1
The System Profiler should give you what you are looking for. It hooks into the micro kernel and lets you see the state of all threads on the system. I used it in a similar setup to find out what our system was getting unexpected time-outs. (The cause turned out to be Page Waits on critical threads.)

CUDA: How many concurrent threads in total on a device?

How do I programatically find the maximum number of concurrent cuda threads or streaming multiprocessors on a device / nvidia graphics card? I know about warpSize, but there is no warpCount.
most answers on the internet concern themselves with looking up things from pdfs.
Have you tried checking their SDK samples , i think this sample is the one you want
Device Query
This does not only depend on the device but also on your code - e.g. things like the number of registers each thread uses or the amount of shared memory your block needs. I would suggest reading about occupancy.
Another thing I would note is that if your code relies on having a certain number of threads resident on the device (e.g. if you wait for several threads to reach some execution point) you are bound to face some race conditions and see your code hanging.