How limit cpu usage from specific process? - c++

How i can limit cpu usage to 10% for example for specific process in windows C++?

You could use Sleep(x) - will slow down your program execution but it will free up CPU cycles
Where x is the time in milliseconds

This is rarely needed and maybe thread priorities are better solution but since you asked, what you should is:
do a small fraction of your "solid" work i.e. calculations
measure how much time step 1) took, let's say it's twork milliseconds
Sleep() for (100/percent - 1)*twork milliseconds where percent is your desired load
go back to 1.
In order for this to work well, you have to be really careful in selecting how big a "fraction" of a calculation is and some tasks are hard to split up. A single fraction should take somewhere between 40 and 250 milliseconds or so, if it takes less, the overhead from sleeping and measuring might become significant, if it's more, the illusion of using 10% CPU will disappear and it will seem like your thread is oscillating between 0 and 100% CPU (which is what happens anyway, but if you do it fast enough then it looks like you only take whatever percent). Two additional things to note: first, as mentioned before me, this is on a thread level and not process level; second, your work must be real CPU work, disk/device/network I/O usually involves a lot of waiting and doesn't take as much CPU.

That is the job of OS, you can not control it.

You can't limit to exactly 10%, but you can reduce it's priority and restrict it to use only one CPU core.

Related

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

how to compute in game loop until the last possible moment

As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.
This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.
I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).

Multithreading crowds out other processes

I have added multithreading to a raytracer I am writing, and while it does run much faster now, when it's running, my computer is almost unusably slow. Obviously I want to use all my PC's compute power, but I don't want it to prevent any other application from getting access to the CPUs.
I thought about having the threads sleep, but unless they all sleep at the same time, then the other threads would just eat up the extra time. Also, I don't necessarily want to give up a certain percentage of available compute power if I'm not going to use it.
Also, (This is not my official question) I've noticed that for some reason the first thread launched does more work than the second, and the second more than the third, and so on until like the last 5 threads (out of 32) won't actually get a crack at any work, despite the fact that there's plenty to go a around (there's at least 0.5M work items for them to chew through). If someone would like to venture a guess in the comments, it would be appreciated.
If you use the standard threads, you could try to use thread::hardware_concurrency to find out an estimate of the maximul number of threads that are really supported by hardware, in order not to overload your cpu.
If it returns 0 the information is not available. In other cases you could limit yourself to this number or a little bit below (thinking that other processes might use these as well).
If limiting the number of threads does not improve responsiveness, you can also consider calling from time to time this_thread::yield() to give opportunity to reschedule threads. But depending on the kind of job and synchronisation you use, this second alternative might decrease performance.
As requested, my comment as an answer:
It sounds like you've oversubscribed your poor CPU. Try reducing the number of threads?
If there's significantly more threads than hardware cores, a lot of time is going to be wasted switching between threads, scheduling them in the OS, and in contention over shared variables. It would also cause the general slowdown of the other running programs, because they have to contend with the high number of threads from your program (which by default all have the same priority as the other programs' threads in the eyes of the OS scheduler).

What should I check: cpu time or wall time?

I have two algorithms to do the same task. To examine their performance, what should I check: cpu time or wall time? I think it is cpu time, right?
I am doing parallelism of my code. To check my parallelism performance, what should I check: cpu time or wall time? I think it is wall time, right?
Assume I have done an ideal parallelism using multi-threads. I think the cpu time for 1 thread will be same as 8 threads, and the wall time for 1 threas will be 8 times longer than the 8 threads. Is it right?
Also any easy way to check those times?
The answer depends on what you're really trying to measure.
If you have a couple small code sequences where each runs on a single CPU (i.e., it's basically single-threaded) and you want to know which is faster, you probably want CPU time. This will tell you the time taken to execute that code, without counting other things like I/O, task switches, time spent on other processes, interrupt handling, etc. [Note: although it attempts to ignore other facts, you'll still usually get the most accurate results with the system otherwise as quiescent as possible.]
If you're writing multi-threaded code and want to measure how well you're distributing your code across processors/cores, you'll probably measure both CPU time and wall time, and compare the two. If, for example, you have 4 cores available, your ideal would be that the wall time is 1/4th the CPU time.
So, for multithreaded code you'll often end up doing things in two phases: first you look at the time to execute on thread, using CPU time. You optimize to get that to (reasonable) minimum. Then in the second phase, you compare wall time to CPU time, to try to use multiple cores efficiently. Since changing one often affects the other, you may well iterate through the two a number of times (and often compromise between the two to some degree).
Just as a really general rule of thumb, you tend to use CPU time to measure microscopic benchmarks of individual bits of code, and wall time for larger (system-level) benchmarks. In other words, when you want to measure how fast one piece of code runs, and nothing else, CPU time generally make the most sense. When you want to include the effects of things like disk I/O time, caching, etc., then you're a lot more likely to care about wall time.
Wall time tells you how long your computer took. But it doesn't tell you anything about how long the execution of your code took as it depends on other things that were keeping your computer busy.
There are different mechanisms for measuring CPU time spent executing your code - I personally like getrusage()

CPU throttling in C++

I was just wondering if there is an elegant way to set the maximum CPU load for a particular thread doing intensive calculations.
Right now I have located the most time consuming loop in the thread (it does only compression) and use GetTickCount() and Sleep() with hardcoded values. It makes sure that the loop continues for a certain period and then sleeps for a certain minimum time. It more or less does the job, i.e. guarantees that the thread will not use more than 50% of CPU. However, behavior is dependent on the number of CPU cores (huge disadvantage) and simply ugly (smaller disadvantage :)). Any ideas?
I am not aware of any API to do get the OS's scheduler to do what you want (even if your thread is idle-priority, if there are no higher-priority ready threads, yours will run). However, I think you can improvise a fairly elegant throttling function based on what you are already doing. Essentially (I don't have a Windows dev machine handy):
Pick a default amount of time the thread will sleep each iteration. Then, on each iteration (or on every nth iteration, such that the throttling function doesn't itself become a significant CPU load),
Compute the amount of CPU time your thread used since the last time your throttling function was called (I'll call this dCPU). You can use the GetThreadTimes() API to get the amount of time your thread has been executing.
Compute the amount of real time elapsed since the last time your throttling function was called (I'll call this dClock).
dCPU / dClock is the percent CPU usage (of one CPU). If it is higher than you want, increase your sleep time, if lower, decrease the sleep time.
Have your thread sleep for the computed time.
Depending on how your watchdog computes CPU usage, you might want to use GetProcessAffinityMask() to find out how many CPUs the system has. dCPU / (dClock * CPUs) is the percentage of total CPU time available.
You will still have to pick some magic numbers for the initial sleep time and the increment/decrement amount, but I think this algorithm could be tuned to keep a thread running at fairly close to a determined percent of CPU.
On linux, you can change the scheduling priority of a thread with nice().
I can't think of any cross platform way of what you want (or any guaranteed way full stop) but as you are using GetTickCount perhaps you aren't interested in cross platform :)
I'd use interprocess communications and set the intensive processes nice levels to get what you require but I'm not sure that's appropriate for your situation.
EDIT:
I agree with Bernard which is why I think a process rather than a thread might be more appropriate but it just might not suit your purposes.
The problem is it's not normal to want to leave the CPU idle while you have work to do. Normally you set a background task to IDLE priority, and let the OS handle scheduling it all the CPU time that isn't used by interactive tasks.
It sound to me like the problem is the watchdog process.
If your background task is CPU-bound then you want it to take all the unused CPU time for its task.
Maybe you should look at fixing the watchdog program?
You may be able to change the priority of a thread, but changing the maximum utilization would either require polling and hacks to limit how many things are occurring, or using OS tools that can set the maximum utilization of a process.
However, I don't see any circumstance where you would want to do this.