I am currently writing an application in Windows using C++ and I would like to simulate CPU load.
I have the following code:
void task1(void *param) {
unsigned elapsed =0;
unsigned t0;
while(1){
if ((t0=clock())>=50+elapsed){//if time elapsed is 50ms
elapsed=t0;
Sleep(50);
}
}
}
int main(){
int ThreadNr;
for(int i=0; i < 4;i++){//for each core (i.e. 4 cores)
_beginthread( task1, 0, &ThreadNr );//create a new thread and run the "task1" function
}
while(1){}
}
I wrote this code using the same methodology as in the answers given in this thread: Simulate steady CPU load and spikes
My questions are:
Have I translated the C# code from the other post correctly over to C++?
Will this code generate an average CPU load of 50% on a quad-core processor?
How can I, within reasonable accuracy, find out the load percentage of the CPU? (is task manager my only option?)
EDIT: The reason I ask this question is that I want to eventually be able to generate CPU loads of 10,20,30,...,90% within a reasonable tolerance. This code seems to work well for to generate loads 70%< but seems to be very inaccurate at any load below 70% (as measured by the task manager CPU load readings).
Would anyone have any ideas as to how I could generate said loads but still be able to use my program on different computers (i.e. with different CPUs)?
At first sight, this looks like not-pretty-but-correct C++ or C (an easy way to be sure is to compile it). Includes are missing (<windows.h>, <process.h>, and <time.h>) but otherwise it compiles fine.
Note that clock and Sleep are not terribly accurate, and Sleep is not terribly reliable either. On the average, the thread function should kind of work as intended, though (give or take a few percent of variation).
However, regarding question 2) you should replace the last while(1){} with something that blocks rather than spins (e.g. WaitForSingleObject or Sleep if you will). otherwise the entire program will not have 50% load on a quadcore. You will have 100% load on one core due to the main thread, plus the 4x 50% from your four workers. This will obviously sum up to more than 50% per core (and will cause threads to bounce from one core to the other, resulting in nasty side effects).
Using Task Manager or a similar utility to verify whether you get the load you want is a good option (and since it's the easiest solution, it's also the best one).
Also do note that simulating load in such a way will probably kind of work, but is not 100% reliable.
There might be effects (memory, execution units) that are hard to predict. Assume for example that you're using 100% of the CPU's integer execution units with this loop (reasonable assumption) but zero of it's floating point or SSE units. Modern CPUs may share resources between real or logical cores, and you might not be able to predict exactly what effects you get. Or, another thread may be memory bound or having significant page faults, so taking away CPU time won't affect it nearly as much as you think (might in fact give it enough time to make prefetching work better). Or, it might block on AGP transfers. Or, something else you can't tell.
EDIT:
Improved version, shorter code that fixes a few issues and also works as intended:
Uses clock_t for the value returned by clock (which is technically "more correct" than using a not specially typedef'd integer. Incidentially, that's probably the very reason why the original code does not work as intended, since clock_t is a signed integer under Win32. The condition in if() always evaluates true, so the workers sleep almost all the time, consuming no CPU.
Less code, less complicated math when spinning. Computes a wakeup time 50 ticks in the future and spins until that time is reached.
Uses getchar to block the program at the end. This does not burn CPU time, and it allows you to end the program by pressing Enter. Threads are not properly ended as one would normally do, but in this simple case it's probably OK to just let the OS terminate them as the process exits.
Like the original code, this assumes that clock and Sleep use the same ticks. That is admittedly a bold assumption, but it holds true under Win32 which you used in the original code (both "ticks" are milliseconds). C++ doesn't have anything like Sleep (without boost::thread, or C++11 std::thread), so if non-Windows portability is intended, you'd have to rethink anyway.
Like the original code, it relies on functions (clock and Sleep) which are unprecise and unreliable. Sleep(50) equals Sleep(63) on my system without using timeBeginPeriod. Nevertheless, the program works "almost perfectly", resulting in a 50% +/- 0.5% load on my machine.
Like the original code, this does not take thread priorities into account. A process that has a higher than normal priority class will be entirely unimpressed by this throttling code, because that is how the Windows scheduler works.
#include <windows.h>
#include <process.h>
#include <time.h>
#include <stdio.h>
void task1(void *)
{
while(1)
{
clock_t wakeup = clock() + 50;
while(clock() < wakeup) {}
Sleep(50);
}
}
int main(int, char**)
{
int ThreadNr;
for(int i=0; i < 4; i++) _beginthread( task1, 0, &ThreadNr );
(void) getchar();
return 0;
}
Here is an a code sample which loaded my CPU to 100% on Windows.
#include "windows.h"
DWORD WINAPI thread_function(void* data)
{
float number = 1.5;
while(true)
{
number*=number;
}
return 0;
}
void main()
{
while (true)
{
CreateThread(NULL, 0, &thread_function, NULL, 0, NULL);
}
}
When you build the app and run it, push Ctrl-C to kill the app.
You can use the Windows perf counter API to get the CPU load. Either for the entire system or for your process.
Related
I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter
Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.
5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).
I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.
QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.
Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call
Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.
Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.
The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.
I've got a loop that looks like this:
while (elapsedTime < refreshRate)
{
timer.stopTimer();
elapsedTime=timer.getElapsedTime();
}
I read something similar to this elsewhere (C Main Loop without 100% cpu), but this loop is running a high resolution timer that must be accurate. So how am I supposed to not take up 100% CPU while still keeping it high resolution?
You shouldn't busy-wait but rather have the OS tell you when the time has passed.
http://msdn.microsoft.com/en-us/library/ms712704(VS.85).aspx
High resolution timers (Higher than 10 ms)
http://msdn.microsoft.com/en-us/magazine/cc163996.aspx
When you say that your timer must be "accurate", how accurate do you actually need to be? If you only need to be accurate to the nearest millisecond, then you can add a half-millisecond sleep inside the loop. You can also add a dynamically-changing sleep statement based off of how much time you have left to sleep. Think of something like (pseudocode):
int time_left = refreshRate - elapsedTime;
while (time_left > 0) {
if (time_left > threshhold)
sleep_for_interval(time_left / 2);
update_timestamp(elapsedTime);
time_left = refreshRate - elapsedTime;
}
With that algorithm, your code will sleep for short bursts if it detects that you still have a while to wait. You would want to run some tests to find an optimal value for threshhold that balances CPU usage savings for risk of overshoot (caused by your app losing the CPU when it sleeps and not getting any more CPU time in time).
The other method for high-resolution timing is to use a hardware timer that triggers an periodic interrupt. Your interrupt handler would send a signal to some thread that it needs to wake up and do something, after which it goes back to sleep and waits for the next signal to come in.
Real-Time Operating Systems have ways to do this sort of things built into the OS. If you're doing Windows programming and need extremely precise timing, be aware that that's not the sort of thing that a general-purpose OS like Windows handles very well.
Look at some timers delivered by the OS, like POSIX usleep.
On the other hand, if you need hyper precision, your code will not work either, because the OS will break this loop after it would exhaust its process time quantum and jump to the kernel space to make some system tasks. To this end you would need some special OS with interruptable kernel and tools delivered by it; look for RTOS keyword.
Typically, you yield to the OS in some fashion. This allows the OS to take a break from your program and do something else.
Obviously this is OS dependent, but:
#ifdef _WIN32
#include <windows.h>
#else
#include <unistd.h>
#endif
void yield(void)
{
#ifdef _WIN32
Sleep(0);
#else
usleep(1);
#endif
}
Insert a call to yield before you stop the timer. The OS will report less time usage by your program.
Keep in mind, of course, this makes your timer "less accurate", because it might not update as frequently as possible. But you really shouldn't depend on extreme-accuracy, it's far too difficult. Approximations are okay.
I am running some profiling tests, and usleep is an useful function. But while my program is sleeping, this time does not appear in the profile.
eg. if I have a function as :
void f1() {
for (i = 0; i < 1000; i++)
usleep(1000);
}
With profile tools as gprof, f1 does not seems to consume any time.
What I am looking is a method nicer than an empty while loop for doing an active sleep, like:
while (1) {
if (gettime() == whatiwant)
break;
}
What kind of a system are you on? In UNIX-like systems you can use setitimer() to send a signal to a process after a specified period of time. This is the facility you would need to implement the type of "active sleep" you're looking for.
Set the timer, then loop until you receive the signal.
Because when you call usleep the CPU is put to work to something else for 1 second. So the current thread does not use any processor resources, and that's a very clever thing to do.
An active sleep is something to absolutely avoid because it's a waste of resources (ultimately damaging the environment by converting electricity to heat ;) ).
Anyway if you really want to do that you must give some real work to do to the processor, something that will not be factored out by compiler optimizations. For example
for (i = 0; i < 1000; i++)
time(NULL);
I assume you want to find out the total amount of time (wall-clock time, real-world time, the time you are sitting watching your app run) f1() is taking, as opposed to CPU time. I'd investigate to see if gprof can give you a wall-clock-time instead of a processing-time.
I imagine it depends upon your OS, but the reason you aren't seeing usleep as taking any process time in the profile is because it technically isn't using any during that time - other running processes are (assuming this is running on a *nix platform).
for (int i = i; i < SOME_BIG_NUMBER; ++i);
The entire point in "sleep" functions is that your application is not running. It is put in a sleep queue, and the OS transfers control to another process. If you want your application to run, but do nothing, an empty loop is a simple solution. But you lose all the benefits of sleep (letting other applications run, saving CPU usage/power consumption)
So what you're asking makes no sense. You can't have your application sleep, but still be running.
AFAIK the only option is to do a while loop. The operating system generally assumes that if you want to wait for a period of time that you will want to be yielding to the operating system.
Being able to get a microsecond accurate timer is also a potential issue. AFAIK there isn't a cross-platform way of doing timing (please someone correct me on this because i'd love a cross-platform sub-microsecond timer! :D). Under Win32, You could surround a loop with some QueryPerformanceCounter calls to work out when you have spent enough time in the loop and then exit.
e.g
void USleepEatCycles( __int64 uSecs )
{
__int64 frequency;
QueryPerformanceFrequency( (LARGE_INTEGER*)&frequency );
__int64 counter;
QueryPerformanceCounter( (LARGE_INTEGER*)&counter );
double dStart = (double)counter / (double)frequency;
double dEnd = dStart;
while( (dEnd - dStart) < uSecs )
{
QueryPerformanceCounter( (LARGE_INTEGER*)&counter );
dEnd = (double)counter / (double)frequency;
}
}
That's why it's important when profiling to look at the "Switched Out %" time. Basically, while your function's exclusive time may be little, if it performs e.g. I/O, DB, etc, waiting for external resources, then "Switched Out %" is the metric to watch out.
This is the kind of confusion you get with gprof, since what you care about is wall-clock time. I use this.
I am creating a test program to test the functionality of program which calcultes CPU Utilization.
Now I want to test that program at different times when CPU utilization is 100%, 50% 0% etc.
My question how to make CPU to utilize to 100% or may be > 80%.
I think creating a while loop like will suffice
while(i++< 2000)
{
cout<<" in while "<< endl;
Sleep(10); // sleep for 10 ms.
}
After running this I dont get high CPU utilization.
What would be the possible solutions to make high cpu intensive??
You're right to use a loop, but:
You've got IO
You've got a sleep
Basically nothing in that loop is going to take very much CPU time compared with the time it's sleeping or waiting for IO.
To kill a CPU you need to give it just CPU stuff. The only tricky bit really is making sure the C++ compiler doesn't optimise away the loop. Something like this should probably be okay:
// A bit like generating a hashcode. Pretty arbitrary choice,
// but simple code which would be hard for the compiler to
// optimise away.
int running_total = 23;
for (int i=0; i < some_large_number; i++)
{
running_total = 37 * running_total + i;
}
return running_total;
Note the fact that I'm returning the value out of the loop. That should stop the C++ compiler from noticing that the loop is useless (if you never used the value anywhere, the loop would have no purpose). You may want to disable inlining too, as otherwise I guess there's a possibility that a smart compiler would notice you calling the function without using the return value, and inline it to nothing. (As Suma points out in the answer, using volatile when calling the function should disable inlining.)
Your loop mostly sleeps, which means it has very light CPU load. Besides of Sleep, be sure to include some loop performing any computations, like this (Factorial implementation is left as an exercise to reader, you may replace it with any other non-trivial function).
while(i++< 2000)
{
int sleepBalance = 10; // increase this to reduce the CPU load
int computeBalance = 1000; // increase this to increase the CPU load
for (int i=0; i<computeBalance; i++)
{
/* both volatiles are important to prevent compiler */
/* optimizing out the function */
volatile int n = 30;
volatile int pretendWeNeedTheResult = Factorial(n);
}
Sleep(sleepBalance);
}
By adjusting sleepBalance / computeBalance you may adjust how much CPU this program takes. If you want to this as a CPU load simulation, you might want to take a few addtional steps:
on a multicore system be sure to either spawn the loop like this in multiple threads (one for each CPU), or execute the process multiple times, and to make the scheduling predictable assign thread/process affinity explicitly
sometimes you may also want to increase the thread/process priority to simulate the environment where CPU is heavily loaded with high priority applications.
Use consume.exe in the Windows SDK.
Don't roll your own when someone else has already done the work and will give it to you for free.
If you call Sleep in your loop then most of the the loop's time will be spent doing nothing (sleeping). This is why your CPU utilization is low - because that 10mS sleep is huge compared to the time the CPU will spend executing the rest of the code in each loop iteration. It is a non-trivial task to write code to accurately waste CPU time. Roger's suggestion of using CPU Burn-In is a good one.
I know the "yes" command on UNIX systems, when routed to /dev/null will eat up 100% CPU on a single core (it doesn't thread). You can launch multiple instances of it to utilize each core. You could probably compile the "yes" code in your application and call it directly. You don't specify what C++ compiler you are using for Windows, but I am going to assume it has POSIX compatibility of some kind (ala Cygwin). If that's the case, "yes" should work fine.
To make a thread use a lot of CPU, make sure it doesn't block/wait. Your Sleep call will suspend the thread and not schedule it for at least the number of ms the Sleep call indicates, during which it will not use the CPU.
Get hold of a copy of CPU Burn-In.