sleep without system or IO calls - c++

I need a sleep that does not issue any system or IO calls for a scenario with Hardware Transactional Memory (these calls would lead to an abort). Sleeping for 1 microsecond as in usleep(1) would be just fine.
This question suggests to implement nested loops to keep the program busy and delay it for some time. However, I want to be able to compile with optimization which would delete these loops.
An idea could be to calculate some sophisticated math equation. Are there approaches to this? The actual time waited does not have to be precise - it should be vaguely the same for multiple runs however.

Try a nop loop with a volatile asm directive:
for (int i = 0; i < 1000; i++) {
asm volatile ("nop");
}
The volatile should prevent the optimizer from getting rid of it. If that doesn't do it, then try __volatile__.

The tricky part here is the timing. Querying any sort of timer may well count as an I/O function, depending on the OS.
But if you just want a delay loop, when timing isn't that important, you should look to platform-specific code. For example, there is an Intel-specific intrinsic called _mm_pause that translates to a CPU pause instruction, which basically halts the pipeline until the next memory bus sync comes through. It was designed to be put into a spinlock loop (no point in spinning and requerying an atomic variable until there is a possibility of new information), but it might (might - read the documentation) inhibit the compiler from removing your delay loop as empty.

You can use this code:
#include <time.h>
void delay(int n)
{
n *= CLOCKS_PER_SEC / 1000;
clock_t t1 = clock();
while (clock() <= t1 + n && clock() >= t1);
}
Sometimes (not very often) this function will cause less delay than specified due to clock counter overflow.
Update
Another option is to use a loops like this with volatile counters.

Related

setting/pausing clock time for std::chrono clocks in C++

I wrote something to measure how long my code takes to run, and to print it out. The way I have it now supports nesting of these measurements.
The thing is that in the process of getting the time interval, converting it into a number, getting time format, and then converting everything into a string, and then print it out takes a while (2-3 milliseconds), I/O especially seems expensive. I want the clock to "skip over" this process in a sense, since the stuff I'm measuring would be in the microseconds. (and I think it'd be a good feature regardless, if I have other things I want to skip)
std::chrono::high_resolution_clock clock;
std::chrono::time_point<std::chrono::steady_clock> first, second;
first = clock.now();
std::chrono::time_point<std::chrono::high_resolution_clock> paused_tp = clock.now();
std::cout << "Printing things \n";
clock.setTime(paused_tp); // Something like this is what I want
second = clock.now();
The idea is to make sure that first and second have minimal differences, ideally identical.
From what I see, the high_resolution_clock class (and all the other chrono clocks), keep their time_point private, and you can only access it from clock.now()
I know there might be benchmarking libraries out there that do this, but I'd like to figure out how to do it myself (if only for the sake of knowing how to do it). Any information on how other libraries do it or insights on how chrono works would be appreciated as well. I might be misunderstanding how chrono internally keeps track.
(Is std::chrono::high_resolution_clock even accurate enough for something like this?)
(While I'm here any resources on making C++ programs more efficient would be great)
Edit: I actually do printing after the section of code I'm trying to time, the problem only arises in, say, when I want to time the entire program, as well as individual functions. Then the printing of the function's time would cause delay in overall program time.
Edit 2: I figured I should have more of an example of what I'm doing.
I have a class that handles everything, let's say it's called tracker, it takes care of all that clock nonsense.
tracker loop = TRACK(
for(int i = 0; i != 100; ++i){
tracker b = TRACK(function_call());
b.display();
}
)
loop.display();
The macro is optional, it just a quick thing that makes it less cluttered and allows me to display the name of the function being called.
Explicitly the macro expands to
tracker loop = "for(int i = 0; i != 100; ++i){ tracker b = TRACK(function_call()); b.display(); }"
loop.start()
for(int i = 0; i != 100; ++i){
tracker b = "function_call()"
b.start();
function_call();
b.end();
b.display();
}
loop.end();
loop.display();
In most situations the printing isn't an issue, it only keeps track of what's between start() and end(), but here the b.display() ends up interfering with the tracker loop.
A goal of mine with this was for the tracker to be as non-intrusive as possible, so I'd like most/all of it to be handled in the tracker class. But then I run into the problem of b.display() being a method of a different instance than the tracker loop. I've tried a few things with the static keyword but ran into a few issues with that (still trying a little). I might've coded myself into a dead end here, but there are still a lot of things left to try.
Just time the two intervals separately and add them, i.e. save 4 total timestamps. For nested intervals, you might just save timestamps into an array and sort everything out at the end. (Or inside an outer loop before timestamps get overwritten). Storing to an array is quite cheap.
Or better: defer printing until later.
If the timed interval is only milliseconds, just save what you were going to print and do it outside the timed interval.
If you have nested timed intervals, at least sink the printing out of the inner-most intervals to minimize the amount of stop/restart you have to do.
If you're manually instrumenting your code all over the place, maybe look at profiling tools like flamegraph, especially if your timed intervals break down on function boundaries. linux perf: how to interpret and find hotspots.
Not only does I/O take time itself, it will make later code run slower for a few hundreds or thousands of cycles. Making a system call touches a lot of code, so when you return to user-space it's likely that you'll get instruction-cache and data-cache misses. System calls that modify your page tables will also result in TLB misses.
(See "The (Real) Costs of System Calls" section in the FlexSC paper (Soares, Stumm), timed on an i7 Nehalem running Linux. (First-gen i7 from ~2008/9). The paper proposes a batched system-call mechanism for high-throughput web-servers and similar, but their baseline results for plain Linux are interesting and relevant outside of that.)
On a modern Intel CPU with Meltdown mitigation enabled, you'll usually get TLB misses. With Spectre mitigation enabled on recent x86, branch-prediction history will probably be wiped out, depending on the mitigation strategy. (Intel added a way for the kernel to request that higher-privileged branches after this point won't be affected by prediction history for lower-privileged branches. On current CPUs, I think this just flushes the branch-prediction cache.)
You can avoid the system-call overhead by letting iostream buffer for you. It's still significant work formatting and copying data around, but much cheaper than writing to a terminal. Redirecting your program's stdout to a file will make cout full-buffered by default, instead of line-buffered. i.e. run it like this:
./my_program > time_log.txt
The final output will match what you would have got on the terminal, but (as long as you don't do anything silly like using std::endl to force a flush) it will just be buffered up. The default buffer size is probably something like 4kiB. Use strace ./my_program or a similar tool to trace system calls, and make sure you're getting one big write() at the end, instead of lots of small write()s.
It would be nice to avoid buffered I/O inside (outer) timed regions, but it's very important to avoid system calls in places your "real" (non-instrumented) code wouldn't have them, if you're timing down to nanoseconds. And this is true even before timed intervals, not just inside.
cout << foo if it doesn't make a system call isn't "special" in terms of slowing down later code.
To overcome the overhead, the time lapse printing can be done by another thread. The main thread saves a start and end time into shared global variables, and notifies the condition variable the print thread is waiting on.
#include<iostream>
#include<thread>
#include<chrono>
#include<mutex>
#include<condition_variable>
#include<atomic>
std::condition_variable cv;
std::mutex mu;
std::atomic<bool> running {true};
std::atomic<bool> printnow {false};
// shared but non-atomic: protected by the mutex and condition variable.
// possible bug: the main thread can write `now` before print_thread wakes up and reads it
std::chrono::high_resolution_clock::time_point start;
std::chrono::high_resolution_clock::time_point now;
void print_thread() {
std::thread([]() {
while (running) {
std::unique_lock<std::mutex> lock(mu);
cv.wait(lock, []() { return !running || printnow; });
if (!running) return;
std::chrono::milliseconds lapse_ms = std::chrono::duration_cast<std::chrono::milliseconds>(now - start);
printnow = false;
std::cout << " lapse time " << lapse_ms.count() << " ms\n";
}
}).detach();
}
void print_lapse(std::chrono::high_resolution_clock::time_point start1, std::chrono::high_resolution_clock::time_point now1) {
start = start1;
now = now1;
printnow = true;
cv.notify_one();
}
int main()
{
//launch thread
print_thread();
// laspe1
std::chrono::high_resolution_clock::time_point start1 = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_for(std::chrono::milliseconds(200));
std::chrono::high_resolution_clock::time_point now1 = std::chrono::high_resolution_clock::now();
print_lapse(start1,now1);
// laspe2
start1 = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_for(std::chrono::milliseconds(500));
now1 = std::chrono::high_resolution_clock::now();
print_lapse(start1, now1);
//winding up
std::this_thread::sleep_for(std::chrono::milliseconds(300));
running = false;
cv.notify_one();
}

Measuing CPU clock speed

I am trying to measure the speed of the CPU.I am not sure how much my method is accurate. Basicly, I tried an empty for loop with values like UINT_MAX but the program terminated quickly so I tried UINT_MAX * 3 and so on...
Then I realized that the compiler is optimizing away the loop, so I added a volatile variable to prevent optimization. The following program takes 1.5 seconds approximately to finish. I want to know how accurate is this algorithm for measuring the clock speed. Also,how do I know how many core's are being involved in the process?
#include <iostream>
#include <limits.h>
#include <time.h>
using namespace std;
int main(void)
{
volatile int v_obj = 0;
unsigned long A, B = 0, C = UINT32_MAX;
clock_t t1, t2;
t1 = clock();
for (A = 0; A < C; A++) {
(void)v_obj;
}
t2 = clock();
std::cout << (double)(t2 - t1) / CLOCKS_PER_SEC << std::endl;
double t = (double)(t2 - t1) / CLOCKS_PER_SEC;
unsigned long clock_speed = (unsigned long)(C / t);
std::cout << "Clock speed : " << clock_speed << std::endl;
return 0;
}
This doesn't measure clock speed at all, it measures how many loop iterations can be done per second. There's no rule that says one iteration will run per clock cycle. It may be the case, and you may have actually found it to be the case - certainly with optimized code and a reasonable CPU, a useless loop shouldn't run much slower than that. It could run at half speed though, some processors are not able to retire more than 1 taken branch every 2 cycles. And on esoteric targets, all bets are off.
So no, this doesn't measure clock cycles, except accidentally. In general it's extremely hard to get an empirical clock speed (you can ask your OS what it thinks the maximum clock speed and current clock speed are, see below), because
If you measure how much wall clock time a loop takes, you must know (at least approximately) the number of cycles per iteration. That's a bad enough problem in assembly, requiring fairly detailed knowledge of the expected microarchitectures (maybe a long chain of dependent instructions that each could only reasonably take 1 cycle, like add eax, 1? a long enough chain that differences in the test/branch throughput become small enough to ignore), so obviously anything you do there is not portable and will have assumptions built into it may become false (actually there is an other answer on SO that does this and assumes that addps has a latency of 3, which it doesn't anymore on Skylake, and didn't have on old AMDs). In C? Give up now. The compiler might be rolling some random code generator, and relying on it to be reasonable is like doing the same with a bear. Guessing the number of cycles per iteration of code you neither control nor even know is just folly. If it's just on your own machine you can check the code, but then you could just check the clock speed manually too so..
If you measure the number of clock cycles elapsed in a given amount of wall clock time.. but this is tricky. Because rdtsc doesn't measure clock cycles (not anymore), and nothing else gets any closer. You can measure something, but with frequency scaling and turbo, it generally won't be actual clock cycles. You can get actual clock cycles from a performance counter, but you can't do that from user mode. Obviously any way you try to do this is not portable, because you can't portably ask for the number of elapsed clock cycles.
So if you're doing this for actual information and not just to mess around, you should probably just ask the OS. For Windows, query WMI for CurrentClockSpeed or MaxClockSpeed, whichever one you want. On Linux there's stuff in /proc/cpuinfo. Still not portable, but then, no solution is.
As for
how do I know how many core's are being involved in the process?
1. Of course your thread may migrate between cores, but since you only have one thread, it's on only one core at any time.
A good optimizer may remove the loop, since
for (A = 0; A < C; A++) {
(void)v_obj;
}
has the same effect on the program state as;
A = C;
So the optimizer is entirely free to unwind your loop.
So you cannot measure CPU speed this way as it depends on the compiler as much as it does on the computer (not to mention the variable clock speed and multicore architecture already mentioned)

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.
QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.
Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call
Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.
Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.
The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.

How to do an active sleep?

I am running some profiling tests, and usleep is an useful function. But while my program is sleeping, this time does not appear in the profile.
eg. if I have a function as :
void f1() {
for (i = 0; i < 1000; i++)
usleep(1000);
}
With profile tools as gprof, f1 does not seems to consume any time.
What I am looking is a method nicer than an empty while loop for doing an active sleep, like:
while (1) {
if (gettime() == whatiwant)
break;
}
What kind of a system are you on? In UNIX-like systems you can use setitimer() to send a signal to a process after a specified period of time. This is the facility you would need to implement the type of "active sleep" you're looking for.
Set the timer, then loop until you receive the signal.
Because when you call usleep the CPU is put to work to something else for 1 second. So the current thread does not use any processor resources, and that's a very clever thing to do.
An active sleep is something to absolutely avoid because it's a waste of resources (ultimately damaging the environment by converting electricity to heat ;) ).
Anyway if you really want to do that you must give some real work to do to the processor, something that will not be factored out by compiler optimizations. For example
for (i = 0; i < 1000; i++)
time(NULL);
I assume you want to find out the total amount of time (wall-clock time, real-world time, the time you are sitting watching your app run) f1() is taking, as opposed to CPU time. I'd investigate to see if gprof can give you a wall-clock-time instead of a processing-time.
I imagine it depends upon your OS, but the reason you aren't seeing usleep as taking any process time in the profile is because it technically isn't using any during that time - other running processes are (assuming this is running on a *nix platform).
for (int i = i; i < SOME_BIG_NUMBER; ++i);
The entire point in "sleep" functions is that your application is not running. It is put in a sleep queue, and the OS transfers control to another process. If you want your application to run, but do nothing, an empty loop is a simple solution. But you lose all the benefits of sleep (letting other applications run, saving CPU usage/power consumption)
So what you're asking makes no sense. You can't have your application sleep, but still be running.
AFAIK the only option is to do a while loop. The operating system generally assumes that if you want to wait for a period of time that you will want to be yielding to the operating system.
Being able to get a microsecond accurate timer is also a potential issue. AFAIK there isn't a cross-platform way of doing timing (please someone correct me on this because i'd love a cross-platform sub-microsecond timer! :D). Under Win32, You could surround a loop with some QueryPerformanceCounter calls to work out when you have spent enough time in the loop and then exit.
e.g
void USleepEatCycles( __int64 uSecs )
{
__int64 frequency;
QueryPerformanceFrequency( (LARGE_INTEGER*)&frequency );
__int64 counter;
QueryPerformanceCounter( (LARGE_INTEGER*)&counter );
double dStart = (double)counter / (double)frequency;
double dEnd = dStart;
while( (dEnd - dStart) < uSecs )
{
QueryPerformanceCounter( (LARGE_INTEGER*)&counter );
dEnd = (double)counter / (double)frequency;
}
}
That's why it's important when profiling to look at the "Switched Out %" time. Basically, while your function's exclusive time may be little, if it performs e.g. I/O, DB, etc, waiting for external resources, then "Switched Out %" is the metric to watch out.
This is the kind of confusion you get with gprof, since what you care about is wall-clock time. I use this.

Create thread with >70% CPU utilization

I am creating a test program to test the functionality of program which calcultes CPU Utilization.
Now I want to test that program at different times when CPU utilization is 100%, 50% 0% etc.
My question how to make CPU to utilize to 100% or may be > 80%.
I think creating a while loop like will suffice
while(i++< 2000)
{
cout<<" in while "<< endl;
Sleep(10); // sleep for 10 ms.
}
After running this I dont get high CPU utilization.
What would be the possible solutions to make high cpu intensive??
You're right to use a loop, but:
You've got IO
You've got a sleep
Basically nothing in that loop is going to take very much CPU time compared with the time it's sleeping or waiting for IO.
To kill a CPU you need to give it just CPU stuff. The only tricky bit really is making sure the C++ compiler doesn't optimise away the loop. Something like this should probably be okay:
// A bit like generating a hashcode. Pretty arbitrary choice,
// but simple code which would be hard for the compiler to
// optimise away.
int running_total = 23;
for (int i=0; i < some_large_number; i++)
{
running_total = 37 * running_total + i;
}
return running_total;
Note the fact that I'm returning the value out of the loop. That should stop the C++ compiler from noticing that the loop is useless (if you never used the value anywhere, the loop would have no purpose). You may want to disable inlining too, as otherwise I guess there's a possibility that a smart compiler would notice you calling the function without using the return value, and inline it to nothing. (As Suma points out in the answer, using volatile when calling the function should disable inlining.)
Your loop mostly sleeps, which means it has very light CPU load. Besides of Sleep, be sure to include some loop performing any computations, like this (Factorial implementation is left as an exercise to reader, you may replace it with any other non-trivial function).
while(i++< 2000)
{
int sleepBalance = 10; // increase this to reduce the CPU load
int computeBalance = 1000; // increase this to increase the CPU load
for (int i=0; i<computeBalance; i++)
{
/* both volatiles are important to prevent compiler */
/* optimizing out the function */
volatile int n = 30;
volatile int pretendWeNeedTheResult = Factorial(n);
}
Sleep(sleepBalance);
}
By adjusting sleepBalance / computeBalance you may adjust how much CPU this program takes. If you want to this as a CPU load simulation, you might want to take a few addtional steps:
on a multicore system be sure to either spawn the loop like this in multiple threads (one for each CPU), or execute the process multiple times, and to make the scheduling predictable assign thread/process affinity explicitly
sometimes you may also want to increase the thread/process priority to simulate the environment where CPU is heavily loaded with high priority applications.
Use consume.exe in the Windows SDK.
Don't roll your own when someone else has already done the work and will give it to you for free.
If you call Sleep in your loop then most of the the loop's time will be spent doing nothing (sleeping). This is why your CPU utilization is low - because that 10mS sleep is huge compared to the time the CPU will spend executing the rest of the code in each loop iteration. It is a non-trivial task to write code to accurately waste CPU time. Roger's suggestion of using CPU Burn-In is a good one.
I know the "yes" command on UNIX systems, when routed to /dev/null will eat up 100% CPU on a single core (it doesn't thread). You can launch multiple instances of it to utilize each core. You could probably compile the "yes" code in your application and call it directly. You don't specify what C++ compiler you are using for Windows, but I am going to assume it has POSIX compatibility of some kind (ala Cygwin). If that's the case, "yes" should work fine.
To make a thread use a lot of CPU, make sure it doesn't block/wait. Your Sleep call will suspend the thread and not schedule it for at least the number of ms the Sleep call indicates, during which it will not use the CPU.
Get hold of a copy of CPU Burn-In.