C++ chrono timers go 1000x slower in a thread - c++

My setup
iOS13.2
Xcode 11.2
I'm implementing a simple timer to count seconds by calendar/wall time.
From many SO tips, I tried various std::chrono:: timers, such as
system_clock
steady_clock
They seem to behave correctly under a single-threaded program.
But in a production code, when I use the timer in a background thread, then it falls apart,
meaning that the duration readings from calling the timer functions are way off.
m_thread = std::thread([&, this]() {
auto start = std::chrono::system_clock::now(); // a persistent state initialized at the beginning of the run.
while (true) {
auto end = std::chrono::system_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
auto isOutdated = duration > 1000;
if (isOutdated) {
print("we are outdated.").
}
}
});
duration seems to be almost always 0;
gettimeofday() works slightly better, i.e. it actually moves, but at a rate 1000x slower than wall time.
I had thought that the chrono class counts all kinds of times, but it seems that I have the wrong expectation how it works.
What am I missing?
UPDATE
Forgot to say, I have 2 more threads going at the same time. Could thread preemption affect this?
UPDATE 2
I tried a few things and now the program behaves as expected, but this actually drove me mad how it happened in the first place.
Things I did
Gradually increase the timeout threshold from 1 to 3000, each time the whole program gets recompiled. I found that when I lower the threshold, the program actually gets the duration right.
Try with gettimeofday() first, which consistently shows numbers 1000x slower, then switch back to system_clock.
Disable some logging to avoid performance hit. I use a thirdparty thread-safe logging lib, which writes to a log file and outputs to device syslog at the same time.
Right now I can finally see the correct duration. NO change in the code logic. What a bizarre experience!

Related

setting/pausing clock time for std::chrono clocks in C++

I wrote something to measure how long my code takes to run, and to print it out. The way I have it now supports nesting of these measurements.
The thing is that in the process of getting the time interval, converting it into a number, getting time format, and then converting everything into a string, and then print it out takes a while (2-3 milliseconds), I/O especially seems expensive. I want the clock to "skip over" this process in a sense, since the stuff I'm measuring would be in the microseconds. (and I think it'd be a good feature regardless, if I have other things I want to skip)
std::chrono::high_resolution_clock clock;
std::chrono::time_point<std::chrono::steady_clock> first, second;
first = clock.now();
std::chrono::time_point<std::chrono::high_resolution_clock> paused_tp = clock.now();
std::cout << "Printing things \n";
clock.setTime(paused_tp); // Something like this is what I want
second = clock.now();
The idea is to make sure that first and second have minimal differences, ideally identical.
From what I see, the high_resolution_clock class (and all the other chrono clocks), keep their time_point private, and you can only access it from clock.now()
I know there might be benchmarking libraries out there that do this, but I'd like to figure out how to do it myself (if only for the sake of knowing how to do it). Any information on how other libraries do it or insights on how chrono works would be appreciated as well. I might be misunderstanding how chrono internally keeps track.
(Is std::chrono::high_resolution_clock even accurate enough for something like this?)
(While I'm here any resources on making C++ programs more efficient would be great)
Edit: I actually do printing after the section of code I'm trying to time, the problem only arises in, say, when I want to time the entire program, as well as individual functions. Then the printing of the function's time would cause delay in overall program time.
Edit 2: I figured I should have more of an example of what I'm doing.
I have a class that handles everything, let's say it's called tracker, it takes care of all that clock nonsense.
tracker loop = TRACK(
for(int i = 0; i != 100; ++i){
tracker b = TRACK(function_call());
b.display();
}
)
loop.display();
The macro is optional, it just a quick thing that makes it less cluttered and allows me to display the name of the function being called.
Explicitly the macro expands to
tracker loop = "for(int i = 0; i != 100; ++i){ tracker b = TRACK(function_call()); b.display(); }"
loop.start()
for(int i = 0; i != 100; ++i){
tracker b = "function_call()"
b.start();
function_call();
b.end();
b.display();
}
loop.end();
loop.display();
In most situations the printing isn't an issue, it only keeps track of what's between start() and end(), but here the b.display() ends up interfering with the tracker loop.
A goal of mine with this was for the tracker to be as non-intrusive as possible, so I'd like most/all of it to be handled in the tracker class. But then I run into the problem of b.display() being a method of a different instance than the tracker loop. I've tried a few things with the static keyword but ran into a few issues with that (still trying a little). I might've coded myself into a dead end here, but there are still a lot of things left to try.
Just time the two intervals separately and add them, i.e. save 4 total timestamps. For nested intervals, you might just save timestamps into an array and sort everything out at the end. (Or inside an outer loop before timestamps get overwritten). Storing to an array is quite cheap.
Or better: defer printing until later.
If the timed interval is only milliseconds, just save what you were going to print and do it outside the timed interval.
If you have nested timed intervals, at least sink the printing out of the inner-most intervals to minimize the amount of stop/restart you have to do.
If you're manually instrumenting your code all over the place, maybe look at profiling tools like flamegraph, especially if your timed intervals break down on function boundaries. linux perf: how to interpret and find hotspots.
Not only does I/O take time itself, it will make later code run slower for a few hundreds or thousands of cycles. Making a system call touches a lot of code, so when you return to user-space it's likely that you'll get instruction-cache and data-cache misses. System calls that modify your page tables will also result in TLB misses.
(See "The (Real) Costs of System Calls" section in the FlexSC paper (Soares, Stumm), timed on an i7 Nehalem running Linux. (First-gen i7 from ~2008/9). The paper proposes a batched system-call mechanism for high-throughput web-servers and similar, but their baseline results for plain Linux are interesting and relevant outside of that.)
On a modern Intel CPU with Meltdown mitigation enabled, you'll usually get TLB misses. With Spectre mitigation enabled on recent x86, branch-prediction history will probably be wiped out, depending on the mitigation strategy. (Intel added a way for the kernel to request that higher-privileged branches after this point won't be affected by prediction history for lower-privileged branches. On current CPUs, I think this just flushes the branch-prediction cache.)
You can avoid the system-call overhead by letting iostream buffer for you. It's still significant work formatting and copying data around, but much cheaper than writing to a terminal. Redirecting your program's stdout to a file will make cout full-buffered by default, instead of line-buffered. i.e. run it like this:
./my_program > time_log.txt
The final output will match what you would have got on the terminal, but (as long as you don't do anything silly like using std::endl to force a flush) it will just be buffered up. The default buffer size is probably something like 4kiB. Use strace ./my_program or a similar tool to trace system calls, and make sure you're getting one big write() at the end, instead of lots of small write()s.
It would be nice to avoid buffered I/O inside (outer) timed regions, but it's very important to avoid system calls in places your "real" (non-instrumented) code wouldn't have them, if you're timing down to nanoseconds. And this is true even before timed intervals, not just inside.
cout << foo if it doesn't make a system call isn't "special" in terms of slowing down later code.
To overcome the overhead, the time lapse printing can be done by another thread. The main thread saves a start and end time into shared global variables, and notifies the condition variable the print thread is waiting on.
#include<iostream>
#include<thread>
#include<chrono>
#include<mutex>
#include<condition_variable>
#include<atomic>
std::condition_variable cv;
std::mutex mu;
std::atomic<bool> running {true};
std::atomic<bool> printnow {false};
// shared but non-atomic: protected by the mutex and condition variable.
// possible bug: the main thread can write `now` before print_thread wakes up and reads it
std::chrono::high_resolution_clock::time_point start;
std::chrono::high_resolution_clock::time_point now;
void print_thread() {
std::thread([]() {
while (running) {
std::unique_lock<std::mutex> lock(mu);
cv.wait(lock, []() { return !running || printnow; });
if (!running) return;
std::chrono::milliseconds lapse_ms = std::chrono::duration_cast<std::chrono::milliseconds>(now - start);
printnow = false;
std::cout << " lapse time " << lapse_ms.count() << " ms\n";
}
}).detach();
}
void print_lapse(std::chrono::high_resolution_clock::time_point start1, std::chrono::high_resolution_clock::time_point now1) {
start = start1;
now = now1;
printnow = true;
cv.notify_one();
}
int main()
{
//launch thread
print_thread();
// laspe1
std::chrono::high_resolution_clock::time_point start1 = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_for(std::chrono::milliseconds(200));
std::chrono::high_resolution_clock::time_point now1 = std::chrono::high_resolution_clock::now();
print_lapse(start1,now1);
// laspe2
start1 = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_for(std::chrono::milliseconds(500));
now1 = std::chrono::high_resolution_clock::now();
print_lapse(start1, now1);
//winding up
std::this_thread::sleep_for(std::chrono::milliseconds(300));
running = false;
cv.notify_one();
}

Wait accurate for 20 millisec

I need to execute some function accurate in 20 milliseconds (for RTP packets sending) after some event. I have tried next variants:
std::this_thread::sleep_for(std::chrono::milliseconds(20));
boost::this_thread::sleep_for(std::chrono::milliseconds(20));
Sleep(20);
Also different perversions as:
auto a= GetTickCount();
while ((GetTickCount() - a) < 20) continue;
Also tried micro and nanoseconds.
All this methods have error in range from -6ms to +12ms but its not acceptable. How to make it work right?
My opinion, that +-1ms is acceptable, but no more.
UPDATE1: to measure time passed i use std::chrono::high_resolution_clock::now();
Briefly, because of how OS kernels manage time and threads, you won't get accuracy much better with that method. Also, you can't rely on sleep alone with a static interval or your stream will quickly drift off your intended send clock rate, because the thread could be interrupted or it could be scheduled again well after your sleep time... for this reason you should check the system clock to know how much to sleep for at each iteration (i.e. somewhere between 0ms and 20ms). Without going into too much detail, this is also why there's a jitter buffer in RTP streams... to account for variations in packet reception (due to network jitter or send jitter). Because of this, you likely won't need +/-1ms level accuracy anyway.
Using std::chrono::steady_clock, I got about 0.1ms accuracy on windows 7.
That is, simply:
auto a = std::chrono::steady_clock::now();
while ((std::chrono::steady_clock::now() - a) < WAIT_TIME) continue;
This should give you accurate "waiting" (about 0.1ms, as I said), at least. We all know that this kind of waiting is "ugly" and should be avoided, but it's a hack that might still do the trick just fine.
You could use high_resolution_clock, which might give even better accuracy for some systems, but it is not guaranteed not to be adjusted by the OS, and you don't want that. steady_clock is supposed to be guaranteed not to be adjusted, and often has the same accuracy as high_resolution_clock.
As for "sleep()" functions that are very accurate, I don't know. Perhaps someone else knows more about that.
In C we have a nanosleep function in time.h.
The nanosleep() function causes the current thread to be suspended from execution until either the time interval specified by the rqtp argument has elapsed or a signal is delivered to the calling thread and its action is to invoke a signal-catching function or to terminate the process.
This below program sleeps for 20 milli seconds.
int main()
{
struct timespec tim, tim2;
tim.tv_sec = 0;
tim.tv_nsec =20000000;//20 milliseconds converted to nano seconds
if(nanosleep(&tim , NULL) < 0 )
{
printf("Nano sleep system call failed \n");
return -1;
}
printf("Nano sleep successfull \n");
return 0;
}

gettimeofday on uLinux wierd behaviour

Recently i've been trying to create a wait function that waits for 25 ms using the wall clock as reference. I looked around and found "gettimeofday", but i've been having problems with it. My code (simplified):
while(1)
{
timeval start, end;
double t_us;
bool release = false;
while (release == false)
{
gettimeofday(&start, NULL);
DoStuff();
{
gettimeofday(&end, NULL);
t_us = ( (end.tv_sec - start.tv_sec) * 1000*1000) + (end.tv_usec - start.tv_usec);
if (t_us >= 25000) //25 ms
{
release = true;
}
}
}
}
This code runs in a thread (Posix) and, on it's its own, works fine. DoStuff() is called every 25ms. It does however eat all the CPU if it can (as you might expect) so obviously this isn't a good idea.
When I tried throttling it by adding a Sleep(1); in the wait loop after the if statement, the entire thing slows by about 50% (that is, it called DoStuff every 37ms or so. This makes no sense to me - assuming DoStuff and any other threads complete their tasks in under (25 - 1) ms the called rate of DoStuff shouldn't be affected (allowing a 1ms error margin)
I also tried Sleep(0), usleep(1000) and usleep(0) but the behaviour is the same.
The same behaviour occurs whenever another higher priority thread needs CPU time (without the sleep). It's as if the clock stops counting when the thread reliqnuishes runtime.
I'm aware that gettimeofday is vulnerable to things like NTP updates etc... so I tried using clock_gettime but linking -ltr on my system causes problems so i don't think that is an option.
Does anyone know what i'm doing wrong?
The part that's missing here is how the kernel does thread scheduling based on time slices. In rough numbers, if you sleep at the beginning of your time slice for 1ms and the scheduling is done on a 35ms clock rate, your thread may not execute again for 35ms. If you sleep for 40ms, your thread may not execute again for 70ms. You can't really change that without changing the scheduling, but that's not recommended due to overall performance implications of the system. You could use a "high-resolution" timer, but often that's implemented in a tight cycle-wasting loop of "while it's not time yet, chew CPU" so that's not really desirable either.
If you used a high-resolution clock and queried it frequently inside of your DoStuff loop, you could potentially play some tricks like run for 30ms, then do a sleep(1) which could effectively relinquish your thread for the remainder of your timeslice (e.g. 5ms) to let other threads run. Kind of a cooperative/preemptive multitasking if you will. It's still possible you don't get back to work for an extended period of time though...
All variants of sleep()/usleep() involve yielding the CPU to other runnable tasks. Your programm can then run only after it is rescheduled by the kernel, which seems to last about 37 ms in your case.

Measuring the runtime of a C++ code?

I want to measure the runtime of my C++ code. Executing my code takes about 12 hours and I want to write this time at the end of execution of my code. How can I do it in my code?
Operating system: Linux
If you are using C++11 you can use system_clock::now():
auto start = std::chrono::system_clock::now();
/* do some work */
auto end = std::chrono::system_clock::now();
auto elapsed = end - start;
std::cout << elapsed.count() << '\n';
You can also specify the granularity to use for representing a duration:
// this constructs a duration object using milliseconds
auto elapsed =
std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
// this constructs a duration object using seconds
auto elapsed =
std::chrono::duration_cast<std::chrono::seconds>(end - start);
If you cannot use C++11, then have a look at chrono from Boost.
The best thing about using such a standard libraries is that their portability is really high (e.g., they both work in Linux and Windows). So you do not need to worry too much if you decide to port your application afterwards.
These libraries follow a modern C++ design too, as opposed to C-like approaches.
EDIT: The example above can be used to measure wall-clock time. That is not, however, the only way to measure the execution time of a program. First, we can distinct between user and system time:
User time: The time spent by the program running in user space.
System time: The time spent by the program running in system (or kernel) space. A program enters kernel space for instance when executing a system call.
Depending on the objectives it may be necessary or not to consider system time as part of the execution time of a program. For instance, if the aim is to just measure a compiler optimization on the user code then it is probably better to leave out system time. On the other hand, if the user wants to determine whether system calls are a significant overhead, then it is necessary to measure system time as well.
Moreover, since most modern systems are time-shared, different programs may compete for several computing resources (e.g., CPU). In such a case, another distinction can be made:
Wall-clock time: By using wall-clock time the execution of the program is measured in the same way as if we were using an external (wall) clock. This approach does not consider the interaction between programs.
CPU time: In this case we only count the time that a program is actually running on the CPU. If a program (P1) is co-scheduled with another one (P2), and we want to get the CPU time for P1, this approach does not include the time while P2 is running and P1 is waiting for the CPU (as opposed to the wall-clock time approach).
For measuring CPU time, Boost includes a set of extra clocks:
process_real_cpu_clock, captures wall clock CPU time spent by the current process.
process_user_cpu_clock, captures user-CPU time spent by the current process.
process_system_cpu_clock, captures system-CPU time spent by the current process. A tuple-like class process_cpu_clock, that captures real, user-CPU, and system-CPU process times together.
A thread_clock thread steady clock giving the time spent by the current thread (when supported by a platform).
Unfortunately, C++11 does not have such clocks. But Boost is a wide-used library and, probably, these extra clocks will be incorporated into C++1x at some point. So, if you use Boost you will be ready when the new C++ standard adds them.
Finally, if you want to measure the time a program takes to execute from the command line (as opposed to adding some code into your program), you may have a look at the time command, just as #BЈовић suggests. This approach, however, would not let you measure individual parts of your program (e.g., the time it takes to execute a function).
Use std::chrono::steady_clock and not std::chrono::system_clock for measuring run time in C++11. The reason is (quoting system_clock's documentation):
on most systems, the system time can be adjusted at any moment
while steady_clock is monotonic and is better suited for measuring intervals:
Class std::chrono::steady_clock represents a monotonic clock. The time
points of this clock cannot decrease as physical time moves forward.
This clock is not related to wall clock time, and is best suitable for
measuring intervals.
Here's an example:
auto start = std::chrono::steady_clock::now();
// do something
auto finish = std::chrono::steady_clock::now();
double elapsed_seconds = std::chrono::duration_cast<
std::chrono::duration<double> >(finish - start).count();
A small practical tip: if you are measuring run time and want to report seconds std::chrono::duration_cast<std::chrono::seconds> is rarely what you need because it gives you whole number of seconds. To get the time in seconds as a double use the example above.
You can use time to start your program. When it ends, it print nice time statistics about program run. It is easy to configure what to print. By default, it print user and CPU times it took to execute the program.
EDIT : Take a note that every measure from the code is not correct, because your application will get blocked by other programs, hence giving you wrong values*.
* By wrong values, I meant it is easy to get the time it took to execute the program, but that time varies depending on the CPUs load during the program execution. To get relatively stable time measurement, that doesn't depend on the CPU load, one can execute the application using time and use the CPU as the measurement result.
I used something like this in one of my projects:
#include <sys/time.h>
struct timeval start, end;
gettimeofday(&start, NULL);
//Compute
gettimeofday(&end, NULL);
double elapsed = ((end.tv_sec - start.tv_sec) * 1000)
+ (end.tv_usec / 1000 - start.tv_usec / 1000);
This is for milliseconds and it works both for C and C++.
This is the code I use:
const auto start = std::chrono::steady_clock::now();
// Your code here.
const auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Time in seconds: " << elapsed.count() << '\n';
You don't want to use std::chrono::system_clock because it is not monotonic! If the user changes the time in the middle of your code your result will be wrong - it might even be negative. std::chrono::high_resolution_clock might be implemented using std::chrono::system_clock so I wouldn't recommend that either.
This code also avoids ugly casts.
If you wish to print the measured time with printf(), you can use this:
auto start = std::chrono::system_clock::now();
/* measured work */
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
printf("Time = %lld ms\n", static_cast<long long int>(elapsed.count()));
You could also try some timer classes that start and stop automatically, and gather statistics on the average, maximum and minimum time spent in any block of code, as well as the number of calls. These cxx-rtimer classes are available on GitHub, and offer support for using std::chrono, clock_gettime(), or boost::posix_time as a back-end clock source.
With these timers, you can do something like:
void timeCriticalFunction() {
static rtimers::cxx11::DefaultTimer timer("expensive");
auto scopedStartStop = timer.scopedStart();
// Do something costly...
}
with timing stats written to std::cerr on program completion.

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.
QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.
Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call
Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.
Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.
The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.