Linux/c++ timing method to gaurentee execution every N seconds despite drift?

Linux/c++ timing method to gaurentee execution every N seconds despite drift? - c++

I have a program that needs to execute every X seconds to write some output. The program will do some intermittent polling and processing between each output. So for example I may be outputing every 5 seconds, but I wake up to poll every .1 seconds until I hit the next 5 second mark. The program in theory will run for months between restart, possible even longer.
I need the the execution at every X seconds to stay consistent to wall clock. In other words I can't allow clock drift to cause me to drift away from the X second mark. I don't need quite the same level of accuracy in the intermitent polling, but I would like to poll more often then a second so I need a structur that can represent sub-second percision.
I realize that by the very nature of running on an OS there wil be a certain inconsistency/latency in execution of any timer such that I can't gaurentee that I'll execute at exactly ever x seconds. But that is fine so long at it stays a small normal distribution, what I don't want is for drift to allow the time to consistently get further and further off.
I would also prefer to try to minimize the CPU cost of the polling as much as possible, but that's a secondary concern.
What timing constructs are available for Linux that can best provide me this level of percision? I'm trying to avoid including boost in the application due to hassles in distribution, but can use it if I have to. So methods using the standard c++ libraries are prefered, but if Bosst can do it better I would like to know that as well.
Thank you.
-ps, I can't use c++11. It's not an option so there is no way I can use any constructs from it.

clock_gettime and clock_nanosleep both operate on sub-second times, and use of CLOCK_MONOTONIC should prevent any skew due to adjustments to system time (such as NTP or adjtimex). For example,
long delay_ns = 250000000;
struct timespec next;
clock_gettime(CLOCK_MONOTONIC, &next);
while (1) {
next.ts_nsec += delay_ns;
next.ts_sec += ts_nsec / 1000000000;
next.ts_nsec %= 1000000000;
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next, NULL);
/* do something after the wait */
}
In reality, you should check whether you've returned early due to a signal and whether you should skip an interval because too much time passed while you were sleeping.
Other methods of sleeping, such as nanosleep and select, only allow for specifying a time interval and use CLOCK_REALTIME, which is system time and may change.

If you need some exact timing without jitter or clock drift, make sure to have an external time source like a GPS, atom clock or radio time clock.
See for example the NTP integration documented in this PDF.
All NTP tricks work as well on Linux as on NanoBSD (which is in the PDF).

I would record the time t_0 when I started, then at any given time t, if i = (t - t_0)/X (integer division) is greater than the same quantity last time it executed, you execute again.
For example, suppose you are running every 5 seconds, if you started at time 23 you record i = 0, t_0 = 23. Then at time 27, (27 - 23)/5 = 0, so you don't yet execute. But next time around when it comes to time 28, (28 - 23)/5 = 1, which is greater than 0, so you execute.
This way you're not adding to a counter each tick which could have a huge drift, you're computing absolutely the time since the start so you know the exact times at which to execute.

Related

Why is the CPU time different with every execution of this program?

I have a hard time understanding processor time. The result of this program:
#include <iostream>
#include <chrono>
// the function f() does some time-consuming work
void f()
{
volatile long double d;
int size = 10000;
for(int n=0; n<size; ++n)
for(int m=0; m<size; ++m)
d = n*m;
}
int main()
{
std::clock_t start = std::clock();
f();
std::clock_t end = std::clock();
std::cout << "CPU time used: "
<< (end - start)
<< "\n";
}
Seems to randomly fluctuate between 210 000, 220 000 and 230 000. At first I was amazed, why these discrete values. Then I found out that std::clock() returns only approximate processor time. So probably the value returned by std::clock() is rounded to a multiple of 10 000. This would also explain why the maximum difference between the CPU times is 20 000 (10 000 == rounding error by the first call to std::clock() and 10 000 by the second).
But if I change to int size = 40000; in the body of f(), I get fluctuations in the ranges of 3 400 000 to 3 500 000 which cannot be explained by rounding.
From what I read about the clock rate, on Wikipedia:
The CPU requires a fixed number of clock ticks (or clock cycles) to
execute each instruction. The faster the clock, the more instructions
the CPU can execute per second.
That is, if the program is deterministic (which I hope mine is), the CPU time needed to finish should be:
Always the same
Slightly higher than the number of instructions carried out
My experiments show neither, since my program needs to carry out at least 3 * size * size instructions. Could you please explain what I am doing wrong?

First, the statement you quote from Wikipedia is simply false.
It might have been true 20 years ago (but not always, even
then), but it is totally false today. There are many things
which can affect your timings:
The first: if you're running on Windows, clock is broken,
and totally unreliable. It returns the difference in elapsed
time, not CPU time. And elapsed time depends on all sorts of
other things the processor might be doing.
Beyond that: things like cache misses have a very significant
impact on time. And whether a particular piece of data is in
the cache or not can depend on whether your program was
interrupted between the last access and this one.
In general, anything less than 10% can easily be due to the
caching issues. And I've seen differences of a factor of 10
under Windows, depending on whether there was a build running or
not.

You don't state what hardware you're running the binary on.
Does it have an interrupt driven CPU ?
Is it a multitasking operating system ?
You're mistaking the cycle time of the CPU (the CPU clock as Wikipedia refers to) with the time it takes to execute a particular piece of code from start to end and all the other stuff the poor CPU has to do at the same time.
Also ... is all your executing code in level 1 cache, or is some in level 2 or in main memory, or on disk ... what about the next time you run it ?

Your program is not deterministic, because it uses library and system functions which are not deterministic.
As a particular example, when you allocate memory this is virtual memory, which must be mapped to physical memory. Although this is a system call, running kernel code, it takes place on your thread and will count against your clock time. How long it takes to do this will depend on what the overall memory allocation situation is.

The CPU time is indeed "fixed" for a given set of circumstances. However, in a modern computer, there are other things happening in the system, which interferes with the execution of your code. It may be that caches are being wiped out when your email software wakes up to check if there is any new emails for you, or when the HP printer software checks for updates, or when the antivirus software decides to run for a little bit checking if your memory contains any viruses, etc, etc, etc, etc.
Part of this is also caused by the problem that CPU time accounting in any system is not 100% accurate - it works on "clock-ticks" and similar things, so the time used by for example an interrupt to service a network packet coming in, or the hard disk servicing interrupt, or the timer interrupt to say "another millisecond ticked by" these all account into "the currently running process". Assuming this is Windows, there is a further "feature", and that is that for historical and other reasons, std::clock() simply returns the time now, not actually the time used by your process. So for exampple:
t = clock();
cin >> x;
t = clock() - t;
would leave t with a time of 10 seconds if it took ten seconds to input the value of x, even though 9.999 of those ten seconds were spent in the idle process, not your program.

C++ fine granular time

The following piece of code gives 0 as runtime of the function. Can anybody point out the error?
struct timeval start,end;
long seconds,useconds;
gettimeofday(&start, NULL);
int optimalpfs=optimal(n,ref,count);
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
cout<<"\nOptimal Runtime is "<<opt_runtime<<"\n";
I get both start and end time as the same.I get the following output
Optimal Runtime is 0
Tell me the error please.

POSIX 1003.1b-1993 specifies interfaces for clock_gettime() (and clock_getres()), and offers that with the MON option there can be a type of clock with a clockid_t value of CLOCK_MONOTONIC (so that your timer isn't affected by system time adjustments). If available on your system then these functions return a structure which has potential resolution down to one nanosecond, though the latter function will tell you exactly what resolution the clock has.
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* and nanoseconds */
};
You may still need to run your test function in a loop many times for the clock to register any time elapsed beyond its resolution, and perhaps you'll want to run your loop enough times to last at least an order of magnitude more time than the clock's resolution.
Note though that apparently the Linux folks mis-read the POSIX.1b specifications and/or didn't understand the definition of a monotonically increasing time clock, and their CLOCK_MONOTONIC clock is affected by system time adjustments, so you have to use their invented non-standard CLOCK_MONOTONIC_RAW clock to get a real monotonic time clock.
Alternately one could use the related POSIX.1 timer_settime() call to set a timer running, a signal handler to catch the signal delivered by the timer, and timer_getoverrun() to find out how much time elapsed between the queuing of the signal and its final delivery, and then set your loop to run until the timer goes off, counting the number of iterations in the time interval that was set, plus the overrun.
Of course on a preemptive multi-tasking system these clocks and timers will run even while your process is not running, so they are not really very useful for benchmarking.
Slightly more rare is the optional POSIX.1-1999 clockid_t value of CLOCK_PROCESS_CPUTIME_ID, indicated by the presence of the _POSIX_CPUTIME from <time.h>, which represents the CPU-time clock of the calling process, giving values representing the amount of execution time of the invoking process. (Even more rare is the TCT option of clockid_t of CLOCK_THREAD_CPUTIME_ID, indicated by the _POSIX_THREAD_CPUTIME macro, which represents the CPU time clock, giving values representing the amount of execution time of the invoking thread.)
Unfortunately POSIX makes no mention of whether these so-called CPUTIME clocks count just user time, or both user and system (and interrupt) time, accumulated by the process or thread, so if your code under profiling makes any system calls then the amount of time spent in kernel mode may, or may not, be represented.
Even worse, on multi-processor systems, the values of the CPUTIME clocks may be completely bogus if your process happens to migrate from one CPU to another during its execution. The timers implementing these CPUTIME clocks may also run at different speeds on different CPU cores, and at different times, further complicating what they mean. I.e. they may not mean anything related to real wall-clock time, but only be an indication of the number of CPU cycles (which may still be useful for benchmarking so long as relative times are always used and the user is aware that execution time may vary depending on external factors). Even worse it has been reported that on Linux CPU TimeStampCounter-based CPUTIME clocks may even report the time that a process has slept.
If your system has a good working getrusage() system call then it will hopefully be able to give you a struct timeval for each of the the actual user and system times separately consumed by your process while it was running. However since this puts you back to a microsecond clock at best then you'll need to run your test code enough times repeatedly to get a more accurate timing, calling getrusage() once before the loop, and again afterwards, and the calculating the differences between the times given. For simple algorithms this might mean running them millions of times, or more. Note also that on many systems the division between user time and system time is done somewhat arbitrarily and if examined separately in a repeated loop one or the other can even appear to run backwards. However if your algorithm makes no system calls then summing the time deltas should still be a fair total time for your code execution.
BTW, take care when comparing time values such that you don't overflow or end up with a negative value in a field, either as #Nim suggests, or perhaps like this (from NetBSD's <sys/time.h>):
#define timersub(tvp, uvp, vvp) \
do { \
(vvp)->tv_sec = (tvp)->tv_sec - (uvp)->tv_sec; \
(vvp)->tv_usec = (tvp)->tv_usec - (uvp)->tv_usec; \
if ((vvp)->tv_usec < 0) { \
(vvp)->tv_sec--; \
(vvp)->tv_usec += 1000000; \
} \
} while (0)
(you might even want to be more paranoid that tv_usec is in range)
One more important note about benchmarking: make sure your function is actually being called, ideally by examining the assembly output from your compiler. Compiling your function in a separate source module from the driver loop usually convinces the optimizer to keep the call. Another trick is to have it return a value that you assign inside the loop to a variable defined as volatile.

You've got weird mix of floats and ints here:
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
Try using:
long opt_runtime = (long)(seconds * 1000 + (float)useconds/1000);
This way you'll get your results in milliseconds.

The execution time of optimal(...) is less than the granularity of gettimeofday(...). This likely happes on Windows. On Windows the typical granularity is up to 20 ms. I've answered a related gettimeofday(...) question here.
For Linux I asked How is the microsecond time of linux gettimeofday() obtained and what is its accuracy? and got a good result.
More information on how to obtain accurate timing is described in this SO answer.

I normally do such a calculation as:
long long ss = start.tv_sec * 1000000LL + start.tv_usec;
long long es = end.tv_sec * 1000000LL + end.tv_usec;
Then do a difference
long long microsec_diff = es - ss;
Now convert as required:
double seconds = microsec_diff / 1000000.;
Normally, I don't bother with the last step, do all timings in microseconds.

timers, threads and compiler misbehaviour

I'm having trouble with something and couldn't find any answers about it, as I don't even know what to search for. I have a done a timer class using QueryPerformanceCounter, from my application, I launch a second thread object that has its own instanced timer and I just have an infinite loop getting delta time from the timer and using it to output the number of loop iterations per second.
I've noticed that it was giving me weird values so I started printing delta time and found out it was coming as 0 sometimes, so I went inside the method that returns delta time and did some testing. This is my deltaTime() method:
double MyTimer2::deltaTime()
{
LARGE_INTEGER timenow;
QueryPerformanceCounter(&timenow);
//std::cout << "timenow=" << (double)timenow.QuadPart << " currentticks=" << (double)m_currentTicks.QuadPart << std::endl;
double m_deltaTime = (double)(timenow.QuadPart - m_currentTicks.QuadPart) /* 1000.0*/ / (double)m_frequency.QuadPart;
m_currentTicks = timenow;
if(m_deltaTime < 0.000001)
return 0.0;
return m_deltaTime;
}
So, I put a breakpoint on "return 0.0;" and what happens is that it gets there most of the time, which is not correct. However, if I uncomment the printing code and run, I will never stop on the breakpoint. So in theory, my printing code is making it work correctly, whereas if I remove it, things stop working as they should! How is this possible, why is it happening and how can I fix it? I've tried _ReadWriteBarrier() unsuccessfully.
Thanks in advance!
EDIT: I need a high-resolution timer for physics simulation!

A couple processor generations ago, QueryPerformanceCounter() would read the CPU's cycle counter (e.g. rdtsc). Using this method, the number of ticks from successive reads would never be zero. The resolution was equal to the CPU clock rate, e.g. 3 GHz.
Modern processors have two characteristics which make the cycle counter useless for timing. First, you have multiple cores, which each have their own cycle counter. Threads can migrate between cores, and if you read the cycle counter from two different cores, the difference would not be related to elapsed time. It could even be negative. Secondly, you have dynamic clocking based on load (both underclocking to save power and overclocking for performance). Intel calls these "SpeedStep" and "Turbo Boost", respectively. When the cycle rate isn't fixed, there's no way to convert from ticks to time.
So, QueryPerformanceCounter now uses a dedicated piece of hardware called a High-Performance Event Counter (HPET), with a resolution of several MHz. Importantly, there's only one regardless of how many cores you have, and it doesn't change speed dynamically. But, since the resolution is lower, it is now possible to read it twice between ticks, in which case you'll get an elapsed time reported as zero.
In practice, this isn't a problem. If you need timing more precise than what the HPET can provide, then a general purpose computer is not suitable for you. Timing in the nanosecond range will be severely affected by interrupts.

What could possibly be the purpose of this block?
if(m_deltaTime < 0.000001)
return 0.0;
It has no value, it simply screws with the results, telling you the time was zero when it actually wasn't.

First of all, your timer is wrong: it consumes your CPU intensively. On the single core machine it will slow down all the system. If you want to create a timer and target Windows, you can use timer functions.
Then, every not negative value, returned by your deltaTime() function is valid. While you hosted not in real-time operating system, every operation can take arbitrary amount of time. One iteration can take about tens cycles of processor ticks, or tens years. No one guarantee.
Third, about experimental results. It seems that if context will be switched once between two consecutive time measurement, you get value about 0.016s, if not, you get value bellow 0.000001s that is floored to 0s.
As it was said, printing to console is relatively heavy operation and you actually always get context switched when you enable it.
EDIT
While QueryPerformanceCounter seems to offer great resolution, it traps you. You will never get actually high resolution timer, unless you work in real-time OS.

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.

QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.

Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call

Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.

Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.

The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.

How to make thread sleep less than a millisecond on Windows

On Windows I have a problem I never encountered on Unix. That is how to get a thread to sleep for less than one millisecond. On Unix you typically have a number of choices (sleep, usleep and nanosleep) to fit your needs. On Windows, however, there is only Sleep with millisecond granularity.
On Unix, I can use the use the select system call to create a microsecond sleep which is pretty straightforward:
int usleep(long usec)
{
struct timeval tv;
tv.tv_sec = usec/1000000L;
tv.tv_usec = usec%1000000L;
return select(0, 0, 0, 0, &tv);
}
How can I achieve the same on Windows?

This indicates a mis-understanding of sleep functions. The parameter you pass is a minimum time for sleeping. There's no guarantee that the thread will wake up after exactly the time specified. In fact, threads don't "wake up" at all, but are rather chosen for execution by the OS scheduler. The scheduler might choose to wait much longer than the requested sleep duration to activate a thread, especially if another thread is still active at that moment.

As Joel says, you can't meaningfully 'sleep' (i.e. relinquish your scheduled CPU) for such short periods. If you want to delay for some short time, then you need to spin, repeatedly checking a suitably high-resolution timer (e.g. the 'performance timer') and hoping that something of high priority doesn't pre-empt you anyway.
If you really care about accurate delays of such short times, you should not be using Windows.

Use the high resolution multimedia timers available in winmm.lib. See this for an example.

#include <Windows.h>
static NTSTATUS(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (NTSTATUS(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandle("ntdll.dll"), "NtDelayExecution");
static NTSTATUS(__stdcall *ZwSetTimerResolution)(IN ULONG RequestedResolution, IN BOOLEAN Set, OUT PULONG ActualResolution) = (NTSTATUS(__stdcall*)(ULONG, BOOLEAN, PULONG)) GetProcAddress(GetModuleHandle("ntdll.dll"), "ZwSetTimerResolution");
static void SleepShort(float milliseconds) {
static bool once = true;
if (once) {
ULONG actualResolution;
ZwSetTimerResolution(1, true, &actualResolution);
once = false;
}
LARGE_INTEGER interval;
interval.QuadPart = -1 * (int)(milliseconds * 10000.0f);
NtDelayExecution(false, &interval);
}
Works very well for sleeping extremely short times. Remember though that at a certain point the actual delays will never be consistent because the system can't maintain consistent delays of such a short time.

Yes, you need to understand your OS' time quantums. On Windows, you won't even be getting 1ms resolution times unless you change the time quantum to 1ms. (Using for example timeBeginPeriod()/timeEndPeriod()) That still won't really guarantee anything. Even a little load or a single crappy device driver will throw everything off.
SetThreadPriority() helps, but is quite dangerous. Bad device drivers can still ruin you.
You need an ultra-controlled computing environment to make this ugly stuff work at all.

Generally a sleep will last at least until the next system interrupt occurs. However, this
depends on settings of the multimedia timer resources. It may be set to something close to
1 ms, some hardware even allows to run at interrupt periods of 0.9765625 (ActualResolution provided by NtQueryTimerResolution will show 0.9766 but that's actually wrong. They just can't put the correct number into the ActualResolution format. It's 0.9765625ms at 1024 interrupts per second).
There is one exception wich allows us to escape from the fact that it may be impossible to sleep for less than the interrupt period: It is the famous Sleep(0). This is a very powerful
tool and it is not used as often as it should! It relinquishes the reminder of the thread's time slice. This way the thread will stop until the scheduler forces the thread to get cpu service again. Sleep(0) is an asynchronous service, the call will force the scheduler to react independent of an interrupt.
A second way is the use of a waitable object. A wait function like WaitForSingleObject() can wait for an event. In order to have a thread sleeping for any time, also times in the microsecond regime, the thread needs to setup some service thread which will generate an event at the desired delay. The "sleeping" thread will setup this thread and then pause at the wait function until the service thread will set the event signaled.
This way any thread can "sleep" or wait for any time. The service thread can be of big complexity and it may offer system wide services like timed events at microsecond resolution. However, microsecond resolution may force the service thread to spin on a high resolution time service for at most one interrupt period (~1ms). If care is taken, this can
run very well, particulary on multi-processor or multi-core systems. A one ms spin does not hurt considerably on multi-core system, when the affinity mask for the calling thread and the service thread are carefully handled.
Code, description, and testing can be visited at the Windows Timestamp Project

As several people have pointed out, sleep and other related functions are by default dependent on the "system tick". This is the minimum unit of time between OS tasks; the scheduler, for instance, will not run faster than this. Even with a realtime OS, the system tick is not usually less than 1 ms. While it is tunable, this has implications for the entire system, not just your sleep functionality, because your scheduler will be running more frequently, and potentially increasing the overhead of your OS (amount of time for the scheduler to run, vs. amount of time a task can run).
The solution to this is to use an external, high-speed clock device. Most Unix systems will allow you to specify to your timers and such a different clock to use, as opposed to the default system clock.

What are you waiting for that requires such precision? In general if you need to specify that level of precision (e.g. because of a dependency on some external hardware) you are on the wrong platform and should look at a real time OS.
Otherwise you should be considering if there is an event you can synchronize on, or in the worse case just busy wait the CPU and use the high performance counter API to measure the elapsed time.

If you want so much granularity you are in the wrong place (in user space).
Remember that if you are in user space your time is not always precise.
The scheduler can start your thread (or app), and schedule it, so you are depending by the OS scheduler.
If you are looking for something precise you have to go:
1) In kernel space (like drivers)
2) Choose an RTOS.
Anyway if you are looking for some granularity (but remember the problem with user space ) look to
QueryPerformanceCounter Function and QueryPerformanceFrequency function in MSDN.

Actually using this usleep function will cause a big memory/resource leak. (depending how often called)
use this corrected version (sorry can't edit?)
bool usleep(unsigned long usec)
{
struct timeval tv;
fd_set dummy;
SOCKET s = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
FD_ZERO(&dummy);
FD_SET(s, &dummy);
tv.tv_sec = usec / 1000000ul;
tv.tv_usec = usec % 1000000ul;
bool success = (0 == select(0, 0, 0, &dummy, &tv));
closesocket(s);
return success;
}

I have the same problem and nothing seems to be faster than a ms, even the Sleep(0). My problem is the communication between a client and a server application where I use the _InterlockedExchange function to test and set a bit and then I Sleep(0).
I really need to perform thousands of operations per second this way and it doesn't work as fast as I planned.
Since I have a thin client dealing with the user, which in turn invokes an agent which then talks to a thread, I will move soon to merge the thread with the agent so that no event interface will be required.
Just to give you guys an idea how slow this Sleep is, I ran a test for 10 seconds performing an empty loop (getting something like 18,000,000 loops) whereas with the event in place I only got 180,000 loops. That is, 100 times slower!

Try using SetWaitableTimer...

Like everybody mentioned, there is indeed no guarantees about the sleep time.
But nobody wants to admit that sometimes, on an idle system, the usleep command can be very precise. Especially with a tickless kernel. Windows Vista has it and Linux has it since 2.6.16.
Tickless kernels exists to help improve laptops batterly life: c.f. Intel's powertop utility.
In that condition, I happend to have measured the Linux usleep command that respected the requested sleep time very closely, down to half a dozen of micro seconds.
So, maybe the OP wants something that will roughly work most of the time on an idling system, and be able to ask for micro second scheduling!
I actually would want that on Windows too.
Also Sleep(0) sounds like boost::thread::yield(), which terminology is clearer.
I wonder if Boost-timed locks have a better precision. Because then you could just lock on a mutex that nobody ever releases, and when the timeout is reached, continue on...
Timeouts are set with boost::system_time + boost::milliseconds & cie (xtime is deprecated).

If your goal is to "wait for a very short amount of time" because you are doing a spinwait, then there are increasing levels of waiting you can perform.
void SpinOnce(ref Int32 spin)
{
/*
SpinOnce is called each time we need to wait.
But the action it takes depends on how many times we've been spinning:
1..12 spins: spin 2..4096 cycles
12..32: call SwitchToThread (allow another thread ready to go on time core to execute)
over 32 spins: Sleep(0) (give up the remainder of our timeslice to any other thread ready to run, also allows APC and I/O callbacks)
*/
spin += 1;
if (spin > 32)
Sleep(0); //give up the remainder of our timeslice
else if (spin > 12)
SwitchTothread(); //allow another thread on our CPU to have the remainder of our timeslice
else
{
int loops = (1 << spin); //1..12 ==> 2..4096
while (loops > 0)
loops -= 1;
}
}
So if your goal is actually to wait only for a little bit, you can use something like:
int spin = 0;
while (!TryAcquireLock())
{
SpinOne(ref spin);
}
The virtue here is that we wait longer each time, eventually going completely to sleep.

Just use Sleep(0). 0 is clearly less than a millisecond. Now, that sounds funny, but I'm serious. Sleep(0) tells Windows that you don't have anything to do right now, but that you do want to be reconsidered as soon as the scheduler runs again. And since obviously the thread can't be scheduled to run before the scheduler itself runs, this is the shortest delay possible.
Note that you can pass in a microsecond number to your usleep, but so does void usleep(__int64 t) { Sleep(t/1000); } - no guarantees to actually sleeping that period.

Sleep function that is way less than a millisecond-maybe
I found that sleep(0) worked for me. On a system with a near 0% load on the cpu in task manager, I wrote a simple console program and the sleep(0) function slept for a consistent 1-3 microseconds, which is way less than a millisecond.
But from the above answers in this thread, I know that the amount sleep(0) sleeps can vary much more wildly than this on systems with a large cpu load.
But as I understand it, the sleep function should not be used as a timer. It should be used to make the program use the least percentage of the cpu as possible and execute as frequently as possible. For my purposes, such as moving a projectile across the screen in a videogame much faster than one pixel a millisecond, sleep(0) works, I think.
You would just make sure the sleep interval is way smaller than the largest amount of time it would sleep. You don't use the sleep as a timer but just to make the game use the minimum amount of cpu percentage possible. You would use a separate function that has nothing to do is sleep to get to know when a particular amount of time has passed and then move the projectile one pixel across the screen-at a time of say 1/10th of a millisecond or 100 microseconds.
The pseudo-code would go something like this.
while (timer1 < 100 microseconds) {
sleep(0);
}
if (timer2 >=100 microseconds) {
move projectile one pixel
}
//Rest of code in iteration here
I know the answer may not work for advanced issues or programs but may work for some or many programs.

If the machine is running Windows 10 version 1803 or later then you can use CreateWaitableTimerExW with the CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag.

On Windows the use of select forces you to include the Winsock library which has to be initialized like this in your application:
WORD wVersionRequested = MAKEWORD(1,0);
WSADATA wsaData;
WSAStartup(wVersionRequested, &wsaData);
And then the select won't allow you to be called without any socket so you have to do a little more to create a microsleep method:
int usleep(long usec)
{
struct timeval tv;
fd_set dummy;
SOCKET s = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
FD_ZERO(&dummy);
FD_SET(s, &dummy);
tv.tv_sec = usec/1000000L;
tv.tv_usec = usec%1000000L;
return select(0, 0, 0, &dummy, &tv);
}
All these created usleep methods return zero when successful and non-zero for errors.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js