Any recommendations out there for Windows application tuning resources (books web sites etc.)?
I have a C++ console application that needs to feed a hardware device with a considerable amount of data at a fairly high rate. (buffer is 32K in size and gets consumed at ~800k bytes per second)
It will stream data without under runs, except when I perform file IO like opening a folder etc... (It seems to be marginally meeting its timing requirements).
Anyway.. a good book or resource to brush up on realtime performance with windows would be helpful.
Thanks!
The best you can hope for on commodity Windows is "usually meets timing requirements". If the system is running any processes other than your target app, it will occasionally miss deadlines due scheduling inconsistencies. However, if your app/hardware can handle the rare but occasional misses, there are a few things you can do to reduce the number of misses.
Set your process's priority to REALTIME_PRIORITY_CLASS
Change the scheduler's granularity to 1ms resolution via the timeBeginPeriod() function (part of the Windows Multimedia libraries)
Avoid as many system calls in your main loop as possible (this includes allocating memory). Each syscall is an opportunity for the OS to put the process to sleep and, consequently, is an opportunity for the non-deterministic scheduler to miss the next deadline
If this doesn't get the job done for you, you might consider trying a Linux distribution with realtime kernel patches applied. I've found those to provide near-perfect timing (within 10s of microseconds accuracy over the course of several hours). That said, nothing short of a true-realtime OS will actually give you perfection but the realtime-linux distros are much closer than commodity Windows.
The first thing I would do is tune it to where it's as lean as possible. I use this method. For these reasons. Since it's a console app, another option is to try out LTProf, which will show you if there is anything you can fruitfully optimize. When that's done, you will be in the best position to look for buffer timing issues, as #Hans suggested.
Optimizing software in C++ from agner.com is a great optimization manual.
As Rakis said, you will need to be very careful in the processing loop:
No memory allocation. Use the stack and preallocated memory instead.
No throws. Exceptions are quite expensive, in win32 they have a cost even not throwing.
No polymorphism. You will save some indirections.
Use inline extensively.
No locks. Try lock-free approaches when possible.
The buffer will last for only 40 milliseconds. You can't guarantee zero under-runs on Windows with such strict timing requirements. In user mode land, you are looking at, potentially, hundreds of milliseconds when kernel threads do what they need to do. They run with higher priorities that you can ever gain. The thread quantum on the workstation version is 3 times the clock tick, already beyond 40 milliseconds (3 x 15.625 msec). You can't even reliably compete with user mode threads that boosted their priority and take their sweet old time.
If a bigger buffer is not an option then you are looking at a device driver to get this kind of service guarantee. Or something in between that can provide a larger buffer.
Related
I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?
Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.
That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.
What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.
I have a 3D application that needs to generate a new frame roughly every 6ms or so. This frame-rate needs to be constant in order to not result in stuttering. To make matters worse, the application has to perform several moderately-heavy calculations (mostly preparing the 3D scene and copying data to the VRAM) so there is it consumes a fairly large amount of that ~6ms doing it's own stuff.
This has been a problem because Windows causes my application to stutter a bit when it tries to use the CPU for other things. Is there any way I could make Windows not "give away" timeslices to other processes? I'm not concerned about it negatively impacting background processes.
Windows will allow you to raise your application's priority. A process will normally only lose CPU time to other processes with the same or higher priority, so raising your priority can prevent CPU time from being "stolen".
Be aware, however, that if you go too far, you can render the system unstable, so if you're going to do this, you generally only want to raise priority a little bit, so it higher than other "normal" applications.
Also note that this won't make a huge difference. If you're running into small problem once in a while, increasing the priority may take care of the problem. If it's a constant problem, chances are that a priority boost won't be sufficient to fix it.
If you decide to try this, see SetPriorityClass and SetThreadPriority.
It normally depends on the scheduling algorithm used by your OS. Windows 7,8,XP,VISTA use the multilevel queue scheduling with round robin so increasing the priority of your application or thread or process will do what you want. Which version of Windows are u currently using?.. I can help u accordingly once i get to know that
You can raise your process priority, but I don’t think it will help much. Instead, you should optimize your code.
For a start, use a VS built-in profiler (Debug/Performance profiler menu) to find out where your app spends most time, optimize that.
Also, all modern CPUs are at least dual core (the last single-core Celeron is from 2013). Therefore, “there it consumes a fairly large amount of that ~6ms doing it's own stuff” shouldn’t be the case. Your own stuff should be running in a separate thread, not on the same thread you use to render. See this article for an idea how to achieve that. You probably don’t need that level of complexity, just 2 threads + 2 tasks (compute and render) will probably be enough, but that article should give you some ideas how to re-design your app. This approach however will bring 1 extra frame of input latency (for 1 frame the background thread will compute stuff, only the next frame the renderer thread will show the result of that), but with your 165Hz rendering you can probably live with that.
I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.
By task scheduler I mean any implementation of worker threads pool which distribute work to the threads according to whatever algorithm they are designed with. (like Intel TBB)
I know that "real-time" constraints imply that work gets done in predictable time (I'm not talking about speed). So my guess is that using a task scheduler, which, as far as I know, can't guarantee that some task will be executed before a given time, makes the application impossible to use in these constraints.
Or am I missing something? Is there a way to have both? Maybe by forcing assumptions on the quantity of data that can be processed? Or maybe there are predictable task schedulers?
I'm talking about "hard" real time constraints, not soft real time (like video games).
To clarify:
It is known that there are features in C++ that are not possible to use in this kind of context: new, delete, throw, dynamic_cast. They are not predictable (you don't know how much time can be spent on one of these operations, it depends on too much parameters that are not even known before execution).
You can't really use them in real time contexts.
What I ask is, does task schedulers have the same unpredictability that would make them unusable in real-time applications?
Yes, it can be done, but no it's not trivial, and yes there are limits.
You can write the scheduler to guarantee (for example) that an interrupt handler, exception handler (etc.) is guaranteed to be invoked without a fixed period of time from when it occurs. You can guarantee that any given thread will (for example) get at least X milliseconds of CPU time out of any given second (or suitable fraction of a second).
To enforce the latter, you generally need admittance criteria -- an ability for the scheduler to say "sorry, but I can't schedule this as a real-time thread, because the CPU is already under too much load.
In other cases, all it does is guarantee that at least (say) 99% of CPU time will be given the real-time tasks (if any exist) and it's up to whomever designs the system on top of that to schedule few enough real-time tasks that this will ensure they all finish quickly enough.
I feel obliged to add that the "hardness" of real-time requirements is almost entirely orthogonal to the response speed needed. Rather, it's almost entirely about the seriousness of the consequences of being late.
Just for example, consider a nuclear power plant. For a lot of what happens, you're dealing with time periods on the order of minutes, or in some cases even hours. Filling a particular chamber with, say, half a million gallons of water just isn't going to happen in microseconds or milliseconds.
At the same time, the consequences of a later answer can be huge -- quite possibly causing not just a few deaths like hospital equipment could, but potentially hundreds or even thousands of deaths, hundreds of millions in damage, etc. As such, it's about as "hard" as real-time requirements get, even though the deadlines are unusually "loose" by most typical standards.
In the other direction, digital audio playback has much tighter limits. Delays or dropouts can be quite audible down to a fraction of a millisecond in some cases. At the same time, unless you're providing sound processing for a large concert (or something on that order) the consequences of a dropout will generally be a moment's minor annoyance on the part of a user.
Of course, it's also possible to combine the two -- for an obvious example, in high-frequency trading, deadlines may well be in the order of microseconds (or so) and the loss from missing a deadline could easily be millions or tens of millions of (dollars|euros|pounds|etc.)
The term real-time is quite flexible. "Hard real-time" tends to mean things where a few tens of microseconds make the difference between "works right" and "doesn't work right. Not all "real-time" systems require that sort of real-time-ness.
I once worked on a radio-base-station for mobile phones. One of the devices on the board had an interrupt that fired every 2-something milliseconds. For correct operation (not losing calls), we had to deal with the interrupt, that is, do the work inside the interrupt and write the hardware registers with the new values, within 100 microseconds - if we missed, there would be dropped calls. If the interrupt wasn't taken after 160 microseconds, the system would reboot. That is "hard real-time", especially as the processor was just running at a few tens of MHz.
If you produce a video-player, it requires real-time in the a few milliseconds range.
A "display stock prices" probably can be within the 100ms range.
For a webserver it is probably acceptable to respond within 1-2seconds without any big problems.
Also, there is a difference between "worst case worse than X means failure" (like the case above with 100 microseconds or dropped calls - that's bad if it happens more than once every few weeks - and even a few times a year is really something that should be fixed). This is called "Hard real-time".
But other systems, missing your deadline means "Oh, well, we have to do that over again" or "a frame of video flickered a bit", as long as it doesn't happen very often, it's probably OK. This is called "soft real-time".
A lot of modern hardware will make "hard real-time" (the 10s or 100 microsecond range) difficult, because the graphics processor will simply stop the processor from accessing memory, or if the processor gets hot, the stopclk pin is pulled for 100 microseconds...
Most modern OS's, such as Linux and Windows, aren't really meant to be "hard real-time". There are sections of code that does disable interrupt for longer than 100 microseconds in some parts of these OS's.
You can almost certainly get some good "soft real-time" (that is, missing the deadline isn't a failure, just a minor annoyance) out of a mainstream modern OS with modern hardware. It'll probably require either modifications to the OS or a dedicated real-time OS (and perhaps suitable special hardware) to make the system do hard real-time.
But only a few things in the world requires that sort of hard real-time. Often the hard real-time requirements are dealt with by hardware - for example, the next generation of radio-base-stations that I described above, had more clever hardware, so you just needed to give it the new values within the next 2-something milliseconds, and you didn't have the "mad rush to get it done in a few tens of microseconds". In a modern mobile phone, the GSM or UMTS protocol is largely dealt with by a dedicated DSP (digital signal processor).
A "hard real-time" requirement is where the system is really failing if a particular deadline (or set of deadlines) can't be met, even if the failure to meet deadlines happens only once. However, different systems have different systems have different sensitivity to the actual time that the deadline is at (as Jerry Coffin mentions). It is almost certainly possible to find cases where a commercially available general purpose OS is perfectly adequate in dealing with the real-time requirements of a hard real-time system. It is also absolutely sure that there are other cases where such hard real-time requirements are NOT possible to meet without a specialized system.
I would say that if you want sub-millisecond guarantees from the OS, then Desktop Windows or Linux are not for you. This is really down to the overall philosophy of the OS and scheduler design, and to build a hard real-time OS requires a lot of thought about locking and potential for one thread to block another thread, from running, etc.
I don't think there is ONE answer that applies to your question. Yes, you can certainly use thread-pools in a system that has hard real-time requirements. You probably can't do it on a sub-millisecond basis unless there is specific support for this in the OS. And you may need to have dedicated threads and processes to deal with the highest priority real-time behaviour, which is not part of the thread-pool itself.
Sorry if this isn't saying "Yes" or "No" to your answer, but I think you will need to do some research into the actual behaviour of the OS, and see what sort of guarantees it can give (worst case). You will also have to decide what is the worse case scenario, and what happens if you miss a deadline - are (lots of) people dying (plane falling out of the sky), or are some banker going to lose millions, is the green lights going to come on at the same time on two directions on a road crossing or is it some bad sound coming out of a speaker?
"Real time" doesn't just mean "fast", it means that the system can respond to meet deadlines in the real world. Those deadlines depend on what you're dealing with in the real world.
Whether or not a task finishes in a particular timeframe is a characteristic of the task, not the scheduler. The scheduler might decide which task gets resources, and if a task hasn't finished by a deadline it might be stopped or have its resource usage constrained so that other tasks can meet their deadlines.
So, the answer to your question is that you need to consider the workload, deadlines and the scheduler together, and construct your system to meet your requirements. There is no magic scheduler that will take arbitrary tasks and make them complete in a predictable time.
Update:
A task scheduler can be used in real-time systems if it provides the guarantees you need. As others have said, there are task schedulers that provide those guarantees.
On the comments: The issue is the upper bound on time taken.
You can use new and delete if you overload them to have the performance characteristics you are after; the problem isn't new and delete, it is dynamic memory allocation. There is no requirement that new and delete use a general-purpose dynamic allocator, you can use them to allocate out of a statically allocated pool sized appropriately for your workload with deterministic behaviour.
On dynamic_cast: I tend not to use it, but I don't think it's performance is non-deterministic to the extent that it should be banned in real-time code. This is an example of the same issue: understanding worst-case performance is important.
My understanding of the Sleep function is that it follows "at least semantics" i.e. sleep(5) will guarantee that the thread sleeps for 5 seconds, but it may remain blocked for more than 5 seconds depending on other factors. Is there a way to sleep for exactly a specified time period (without busy waiting).
As others have said, you really need to use a real-time OS to try and achieve this. Precise software timing is quite tricky.
However... although not perfect, you can get a LOT better results than "normal" by simply boosting the priority of the process that needs better timing. In Windows you can achieve this with the SetPriorityClass function. If you set the priority to the highest level (REALTIME_PRIORITY_CLASS: 0x00000100) you'll get much better timing results. Again - this will not be perfect like you are asking for, though.
This is also likely possible on other platforms than Windows, but I've never had reason to do it so haven't tested it.
EDIT: As per the comment by Andy T, if your app is multi-threaded you also need to watch out for the priority assigned to the threads. For Windows this is documented here.
Some background...
A while back I used SetPriorityClass to boost the priority on an application where I was doing real-time analysis of high-speed video and I could NOT miss a frame. Frames were arriving to the pc at a very regular (driven by external framegrabber HW) frequency of 300 frames per second (fps), which fired a HW interrupt on every frame which I then serviced. Since timing was very important, I collected a lot of stats on the interrupt timing (using QueryPerformanceCounter stuff) to see how bad the situation really was, and was appalled at the resulting distributions. I don't have the stats handy, but basically Windows was servicing the interrupt whenever it felt like it when run at normal priority. The histograms were very messy, with the stdev being wider than my ~3ms period. Frequently I would have gigantic gaps of 200 ms or greater in the interrupt servicing (recall that the interrupt fired roughly every 3 ms)!! ie: HW interrupts are FAR from exact! You're stuck with what the OS decides to do for you.
However - when I discovered the REALTIME_PRIORITY_CLASS setting and benchmarked with that priority, it was significantly better and the service interval distribution was extremely tight. I could run 10 minutes of 300 fps and not miss a single frame. Measured interrupt servicing periods were pretty much exactly 1/300 s with a tight distribution.
Also - try and minimize the other things the OS is doing to help improve the odds of your timing working better in the app where it matters. eg: no background video transcoding or disk de-fragging or anything while your trying to get precision timing with other code!!
In summary:
If you really need this, go with a real time OS
If you can't use a real-time OS (impossible or impractical), boosting your process priority will likely improve your timing by a lot, as it did for me
HW interrupts won't do it... the OS still needs to decide to service them!
Make sure that you don't have a lot of other processes running that are competing for OS attention
If timing is really important to you, do some testing. Although getting code to run exactly when you want it to is not very easy, measuring this deviation is quite easy. The high performance counters in PCs (what you get with QueryPerformanceCounter) are extremely good.
Since it may be helpful (although a bit off topic), here's a small class I wrote a long time ago for using the high performance counters on a Windows machine. It may be useful for your testing:
CHiResTimer.h
#pragma once
#include "stdafx.h"
#include <windows.h>
class CHiResTimer
{
private:
LARGE_INTEGER frequency;
LARGE_INTEGER startCounts;
double ConvertCountsToSeconds(LONGLONG Counts);
public:
CHiResTimer(); // constructor
void ResetTimer(void);
double GetElapsedTime_s(void);
};
CHiResTimer.cpp
#include "stdafx.h"
#include "CHiResTimer.h"
double CHiResTimer::ConvertCountsToSeconds(LONGLONG Counts)
{
return ((double)Counts / (double)frequency.QuadPart) ;
}
CHiResTimer::CHiResTimer()
{
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&startCounts); // starts the timer right away
}
void CHiResTimer::ResetTimer()
{
QueryPerformanceCounter(&startCounts); // reset the reference counter
}
double CHiResTimer::GetElapsedTime_s()
{
LARGE_INTEGER countsNow;
QueryPerformanceCounter(&countsNow);
return ConvertCountsToSeconds(countsNow.QuadPart - startCounts.QuadPart);
}
No.
The reason it's "at least semantics" is because that after those 5 seconds some other thread may be busy.
Every thread gets a time slice from the Operating System. The Operating System controls the order in which the threads are run.
When you put a thread to sleep, the OS puts the thread in a waiting list, and when the timer is over the operating system "Wakes" the thread.
This means that the thread is added back to the active threads list, but it isn't guaranteed that t will be added in first place. (What if 100 threads need to be awaken in that specific second ? Who will go first ?)
While standard Linux is not a realtime operating system, the kernel developers pay close attention to how long a high priority process would remain starved while kernel locks are held. Thus, a stock Linux kernel is usually good enough for many soft-realtime applications.
You can schedule your process as a realtime task with the sched_setscheduler(2) call, using either SCHED_FIFO or SCHED_RR. The two have slight differences in semantics, but it may be enough to know that a SCHED_RR task will eventually relinquish the processor to another task of the same priority due to time slices, while a SCHED_FIFO task will only relinquish the CPU to another task of the same priority due to blocking I/O or an explicit call to sched_yield(2).
Be careful when using realtime scheduled tasks; as they always take priority over standard tasks, you can easily find yourself coding an infinite loop that never relinquishes the CPU and blocks admins from using ssh to kill the process. So it might not hurt to run an sshd at a higher realtime priority, at least until you're sure you've fixed the worst bugs.
There are variants of Linux available that have been worked on to provide hard-realtime guarantees. RTLinux has commercial support; Xenomai and RTAI are competing implementations of realtime extensions for Linux, but I know nothing else about them.
As previous answerers said: there is no way to be exact (some suggested realtime-os or hardware interrupts and even those are not exact). I think what you are looking for is something that is just more precise than the sleep() function and you find that depending on your OS in e.g. the Windows Sleep() function or under GNU the nanosleep() function.
http://msdn.microsoft.com/en-us/library/ms686298%28VS.85%29.aspx
http://www.delorie.com/gnu/docs/glibc/libc_445.html
Both will give you precision within a few milliseconds.
Well, you try to tackle a difficult problem, and achieving exact timing is not feasible: the best you can do is to use hardware interrupts, and the implementation will depend on both your underlying hardware, and your operating system (namely, you will need a real-time operating system, which most regular desktop OS are not). What is your exact target platform?
No. Because you're always depending on the OS to handle waking up threads at the right time.
There is no way to sleep for a specified time period using standard C. You will need, at minimum, a 3rd party library which provides greater granularity, and you might also need a special operating system kernel such as the real-time Linux kernels.
For instance, here is a discussion of how close you can come on Win32 systems.
This is not a C question.