setting a c++ application to use maximum CPU usage, in the code - c++

I developed a program in c++ and when I run it in windows XP it uses all the available CPU to 100% of usage but when I run the application in windows 7 the app could hardly makes it's way to 40% even by setting the task to real-time or high priority one in taskbar is there a way that I could force the OS to let my application use maximum available CPU like what was in winXP in my code. I mean something like APIs or a library.

This is more than likely due to you having more than one core. In order to use 100% of your CPU you may need to have multiple threads created.

If your app is using any kind of IO, and that IO is messed up in XP (bad driver and/or something else), that might be causing your app to spin the CPU entirely.
7 is maybe better optimized in such areas, so it frees the CPU until slow (disk, network) stuff is completed.

Also depending on what this thread is doing and how often it spends time off the processor (Sleep, object waits) can be a factor, but MK pretty much summed it up for you. You could also have a look here:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277%28v=vs.85%29.aspx

Related

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?
Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.
That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.
What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Android NDK - Multithreading is slowing down rendering

I have an Android app with a C++ library which uses pthreads to break down rendering tasks. This is for devices running Android 4+.
Lets say I have a 100 x 100 array of elements into which I repetitively do CPU-intensive processing. Currently I'm breaking the array up into four 25 x 100 element chunks and handing it off to four Posix threads (from a pool of stalled, pre-created threads). This gives an almost 4x speed increase on iOS and desktop Mac but slower results than single-threading under Android.
So the same code is used successfully to speed up the app on iOS or desktop Mac but in Android it often makes it even slower.
I have done some tests on it and only quite big junks of data speed up when using multi threading. If the whole process (all threads) takes around 2 seconds or more it will speed up in multi threading mode but if it is less (say only takes about 400ms) it will be either the same speed or slower than just calling the rendering function normally. Which could point to thread switching being really slow. The bigger the processing tasks, the more they profit from multithreading. My tasks are usually not as big, but not fast enough in single threading mode.
I have also noticed that on ARM builds the speed difference between slower multi threading and the faster single threading is quite significant (almost twice as fast in multi threading rather than single threading) whereas on x86 builds the multi and single threaded versions will run at about the same speed as single threading on ARM builds. So x86 builds do not get slower on multithreading but also not faster.
Has anyone else had the same behaviour or knows where the slowdown could come from? Are there any special requirements for multithreading on Android? Unfortunately I can't really post any code at the moment but it is all standard posix threading code which works fine on iOS and Mac in general and has been in use for years.
Android vendors aggressively optimize for battery life which includes keeping number of cores (hot-plugged) and their individual (if possible) frequency low.
Generic idea for managing number of cores online is to keep an eye on system load for a period of time (window). If load persists and is above a threshold, system will bring necessary additional available cores online. Such decision taking afaik always happens via a user-level daemon. This approach is generally very different from desktops since being able to bring cores online/offline and benefit of it is mostly SoC dependent.
Managing cpu frequency is also similar, if load persists cpu freq is increased but there is a more settled mechanism for this provided by Linux called cpu-freq and due to that it is similar between desktop and mobile.
So it is very possible that you are creating a cpu load pattern that's not triggering core bring up or freq increase. (as you also describe within your description)

Make a program run slowly

Is there any way to run a C++ program slower by changing any OS parameters in Linux? In this way I would like to simulate what will happen if that particular program happens to run on a real slower machine.
In other words, a faster machine should behave as a slower machine to that particular program.
Lower the priority using nice (and/or renice). You can also do it programmatically using nice() system call. This will not slow down the execution speed per se, but will make Linux scheduler allocate less (and possibly shorter) execution time frames, preempt more often, etc. See Process Scheduling (Chapter 10) of Understanding the Linux Kernel for more details on scheduling.
You may want to increase the timer interrupt frequency to put more load on the kernel, which will in turn slow everything down. This requires a kernel rebuild.
You can use CPU Frequency Scaling mechanism (requires kernel module) and control (slow down, speed up) the CPU using the cpufreq-set command.
Another possibility is to call sched_yield(), which will yield quantum to other processes, in performance critical parts of your program (requires code change).
You can hook common functions like malloc(), free(), clock_gettime() etc. using LD_PRELOAD, and do some silly stuff like burn a few million CPU cycles with rep; hop;, insert memory barriers etc. This will slow down the program for sure. (See this answer for an example of how to do some of this stuff).
As #Bill mentioned, you can always run Linux in a virtualization software which allows you to limit the amount of allocated CPU resources, memory, etc.
If you really want your program to be slow, run it under Valgrind (may also help you find some problems in your application like memory leaks, bad memory references, etc).
Some slowness can be achieved by recompiling your binary with disabled optimizations (i.e. -O0 and enable assertions (i.e. -DDEBUG).
You can always buy an old PC or a cheap netbook (like One Laptop Per Child, and don't forget to donate it to a child once you are done testing) with a slow CPU and run your program.
Hope it helps.
QEMU is a CPU emulator for Linux. Debian has packages for it (I imagine most distros will). You can run a program in an emulator and most of them should support slowing things down. For instance, Miroslav Novak has patches to slow down QEMU.
Alternatively, you could cross compile to another CPU-linux (arm-none-gnueabi-linux, etc) and then have QEMU translate that code to run.
The nice suggestion is simple and may work if you combine it with another process which will consume cpu.
nice -19 test &
while [ 1 ] ; do sha1sum /boot/vmlinuz*; done;
You did not say if you need graphics, file and/or network I/O? Do you know something about the class of error you are looking for? Is it a race condition, or does the code just perform poorly at a customer site?
Edit: You can also use signals like STOP and CONT to start and stop your program. A debugger can also do this. The issue is that the code runs a full speed and then gets stopped. Most solutions with the Linux scheduler will have this issue. There was some sort of thread analyzer from Intel afair. I see Vtune Release Notes. This is Vtune, but I was pretty sure there is another tool to analyze thread races. See: Intel Thread Checker, which can check for some thread race conditions. But we don't know if the app is multi-threaded?
Use cpulimit:
Cpulimit is a tool which limits the CPU usage of a process (expressed in percentage, not in CPU time). It is useful to control batch jobs, when you don't want them to eat too many CPU cycles. The goal is prevent a process from running for more than a specified time ratio. It does not change the nice value or other scheduling priority settings, but the real CPU usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.
The control of the used cpu amount is done sending SIGSTOP and SIGCONT POSIX signals to processes.
All the children processes and threads of the specified process will share the same percent of CPU.
It's in the Ubuntu repos. Just
apt-get install cpulimit
Here're some examples on how to use it on an already-running program:
Limit the process 'bigloop' by executable name to 40% CPU:
cpulimit --exe bigloop --limit 40
cpulimit --exe /usr/local/bin/bigloop --limit 40
Limit a process by PID to 55% CPU:
cpulimit --pid 2960 --limit 55
Get an old computer
VPS hosting packages tend to run slowly, have lots of interruptions, and wildly varying latencies. The cheaper you go the worse the hardware will be. Unlike truly old hardware, there is a good chance they will contain instruction sets (SSE4) that are not usually found on old hardware. Neverthless, if you want a system that walks slowly and shutters often, a cheap VPS host will be the quickest start.
If you just want to simulate your program to analyze its behavior on really slow machine, you can try making your whole program to run as a thread of some other main program.
In this manner you can prioritize same code with different priorities in few threads at once and collect data of your analysis. I have used this in game development for frame-processing analysis.
Use sleep or wait inside of your code. Its not the brightest way to do but acceptable in all kind of computer with different speeds.
The simplest possible way to do it would be to wrap your main runable code in a while loop with a sleep at the end of it.
For example:
void main()
{
while 1
{
// Logic
// ...
usleep(microseconds_to_sleep)
}
}
As people will mention, this isn't the most accurate way, since your logic code will still run at normal speed but with delays between runs. Also, it assumes that your logic code is something that runs in a loop.
But it is both simple and configurable.
You can increase load on CPUs via launching tools like stress or stress-ng

Schedule task on precise periods in Linux or Windows

I have this weird question.
I would like to know if it is possible to make a program in C/C++ that will run on Linux or Windows and will hook interrupt handler on a system timer set to specific period (2000 times per second, for example) and I want this interrupt to be with highest priority, meaning that it has to be executed every half millisecond and while executing it must not be interrupted.
This we have done with MS-DOS with Borland Turbo C 3.1. We have an interface card (our own) that runs on ISA slot. Every half millisecond, our program reads the state of electronics that is controlling an industrial process thru the interface. This has worked for us in the past 15 years, but we are running out of motherboards that have ISA slot, so we are looking for new solutions.
We also have solution based on PIC microcontrollers, but our horizons will be widened with general purpose processor.
My guess is that there are some customized Linux kernels for embedded applications, so I am looking for some sources with which we can start experimenting.
Yes, you can do that in MS-DOS because it is not a multi-user or multi-tasking operating system. However, the same thing will not work in Windows because it is a mult-user and multi-tasking operating system. It's also not real-time, which means there's no guarantee that your task will be executed exactly when you ask for it to be executed. Everything is pre-emptively scheduled, meaning that any number of other processes and tasks (either user-mode or system-level) could effectively "bump" your process down the priority list and force it to wait to be executed until those other tasks completed or were themselves interrupted to give your process a chance to run for a while.
I don't know about Linux, but I imagine most of the major distributions are written similarly to Windows.
You will need to find a real-time, single-user operating system to do this. A Unix-derivative is probably the best place to start looking, but I won't be the person able to suggest one.
Alternatively, you could continue using MS-DOS (or alternatives such as FreeDOS), but switch to a different interface technology that is available on newer boards. There's no reason to update something that works for you, especially if the updates are counter-productive to your goal.
A typical OS such as a standard Linux or Windows is not designed to, and will not be able to perform to that degree of real-time accuracy and availability.
It sounds to me like you need to be investigating Real-Time Linux, or similar.
RTLinux is a modified version of the Linux Kernel which is designed to perform in real-time, precicely for applications such as this.
Hope that helps.
Personal and affordable computing has increased in performance incredibly over the years, except in one area, low latency. Latency has actually increased in many use cases when you compare a 486 and a modern desktop CPU.
That said, have a look at this paper, where the authors come to the conclusion that sub-millisecond scheduling is possible in Linux on commodity hardware.

C++: Limiting CPU usage intentionally

At my company, we often test the performance of our USB and FireWire devices under CPU strain.
There is a test code we run that loads the CPU, and it is often used in really simple informal tests to see what happens to our device's performance.
I took a look at the code for this, and its a simple loop that increments a counter and does a calculation based on the new value, storing this result in another variable.
Running a single instance will use 1/X of the CPU, where X is the number of cores.
So, for instance, if we're on a 8-core PC and we want to see how our device runs under 50% CPU usage, we can open four instances of this at once, and so forth...
I'm wondering:
What decides how much of the CPU gets used up? does it just run everything as fast as it can on a single thread in a single threaded application?
Is there a way to voluntarily limit the maximum CPU usage your program can use? I can think of some "sloppy" ways (add sleep commands or something), but is there a way to limit to say, some specified percent of available CPU or something?
CPU quotas on Windows 7 and on Linux.
Also on QNX (i.e. Blackberry Tablet OS) and LynuxWorks
In case of broken links, the articles are named:
Windows -- "CPU rate limits in Windows Server 2008 R2 and Windows 7"
Linux -- "CPU Usage Limiter for Linux"
QNX -- "Adaptive Partitioning"
LynuxWorks - "Partitioning Operating Systems" and "ARINC 653"
The OS usually decides how to schedule processes and on which CPUs they should run. It basically keeps a ready queue for processes which are ready to run (not marked for termination and not blocked waiting for some I/O, event etc.). Whenever a process used up its timeslice or blocks it basically frees a processing core and the OS selects another process to run. Now if you have a process which is always ready to run and never blocks then this process essentially runs whenever it can thus pushing the utilization of a processing unit to a 100%. Of course this is a bit simplified description (there are things like process priorities for example).
There is usually no generic way to achieve this. The OS you are using might offer some mechanism to do this (some kind of CPU quota). You could try and measure how much time has passed vs. how much cpu time your process used up and then put your process to sleep for certain periods to achieve an approximation of desired CPU utilization.
You've essentially answered your own questions!
The key trait of code that burns a lot of CPU is that it never does anything that blocks (e.g. waiting for network or file I/O), and never voluntarily yields its time slice (e.g. sleep(), etc.).
The other trick is that the code must do something that the compiler cannot optimize away. So, most likely your CPU burn code outputs something based on the loop calculation at the end, or is simply compiled without optimization so that the optimizer isn't tempted to remove the useless loop. Since you're trying to load the CPU, there's no sense in optimizing anyways.
As you hypothesized, single threaded code that matches this description will saturate a CPU core unless the OS has more of these processes than it has cores to run them--then it will round-robin schedule them and the utilization of each will be some fraction of 100%.
The issue isn't how much time the CPU spends idle, but rather how long it takes for your code to start executing. Who cares if it's idle or doing low-priority busywork, as long as the latency is low?
Your problem is fundamentally a consequence of using a synthetic benchmark, presumably in an attempt to obtain reproducible results. But synthetic benchmarks tend to produce meaningless results, so reproducibility is moot.
Look at your bug database, find actual customer complaints, and use actual software and test hardware to reproduce a situation that actually made someone dissatisfied. Develop the performance test in parallel with hard, meaningful performance requirements.