Maximum Windows 8.1 CPU usage <= 30% - c++

I'm writing a C++ application using Visual Studio 2013. The application iterates through an image doing some complicated analysis. To test code efficiency I am running the analysis (say) 100 times and seeing how long it takes. Then I modify the code, re-run the test and see if there is an improvement (or degradation) in performance.
Problem is that while I have a powerful 4-core i5 (i5-4200U # 1.6 GHz to be specific) and plenty of RAM, the overall CPU utilisation never exceeds about 30%. My process never seems to get beyond about 29.5%. I've tried setting the priority class of my application to "High" (using SetProcessPriority) and this doesn't help. There is zero disk and network access, all in memory (and about 5GB of memory to spare).
Is this some secret Windows 8.1 setting (to preserve performance)? Can I change this programmatically or through some Control Panel applet?

Well how do you expect your application to use 100% cpu when it is (most likely) only running on one core because you aren't using threads?
30% is slightly above the usage for one core (25%) so it is almost certain you aren't using threads here.

Related

does building from a RAM drive truly yield speed increase?

I'm working on a project that has thousands of .cpp files plus thousands more .h and .hpp and the build takes 28min running from an SSD.
We inherited this project from a different company just weeks ago but perusing the makefiles, they explicitly disabled parallel builds via the .NOPARALLEL phony target; we're trying to find out if they have a good reason.
Worst case, the only way to speed this up is to use a RAM drive.
So I followed the instructions from Tekrevue and installed Imdisk and then ran benchmarks using CrystalDiskMark:
SSD
RAM Drive
I also ran dd using Cygwin and there's a significant speedup (at least 3x) on the RAM drive compared to my SSD.
However, my build time changes not one minute!
So then I thought: maybe my proprietary compiler calls some Windows API and causes a huge slowdown so I built fftw from source on Cygwin.
What I expected is that my processor usage would increase to some max and stay there for the duration of the build. Instead, my usage was very spiky: one for each file compiled. I understand even Cygwin still has to interact with windows so the fact that I still got spiky proc usage makes me assume that it's not my compiler that's the issue.
Ok. New theory: invoking compiler for each source-file has some huge overhead in Windows so, I copy-pasted from my build-log and passed 45 files to my compiler and compared it to invoking the compiler 45 times separately. Invoking ONCE was faster but only by 4 secs total for the 45 files.
And I saw the same "spiky" processor usage as when invoking compiler once for each file.
Why can't I get the compiler to run faster even when running from RAM drive? What's the overhead?
UPDATE #1
Commenters have been saying, I think, that the RAM drive thing is kind of unnecessary bc windows will cache the input and output files in RAM anyway.
Plus, maybe the RAM drive implementation (ie drivers) is sub-optimal.
So, I'm not using the RAM drive anymore.
Also, people have said that I should run the 45-file build multiple times so as to remove the overhead for caching: I ran it 4 times and each time it was 52secs.
CPU usage (taken 5 secs before compilation ended)
Virtual memory usage
When the compiler spits out stuff to disk, it's actually cached in RAM first, right?
Well then this screenshot indicates that IO is not an issue or rather, it's as fast as my RAM.
Question:
So since everything is in RAM, why isn't the CPU % higher more of the time?
Is there anything I can do to make single- threaded/job build go faster?
(Remember this is single-threaded build for now)
UPDATE 2
It was suggested below that I should set the affinity, of my compile-45-files invocation, to 1 so that windows won't bounce around the invocation to multiple cores.
The result:
100% single-core usage! for the same 52secs
So it wasn't the hard drive, RAM or the cache but CPU that's the bottleneck.
**THANK YOU ALL! ** for your help
========================================================================
My machine: Intel i7-4710MQ # 2.5GHz, 16GB RAM
I don't see why you are blaming so much the operating system, besides sequential, dumb IO (to load sources/save intermediate output - which should be ruled out by seeing that an SSD and a ramdisk perform the same) and process starting (ruled out by compiling a single giant file) there's very little interaction between the compiler and the operating system.
Now, once you ruled out "disk" and processor, I expect the bottleneck to be the memory speed - not for the RAM-disk IO part (which probably was already mostly saturated by the SSD), but for the compilation process itself.
That's actually quite a common problem, at this moment of time processors are usually faster than memory, which is often the bottleneck (that's the reason why currently it's critical to write cache-friendly code). The processor is probably wasting some significant time waiting for out of cache data to be fetched from main memory.
Anyway, this is all speculation. If you want a reliable answer, as usual you have to profile. Grab some sampling profiler from a list like this and go see where the compiler is wasting time. Personally, I expect to see a healthy dose of cache misses (or even page faults if you burned too much RAM for the ramdisk), but anything can be.
Reading your source code from the drive is a very, very small part of the overhead of compiling software. Your CPU speed is far more relevant, as parsing and generating binaries are the slowest part of the process.
**Update
Your graphs show a very busy CPU, I am not sure what you expect to see. Unless the build is multithreaded AND your kernel stops scheduling other, less intensive threads, this is certainly the graph of a busy processor.
Your trace is showing 23% CPU usage. Your CPU has 4 actual cores (with hyperthreading to make it look like 8). So, you're using exactly one core to its absolute maximum (plus or minus 2%, which is probably better accuracy than you can really expect).
The obvious conclusion from this would be that your build process is CPU bound, so improving your disk speed is unlikely to make much difference.
If you want substantially faster builds, you need to either figure out what's wrong with your current makefiles or else write entirely new ones without the problems, so you can support both partial and parallel builds.
That can gain you a lot. Essentially anything else you do (speeding up disks, overclocking the CPU, etc.) is going to give minor gains at best (maybe 20% if you're really lucky, where a proper build environment will probably give at least a 20:1 improvement for most typical builds).

Android NDK - Multithreading is slowing down rendering

I have an Android app with a C++ library which uses pthreads to break down rendering tasks. This is for devices running Android 4+.
Lets say I have a 100 x 100 array of elements into which I repetitively do CPU-intensive processing. Currently I'm breaking the array up into four 25 x 100 element chunks and handing it off to four Posix threads (from a pool of stalled, pre-created threads). This gives an almost 4x speed increase on iOS and desktop Mac but slower results than single-threading under Android.
So the same code is used successfully to speed up the app on iOS or desktop Mac but in Android it often makes it even slower.
I have done some tests on it and only quite big junks of data speed up when using multi threading. If the whole process (all threads) takes around 2 seconds or more it will speed up in multi threading mode but if it is less (say only takes about 400ms) it will be either the same speed or slower than just calling the rendering function normally. Which could point to thread switching being really slow. The bigger the processing tasks, the more they profit from multithreading. My tasks are usually not as big, but not fast enough in single threading mode.
I have also noticed that on ARM builds the speed difference between slower multi threading and the faster single threading is quite significant (almost twice as fast in multi threading rather than single threading) whereas on x86 builds the multi and single threaded versions will run at about the same speed as single threading on ARM builds. So x86 builds do not get slower on multithreading but also not faster.
Has anyone else had the same behaviour or knows where the slowdown could come from? Are there any special requirements for multithreading on Android? Unfortunately I can't really post any code at the moment but it is all standard posix threading code which works fine on iOS and Mac in general and has been in use for years.
Android vendors aggressively optimize for battery life which includes keeping number of cores (hot-plugged) and their individual (if possible) frequency low.
Generic idea for managing number of cores online is to keep an eye on system load for a period of time (window). If load persists and is above a threshold, system will bring necessary additional available cores online. Such decision taking afaik always happens via a user-level daemon. This approach is generally very different from desktops since being able to bring cores online/offline and benefit of it is mostly SoC dependent.
Managing cpu frequency is also similar, if load persists cpu freq is increased but there is a more settled mechanism for this provided by Linux called cpu-freq and due to that it is similar between desktop and mobile.
So it is very possible that you are creating a cpu load pattern that's not triggering core bring up or freq increase. (as you also describe within your description)

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)
What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

setting a c++ application to use maximum CPU usage, in the code

I developed a program in c++ and when I run it in windows XP it uses all the available CPU to 100% of usage but when I run the application in windows 7 the app could hardly makes it's way to 40% even by setting the task to real-time or high priority one in taskbar is there a way that I could force the OS to let my application use maximum available CPU like what was in winXP in my code. I mean something like APIs or a library.
This is more than likely due to you having more than one core. In order to use 100% of your CPU you may need to have multiple threads created.
If your app is using any kind of IO, and that IO is messed up in XP (bad driver and/or something else), that might be causing your app to spin the CPU entirely.
7 is maybe better optimized in such areas, so it frees the CPU until slow (disk, network) stuff is completed.
Also depending on what this thread is doing and how often it spends time off the processor (Sleep, object waits) can be a factor, but MK pretty much summed it up for you. You could also have a look here:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277%28v=vs.85%29.aspx

How to prepare a constant benchmark environment

When I am doing graphics benchmark performance test (C++), I find the application is sometimes a little faster or slower. And this is related with current operating system status/caches/memory usage, and graphics hardware status.
I am using Win7. I am wondering if there is some guideline to tell me how to get a stable/constant environment for benchmark performance testing?
There are many ways to do that - what I tend to do for my testing, is using WAIK (Windows Automated Installation Kit, available free of charge from Microsoft), to deploy a minimal windows 7 system on a separate workstation.
Then, the following configuration items need to be considered/changed (try not to deviate too much from a typical user machine, thou, otherwise your benchmark would not be constructive):
Set Paging File to static 2x RAM
Disable Automatic Updates
Disable Drive Indexing
These represent a reasonably optimal environment for testing, that is still attainable by enthusiasts, and thus can be representative of a Power-User (even if I use Automatic Updates and Drive Indexing, I schedule them both for when I'm away/sleeping)
As for caches and memory usages - at least in Win7 Professional, you can script remote startup - so for instance, I would have a script run my benchmark overnight (for large regression tests), restarting the OS after each run. Or I would run the same benchmark 5-10 times without rebooting, to see if cache usage changes.
Finally, there are bootloader switches to control the number of processors and the amount of available RAM - my test machine is an AMD Phenom X6 with 16GB of RAM, but we need to test how performance changes with the number of cores (some users would have single-core systems, and some would have multi-core systems), and with the amount of RAM (from 1-16GB).
This is usually done prior to a checkpoint release, to see if recommended or minimal recommendation need to be adjusted due to both extra features and additional optimization that happened since.