When I am doing graphics benchmark performance test (C++), I find the application is sometimes a little faster or slower. And this is related with current operating system status/caches/memory usage, and graphics hardware status.
I am using Win7. I am wondering if there is some guideline to tell me how to get a stable/constant environment for benchmark performance testing?
There are many ways to do that - what I tend to do for my testing, is using WAIK (Windows Automated Installation Kit, available free of charge from Microsoft), to deploy a minimal windows 7 system on a separate workstation.
Then, the following configuration items need to be considered/changed (try not to deviate too much from a typical user machine, thou, otherwise your benchmark would not be constructive):
Set Paging File to static 2x RAM
Disable Automatic Updates
Disable Drive Indexing
These represent a reasonably optimal environment for testing, that is still attainable by enthusiasts, and thus can be representative of a Power-User (even if I use Automatic Updates and Drive Indexing, I schedule them both for when I'm away/sleeping)
As for caches and memory usages - at least in Win7 Professional, you can script remote startup - so for instance, I would have a script run my benchmark overnight (for large regression tests), restarting the OS after each run. Or I would run the same benchmark 5-10 times without rebooting, to see if cache usage changes.
Finally, there are bootloader switches to control the number of processors and the amount of available RAM - my test machine is an AMD Phenom X6 with 16GB of RAM, but we need to test how performance changes with the number of cores (some users would have single-core systems, and some would have multi-core systems), and with the amount of RAM (from 1-16GB).
This is usually done prior to a checkpoint release, to see if recommended or minimal recommendation need to be adjusted due to both extra features and additional optimization that happened since.
Related
I'm working on a project that has thousands of .cpp files plus thousands more .h and .hpp and the build takes 28min running from an SSD.
We inherited this project from a different company just weeks ago but perusing the makefiles, they explicitly disabled parallel builds via the .NOPARALLEL phony target; we're trying to find out if they have a good reason.
Worst case, the only way to speed this up is to use a RAM drive.
So I followed the instructions from Tekrevue and installed Imdisk and then ran benchmarks using CrystalDiskMark:
SSD
RAM Drive
I also ran dd using Cygwin and there's a significant speedup (at least 3x) on the RAM drive compared to my SSD.
However, my build time changes not one minute!
So then I thought: maybe my proprietary compiler calls some Windows API and causes a huge slowdown so I built fftw from source on Cygwin.
What I expected is that my processor usage would increase to some max and stay there for the duration of the build. Instead, my usage was very spiky: one for each file compiled. I understand even Cygwin still has to interact with windows so the fact that I still got spiky proc usage makes me assume that it's not my compiler that's the issue.
Ok. New theory: invoking compiler for each source-file has some huge overhead in Windows so, I copy-pasted from my build-log and passed 45 files to my compiler and compared it to invoking the compiler 45 times separately. Invoking ONCE was faster but only by 4 secs total for the 45 files.
And I saw the same "spiky" processor usage as when invoking compiler once for each file.
Why can't I get the compiler to run faster even when running from RAM drive? What's the overhead?
UPDATE #1
Commenters have been saying, I think, that the RAM drive thing is kind of unnecessary bc windows will cache the input and output files in RAM anyway.
Plus, maybe the RAM drive implementation (ie drivers) is sub-optimal.
So, I'm not using the RAM drive anymore.
Also, people have said that I should run the 45-file build multiple times so as to remove the overhead for caching: I ran it 4 times and each time it was 52secs.
CPU usage (taken 5 secs before compilation ended)
Virtual memory usage
When the compiler spits out stuff to disk, it's actually cached in RAM first, right?
Well then this screenshot indicates that IO is not an issue or rather, it's as fast as my RAM.
Question:
So since everything is in RAM, why isn't the CPU % higher more of the time?
Is there anything I can do to make single- threaded/job build go faster?
(Remember this is single-threaded build for now)
UPDATE 2
It was suggested below that I should set the affinity, of my compile-45-files invocation, to 1 so that windows won't bounce around the invocation to multiple cores.
The result:
100% single-core usage! for the same 52secs
So it wasn't the hard drive, RAM or the cache but CPU that's the bottleneck.
**THANK YOU ALL! ** for your help
========================================================================
My machine: Intel i7-4710MQ # 2.5GHz, 16GB RAM
I don't see why you are blaming so much the operating system, besides sequential, dumb IO (to load sources/save intermediate output - which should be ruled out by seeing that an SSD and a ramdisk perform the same) and process starting (ruled out by compiling a single giant file) there's very little interaction between the compiler and the operating system.
Now, once you ruled out "disk" and processor, I expect the bottleneck to be the memory speed - not for the RAM-disk IO part (which probably was already mostly saturated by the SSD), but for the compilation process itself.
That's actually quite a common problem, at this moment of time processors are usually faster than memory, which is often the bottleneck (that's the reason why currently it's critical to write cache-friendly code). The processor is probably wasting some significant time waiting for out of cache data to be fetched from main memory.
Anyway, this is all speculation. If you want a reliable answer, as usual you have to profile. Grab some sampling profiler from a list like this and go see where the compiler is wasting time. Personally, I expect to see a healthy dose of cache misses (or even page faults if you burned too much RAM for the ramdisk), but anything can be.
Reading your source code from the drive is a very, very small part of the overhead of compiling software. Your CPU speed is far more relevant, as parsing and generating binaries are the slowest part of the process.
**Update
Your graphs show a very busy CPU, I am not sure what you expect to see. Unless the build is multithreaded AND your kernel stops scheduling other, less intensive threads, this is certainly the graph of a busy processor.
Your trace is showing 23% CPU usage. Your CPU has 4 actual cores (with hyperthreading to make it look like 8). So, you're using exactly one core to its absolute maximum (plus or minus 2%, which is probably better accuracy than you can really expect).
The obvious conclusion from this would be that your build process is CPU bound, so improving your disk speed is unlikely to make much difference.
If you want substantially faster builds, you need to either figure out what's wrong with your current makefiles or else write entirely new ones without the problems, so you can support both partial and parallel builds.
That can gain you a lot. Essentially anything else you do (speeding up disks, overclocking the CPU, etc.) is going to give minor gains at best (maybe 20% if you're really lucky, where a proper build environment will probably give at least a 20:1 improvement for most typical builds).
I'm writing a C++ application using Visual Studio 2013. The application iterates through an image doing some complicated analysis. To test code efficiency I am running the analysis (say) 100 times and seeing how long it takes. Then I modify the code, re-run the test and see if there is an improvement (or degradation) in performance.
Problem is that while I have a powerful 4-core i5 (i5-4200U # 1.6 GHz to be specific) and plenty of RAM, the overall CPU utilisation never exceeds about 30%. My process never seems to get beyond about 29.5%. I've tried setting the priority class of my application to "High" (using SetProcessPriority) and this doesn't help. There is zero disk and network access, all in memory (and about 5GB of memory to spare).
Is this some secret Windows 8.1 setting (to preserve performance)? Can I change this programmatically or through some Control Panel applet?
Well how do you expect your application to use 100% cpu when it is (most likely) only running on one core because you aren't using threads?
30% is slightly above the usage for one core (25%) so it is almost certain you aren't using threads here.
I have written a C++ code which I have to run on many low configuration computers. Now, my PC is very high configuration. I am using Ubuntu 10.04 & I set hard limit on some resources i.e. on memory & virtual memory. Now my question is:
1) how to set limit on the cache size and cache line size ?
2) what other limits I should put to check my code is OK or not ?
I am using command:
ulimit -H -m 1000000
ulimit -H -v 500000
You can't limit cache size, that's a (mostly transparent) hardware feature.
The good news is this shouldn't matter, since you can't run out of cache - it just spills and your program runs more slowly.
If your concern is avoiding spills, you could investigate valgrind --tool=cachegrind - it may be possible to examine the likely behaviour on your target hardware cache.
To PROPERLY simulate running on low-end machines (although not with low cache-limits), you can run the code in a virtual machine rather than the real hardware of your machine. This will show you what happens on a machine with small memory much more than if you limit using ulimit, as ulimit simply limits what YOUR application will get. So it shows that your application doesn't run out of memory when running a particular set of tests. But it doesn't show how the application and system behaves together when there isn't a huge amount of memory in the first place.
A machine with low amount of physical memory will behave quite differently when it comes to for example swapping behaviour, and filesystem caching, just to mention a couple of things that change between a "large memory, but application is limited" vs "small memory in the first place".
I'm not sure if Ubuntu comes with any flavour of Virtual Machine setup, but for example VirtualBox is pretty easy to configure and set up on any Linux/Windows machine. As long as you have a modern enough processor to run hardware virtualization instructions.
As Useless not at all uselessly stated, cache-memory will not "run out" or in any other way cause a failure. It will run a little slower, but not massive amounts (about 10x for any given operation, but this will be averaged over a large number of other instructions in most cases, unless you are really working hard at proving how important cache is, such as very large matrix multiplications).
One tip might also be to look around for some old hardware. There are usually computers for sale that are several years old, for next to nothing at a "computer recycling shop" or similar. Set such a system up, install your choice of OS, and see what happens.
I have a system (Linux & C++) doing intensive signal/image processing operations. I would like to use PGO to improve performance of our application.
Are there any risks / potential issues I should be aware of when using PGO ?
Are unit tests + E2E tests enough to verify that PGO didn't break anything ?
Microsoft has system that is modifying conditional jumps based on the usage statistics plus it condenses frequently used pieces of code into smaller number of pages. This essentially compacts effective memory footprint several times and reduces CPU consumption for 20-50%.
This system was extensively used both in user and in kernel mode. The quality of this system was very high. In 100% of cases it was doing its job correctly. I do not see even minor down sides.
It might happen that some other similar system might be less reliable than that of Microsoft. That one from Microsoft was extremely good.
At our company we have unit tests.
We are thinking of writing some automated performance tests that will also be part of the test suite, so that both developers and the automated build will run them. The tests will do something and then fail if it took more than some pre-estimated time.
The problem is, different computers have different CPU speeds, and also processes running in the background can slow down execution. So how should we go about these tests?
One strategy is to design your performance metrics for the best machine that code will run on; as long as it runs fast enough on worse machines, you're guaranteed to have better performance in production. Basically, include a fudge factor knowing that it will have to run on slower machines, presumably during testing/development.
Another strategy is to do some benchmarking during your test setup, and use that time amount as your "unit time" instead of using seconds. For example, calculating the 20th Fibonacci number using the dog-slow recursive algorithm, and then saying that all the tests have to run within 10 "20-fibs", so while the wall-clock time is going to be slower on slow machines, you have a machine-independant metric for how well it's running.
Processes running in the background is harder. Obviously you usually don't want other things interfering with your test, so one strategy is to try and eliminate that as much as possible - regular developers can probably kill some processes and run again if there's a failure, and your continuous integration box should be kept relatively clear.
If that doesn't work, or isn't good enough, you could try the opposite approach: run a bunch of CPU/IO intensive processes at the same time as your tests to mimic an overloaded system, and if the tests pass with that environment, the performance should be fine in a normal system
Depending on the limiting resource of your program (I/O, CPU, memory), you can get good results with measuring the used CPU time and comparing it to the system speed. For example, the performance tests for my current program obtain the spent CPU time with time and get the CPU speed from /proc/cpuinfo to measure the number of cycles spent for a computation.
This approach has two caveats: Firstly, it does not measure the achieved parallelity, and secondly, it does not measure external performance factors like I/O usage.
If the idea is to understand how code changes affect performance and ensure that the performance is greater than or equal to previous builds then you need to run the tests on a known hardware profile every time. The most accurate way to do this would be to set up a machine(s) that you use for your testing every single time the tests are executed. If many developers need to do this, sometimes simultaneously, perhaps creating a VM image that they could spin up and point to for the tests to execute on would be worthwhile.
You should not run these on the developers boxes themselves because as you mentioned all kinds of factors could affect the outcome of the tests on those boxes.
You should avoid trying to measure performance while under load/strain from outside of the system being tested, (low disk space, network bandwidth, memory, cpu, etc) unless those conditions are specifically set up as part of the test case. For instance, you can have 3 different test runs, one while the machine is under no load, another where you are under medium load (simulating other programs running in the background) and another under high load.
You can also run tests on various hardware profiles as part of your other stress/performance tests but you probably won't get much value out of running them against every build. Again, however, if you want you could do a few different test runs against different hardware profiles, this requires more setup though since you would need additional machines and/or VM images set up and the infrastructure to kick off the tests against these machines, gather the results and report on them.
+1 for Sam's response. I've done this a number of times in the past and it's critical to lock down your performance test environment and ensure you're minimizing any potential flux.
Running the tests on devs' systems may be a useful flag for individual devs, but having a central system to run the tests on is critical. One caveat about doing this in VMs: ensure you understand the load on the VM host system because load there can impact performance in the hosted VMs.
I've had the best, most consistent and useful results when I ran these sorts of suites during a nightly smoke check build.
It is also a question about tolerances (or acceptable capacity ranges) that will make your tests valid. Ideally, as has been stated, you need a predictable, stable and consistent set up for any useful comparison. That said if you understand the basic operational ranges of the SUT (CPU available, Mem Available etc.) then early developer testing can be done on a mix and match of systems and conditions that are within the known resource tolerances.