How to get an accurate performance measure?

How to get an accurate performance measure? - c++

In our project we're trying to automatically monitor the performance of test runs, to make sure that we don't have any significant changes in the performance of the program over time.
The problem is that there seems to be a consistent 5% variability in the measures we get. That is, on the same machine with the same program (no recompilation) running the same test we get values that differ by around 5% from run to run. This is way too much for what we want to use the numbers for.
We're already excluding setup costs from the timing considerations - that is, from within C++ code itself we're grabbing the time immediately before and after running the time-critical portions, rather than doing the timing of the whole program on the OS level. We are also doing averaging and outlier exclusion. The problem is that the variability looks to also have long-term trends, so we get tight clustering of times for replicates right after each other, but an hour or two later the times are substantially different. (Unfortunately, spreading the test out over several hours is not feasible.) The tests are also being run on a dedicated machine while "nothing else" is being run on it.
We're not quite sure where the timing variation is coming from, but it may have to do with the processor and the system - there's indications that the size of the variability depends on what machine the program is running on.
Does anyone have an idea where this variation is likely to be coming from, and how to remove it? The tests are running on a dedicated machine, so changing the operating system settings would be possible.
(As indicated by the tags, this is a C++ program running on a x86 Linux system, if that helps clarify things.)
Edit: Response to comments
Our current timing scheme is to use the clock() function from the C standard library, looking at the difference in the return value from before/after the functions we want to test.
The code we're testing should be deterministic, and shouldn't involve heavy IO.
I realize that the situation is a little hazy for a "silver bullet" answer. I guess I'm more looking for a "these are the factors that are important to consider, this is the order you probably should check them in, and here's how you go about checking each of them" type answer.

I'm amazed you got down to 5% variation.
Unless you can get rid of all the unnecessary things running on your system, you will be getting high variation. This is at the top level.
You OS needs to be deterministic. You need to know what other tasks and threads are running and their durations. For example, there is the clock interrupt. Now, how many other functions are chained to this interrupt? Do these other functions vary?
Is your system isolated? For example, your measurements may vary if your system is connected to a network.
Does your program use external resources? For example a hard drive. If the program writes to the hard drive, the drive will not be deterministic. Files and parts of files may move on the drive. The drive may become fragmented. This fragmentation may cause variance in your measurements.
The operating system memory may get fragmented. Also, the executable's memory may become fragmented. Fragmentation may add to the variance.

Related

Are there tools/ methods to objectively measure performance?

I'm writing a high performance application (a raytracer) in C++ using Visual Studio, and I just spent two days trying to root out a performance drop I witnessed after refactoring the code. The reason it took so long was because the performance drop was smaller than the normal variation in execution time I witnessed from run to run.
Not sure if this is normal, but sometimes the program may run at around 33fps pretty consistently, then if you close and rerun, it may run at 37fps. This means that in order to test any new change, I had to manually run and rerun until I witnessed peak performance (And this could require up to like 10 runs). Simply running it for some large number of frames, and measuring the time doesn't fix this variability. For example, if the program runs for 40 seconds on average, it will nevertheless vary by over 1-2 seconds, which makes this test nearly useless for detecting the 1 millisecond per frame performance loss I was dealing with.
Visual Studio's profiling tools also didn't help find this small of an issue, because they also were subject to variation, and in any case, its not necessarily going to tell me the exact offending line, so I have to test solutions, and the profiler is not very effective at confirming a proposed solution's efficacy.
I realize this all may sound like premature optimization, but I don't think it is because I'm optimizing only after finishing complete features; I'm just trying to monitor changes in performance regularly so that issues like the above don't slip in and just get added to the apparent cost of the new feature.
Anyways, my question is simply whether there's a way to objectively determine the "real" speed of an application, discounting the effect of variation. Or, failing that, how do developers deal with such issues? I doubt that my current process is the ideal one.

There are lots of profilers for both c++ and openGL. For those who just need the links, here are they.
OpenGL debugger-profiler
C++ profilers but I recommend Google orbit because it has dark theme.
My eyes stopped at
Objectively measure performance
As you mentioned the speed varies from run to run because it's too complex system. It helps if the scope is small and it only tests some key algorithms. It worth to automatize and collect some reference data. As every scientist say one test is not a test, you should rely on regular tests with controlled environments.
And here comes some tricks that can be used to measure performance.
In the comments others said, an average based on several runs may help you. It softens the noise from the outside.
Process priority or processor affinity could help you control the environment. By giving low priority to other processes your program gains more resource.
Measuring the whole execution of a test and compare it against processor time. As several processes runs at the same time, processor time may differs from execution time.
Update your reference values if you do a software update. Perhaps one update comes with performance boost while other with security patch.
Give a performance range for your program instead of one specific number. Perhaps the temperature messed up your measurement and the clock speed was decreased.
If a test runs too fast to measure, execute the most critical part several times in a test case. Too fast depend on how accurate you can measure. On ms basis it's really hard to decide if a test executed in 2 ms instead of 1 ms is a failure or not. However, if executed 1000 times - 1033 ms compared to 1000 ms gives you better insight.
Only test what is the critical section. Set up the environment and start the stopwatch when everything is ready. The system startup could be another test.

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

What strategies and practices are used, when running very intense and long calculations, to ensure that hardware isn't damaged?

I have many large Fortran programs to run at work. I have access to several desktop computers and the Fortran code runs over takes several consecutive days. It's essentially running the same master module many times (lets say N times) with different parameters, something akin to Monte Carlo on steroids. In that sense the code is parallelizable, however I don't have access to a cluster.
With the scientific computing community, what practices and strategies are used to minimise hardware damaged from heat? The machines of course have their own cooling system (fans and heat sinks), but even so running intense calculations non stop for half a week cannot be healthy for the life of the machines? Though maybe I'm over-thinking this?
I'm not aware of any intrinsic functions in Fortran that can pause the code to give components a break? Current I've written a small module that keeps an eye on system clock, with a do while loop that "wastes time" in between consecutive runs of the master module in order to discharge heat. Is this an acceptable way of doing this? The processor is, after all, still running a while loop.
Another way would be to use a shell scripts or a python code to import Fortran? Alternatively are there any intrinsic routines in the compile (gfortran) that could achieve this? What are the standard, effective and accepted practices for dealing with this?
Edit: I should mention that all machines run on Linux, specifically Ubuntu 12.04.

For MS-DOS application I would consider the following:
Reduce as much as possible I/O operations withHDD, that is, keep data in memory as much as you can,
or keep data on a RamDisk.A RamDisk driver is available on Microsoft's website.
Let me know if you won't be able to find and I look at my CD archives
-Try to use Extended Memory by using aDPMI driver
DPMI - DOS Protected Mode Interface
-Set CPU affinity for a second CPU
Boost a priority to High, butI wouldn't recommend toboost toReal-Time

I think you need a hardware solution here, not a software solution. You need to increase the rate of heat exchange in the computers (new fans, water cooling, etc) and in the room (turn the thermostat way down, get some fans running, etc).
To answer the post more directly, you can use the fortran SLEEP command to pause a computation for a given number of seconds. You could use some system calls in Fortran to set the argument on the fly. But I wouldn't recommend it - you might as well just run your simulations on fewer computers.
To keep the advantages of the multiple computers, you need better heat exchange.

As long as the hardware is adequately dissipating heat and components are not operating at or beyond their "safe" temperature limits, they * should be fine.
*Some video cards were known to run very hot; i.e. 65-105°C. Typically, electronic components have a maximum temperature rating of exactly this. Beyond it, reliability degrades very quickly. Even though the manufacturer made these cards this way, they ended up with a reputation for failing (i.e. older nVidia FX, Quadro series.)
*Ubuntu likely has a "Critical temperature reached" feature where the entire system will power off if it overheats, as explained here. Windows is "blissfully ignorant." :)
*Thermal stress (large, repeated temperature variations) may contribute to component failure of IC's, capacitors, and hard disks. Over three decades of computing has taught that adequate cooling and leaving the PC on 24/7 actually may save wear-and-tear in my experience. (A typical PC will cost around $200 USD/year in electricity, so it's more like a trade-off in terms of cost.)
*PC's must be cleaned twice a year (depending on airborne particulate constituency and concentration.) Compressed air is nice for removing dust. Dust traps heat and causes failures. Operate a shop-vac while "dusting" to prevent the dust from going everywhere. Wanna see a really dusty computer?
*The CPU should be "ok" with it's stock cooler. Check it's temperature at cold system boot-up, then again after running code for an hour or so. The fan is speed-controlled to limit temperature rise. CPU temperature rise shouldn't be much warmer than about 40°C and less would be better. But an aftermarket, better-performing CPU cooler never hurts, such as these. CPU's rarely fail unless there is a manufacturing flaw or they operate near or beyond their rated temperatures for too long, so as long as they stay cool, long calculations are fine. Typically, they stop functioning and/or reset the PC if too hot.
*Capacitors tend to fail very rapidly when overheated. It is a known issue that some cap vendors are "junk" and will fail prematurely, regardless of other factors. "Re-capping" is the art of fixing these components. For a full run-down on this topic, see badcaps.net. It used to be possible to re-cap a motherboard, but today's 12+ layer and ROHS (no lead) motherboards make it very difficult without specialty hot-air tools.

running time of two programs run separately and then together

I was recently asked this question in an interview, and while I did alright on the first two parts [I am assuming] I struggled a bit on the third. Here's the question:
You have two Linux programs, A and B. When run separately, A and B each take one minute to complete on a system that has just been restarted. [ie: fresh system: you reboot it, log in, get a shell prompt, run the program.]
What can you tell me about the programs if:
a) when run together, they take 2 minutes
b) when run together, they take 1 minute
c) when run together, they take 30 seconds
I said for a) that if they take exactly double the time when run together, they share no mutual exclusion and are vying for all the same resources, probably don't share any sort of cache data or instructions [and thus don't help each other out from a cache perspective] and each program needs the full utilization of said resource to complete such that the OS cannot parallelize them.
For b), I said that if they can run just as fast together, they probably share some spacial/temporal locality in the cash, and may lend themselves to being properly pipelined in such a way that while program A is waiting on something, program B can run in between those stages, and vice versa-- effectively running them both in 1 minute.
For c), I was a bit stuck. In retrospect, I probably should have said that perhaps program A and B were both doing a common task, where two of them running at once could complete said task faster than one running alone-- such as a garbage collector. But the best that I could come up with was that perhaps they loaded out of the same sector on the hard disk, and that helped them both together run quickly.
I am just looking for some input from some of the smarties here on things I probably missed. The position was for a platforms/systems position that require a good understanding of hardware/software and operating systems, and namely interactions between them which is why [I'm assuming] the question was asked.
I was also trying to think of examples that I could apply to each part to help show my knowledge of the questions real life applications, but on the spot I was coming up short.

Together they take 2 minutes to complete
In this case, I think that each program is fully CPU-bound and can saturate 100% of the CPUs available on the machine. Therefore when the programs run together, each runs at half speed.
It's also possible that this would be the observed behavior if both programs were able and willing to saturate some other resource apart from the CPU, for example some I/O device. However, since in practice, usually the performance of I/O devices does not decrease linearly with the load applied to them if they are oversaturated, I would consider that a less likely scenario and go with CPU-bound as a first guess.
Together they take 1 minute to complete
The two programs do not contest the same resources, or there are ample resources in the system to satisfy the demands of both. Therefore, they end up not interfering with each other.
Together they take half a minute to complete
The programs operate on the same input, and both can tell when all input is used up, so each ends up doing half the work it would do if launched alone at half the running time. Also, the system obviously has the capacity to supply double the amount of whatever resource these programs are constrained by.
Since in this case the running time decreases linearly with the amount of processes (perfect scaling), it seems more likely that the resource constraining the programs is CPU for the same reasons explained in the "2 minutes" scenario. This also fits in well with the "common input" assumption, as the input would not be very likely to be coming from one source if there were e.g. different I/O devices supplying it.
Therefore, the first guess in this case is that each program is CPU-bound and written such that it consumes at most half the CPU resources in the system.

For A, They're programs that are in competition for a mutually exclusive resource.
For B, They're independent programs that don't really interact.
For C, which is the one you're struggling with, it seems they both have the same work to pick from. For example, there's a queue of tasks to do, both programs are capable of doing the tasks, and they know what tasks have been done. So if they both run at the same time (assuming multi core machine, but even then not necessarily, all that's important is that they don't have a resource bottleneck) they get the work done in half the time.

See Performance in multithreaded Java application for another possible reason why processes can run faster when you have more than one.
Although I admit that the queue of tasks that canbeperformed concurrently is a much simpler reason to explain this reduced running time.

Measuring running time of computational geometry algorithms

I am taking a course on computational geometry in the fall, where we will be implementing some algorithms in C or C++ and benchmarking them. Most of the students generate a few datasets and measure their programs with the time command, but I would like to be a bit more thorough.
I am thinking about writing a program to automatically generate different datasets, run my program with them and use R to test hypotheses and estimate parameters.
So... How do you measure program running time more accurately?
What might be relevant to measure?
What hypotheses might be interesting to test (variance, effects caused by caching, etc.)?
Should I test my code on more than one machine? How should these machines differ?
My overall goals are to learn how these algorithms perform in practice, which implementation techniques are better and how the hardware actually performs.

Profilers are great. Valgrind is pretty popular. Also, I'd suggest trying your code out on risc machines if you can get access to some. Their performance characteristics are different from those of cisc machines in interesting ways.

You could use the Windows API timing function (are not that exactly) and you can use the RDTSC inline assembler command which is sub-nanosecond exact(don't forget that the command and the instructions around it create a small overhead of some hundreds cycles but this is not an big issue).

In order to get better accuracy with program metrics, you will have to run your program many times, such as 100 or 1000.
For more details, on metrics, search the web for metrics and profiling.
Beware that programs may differ in performance (time) measurements due to things running in the background such as virus scanners, music players, and other programs with timers in them.
You could test your program on different machines. Processor clock rates, L1 and L2 cache sizes, RAM sizes, and Disk speeds are all factors (as well as the number of other programs / tasks running concurrently). Floating point may also be a factor.
If you want, you can challenge your compiler by printing the assembly language of the listings for various optimization settings. See which setting produces the fewest or most efficient assembly code.
Since your processing data, look at data driven design: http://www.gamearchitect.net/Articles/DataDrivenDesign.html

You can use the Windows High Performance Counter to get nanosecond accuracy. Technically, afaik, the HPC can be any speed, but you can query it's counts per second, and as far as I know, most CPUs do very very high performance counting.
What you should do is just get a professional profiler. That's what they're for. More realistically, however.
If you're only comparing between algorithms, as long as your machine doesn't happen to excel in one area (Pentium D, SSD sort of thing) it shouldn't matter too much to do it on just one machine. If you want to look at cache effects, try running the algorithm right after the machine starts up (make sure that you get a copy of Windows 7, should be free for CS students), then leave it doing something that can be plenty cache heavy, like image processing, for 24h or something to convince the OS to cache it. Then run algorithm again. Compare.

You didn't specify your platform. If you are on a POSIX system (eg linux) have a look into clock_gettime. This lets you access different kinds of clocks e.g wall clock time or cpu time. You also may get to know about the precision of the clocks.
Since you are willing to do good statistics on your numbers, you should repeat your experiments often enough such that the statistical test give you enough confidence.
If your measurements are not too fine grained and your variance is low this often is quite good for 10 probes or so. But if you go down to small scale, a short function or so, you might need to go much higher.
Also you would have to ensure reproducible experimental conditions, no other load on the machine, enough memory available etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js