How to reduce the overhead of loop when measuring the performance?

How to reduce the overhead of loop when measuring the performance? - c++

When I try to measure the performance of a piece of code, I put it into a loop and iterate for a million time.
for i: 1 -> 1000000
{
"test code"
}
But by using profiling tools, I found that the overhead of the loop is so big that it impacts the performance result significantly, especially when the piece of code is small, say, 1.5s of total elapsed time with 0.5s of loop overhead.
So I'd like to know if there is a better way to test the performance? Or should I stick to this method, but make multiple pieces of the same code under the same loop to increase its weight in the performance?
for i: 1 -> 1000000
{
"test code copy 1"
"test code copy 2"
"test code copy 3"
"test code copy 4"
}
Or is it OK to subtract loop overhead off the total time? Thanks a lot!

You will need to look at the assembly listing generated by the compiler. Count the number of instructions in the overhead.
Usually, for an incrementing loop, the overhead consists of:
Incrementing loop counter.
Brancing to top of loop.
Comparison of counter to limit.
On many processors, these are one processor instruction each or close to that. So find out the average time for an instruction to exit, multiply by the number of instructions in the overhead and that becomes your overhead time for one iteration.
For example, on a processor that averages 100ns per instruction and 3 instructions for the overhead, each iteration uses 3 * (100ns) or 300ns per iteration. Given 1.0E6 iterations, 3.0E08 nanoseconds will be due to overhead. Subtract this quantity from your measurements for a more accurate measurement of the loop's content.

Related

Rough estimate of programme execution time

How to get rough idea of time taken by a programme on normal pc So that based on input I can know if my algorithm is going to get TLE or not with given time limit (2 sec etc..)
Suppose I have to traverse a array of size 10^6,10^7,10^7 etc..
I think it will take 1 sec for traversal of 10^6 array..
if anyone can explain it clearly.

Check the Instructions per cycle for the current processor then I would look at the assembly code and calculate the number of cycles required.
Once you have the number of cycles, multiply it with cycle time.

Several factors are needed to be considered to reach to any conclusion in this case.
Every machine/assembly instruction takes one or more clock cycles to complete.
After fetching the assembly code for your program, you can calculate the total time using following formula:
Execution time = Total number of cycles * Clock cycle time = Instruction count * cycles per instruction * clock cycle time.
In general, you cannot directly estimate the the total time to process an array of size 10^6 to be 1 second.
The time to execute a program may be dependent on the following factors:
Processor: To find the closest estimate, you can read the processor manual to get the cycles per instruction for an instruction (as different instruction takes different number of cycles to retire) and use the above formula.
The data/operand: The size of the operand (in your case, the data in the array), has effect on latency.
Caching: The time required to access the data on the same cache line is same. Therefore, the total time is also dependent on the number of cache lines the CPU needs to access in total.
Compiler Optimizations: The modern compilers are very smart in optimising the code where read/write operations are not involved. In your case, you are just traversing the array and not performing any operations. Therefore, due to the optimisation, it may take much less than 1 second to traverse the array.

Why is traversing more time-consuming than merging on two sorted std::list?

I am pretty amazed on the result that traversing takes more time than merging on two sorted std::list by around 12%. Since merging can be considered and implemented as continuous element comparisons, list splice and iterators traversal through two separated sorted linked lists. Hence, traversing should not be slower than merging through them especially when two lists are large enough because the ratio of iterated elements is getting increased.
However, the result seems to not match what I thought, and this is how I test my ideas above:
std::list<int> list1, list2;
for (int cnt = 0; cnt < 1 << 22; cnt++)
list1.push_back(rand());
for (int cnt = 0; cnt < 1 << 23; cnt++)
list2.push_back(rand());
list1.sort();
list2.sort();
auto start = std::chrono::system_clock::now(); // C++ wall clock
// Choose either one option below
list1.merge(list2); // Option 1
for (auto num : list1); // Option 2
for (auto num : list2); // Option 2
std::chrono::duration<double> diff = std::chrono::system_clock::now() - start;
std::cout << std::setprecision(9) << "\n "
<< diff.count() << " seconds (measured)" << std::endl; // show elapsed time
PS. icc is smart enough to eliminate Option 2. Try sum += num; and print out sum.
This is the output from perf: (the measured time remains the same without using perf)
Option 1: Merge
0.904575206 seconds (measured)
Performance counter stats for './option-1-merge':
33,395,981,671 cpu-cycles
149,371,004 cache-misses # 49.807 % of all cacherefs
299,898,436 cache-references
24,254,303,068 cycle-activity.stalls-ldm-pending
7.678166480 seconds time elapsed
Option 2: Traverse
1.01401903 seconds (measured)
Performance counter stats for './option-2-traverse':
33,844,645,296 cpu-cycles
138,723,898 cache-misses # 48.714 % of all cacherefs
284,770,796 cache-references
25,141,751,107 cycle-activity.stalls-ldm-pending
7.806018949 seconds time elapsed
Due to the property of horrible spatial locality on these linked lists. The cache miss is the major reason that make CPU stalls, and occupies most of the CPU resources. The strange point is that option 2 has fewer cache misses than option 1, but it has a higher amount of CPU stalls and CPU cycles to accomplish its task. What makes this abnormality happen?

As you know, it is memory that is taking all your time.
Cache misses are bad, but so are stalls.
From this paper:
Applications with irregular memory access patterns, e.g.,dereferencing chains of pointers when traversing linked lists or trees, may not generate enough concurrently outstanding requests to fully utilize the data paths. Nevertheless, such applications are clearly limited by the performance of memory accesses as well. Therefore, considering the bandwidth utilization is not sufficient to detect all memory related performance issues.
Basically, randomly walking pointers can fail to saturate the memory bandwidth.
The tight loop on each is blocked each iteration by waiting on where the next pointer is to be loaded. If it is not in cache, the cpu can do nothing -- it stalls.
The combined tight loop/merge tries to load two pages into the cache. When one is loading, sometimes the cpu can advannce on the other.
The result you measured was that the merge has fewer stalls that the naked wasted double iteration.
Or in other words,
24,254,303,068 cycle-activity.stalls-ldm-pending
is a big number and smaller than:
25,141,751,107 cycle-activity.stalls-ldm-pending
I am surprised this is enough to make a 10% difference, but that is why perf is about measuring.

Do I need to prevent preemption while measuring performance

I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.
How do people make accurate measurements through measuring the time of execution of a specific piece of code?

Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).
That can hugely influence your measured times.
Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.
Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.
Threads Size_MB Time_ms us/Page MB/s Scenario
1 2000 355 0.693 5634 Touch 1
1 2000 11 0.021 N.a. Touch 2
2 2000 276 0.539 7246 Touch 1
2 2000 12 0.023 N.a. Touch 2
3 2000 274 0.535 7299 Touch 1
3 2000 13 0.025 N.a. Touch 2
4 2000 288 0.563 6944 Touch 1
4 2000 11 0.021 N.a. Touch 2
// Touch is from the compiler point of view a nop operation with no observable side effect
// This is true from a pure data content point of view but performance wise there is a huge
// difference. Turn optimizations off to prevent the compiler to outsmart us.
#pragma optimize( "", off )
void Program::Touch(void *p, size_t N)
{
char *pB = (char *)p;
char tmp;
for (size_t i = 0; i < N; i += 4096)
{
tmp = pB[i];
}
}
#pragma optimize("", on)
To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.

Just went to a conference with Andrei Alexandrescu on Fastware and he was adressing this exact issue, how to measure speed. Apparently getting the mean is a bad idea BUT, measuring many times is a great idea. So with that in mind you measure a million times and remember the smallest measurement because in fact that's where you would get the least amount of noise.
Means are awful because you're actually adding more of the noise's weight to the actual speed you're measuring (these are not the only things that you should consider when evaluating code speed but this is a good start, there's even more horrid stuff regarding where the code will execute, and the overhead brought by the code starting execution on one core and finishing on another, but that's a different story and I don't think it applies to my sort).
A good joke was: if you put Bill Gates into a bus, on average everybody in that bus is a millionaire :))
Cheers and thanks to all who provided input.

Is looping faster than traversing one by one

Let us consider the following code snippet in C++ to print the fist 10 positive integers :
for (int i = 1; i<11;i++)
{
cout<< i ;
}
Will this be faster or slower than sequentially printing each integer one by one as follow :
x =1;
cout<< x;
x++;
cout<< x;
And so on ..
Is there any reason as to why it should be faster or slower ? Does it vary from one language to another ?

This question is similar to this one; I've copied an excerpt of my answer to that question below... (the numbers are different; 11 vs. 50; the analysis is the same)
What you're considering doing is a manual form of loop unrolling. Loop unrolling is an optimization that compilers sometimes use for reducing the overhead involved in a loop. Compilers can do it only if the number of iterations of the loop can be known at compile time (i.e. the number of iterations is a constant, even if the constant involves computation based on other constants). In some cases, the compiler may determine that it is worthwhile to unroll the loop, but often it won't unroll it completely. For instance, in your example, the compiler may determine that it would be a speed advantage to unroll the loop from 50 iterations out to only 10 iterations with 5 copies of the loop body. The loop variable would still be there, but instead of doing 50 comparisons of the loop counter, now the code only has to do the comparison 10 times. It's a tradeoff, because the 5 copies of the loop body eat up 5 times as much space in the cache, which means that loading those extra copies of the same instructions forces the cache to evict (throw out) that many instructions that are already in the cache and which you might have wanted to stay in the cache. Also, loading those 4 extra copies of the loop body instructions from main memory takes much, much longer than simply grabbing the already-loaded instructions from the cache in the case where the loop isn't unrolled at all.
So all in all, it's often more advantageous to just use only one copy of the loop body and go ahead and leave the loop logic in place. (I.e. don't do any loop unrolling at all.)

In loop, the actual machine level instruction would be the same, and therefore the same address. In explicit statements, the instructions will have different addresses. So it is possible that for loops, the CPU's instruction cache will provide performance boost that might not happen for the latter case.
For really small range (10) the difference will most likely be negligible. For significant length of the loop it could show up more clearly.

Big-Oh complexity of multithreaded code

How is the running time of algorithms that are affected by multithreading specified?
For example, a CompareAndSet loop may never be satisfied (if you're very very unlucky)
AtomicReference<ContainerOfItems> oldContainer;
void AddItem(Item aItem)
{
ContainerOfItems newContainer;
do
{
newContainer = null;
newContainer = new ContainerOfItems();
newContainer.CopyContents(oldContainer);
newContainer.Add(aItem);
}
while (!CompareAndSet(oldContainer, newContainer));
oldContainer = null;
}
In this example (which looks a lot like Java but really is pseudocode) the CopyContents
operation could take a long time, such that oldContainer has been replaced by some other thread causing the CompareAndSet to fail. What's the running time of this code?

What's the running time of this code?
The overall runtime of your program depends highly on how long copyContents(...) takes and a prediction on how often there are going to be race conditions that cause the compareAndSet(...) to fail. This will depend on how many threads are running at the same time.
However, I suspect that in terms of Big-O, the number of times looping because of compareAndSet(...) does not matter. For example, if a copyContents(...) takes O(N) time to run, on average it has to loop 3 times to complete the compareAndSet(...), and you run it N times to add all of the items, the it is O(N^2) -- the 3 drops out because of the constant.
Also, because you are implying concurrency, there will also be a factor speed improvement because the algorithm will be multi-threaded, but that too will only be a constant factor improvement and not affect the Big-O. So the Big-O can be calculated by looking at the Big-O of copyContents(...) times (I assume) N.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js