Rough estimate of programme execution time - c++

How to get rough idea of time taken by a programme on normal pc So that based on input I can know if my algorithm is going to get TLE or not with given time limit (2 sec etc..)
Suppose I have to traverse a array of size 10^6,10^7,10^7 etc..
I think it will take 1 sec for traversal of 10^6 array..
if anyone can explain it clearly.

Check the Instructions per cycle for the current processor then I would look at the assembly code and calculate the number of cycles required.
Once you have the number of cycles, multiply it with cycle time.

Several factors are needed to be considered to reach to any conclusion in this case.
Every machine/assembly instruction takes one or more clock cycles to complete.
After fetching the assembly code for your program, you can calculate the total time using following formula:
Execution time = Total number of cycles * Clock cycle time = Instruction count * cycles per instruction * clock cycle time.
In general, you cannot directly estimate the the total time to process an array of size 10^6 to be 1 second.
The time to execute a program may be dependent on the following factors:
Processor: To find the closest estimate, you can read the processor manual to get the cycles per instruction for an instruction (as different instruction takes different number of cycles to retire) and use the above formula.
The data/operand: The size of the operand (in your case, the data in the array), has effect on latency.
Caching: The time required to access the data on the same cache line is same. Therefore, the total time is also dependent on the number of cache lines the CPU needs to access in total.
Compiler Optimizations: The modern compilers are very smart in optimising the code where read/write operations are not involved. In your case, you are just traversing the array and not performing any operations. Therefore, due to the optimisation, it may take much less than 1 second to traverse the array.

Related

How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

Do I need to prevent preemption while measuring performance

I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.
How do people make accurate measurements through measuring the time of execution of a specific piece of code?
Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).
That can hugely influence your measured times.
Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.
Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.
Threads Size_MB Time_ms us/Page MB/s Scenario
1 2000 355 0.693 5634 Touch 1
1 2000 11 0.021 N.a. Touch 2
2 2000 276 0.539 7246 Touch 1
2 2000 12 0.023 N.a. Touch 2
3 2000 274 0.535 7299 Touch 1
3 2000 13 0.025 N.a. Touch 2
4 2000 288 0.563 6944 Touch 1
4 2000 11 0.021 N.a. Touch 2
// Touch is from the compiler point of view a nop operation with no observable side effect
// This is true from a pure data content point of view but performance wise there is a huge
// difference. Turn optimizations off to prevent the compiler to outsmart us.
#pragma optimize( "", off )
void Program::Touch(void *p, size_t N)
{
char *pB = (char *)p;
char tmp;
for (size_t i = 0; i < N; i += 4096)
{
tmp = pB[i];
}
}
#pragma optimize("", on)
To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.
Just went to a conference with Andrei Alexandrescu on Fastware and he was adressing this exact issue, how to measure speed. Apparently getting the mean is a bad idea BUT, measuring many times is a great idea. So with that in mind you measure a million times and remember the smallest measurement because in fact that's where you would get the least amount of noise.
Means are awful because you're actually adding more of the noise's weight to the actual speed you're measuring (these are not the only things that you should consider when evaluating code speed but this is a good start, there's even more horrid stuff regarding where the code will execute, and the overhead brought by the code starting execution on one core and finishing on another, but that's a different story and I don't think it applies to my sort).
A good joke was: if you put Bill Gates into a bus, on average everybody in that bus is a millionaire :))
Cheers and thanks to all who provided input.

How to reduce the overhead of loop when measuring the performance?

When I try to measure the performance of a piece of code, I put it into a loop and iterate for a million time.
for i: 1 -> 1000000
{
"test code"
}
But by using profiling tools, I found that the overhead of the loop is so big that it impacts the performance result significantly, especially when the piece of code is small, say, 1.5s of total elapsed time with 0.5s of loop overhead.
So I'd like to know if there is a better way to test the performance? Or should I stick to this method, but make multiple pieces of the same code under the same loop to increase its weight in the performance?
for i: 1 -> 1000000
{
"test code copy 1"
"test code copy 2"
"test code copy 3"
"test code copy 4"
}
Or is it OK to subtract loop overhead off the total time? Thanks a lot!
You will need to look at the assembly listing generated by the compiler. Count the number of instructions in the overhead.
Usually, for an incrementing loop, the overhead consists of:
Incrementing loop counter.
Brancing to top of loop.
Comparison of counter to limit.
On many processors, these are one processor instruction each or close to that. So find out the average time for an instruction to exit, multiply by the number of instructions in the overhead and that becomes your overhead time for one iteration.
For example, on a processor that averages 100ns per instruction and 3 instructions for the overhead, each iteration uses 3 * (100ns) or 300ns per iteration. Given 1.0E6 iterations, 3.0E08 nanoseconds will be due to overhead. Subtract this quantity from your measurements for a more accurate measurement of the loop's content.

How to find the size of the L1 cache line size with IO timing measurements?

As a school assignment, I need to find a way to get the L1 data cache line size, without reading config files or using api calls. Supposed to use memory accesses read/write timings to analyze & get this info. So how might I do that?
In an incomplete try for another part of the assignment, to find the levels & size of cache, I have:
for (i = 0; i < steps; i++) {
arr[(i * 4) & lengthMod]++;
}
I was thinking maybe I just need vary line 2, (i * 4) part? So once I exceed the cache line size, I might need to replace it, which takes sometime? But is it so straightforward? The required block might already be in memory somewhere? Or perpahs I can still count on the fact that if I have a large enough steps, it will still work out quite accurately?
UPDATE
Heres an attempt on GitHub ... main part below
// repeatedly access/modify data, varying the STRIDE
for (int s = 4; s <= MAX_STRIDE/sizeof(int); s*=2) {
start = wall_clock_time();
for (unsigned int k = 0; k < REPS; k++) {
data[(k * s) & lengthMod]++;
}
end = wall_clock_time();
timeTaken = ((float)(end - start))/1000000000;
printf("%d, %1.2f \n", s * sizeof(int), timeTaken);
}
Problem is there dont seem to be much differences between the timing. FYI. since its for L1 cache. I have SIZE = 32 K (size of array)
Allocate a BIG char array (make sure it is too big to fit in L1 or L2 cache). Fill it with random data.
Start walking over the array in steps of n bytes. Do something with the retrieved bytes, like summing them.
Benchmark and calculate how many bytes/second you can process with different values of n, starting from 1 and counting up to 1000 or so. Make sure that your benchmark prints out the calculated sum, so the compiler can't possibly optimize the benchmarked code away.
When n == your cache line size, each access will require reading a new line into the L1 cache. So the benchmark results should get slower quite sharply at that point.
If the array is big enough, by the time you reach the end, the data at the beginning of the array will already be out of cache again, which is what you want. So after you increment n and start again, the results will not be affected by having needed data already in the cache.
Have a look at Calibrator, all of the work is copyrighted but source code is freely available. From its document idea to calculate cache line sizes sounds much more educated than what's already said here.
The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends
on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop
that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent
memory accesses) and the size of the memory area, we force varying cache miss rates.
In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into
the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously,
arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally,
arrays larger than L2 cause both L1 and L2 misses.
The frequency of cache misses depends on the access stride and the cache line size. With strides
equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides
smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is
the ratio cache
line
size/stride.
Thus, we can calculate the latency for a cache miss by comparing the execution time without
misses to the execution time with exactly one miss per iteration. This approach only works, if
memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load
instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing
mechanism to achieve this: the memory area we access is initialized such that each load returns the
address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from
their ability to hide memory access latency by speculative execution.
To measure the cache characteristics, we run our experiment several times, varying the stride and
the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal
expected cache line size, and that the array size varies from half the minimal expected cache size to
at least ten times the maximal expected cache size.
I had to comment out #include "math.h" to get it compiled, after that it found my laptop's cache values correctly. I also couldn't view postscript files generated.
You can use the CPUID function in assembler, although non portable, it will give you what you want.
For Intel Microprocessors, the Cache Line Size can be calculated by multiplying bh by 8 after calling cpuid function 0x1.
For AMD Microprocessors, the data Cache Line Size is in cl and the instruction Cache Line Size is in dl after calling cpuid function 0x80000005.
I took this from this article here.
I think you should write program, that will walk throught array in random order instead straight, because modern process do hardware prefetch.
For example, make array of int, which values will number of next cell.
I did similar program 1 year ago http://pastebin.com/9mFScs9Z
Sorry for my engish, I am not native speaker.
See how to memtest86 is implemented. They measure and analyze data transfer rate in some way. Points of rate changing is corresponded to size of L1, L2 and possible L3 cache size.
If you get stuck in the mud and can't get out, look here.
There are manuals and code that explain how to do what you're asking. The code is pretty high quality as well. Look at "Subroutine library".
The code and manuals are based on X86 processors.
Just a note.
Cache line size is variable on few ARM Cortex families and can change during execution without any notifications to a current program.
I think it should be enough to time an operation that uses some amount of memory. Then progresively increase the memory (operands for instance) used by the operation.
When the operation performance severelly decreases you have found the limit.
I would go with just reading a bunch of bytes without printing them (printing would hit the performance so bad that would become a bottleneck). While reading, the timing should be directly proportinal to the ammount of bytes read until the data cannot fit the L1 anymore, then you will get the performance hit.
You should also allocate the memory once at the start of the program and before starting to count time.

The overhead of a OpenCL or CUDA call?

I'm writing a function that does a lot of BLAS gemv operations.
I would like to be able to do this on the GPU, and I've tried with cuBlas.
My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.
Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.
That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.
How would I go about doing this?
Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks
The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.
N FFTw (ms) GPUFFT (ms) GPUFFT MFLOPS GPUFFT Speedup
8 0 0.06 3.352705 0.006881
16 0.001 0.065 7.882117 0.010217
32 0.001 0.075 17.10887 0.014695
64 0.002 0.085 36.080118 0.026744
128 0.004 0.093 76.724324 0.040122
256 0.007 0.107 153.739856 0.066754
512 0.015 0.115 320.200892 0.134614
1024 0.034 0.125 657.735381 0.270512
2048 0.076 0.156 1155.151507 0.484331
4096 0.173 0.215 1834.212989 0.804558
8192 0.483 0.32 2664.042421 1.510011
16384 1.363 0.605 3035.4551 2.255411
32768 3.168 1.14 3450.455808 2.780041
65536 8.694 2.464 3404.628083 3.528726
131072 15.363 5.027 3545.850483 3.05604
262144 33.223 12.513 3016.885246 2.655183
524288 72.918 25.879 3079.443664 2.817667
1048576 173.043 76.537 2192.056517 2.260904
2097152 331.553 157.427 2238.01491 2.106081
4194304 801.544 430.518 1715.573229 1.861814
The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.
If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).
Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).
Hope this helps,
You can get the time in nanoseconds from the device when an event was queued, submitted, started, and finished by using clGetEventProfilingInfo on your buffer transfer event.
more info, and how to set it up here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
I think that for 100x100 matrices, you may be better off sticking to cpu for the crunching. Unless you have many to multiply at the same time, the benefit of the gpu will be hardly noticeable due to the (small) transfer overhead and usually much lower clock speeds. Make sure you tweak your kernel to use as much of the local data as possible - on my hardware, there is 32KB per work group, and that should be plenty to hold two 100x100 matrices. The built-in dot product functions should also be very handy too.
There was an awesome talk about this at ADFS last year (see sessionId: 2908)
http://developer.amd.com/afds/pages/OLD/sessions.aspx
They talk in detail about optimizing the kernel, and hard-coding the optimal sizes.
Are your matrices already on the GPU?
If not, CUBLAS might transfer them for you (known as thunking), which is an additional overhead.
Also, GPUs do not really shine for such small computations, i.e. it will probably be slower than CPUs since you have to transfer your result back.
If you can, use bigger matrices.
Otherwise you might want to use streams (cudaStream_t) to start multiple parallel computations on the GPU.
If you want to measure the execution time of a kernel in CUDA, you need to enclose that (or anything else that computes on the GPU) in events, like this when using the CUDA runtime API:
cudaEvent_t start, stop;
cudaEventRecord(&start);
struct timeval cpuStart, cpuEnd;
gettimeofday(&cpuStart, 0); // get start time on CPU
// Do something with CUDA on the GPU, e.g. call kernels, transfer memory, ...
gettimeofday(&cpuEnd, 0); // get end time on CPU
double seconds = cpuEnd.tv_sec - cpuStart.tv_sec;
double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec;
double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds
cudaEventRecord(&stop);
// Wait until the stop event occurred
cudaError_t eventResult;
do
{
eventResult = cudaEventQuery(stop);
}
while (eventResult == cudaErrorNotReady);
// Assert there was no error; check the CUDA Toolkit Reference for further info
assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert>
// Retrieve the time
float gpuDuration = 0.0; // in milliseconds
cudaEventElapsedTime(&gpuDuration, start, stop);
// Release the event objects
cudaEventDestroy(stop);
cudaEventDestroy(start);
You might want to check the error code of every call to CUDA (at least with an assert), as you may get errors from previous calls, resulting in hours of debugging...
(Note: I mostly use the CUDA driver API, so this might not work out of the box. Sorry for that.)
EDIT: Just saw that you want to measure the call itself, not the duration of the kernel.
You can do that by simply measuring the time on the CPU for the call - see the updated code above.
This works only on Linux because gettimeofday is not available for Windows (AFAIK).
To find the call overhead, call a CUDA kernel that does as little as possible.
for (int i=0; i<NLoops; i++) {
gettimeofday(&cpuStart, 0); // get start time on CPU
// Call minimal CUDA kernel
gettimeofday(&cpuEnd, 0); // get end time on CPU
// save elapsed time
}
Follow the code of Alex P. above.
The less processing you do in the kernel, the more the time difference will be only the call overhead.
Do a little experimenting to find a good value for NLoops (maybe 1,000,000). Be sure that the elapsed time is longer than the interval of your timer, or you'll end up with all zeros. If that happens, write some kernel code that executes in a fixed time interval that you can predict: (n loops of x cycles each).
It's hard to remove all the non-CUDA computations that might occur between cpuStart and cpuEnd (like interrupt processing), but making several runs and averaging can give good results.