I have loop like this
start = __rdtsc();
unsigned long long count = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < M; j++)
count += tab[i][j];
stop = __rdtsc();
time = (stop - start) * 1/3;
Need to check how prefetch data influences on efficiency. How to force prefetch some values from memory into cache before they will be counted?
For GCC only:
__builtin_prefetch((const void*)(prefetch_address),0,0);
prefetch_address can be invalid, there will be no segfault. If there too small difference between prefetch_address and current location, there might be no effect or even slowdown. Try to set it at least 1k ahead.
First, I suppose that tab is a large 2D array such as a static array (e.g., int tab[1024*1024][1024*1024]) or a dynamically-allocated array (e.g., int** tab and following mallocs). Here, you want to prefetch some data from tab to the cache to reduce the execution time.
Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.
Two facts you should know for this problem:
(1) You are already exploit the spatial locality of tab inside of the innermost loop. Once tab[i][0] is read (after a cache miss, or a page fault), the data from tab[i][0] to tab[i][15] will be in your CPU caches, assuming that the cache line size is 64 bytes.
(2) However, when the code traverses in the row, i.e., tab[i][M-1] to tab[i+1][0], it is highly likely to happen a cold cache miss, especially when tab is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.
So, prefetching makes a sense only when you read (1) the first item of the next row and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.
You may do so by inserting a prefetch operation (e.g., __builtin_prefetch) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.
However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.
If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.
The easiest/most portable method is to simply read some data every cacheline bytes apart. Assuming tab is a proper two-dimensional array, you could:
char *tptr = (char *)&tab[0][0];
tptr += 64;
char temp;
volatile char keep_temp_alive;
for(int i = 0; i < N; i++)
{
temp += *tptr;
tptr += 64;
for(j = 0; j < M; j++)
count += tab[i][j];
}
keep_temp_alive = temp;
Something like that. However, it does depend on:
1. You don't end up reading outside the allocated memory [by too much].
2. the J loop is not that much larger than 64 bytes. If it is, you may want to add more steps of temp += *tptr; tptr += 64; in the begginning of the loop.
The keep_temp_alive after the loop is essential to prevent the compiler from completely removing temp as unnecessary loads.
Unfortunately, I'm too slow writing generic code to suggest the builtin instructions, the points for that goes to Leonid.
The __builtin_prefetch instruction is pretty helpful, but is clang/gcc specific. If you are compiling to multiple compiler targets, I had luck using the x86 intrinsic _mm_prefetch with both clang and MSVC.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_prefetch
Related
This question already has answers here:
Accessing elements of a matrix row-wise versus column-wise
(3 answers)
Closed 4 years ago.
I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:
long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}
long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}
return sum;
}
When I execute the two functions with the given array, I get running times:
sum1: 0.19s
sum2: 1.25s
Can anyone explain why there is this huge difference?
C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:
Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).
Emphasis mine.
Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:
The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.
There are some other languages that use column-major ordering for multidimensional arrays; among them are R, FORTRAN and MATLAB. If you wrote equivalent code in these languages you would observe faster output with sum2.
Computers generally use cache to help speed up access to main memory.
The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.
Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.
The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.
So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.
Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.
Related Resource: Memory layout of multi-dimensional arrays
On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).
By "skipping" data (reading far from the previous read), the CPU has to read the memory again.
That's why your first snippet is faster.
For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.
To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.
for (i=0; i < ROWS; i++) {
const long *row = m[i];
for (j=0; j < COLS; j++) {
sum += row[j];
}
}
If you cannot do this, that means that your data is wrongly oriented.
This is an issue with the cache.
The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.
A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.
When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.
By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)
To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.
http://valgrind.org/docs/manual/cg-manual.html
Which one of the 2 is faster (C++)?
for(i=0; i<n; i++)
{
sum_a = sum_a + a[i];
sum_b = sum_b + b[i];
}
Or
for(i=0; i<n; i++)
{
sum_a = sum_a + a[i];
}
for(i=0; i<n; i++)
{
sum_b = sum_b + b[i];
}
I am a beginner so I don't know whether this makes sense, but in the first version, array 'a' is accessed, then 'b', which might lead to many memory switches, since arrays 'a' and 'b' are at different memory locations. But in the second version, whole of array 'a' is accessed first, and then whole of array 'b', which means continuous memory locations are accessed instead of alternating between the two arrays.
Does this make any difference between the execution time of the two versions (even a very negligible one)?
I don't think there is correct answer to this question. In general, second version has more twice as much iterations (CPU execution overhead), but worse access to memory (Memory access overhead). Now imagine you run this code on PC that has slow clock, but insanely good cache. Memory overhead gets reduced, but since clock is slow running same loop twice makes execution much longer. Other way around: fast clock, but bad memory - running two loops is not a problem, so it's better to optimize for memory access.
Here is cool example on how you can profile your app: Link
Which one of the 2 is faster (C++)?
Either. It depends on
The implementation of operator+ and operator[] (in case they are overloaded)
Location of the arrays in memory (adjacent or not)
Size of the arrays
Size of the cpu caches
Associativity of caches
Cache speed in relation to memory speed
Possibly other factors
As Revolver_Ocelot mentionend in their observation in a comment, some compilers may even transform the written loop into the other form.
Does this make any difference between the execution time of the two versions (even a very negligible one)?
It can make a difference. The difference may be significant or negligible.
Your analysis is sound. Memory access is typically much slower than cache, and jumping between two memory locations may cause cache thrashing † in some situations. I would recommend using the separated approach by default, and only combine the loops if you have measured it to be faster on your target CPU.
† As MSalters points out thrashing shouldn't be a problem modern desktop processors (modern as in ~x86).
So I measured the cycles for accessing the L2 cache of the ARM Cortex-A15.
I did this by allocating one byte and
invalidate the address
read the PMCCNTR register
access the memory location of the allocated byte with ldr
read the PMCCNTR register again
subtract first measurement from second
I got about ~240 cycles for cached access and ~350 for uncached access.
I also used ISB, DMB and DSB. Do these Numbers sound accurate to you? I can't seem to find official ressources to compare with. Maybe you can point me in the right direction.
You are not measuring the latency with your approach, you are measuring the overhead.
A standard approach to measure latencies is to use a pointer chasing test, you initialize a chain of pointers so that you get dependent accesses, and you control their placement so that they fit (or not) in caches of specified sizes. The rest of the procedure is the same except you don't invalidate anything.
Something like this (for illustration, not tested)
// prepare a chain of N pointers in a buffer
// Assume unsigned int has the same size as a pointer
unsigned int Buffer[N] ;
// chain them, here in a simple direct fashion.
// You can also use a randomized sequence if you work in main memory
for (i=1; i<N; i++) { Buffer[i] = (unsigned int) &(Buffer[i-1]) ; }
// close the chain
Buffer[0] = (unsigned int) &(Buffer[N-1]) ;
// measure M accesses
Start = PMCCNTR() ;
p = &(Buffer[0]) ;
for (i=M; i>0; i--) {
p = *p;
}
Stop = PMCCNTR();
Measuring a single access is subjected to inaccuracy due to measuring overhead and random interferences. You should measure time over a large number of accesses to get an amortized latency that would better reflect what you want. To measure the average access time you also need to make sure these accesses are not run in parallel (that would measure throughput, not latency), so add some false dependency, like adding the content of the previously accessed byte to the next address (after initializing all these bytes to zeros).
Also, you didn't say how you were invalidating the address, but i'm guessing that you also threw it out of the L2, and are actually measuring memory latency only.
I am in need of a 2bit array, I am not concerned with saving memory at all, but I am concerned with minimizing cache misses and maximizing cache efficiency. Using an array of bools will use 4 times more memory, which means for every usable chunk of data in the cache, there will be 3 that are not used. So technically, I can get 3 times better cache consistency if I use bitfields.
The plan is to implement it as an array of bytes, divided into 4 equal bitfields, and use the div function to be able to get the integral quotient and remainder, possibly in a single clock, and use those to access the right index and right bitfield.
The array I needs is about 10000 elements long, so it will make for a significantly denser packed data, using 2 actual bits will allow for the entire array to fit in L1 cache, while using a byte array this will not be possible.
So my question is whether someone can tell me if this is a good idea in a performance oriented task, so I know if it is worth to go forth and implement a 2bit array? And surely, the best way to know is profiling, but any information in advance may be useful and will be appreciated.
With 10000 elements, on a modern processor, it should fit nicely in memory as bytes (10KB), so I wouldn't worry too much about it, unless you want this to run on some very tiny microprocessor with a cache that is much smaller than the typical 16-32KB L1 caches that modern CPU's have.
Of course, you may well want to TEST the performance with different solutions, if you think this is an important part of your code from a performance perspective [as measured from your profiling that you've already done before you start optimising, right?].
It's not clear to me that this will result in a performance
gain. Accessing each field will require several instructions
((data[i / 4] >> 2 * (i % 4)) & 0x03), and a lot of modern
processors have an L3 cache which would hold the entire array
with one byte per entry. Whether the extra cost in execution
time will be greater or less than the difference in caching is
hard to say; you'll have to profile to know exactly.
If you can organize your algorithms to work a byte (or even a
word) at a time, the cost of access may be much less. Iterating
over the entire array, for example:
for ( int i = 0; i < 10000; i += 4 ) {
unsigned char w1 = data[ i / 4 ];
for ( int j = 0; j < 4; ++ j ) {
unsigned char w2 = w1 & 0x03;
// w2 is entry i + j...
w1 >>= 2;
}
}
could make a significant difference. Most compilers will be
able to keep w1 and w2 in registers, meaning you'll only
have 1/4 as many memory accesses. Packing withunsigned int`
would probably be even faster.
I was just wondering if this is expected behavior in C++. The code below runs at around 0.001 ms:
for(int l=0;l<100000;l++){
int total=0;
for( int i = 0; i < num_elements; i++)
{
total+=i;
}
}
However if the results are written to an array, the time of execution shoots up to 15 ms:
int *values=(int*)malloc(sizeof(int)*100000);
for(int l=0;l<100000;l++){
int total=0;
for( unsigned int i = 0; i < num_elements; i++)
{
total+=i;
}
values[l]=total;
}
I can appreciate that writing to the array takes time but is the time proportionate?
Cheers everyone
The first example can be implemented using just CPU registers. Those can be accessed billions of times per second. The second example uses so much memory that it certainly overflows L1 and possibly L2 cache (depending on CPU model). That will be slower. Still, 15 ms/100.000 writes comes out to 1.5 ns per write - 667 Mhz effectively. That's not slow.
It looks like the compiler is optimizing that loop out entirely in the first case.
The total effect of the loop is a no-op, so the compiler just removes it.
It's very simple.
In first case You have just 3 variables, which can be easily stored in GPR (general purpose registers), but it doesn't mean that they are there all the time, but they are probably in L1 cache memory, which means thah they can be accessed very fast.
In second case You have more than 100k variables, and You need about 400kB to store them. That is deffinitely to much for registers and L1 cache memory. In best case it could be in L2 cache memory, but probably not all of them will be in L2. If something is not in register, L1, L2 (I assume that your processor doesn't have L3) it means that You need to search for it in RAM and it takes muuuuuch more time.
I would suspect that what you are seeing is an effect of virtual memory and possibly paging. The malloc call is going to allocate a decent sized chunk of memory that is probably represented by a number of virtual pages. Each page is linked into process memory separately.
You may also be measuring the cost of calling malloc depending on how you timed the loop. In either case, the performance is going to be very sensitive to compiler optimization options, threading options, compiler versions, runtime versions, and just about anything else. You cannot safely assume that the cost is linear with the size of the allocation. The only thing that you can do is measure it and figure out how to best optimize once it has been proven to be a problem.