I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).
Related
People told me that adding padding can help to have better performance because it's using the cache in a better way.
I don't understand how is it possible that by making your data bigger you get better performance.
Can someone understand why?
Array padding
The array padding technique consists of increasing the size of the array dimensions in order to reduce conflict misses when accessing a cache memory.
This type of miss can occur when the number of accessed elements mapping to the same set is greater than the degree of associativity of the cache.
Padding changes the data layout and can be applied (1) between variables (Inter-Variable Padding) or (2) to a variable (Intra-Variable Padding):
1. Inter-Variable Padding
float x[LEN], padding[P], y[LEN];
float redsum() {
float s = 0;
for (int i = 0; i < LEN; i++)
s = s + x[i] + y[i];
return s;
}
If we have a direct mapped cache and the elements x[i] and y[i] are mapped into the same set, accesses to x will evict a block from y and vice versa, resulting in a high miss rate and low performance.
2. Intra-Variable Padding
float x[LEN][LEN+PAD], y[LEN][LEN];
void symmetrize() {
for (int i = 0; i < LEN; i++) {
for (int j = 0; j < LEN; j++)
y[i][j] = 0.5 *(x[i][j] + x[j][i]);
}
}
In this case, if the elements of a column are mapped into a small number of sets, their sequence of accesses may lead to conflict misses, so that the spatial locality would not be exploited.
For example, suppose that during the first iteration of the outer loop, the block containing x[0][0] x[0][1] ... x[0][15] is evicted to store the block containing the element x[k][0]. Then, at the start of the second iteration, the reference to x[0][1] would cause a cache miss.
This technical document analyses the performance of the Fast Fourier Transform (FFT) as a function of the size of the matrix used in the calculations:
https://www.intel.com/content/www/us/en/developer/articles/technical/fft-length-and-layout-advisor.html
References
Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. PLDI 1998. DOI: https://doi.org/10.1145/277650.277661
Changwan Hong et al. Effective padding of multidimensional arrays to avoid cache conflict misses. PLDI 2016. DOI: https://doi.org/10.1145/2908080.2908123
I don't think it would matter in a simple loop.
Have a look at this answer: Does alignment really matter for performance in C++11?
The most interesting bit for you from that answer is probably that you could arrange your classes so that members used together are in one cache line and those used by different threads are not.
Here's the basic algorithm to count number of distinct values in a buffer:
unsigned getCount(const uint8_t data[16])
{
uint8_t pop[256] = { 0 };
unsigned count = 0;
for (int i = 0; i < 16; ++i)
{
uint8_t b = data[i];
if (0 == pop[b])
count++;
pop[b]++;
}
return count;
}
Can this be done somehow in neon efficiently by loading into a q-reg and doing some bit magic? Alternatively, can I efficiently say that data has all elements identical, or contains only two distinct values or more than two?
For example, using vminv_u8 and vmaxv_u8 I can find min and max elements and if they are equal I know that data has identical elements. If not, then I can vceq_u8 with min value and vceq_u8 with max value and then vorr_u8 these results and compare that I have all 1-s in the result. Basically, in neon it can be done this way. Any ideas how to make it better?
unsigned getCountNeon(const uint8_t data[16])
{
uint8x16_t s = vld1q_u8(data);
uint8x16_t smin = vdupq_n_u8(vminvq_u8(s));
uint8x16_t smax = vdupq_n_u8(vmaxvq_u8(s));
uint8x16_t res = vdupq_n_u8(1);
uint8x16_t one = vdupq_n_u8(1);
for (int i = 0; i < 14; ++i) // this obviously needs to be unrolled
{
s = vbslq_u8(vceqq_u8(s, smax), smin, s); // replace max with min
uint8x16_t smax1 = vdupq_n_u8(vmaxvq_u8(s));
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax1, smax), one));
smax = smax1;
}
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax, smin), one));
return vgetq_lane_u8(res, 0);
}
With some optimizations and improvements perhaps a 16-byte block can be processed in 32-48 neon instructions. Can this be done better in arm? Unlikely
Some background why I ask this question. As I'm working on an algorithm I'm trying different approaches at processing data and I'm not sure yet what exactly I'll use at the end. Information that might be of use:
count of distinct elements per 16-byte block
value that repeats most per 16-byte block
average per block
median per block
speed of light?.. that's a joke, it cannot be computed in neon from 16-byte block :)
so, I'm trying stuff, and before I use any approach I want to see if that approach can be well optimized. For example, average per block will be memcpy speed on arm64 basically.
If you're expecting a lot of duplicates, and can efficiently get a horizontal min with vminv_u8, this might be better than scalar. Or not, maybe NEON->ARM stalls for the loop condition kill it. >.< But it should be possible to mitigate that with unrolling (and saving some info in registers to figure out how far you overshot).
// pseudo-code because I'm too lazy to look up ARM SIMD intrinsics, edit welcome
// But I *think* ARM can do these things efficiently,
// except perhaps the loop condition. High latency could be ok, but stalling isn't
int count_dups(uint8x16_t v)
{
int dups = (0xFF == vmax_u8(v)); // count=1 if any elements are 0xFF to start
auto hmin = vmin_u8(v);
while (hmin != 0xff) {
auto min_bcast = vdup(hmin); // broadcast the minimum
auto matches = cmpeq(v, min_bcast);
v |= matches; // min and its dups become 0xFF
hmin = vmin_u8(v);
dups++;
}
return dups;
}
This turns unique values into 0xFF, one set of duplicates at a time.
The loop-carried dep chain through v / hmin stays in vector registers; it's only the loop branch that needs NEON->integer.
Minimizing / hiding NEON->integer/ARM penalties
Unroll by 8 with no branches on hmin, leaving results in 8 NEON registers. Then transfer those 8 values; back-to-back transfers of multiple NEON registers to ARM only incurs one total stall (of 14 cycles on whatever Jake tested on.) Out-of-order execution could also hide some of the penalty for this stall. Then check those 8 integer registers with a fully-unrolled integer loop.
Tune the unroll factor to be large enough that you usually don't need another round of SIMD operations for most input vectors. If almost all of your vectors have at most 5 unique values, then unroll by 5 instead of 8.
Instead of transferring multiple hmin results to integer, count them in NEON. If you can use ARM32 NEON partial-register tricks to put multiple hmin values in the same vector for free, it's only a bit more work to shuffle 8 of them into one vector and compare for not-equal to 0xFF. Then horizontally add that compare result to get a -count.
Or if you have values from different input vectors in different elements of a single vector, you can use vertical operations to add results for multiple input vectors at once without needing horizontal ops.
There's almost certainly room to optimize this, but I don't know ARM that well, or ARM performance details. NEON's hard to use for anything conditional because of the big performance penalty for NEON->integer, totally unlike x86. Glibc has a NEON memchr with NEON->integer in the loop, but I don't know if it uses it or if it's faster than scalar.
Speeding up repeated calls to the scalar ARM version:
Zeroing the 256-byte buffer every time would be expensive, but we don't need to do that. Use a sequence number to avoid needing to reset:
Before every new set of elements: ++seq;
For each element in the set:
sum += (histogram[i] == seq);
histogram[i] = seq; // no data dependency on the load result, unlike ++
You might make the histogram an array of uint16_t or uint32_t to avoid needing to re-zero if a uint8_t seq wraps. But then it takes more cache footprint, so maybe just re-zeroing every 254 sequence numbers makes the most sense.
My application does some operations on matrices of large size.
I recently came accross the concept of cache & the performance effect it can have through this answer.
I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000.
Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000.
Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999
If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].
In summary, the second algorithm is very likely better for nearly all options.
Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.
I am the author of an open source scientific code called vampire ( http://github.com/richard-evans/vampire ), and being compute intensive means any improvement in code performance can significantly increase the amount of research that can be done. Typical runtimes of this code can be hundreds of core hours, so I am always looking for ways to improve the performance critical sections of the code. However, I have come a bit stuck with the following bit of relatively innocuous looking bit of code, which makes up around 40% of the runtime:
for (int atom = start_index; atom < end_index; atom++){
register double Hx = 0.0;
register double Hy = 0.0;
register double Hz = 0.0;
const int start = atoms::neighbour_list_start_index[atom];
const int end = atoms::neighbour_list_end_index[atom] + 1;
for (int nn = start; nn < end; nn++){
const int natom = atoms::neighbour_list_array[nn];
const double Jij = atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij;
Hx -= Jij * atoms::x_spin_array[natom];
Hy -= Jij * atoms::y_spin_array[natom];
Hz -= Jij * atoms::z_spin_array[natom];
}
atoms::x_total_spin_field_array[atom] += Hx;
atoms::y_total_spin_field_array[atom] += Hy;
atoms::z_total_spin_field_array[atom] += Hz;
}
The high level overview of the function and variables of this code is as follows: There is a 1D array of a physical vector ( split into three 1D arrays for each component x,y,z for memory caching purposes, atoms::x_spin_array, etc) called 'spin'. Each of these spins interact with some other spins, and all the interactions are stored as a 1D neighbour list (atoms::neighbour_list_array). The relevant list of interactions for each atom is determined by a start and end index of the neighbor listarray in two separate arrays. At the end of the calculation the each atomic spin has an effective field which is the vector sum of the interactions.
Given the small amount of code and the sizable fraction of the runtime it occupies I have done by best, but I feel there must be a way to optimize this further, but as a physicist rather than a computer scientist maybe I am missing something?
You've got a constant stream of multiply, subtract and adds on contiguous data. That's seems like an ideal use of SSE. If it's memory bandwidth limited, then OpenCL/CUDA instead.
Try using this library if you aren't familiar with all the low level instructions.
That inner loop could potentially then be restructured significantly maybe leading to speed ups.
If the x, y, z components are indeed linked lists, doing x[i], y[i] and z[i] will cause the lists to be traversed multiple times, giving (n^2)/2 iterations. Using vectors will make this an O(1) operation.
You mention that the three coordinates are split out for memory caching purposes, but this will affect the Level 1 and Level 2 cache locality as you are accessing 3 different areas in memory. The linked list is also impacting your cache locality.
Using something like:
struct vector3d {
double x;
double y;
double z;
};
std::vector<vector3d> spin;
std::vector<vector3d> total_spin;
This should improve the cache locality, as the x, y and z values are adjacent in memory and the spins occupy a linear block of memory.
I feel following suggestions can help you optimize the code a bit if not completely:
Use initialization over assignments wherever possible
Prefer pre-increment over post for better speed.(believe me, it does make a change)
Apart from that I think the code is just fine. There are some pros and cons of each DS..you gotta live with it.
Happy Coding!
I have the following C++ code snippet (the C++ part is the profiler class which is omitted here), compiled with VS2010 (64bit Intel machine). The code simply multiplies an array of floats (arr2) with a scalar, and puts the result into another array (arr1):
int M = 150, N = 150;
int niter = 20000; // do many iterations to have a significant run-time
float *arr1 = (float *)calloc (M*N, sizeof(float));
float *arr2 = (float *)calloc (M*N, sizeof(float));
// Read data from file into arr2
float scale = float(6.6e-14);
// START_PROFILING
for (int iter = 0; iter < niter; ++iter) {
for (int n = 0; n < M*N; ++n) {
arr1[n] += scale * arr2[n];
}
}
// END_PROFILING
free(arr1);
free(arr2);
The reading-from-file part and profiling (i.e run-time measurement) is omitted here for simplicity.
When arr2 is initialized to random numbers in the range [0 1], the code runs about 10 times faster as compared to a case where arr2 is initialized to a sparse array in which about 2/3 of the values are zeros. I have played with the compiler options /fp and /O, which changed the run-time a little bit, but the ratio of 1:10 was approximately kept.
How come the performance is dependent on the actual values? What does the CPU do differently that makes the sparse data run ~10 times slower?
Is there a way to make the "slow data" run faster, or will any optimization (e.g vectorizing the calculation) have the same effect on both arrays (i.e, the "slow data" will still run slower then the "fast data")?
EDIT
Complete code is here: https://gist.github.com/1676742, the command line for compiling is in a comment in test.cpp.
The data files are here:
https://ccrma.stanford.edu/~itakatz/tmp/I.bin
https://ccrma.stanford.edu/~itakatz/tmp/I0.bin
Probably that's because your "fast" data consists only of normal floating point numbers, but your "slow" data contains lots of denormalized numbers.
As for your second question, you can try to improve speed with this (and treat all denormalized numbers as exact zeros):
#include <xmmintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);
I can think of two reasons for this.
First, the branch predictor may be making incorrect decisions. This is one potential cause of performance gaps caused by data changes without code changes. However, in this case, it seems very unlikely.
The second possible reason is that your "mostly zeros" data doesn't really consist of zeros, but rather of almost-zeros, or that you're keeping arr1 in the almost-zero range. See this Wikipedia link.
There is nothing strange that the data from I.bin takes longer to process: you have lots of numbers like '1.401e-045#DEN' or '2.214e-043#DEN', where #DEN means the number cannot be normalized to the standard float precision. Given that you are going to multiply it by 6.6e-14 you'll definitely have underflow exceptions, which significantly slows down calculations.