How is performance dependent on the underlying data values - c++

I have the following C++ code snippet (the C++ part is the profiler class which is omitted here), compiled with VS2010 (64bit Intel machine). The code simply multiplies an array of floats (arr2) with a scalar, and puts the result into another array (arr1):
int M = 150, N = 150;
int niter = 20000; // do many iterations to have a significant run-time
float *arr1 = (float *)calloc (M*N, sizeof(float));
float *arr2 = (float *)calloc (M*N, sizeof(float));
// Read data from file into arr2
float scale = float(6.6e-14);
// START_PROFILING
for (int iter = 0; iter < niter; ++iter) {
for (int n = 0; n < M*N; ++n) {
arr1[n] += scale * arr2[n];
}
}
// END_PROFILING
free(arr1);
free(arr2);
The reading-from-file part and profiling (i.e run-time measurement) is omitted here for simplicity.
When arr2 is initialized to random numbers in the range [0 1], the code runs about 10 times faster as compared to a case where arr2 is initialized to a sparse array in which about 2/3 of the values are zeros. I have played with the compiler options /fp and /O, which changed the run-time a little bit, but the ratio of 1:10 was approximately kept.
How come the performance is dependent on the actual values? What does the CPU do differently that makes the sparse data run ~10 times slower?
Is there a way to make the "slow data" run faster, or will any optimization (e.g vectorizing the calculation) have the same effect on both arrays (i.e, the "slow data" will still run slower then the "fast data")?
EDIT
Complete code is here: https://gist.github.com/1676742, the command line for compiling is in a comment in test.cpp.
The data files are here:
https://ccrma.stanford.edu/~itakatz/tmp/I.bin
https://ccrma.stanford.edu/~itakatz/tmp/I0.bin

Probably that's because your "fast" data consists only of normal floating point numbers, but your "slow" data contains lots of denormalized numbers.
As for your second question, you can try to improve speed with this (and treat all denormalized numbers as exact zeros):
#include <xmmintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);

I can think of two reasons for this.
First, the branch predictor may be making incorrect decisions. This is one potential cause of performance gaps caused by data changes without code changes. However, in this case, it seems very unlikely.
The second possible reason is that your "mostly zeros" data doesn't really consist of zeros, but rather of almost-zeros, or that you're keeping arr1 in the almost-zero range. See this Wikipedia link.

There is nothing strange that the data from I.bin takes longer to process: you have lots of numbers like '1.401e-045#DEN' or '2.214e-043#DEN', where #DEN means the number cannot be normalized to the standard float precision. Given that you are going to multiply it by 6.6e-14 you'll definitely have underflow exceptions, which significantly slows down calculations.

Related

Strange behavior in matrix formation (C++, Armadillo)

I have a while loop that continues as long as energy variable (type double) has not converged to below a certain threshold. One of the variables needed to calculate this energy is an Armadillo matrix of doubles, named f_mo. In the while loop, this f_mo updates iteratively, so I calculate f_mo at the beginning of each loop as:
arma::mat f_mo = h_core_mo;//h_core_mo is an Armadillo matrix of doubles
for(size_t p = 0; p < n_mo; p++) {//n_mo is of type size_t
for(size_t q = 0; q < n_mo; q++) {
double sum = 0.0;
for(size_t i = 0; i < n_occ; i++) {//n_occ is of type size_t
//f_mo(p,q) += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q); //all g_mat_ are Armadillo matrices of doubles
sum += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q);
}
for(size_t i2 = 0; i2 < n_occ2; i2++) {//n_occ2 is of type size_t
//f_mo(p,q) -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
sum -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
}
f_mo(p,q) +=sum;
}}
But say I replace the sum (which I add at the end to f_mo(p,q)) with addition to f_mo(p,q) directly (the commented out code). The output f_mo matrices are identical to machine precision. Nothing about the code should change. The only variables affected in the loop are sum and f_mo. And YET, the code converges to a different energy and in vastly different number of while loop iterations. I am at a loss as to the cause of the difference. When I run the same code 2,3,4,5 times, I get the same result every time. When I recompile with no optimization, I get the same issue. When I run on a different computer (controlling for environment), I yet again get a discrepancy in # of while loops despite identical f_mo, but the total number of iterations for each method (sum += and f_mo(p,q) += ) differ.
It is worth noting that the point at which the code outputs differ is always g_mat_full_qp1_qp2_mo, which is recalculated later in the while loop. HOWEVER, every variable going into the calculation of g_mat_full_qp1_qp2_mo is identical between the two codes. This leads me to think there is something more profound about C++ that I do not understand. I welcome any ideas as to how you would proceed in debugging this behavior (I am all but certain it is not a typical bug, and I've controlled for environment and optimization)
I'm going to assume this a Hartree-Fock, or some other kind of electronic structure calculation where you adding the two-electron integrals to the core Hamiltonian, and apply some domain knowledge.
Part of that assume is the individual elements of the two-electron integrals are very small, in particular compared to the core Hamiltonian. Hence as 1201ProgramAlarm mentions in their comment, the order of addition will matter. You will get a more accurate result if you add smaller numbers together first to avoid loosing precision when adding two numbers many orders of magintude apart.. Because you iterate this processes until the Fock matrix f_mo has tightly converged you eventually converge to the same value.
In order to add up the numbers in a more accurate order, and hopefully converge faster, most electronic structure programs have a seperate routine to calculate the two-electron integrals, and then add them to the core Hamiltonian, which is what you are doing, element by element, in your example code.
Presentation on numerical computing.

Efficiently count number of distinct values in 16-byte buffer in arm neon

Here's the basic algorithm to count number of distinct values in a buffer:
unsigned getCount(const uint8_t data[16])
{
uint8_t pop[256] = { 0 };
unsigned count = 0;
for (int i = 0; i < 16; ++i)
{
uint8_t b = data[i];
if (0 == pop[b])
count++;
pop[b]++;
}
return count;
}
Can this be done somehow in neon efficiently by loading into a q-reg and doing some bit magic? Alternatively, can I efficiently say that data has all elements identical, or contains only two distinct values or more than two?
For example, using vminv_u8 and vmaxv_u8 I can find min and max elements and if they are equal I know that data has identical elements. If not, then I can vceq_u8 with min value and vceq_u8 with max value and then vorr_u8 these results and compare that I have all 1-s in the result. Basically, in neon it can be done this way. Any ideas how to make it better?
unsigned getCountNeon(const uint8_t data[16])
{
uint8x16_t s = vld1q_u8(data);
uint8x16_t smin = vdupq_n_u8(vminvq_u8(s));
uint8x16_t smax = vdupq_n_u8(vmaxvq_u8(s));
uint8x16_t res = vdupq_n_u8(1);
uint8x16_t one = vdupq_n_u8(1);
for (int i = 0; i < 14; ++i) // this obviously needs to be unrolled
{
s = vbslq_u8(vceqq_u8(s, smax), smin, s); // replace max with min
uint8x16_t smax1 = vdupq_n_u8(vmaxvq_u8(s));
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax1, smax), one));
smax = smax1;
}
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax, smin), one));
return vgetq_lane_u8(res, 0);
}
With some optimizations and improvements perhaps a 16-byte block can be processed in 32-48 neon instructions. Can this be done better in arm? Unlikely
Some background why I ask this question. As I'm working on an algorithm I'm trying different approaches at processing data and I'm not sure yet what exactly I'll use at the end. Information that might be of use:
count of distinct elements per 16-byte block
value that repeats most per 16-byte block
average per block
median per block
speed of light?.. that's a joke, it cannot be computed in neon from 16-byte block :)
so, I'm trying stuff, and before I use any approach I want to see if that approach can be well optimized. For example, average per block will be memcpy speed on arm64 basically.
If you're expecting a lot of duplicates, and can efficiently get a horizontal min with vminv_u8, this might be better than scalar. Or not, maybe NEON->ARM stalls for the loop condition kill it. >.< But it should be possible to mitigate that with unrolling (and saving some info in registers to figure out how far you overshot).
// pseudo-code because I'm too lazy to look up ARM SIMD intrinsics, edit welcome
// But I *think* ARM can do these things efficiently,
// except perhaps the loop condition. High latency could be ok, but stalling isn't
int count_dups(uint8x16_t v)
{
int dups = (0xFF == vmax_u8(v)); // count=1 if any elements are 0xFF to start
auto hmin = vmin_u8(v);
while (hmin != 0xff) {
auto min_bcast = vdup(hmin); // broadcast the minimum
auto matches = cmpeq(v, min_bcast);
v |= matches; // min and its dups become 0xFF
hmin = vmin_u8(v);
dups++;
}
return dups;
}
This turns unique values into 0xFF, one set of duplicates at a time.
The loop-carried dep chain through v / hmin stays in vector registers; it's only the loop branch that needs NEON->integer.
Minimizing / hiding NEON->integer/ARM penalties
Unroll by 8 with no branches on hmin, leaving results in 8 NEON registers. Then transfer those 8 values; back-to-back transfers of multiple NEON registers to ARM only incurs one total stall (of 14 cycles on whatever Jake tested on.) Out-of-order execution could also hide some of the penalty for this stall. Then check those 8 integer registers with a fully-unrolled integer loop.
Tune the unroll factor to be large enough that you usually don't need another round of SIMD operations for most input vectors. If almost all of your vectors have at most 5 unique values, then unroll by 5 instead of 8.
Instead of transferring multiple hmin results to integer, count them in NEON. If you can use ARM32 NEON partial-register tricks to put multiple hmin values in the same vector for free, it's only a bit more work to shuffle 8 of them into one vector and compare for not-equal to 0xFF. Then horizontally add that compare result to get a -count.
Or if you have values from different input vectors in different elements of a single vector, you can use vertical operations to add results for multiple input vectors at once without needing horizontal ops.
There's almost certainly room to optimize this, but I don't know ARM that well, or ARM performance details. NEON's hard to use for anything conditional because of the big performance penalty for NEON->integer, totally unlike x86. Glibc has a NEON memchr with NEON->integer in the loop, but I don't know if it uses it or if it's faster than scalar.
Speeding up repeated calls to the scalar ARM version:
Zeroing the 256-byte buffer every time would be expensive, but we don't need to do that. Use a sequence number to avoid needing to reset:
Before every new set of elements: ++seq;
For each element in the set:
sum += (histogram[i] == seq);
histogram[i] = seq; // no data dependency on the load result, unlike ++
You might make the histogram an array of uint16_t or uint32_t to avoid needing to re-zero if a uint8_t seq wraps. But then it takes more cache footprint, so maybe just re-zeroing every 254 sequence numbers makes the most sense.

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).

Digital filter and std::inner_product optimization

In a digital filtering C++ application, I use std::inner_product (with std::vector<double> and std::deque<double>) to compute the dot product between the filter coefficients and the input data, for each data sample. After profiling my application, I figured out that no less than 85% of the execution time is spent in std::inner_product!
To what extend is std::inner_product optimized, in GCC for example?
Does it uses SIMD instructions? Does it performs loop unrolling? How to make sure of that?
Based on this, would it worth it to implement custom dot product function(s) (especially if the number of coefficient is low)? (but I would like to keep the function as generic as possible)
More specifically, this is the piece of code I use to apply a filter:
std::deque<double> in(filterNum.size(), 0.0);
std::deque<double> out(filterDenom.size() - 1, 0.0);
const double gain = filterDenom.back();
for (unsigned int s = 0, size = data.size(); s < size; ++s) {
in.pop_front();
in.push_back(data[s] / gain);
data[s] = inner_product(in.begin(), in.end(), filterNum.begin(),
-inner_product(out.begin(), out.end(), filterDenom.begin(), 0.0));
out.pop_front();
out.push_back(data[s]);
}
Typically, I use second order bandpass IIR filters, which means that the size of filterNum and filterDenom (numerator and denominator coefficients of the filter) is 5. data is the vector containing the input samples.
Getting an additional factor of 2 out of this shouldn't be hard if you just write the code directly. Part of it might come from removing some of the generality of inner_product, but some would also come from such things as eliminating the use of deques - if you just keep a pointer into your input array you can index off it and off the filter array in the inner loop, and increment the pointer to the input array in the outer loop.
Each of those inner_products has to use iterators through deques,
Most of the (coding) effort then becomes handling the edge conditions.
And take that division out of there - it should be a multiplication by a constant calculated outside the loop.
Inner product itself is pretty efficient (there's not much to do there), but it needs to increment two iterators on each pass through the inner loop. There is no explicit loop unrolling, but a good compiler can unroll a loop that simple. And a compiler is more likely to know how far to unroll a loop before running into instruction cache issues.
Deque iterators are not nearly as efficient as ++ on a pure pointer. There is at least a test on every ++, and there may be more than one assignment.
This is what a simple (FIR) filter can look like, without including the code for the edge conditions (which goes outside of the loop)
double norm = 1.0/sum;
double *p = data.values(); // start of input data
double *q = output.values(); // start of output buffer
int width = data.size() - filter.size();
for( int i = 0; i < width; ++i )
{
double *f = filter.values();
double accumulator = ( f[0] * p[0] );
for( int j = 1; j < filter.size(); ++j )
{
accumulator += ( f[i] * p[i] );
}
*q++ = accumulator * norm;
}
Note that there are messy details left out, and this is not the same as your filter, but it gives the idea. What's inside the outer loop easily fits in a modern instruction cache. The inner loop may be unrolled by the compiler. Most modern architectures can do the add and multiply in parallel.
You can ask GCC to computes most of the algorithms in <algorithms> and <numeric> in parallel mode, it may give a performance boost if your data set is very high (I think that it really only uses OpenMP inside).
However on small datasets it may give a performance hit.
A comparison with the other solution would be more than welcome!
http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html