Histogram approximation for streaming data

Histogram approximation for streaming data - c++

This question is a slight extension of the one answered here. I am working on re-implementing a version of the histogram approximation found in Section 2.1 of this paper, and I would like to get all my ducks in a row before beginning this process again. Last time, I used boost::multi_index, but performance wasn't the greatest, and I would like to avoid the logarithmic in number of buckets insert/find complexity of a std::set. Because of the number of histograms I'm using (one per feature per class per leaf node of a random tree in a random forest), the computational complexity must be as close to constant as possible.
A standard technique used to implement a histogram involves mapping the input real value to a bin number. To accomplish this, one method is to:
initialize a standard C array of size N, where N = number of bins; and
multiply the input value (real number) by some factor and floor the result to get its index in the C array.
This works well for histograms with uniform bin size, and is quite efficient. However, Section 2.1 of the above-linked paper provides a histogram algorithm without uniform bin sizes.
Another issue is that simply multiplying the input real value by a factor and using the resulting product as an index fails with negative numbers. To resolve this, I considered identifying a '0' bin somewhere in the array. This bin would be centered at 0.0; the bins above/below it could be calculated using the same multiply-and-floor method just explained, with the slight modification that the floored product be added to two or subtracted from two as necessary.
This then raises the question of merges: the algorithm in the paper merges the two closest bins, as measured from center to center. In practice, this creates a 'jagged' histogram approximation, because some bins would have extremely large counts and others would not. Of course, this is due to non-uniform-sized bins, and doesn't result in any loss of precision. A loss of precision does, however, occur if we try to normalize the non-uniform-sized bins to make the uniform. This is because of the assumption that m/2 samples fall to the left and right of the bin center, where m = bin count. We could model each bin as a gaussian, but this will still result in a loss of precision (albeit minimal)
So that's where I'm stuck right now, leading to this major question: What's the best way to implement a histogram accepting streaming data and storing each sample in bins of uniform size?

Keep four variables.
int N; // assume for simplicity that N is even
int count[N];
double lower_bound;
double bin_size;
When a new sample x arrives, compute double i = floor(x - lower_bound) / bin_size. If i >= 0 && i < N, then increment count[i]. If i >= N, then repeatedly double bin_size until x - lower_bound < N * bin_size. On every doubling, adjust the counts (optimize this by exploiting sparsity for multiple doublings).
for (int j = 0; j < N / 2; j++) count[j] = count[2 * j] + count[2 * j + 1];
for (int j = N / 2; j < N; j++) count[j] = 0;
The case i < 0 is trickier, since we need to decrease lower_bound as well as increase bin_size (again, optimize for sparsity or adjust the counts in one step).
while (lower_bound > x) {
lower_bound -= N * bin_size;
bin_size += bin_size;
for (int j = N - 1; j > N / 2 - 1; j--) count[j] = count[2 * j - N] + count[2 * j - N + 1];
for (int j = 0; j < N / 2; j++) count[j] = 0;
}
The exceptional cases are expensive but happen only a logarithmic number of times in the range of your data over the initial bin size.
If you implement this in floating-point, be mindful that floating-point numbers are not real numbers and that statements like lower_bound -= N * bin_size may misbehave (in this case, if N * bin_size is much smaller than lower_bound). I recommend that bin_size be a power of the radix (usually two) at all times.

Related

Finding median of a set of circular data

I would like to write a C++ function which finds the median of an array of circular data.
For example, consider the reading from a compass where the readings are assumed to be in [0,360). Though 1 & 359 appears to be far away, they are very close due to the circular nature of the reading.
Finding median of N-elements in ordinary data is as follows.
1. sort the data of N-elements (ascending or descending order)
2. If N is odd, median is the (N+1)/2 th element in the sorted array.
3. If N is even, median is the average of the N/2 th and N/2+1 th elements in the sorted array.
However, the wrap around problem in the circular data takes the problem to a different dimension and the solution non-trivial.
A similar question to find mean from circular data is explained here How do you calculate the average of a set of circular data?
The suggestion in the above link is to find the unit vector corresponding to each angle and find the average. However, median requires sorting the data and sorting of vectors don't make any sense in this context. Hence I don't think we can use the proposed scheme to find median!

I've actually given this topic way more thought than is healthy so I'll share my thoughts and findings here. Maybe someone will have a similar problem and find this useful.
I haven't used C++ in many years so please forgive me if I write all the code in C#. I believe a fluent C++ speaker can pretty easily translate the algorithms.
Circular mean
First, let's define the circular mean. It's calculated by converting your points to radians, where your period (256, 360 or whatever - the value that is interpreted to be the same as zero) is scaled to 2*pi. You then calculate the sine and cosine of those radian values. Those are the y and x coordinates of your values on a unit circle. You then sum up all the sines and cosines and calculate atan2. This gives you the average angle, which can be easily converted back to your data point by dividing with the scaling factor.
var scalingFactor = 2 * Math.PI / period;
var sines = 0.0;
var cosines = 0.0;
foreach (var value in inputs)
{
var radians = value * scalingFactor;
sines += Math.Sin(radians);
cosines += Math.Cos(radians);
}
var circularMean = Math.Atan2(sines, cosines) / scalingFactor;
if (circularMean >= 0)
return circularMean;
else
return circularMean + period;
Marginal circular median
The simplest approach to a circular median is just a modified way of handling the circular mean.
The circular median can be calculated in a similar way, by just finding the median of the sines and cosines instead of the sums, and calculating the atan2 of that. This way, you are finding the marginal median of the circle points and taking its angle as a result.
var scalingFactor = 2 * Math.PI / period;
var sines = new List<double>();
var cosines = new List<double>();
foreach (var value in inputs)
{
var radians = value * scalingFactor;
sines.Add(Math.Sin(radians));
cosines.Add(Math.Cos(radians));
}
var circularMedian = Math.Atan2(Median(sines), Median(cosines)) / scalingFactor;
if (circularMedian >= 0)
return circularMedian;
else
return circularMedian + period;
This approach is O(n), robust to outliers and very simple to implement. It may suit your purposes well enough, but it has a problem: rotating the input points will give you different results. Depending on the distribution of your input data, it may or may not be a problem.
Circular arc median
To understand this other approach, you need to stop thinking of means and medians in terms of "this is how it's calculated", but in terms of what the resulting values actually represent.
For non-cyclic data, you get the mean by summing up all the values and dividing by the number of elements. What this number represents, though, is the value with the minimal sum of all squared distances to data elements. (I hear statisticians call this value the L2 estimate of location, but a statistician should probably confirm or deny this.)
Likewise for median. You get it by finding the data element that would end up in the middle if all data were sorted (ideally, using an O(n) selection algorithm, like nth_element in C++). What this number is, though, is a value that has the minimal sum of all absolute (non-squared!) distances to data elements. (Supposedly, this value is called an L1 estimate of location.)
Sorting circular data doesn't help you find a middle, so the usual way of thinking about medians doesn't work, but you can still find this point that minimizes the sum of absolute distances from all data points. Here's the algorithm that I came up with, that runs in O(n) time assuming the input data is normalized to >= 0 and < period, and then sorted. (If you need to do this sorting as part of your calculation, then the runtime is O(n log n).)
It works by going through all the data points and keeping track of the sum of distances. When you shift to the right data point by a distance D, the sum of distances to all the left points increases by D*LeftCount and the sum of all distances to all the right points decreases by D*RightCount. Then, if some of the left points are now actually the right points, because their left distance is larger than period/2, you subtract their previous distance and add the new, correct distance.
For comparing the current sum to the best sum, I added a bit of tolerance to guard against inexact floating point arithmetic.
There may be multiple or infinitely many points that satisfy the minimum distances condition. With non-circular medians with even number of values, the median can be any value between the two central values. It's usually taken to be the average of those two central values, so I took the similar approach with this median algorithm. I find all data points that minimize the distances and then just calculate the circular mean of those points.
// Requires a sorted list with values normalized to [0,period).
// Doing an initialization pass:
// * candidate is the lowest number
// * finding the index where the circle with this candidate starts
// * calculating the score for this candidate - the sum of absolute distances
// * counting the number of values to the left of the candidate
int i;
var candidate = list[0];
var distanceSum = 0.0;
for (i = 1; i < list.Count; ++i)
{
if (list[i] >= candidate + period / 2)
break;
distanceSum += list[i] - candidate;
}
var leftCount = list.Count - i;
var circleStart = i;
if (circleStart == list.Count)
circleStart = 0;
else
for (; i < list.Count; ++i)
distanceSum += candidate + period - list[i];
var previousCandidate = candidate;
var bestCandidates = new List<double> { candidate };
var bestDistanceSum = distanceSum;
var equalityTolerance = period * 1e-10;
for (i = 1; i < list.Count; ++i)
{
candidate = list[i];
// A formula for correcting the distance given the movement to the right.
// It doesn't take into account that some values may have wrapped to the other side of the circle.
++leftCount;
distanceSum += (2 * leftCount - list.Count) * (candidate - previousCandidate);
// Counting all the values that wrapped to the other side of the circle
// and correcting the sum of distances from the candidate.
if (i <= circleStart)
while (list[circleStart] < candidate + period / 2)
{
--leftCount;
distanceSum += 2 * (list[circleStart] - candidate) - period;
++circleStart;
if (circleStart == list.Count)
{
circleStart = 0;
break; // Letting the next loop continue.
}
}
if (i > circleStart)
while (list[circleStart] < candidate - period / 2)
{
--leftCount;
distanceSum += 2 * (list[circleStart] - candidate) + period;
++circleStart;
}
// Comparing current sum to the best one, using the given tolerance.
if (distanceSum <= bestDistanceSum + equalityTolerance)
{
if (distanceSum >= bestDistanceSum - equalityTolerance)
{
// The numbers are close, so using their average as the next best.
bestDistanceSum = (bestCandidates.Count * bestDistanceSum + distanceSum) / (bestCandidates.Count + 1);
}
else
{
// The new number is significantly better, clearing.
bestDistanceSum = distanceSum;
bestCandidates.Clear();
}
bestCandidates.Add(candidate);
}
previousCandidate = candidate;
}
if (bestCandidates.Count == 1)
return bestCandidates[0];
else
return CircularMean(bestCandidates, period);
Geometric circular median
There is an inconsistency in the previous algorithm, in the way the median is defined in relation to the circular mean. The circular mean minimizes the sum of squared euclidian distances between points on a circle. In other words, it looks at the straight lines connecting points on a circle, cutting through the circle.
The arc median, as I calculate it above, looks at the arc distances: how far the points are to each other by moving on the perimeter of the circle, not by taking a straight line between them.
I have thought about how to address this issue, if it bothers you, but I haven't really done any experiments so I can't claim the following method works. In short, I believe you could use a modification of the Iteratively reweighted least squares algorithm (IRLS), which is what is usually used to calculate geometric medians.
The idea is to pick a starting value (for instance, the circular mean or the arc median presented above), and calculate the euclidean distance to each point: Di = sqrt(dxi^2 + dyi^2). Circular mean will minimize the squares of those distances, so the weights of each point should cancel out the square and reset to just D: Wi = Di / Di^2, which is just Wi = 1 / Di.
With these weights, calculate the weighted circular mean (same as the circular mean, but multiply each sine and cosine by the weight of that point before summing them up) and repeat the process. Repeat until enough iterations have passed or until the result stops changing much.
The problem with this algorithm is that it has a division by zero if the current solution falls exactly on a data point. Even if the distance isn't exactly zero, the solution will stop moving if you hit close enough to the point because the weight will become enormous compared to all the other ones. This can be fixed by adding a small fixed offset to the distance before dividing by it. This will make the solution suboptimal, but at least it won't stop on a wrong point.
It will still take some number of iterations to dig itself out of that wrong point unless the offset is relatively large, and the final solution is worse the bigger the offset is. So the best way would probably be to start with a fairly large offset and then progressively making it smaller for each next iteration.

Two properties of median allow inventing two distinct algorithms for median finding.
1) Median minimizes sum of absolute distance to all other elements -- O(n^2) algo:
for (i = 0; i < N; i++)
{
sum = 0;
for (j = 0; j < N; j++)
sum += abs(item[i] - item[j]) % 360;
if (sum < best_so_far) { best_so_far = sum; index = i; }
}
2) Median satisfies that half of items are less and half are greater
sort the items
locate the first set of items (i=0...I), satisfying either that
I <= N/2, OR item[I] > i + 180
if the condition for median is not satisfied, advance either i, or I.
requires O(N*log N) for sorting and O(N) for the next scan
Of course in cyclical data all items (and all items inbetween data points) can be a proper candidate for the median.

For definition and discussion of circular median see
N.I. Fisher's 'Statistical Analysis of Circular Data', Cambridge Univ. Press 1993
and the discussion surrounding equations 2.32 and 2.33. For multi-modal or isotropic data a unique median may not exist.
Find an axis that divides the data into 2 equal groups and choose the end of the axis at the smaller value of the angle. If the sample size is odd the median will be a data point, otherwise it will be the midpoint of 2 data points.
There are packages in other languages (e.g. R, MatLab) that would help provide test values for any function you write.
e.g.
https://www.rdocumentation.org/packages/circular/versions/0.4-93
See in particular median.circular and medianHL.circular
or
Berens, Philipp. ‘CircStat: A MATLAB Toolbox for Circular Statistics’. Journal of Statistical Software 31, no. 1 (23 September 2009): 1–21. https://doi.org/10.18637/jss.v031.i10.
and see circ_median

With your vector of angular datapoints (i.e. vector of numbers from 0 to 259), create two new vectors, I'll call them x and y. These two new vectors are the sine and cosine respectively of your angular datapoints.
That is, x[n] = cos(data[n]) and y[n] = sin(data[n]) where data is your angular data vector and n is however many datapoints there are.
Next, add up all the values in the x vector to get a single value, call it say sum_x and add up all the values in the y vector to get a another single value, call it sum_y.
Now you can do tangent-inverse (e.g. atan(sum_y/sum_x)) to get a new value. And this value is very meaningful. This value is basically telling you which direction your data is "pointing", i.e. where the majority of your data exists. NOTE: You must be careful of dividing by 0 (when sum_x=0) and when the indeterminate forms occurs (when both sum_x=0 and sum_y=0). The indeterminate form just means your data is evenly distributed, in which case the median is meaningless, and when sum_x=0 but sum_y!=0, then it is effectively atan(inf) or atan(-inf), both of which are known.
EDIT:
My previous answer needed some tweaking after this point.
From here, it is easy. Take the value you got in the previous step (atan(sum_y/sum_x)) and add 180 degrees to that value. This is your reference point of where your data starts and ends. From here, you can sort your angular data with this reference point as both the starting and ending point, and find the median of that data.

It is not possible to canonically extend the concept of median to circular data. For the sake of simplicity lets consider numbers in [0 10) and as an example the (already ordered) set { 1 3 5 7 8 }. Depending on how you rotate the array you get different values for the median:
1 3 5 7 8 -> 5
3 5 7 8 1 -> 7
5 7 8 1 3 -> 8
...etc...
and any is as good as the other.
I am not claiming that it is not possible to define a median on circular data. I am just claiming that the "normal" median cannot be extended to that case in a meaningful way without adding additional constraints or making an arbitrary choice.

How can I most efficiently map a kernel range for a hermitian (symmetric) matrix in OpenCL?

I'm working on an OpenCL project to generate very large hermitian (symmetric) matrices, and I am trying to determine the best way to generate the work IDs.
A hermitian matrix is symmetric along the diagonal, so that M(i,j) = M*(j,i).
In the brute force way, the for loop looks like:
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
}
}
However, taking advantage of the hermitian property, the loop can be made to be twice as efficient by only calculating the upper triangular part of the matrix and duplicating the result in the lower triangular part:
for(int i = 0; i < N; i++)
{
for(int j = i; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
M(j,i) = conj(result);
}
}
In both loops, doSomeCalculation() is an expensive operation, and each entry in the matrix is completely uncorrelated from every other entry (i.e. the problem is stupidly parallel).
My question is this:
How can I implement the second loop with doSomeCalculation as an OpenCL kernel so that the thread IDs are most efficiently used (i.e. so that the thread calculates both M(i,j) and M(j,i) without having to call doSomeCalculation() twice)?

You need to use a linear index, for example you can index every element of your matrix in this way:
0 1 2 ... N-1
* N-2 ... 2N-2
....
* * 2N-1 ... N(N+1)/2 -1
That is, the index K is given by:
k=iN-i*(i+1)/2+j
Where N is the size of the matrix and (i,j) are respectively the 0-based indices of the row and the column.
This relationship can be inverted; see the answer of this question, which I report here for completeness:
i = floor( ( 2*N+1 - sqrt( (2N+1)*(2N+1) - 8*k ) ) / 2 ) ;
j = k - N*i + i*(i+1)/2 ;
So you need to enqueue a 1D kernel with N(N+1)/2 work items, and you can decide by yourself the size of the workgroup (usually 64 items per work group is a good choice).
Then in the OpenCL code you can retrieve the index K by using:
int k = get_group_id(0)*64 + get_local_id(0);
And then use the two relationships above the index of the matrix element you need to compute.
Moreover, notice that you can also save space by representing your hermitian matrix as a linear vector with N(N+1)/2 elements.

If your matrices are really big, than you can dice up your NxN matrix into (N/k)x(N/k) tiles, each of size kxk. As soon as you need only a half of the data, you create 1D NDRange of size local_group_size * (N/k)x(N/k)/2 roughly.
Every tile of matrix is processed by one LocalGroup (size of LocalGroup is of your choice). The idea is that you create an array on Host side, which contain position of every WorkGroup in matrix. Kernel stub should look like follows:
void __kernel myKernel(
__global int* coords,
....)
{
int2 WorkGroupPositionInMatrix = vload2(get_group_id(0), coords);
...
DoCalculation();
...
WriteResultTwice();
...
return;
}
What you need to do by hand - is to cope with thouse WorkGroups, which will be placed on the matrix diagonal. If matrix size is big, than overhead for LocalGroups, placed on diagonal is negligible.

A right triangle can be cut in half vertically and the smaller portion rotated to fit with the larger portion to form a rectangle of equal area. Therefore it is easy to make your triangular global work area into one that is rectangular, which fits OpenCL.
See my answer here: OpenCL efficient way to group a lower triangular matrix

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};

1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}

for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

Calculate squared Euclidean distance matrix on GPU

Let p be a matrix of first set of locations where each row gives the coordinates of a particular point. Similarly, let q be a matrix of second set of locations where each row gives the coordinates of a particular point.
Then formula for pairwise squared Euclidean distance is:
k(i,j) = (p(i,:) - q(j,:))*(p(i,:) - q(j,:))',
where p(i,:) denotes i-th row of matrix p, and p' denotes the transpose of p.
I would like to compute matrix k on CUDA-enabled GPU (NVidia Tesla) in C++. I have OpenCV v.2.4.1 with GPU support but I'm open to other alternatives, like Thrust library. However, I'm not too familiar with GPU programming. Can you suggest an efficient way to accomplish this task? What C++ libraries should I use?

The problem looks simple enough to make a library overkill.
Without knowing the range of i and j, I'd suggest you partition k into blocks of a multiple of 32 threads each and in each block, compute
float sum, myp[d];
int i = blockIdx.x*blockDim.x + threadIdx.x;
for ( int kk = 0 ; kk < d ; kk++ )
myp[kk] = p(i,kk);
for ( j = blockIdx.y*blockDim.y ; j < (blockIdx.y+1)*blockDim ; j++ ) {
sum = 0.0f;
#pragma unroll
for ( int kk = 0 ; kk < d ; kk++ ) {
temp = myp[kk] - q(j,kk);
sum += temp*temp;
}
k(i,j) = sum;
}
where I am assuming that your data has d dimensions and writing p(i,k), q(j,k) and k(i,j) to mean an access to a two-dimensional array. I also took the liberty in assuming that your data is of type float.
Note that depending on how k is stored, e.g. row-major or column-major, you may want to loop over i per thread instead to get coalesced writes to k.

UBLAS Matrix Finding Surrounding Values of a Cell?

I am looking for an elegant way to implement this. Basically i have a m x n matrix. Where each cell represents the pixel value, and the rows and columns represent the pixel rows and pixel columns of the image.
Since i basically mapped points from a HDF file, along with their corresponding pixel values. We basically have alot of empty pixels. Which are filled with 0.
Now what i need to do is take the average of the surrounding cell's, to average out of a pixel value for the missing cell.
Now i can brute force this but it becomes ugly fast. Is there any sort of elegant solution for this?

There's a well-known optimization to this filtering problem.
Integrate the cells in one direction (say horizontally)
Integrate the cells in the other direction (say vertically)
Take the difference between each cell and it's N'th neighbor to the left.
Take the difference between each cell and it's N'th lower neighbor
Like this:
for (i = 0; i < h; ++i)
for (j = 0; j < w-1; ++j)
A[i][j+1] += A[i][j];
for (i = 0; i < h-1; ++i)
for (j = 0; j < w; ++j)
A[i+1][j] += A[i][j]
for (i = 0; i < h; ++i)
for (j = 0; j < w-N; ++j)
A[i][j] -= A[i][j+N];
for (i = 0; i < h-N; ++i)
for (j = 0; j < w; ++j)
A[i][j] -= A[i-N][j];
What this does is:
The first pass makes each cell the sum of all of the cells on that row to it's left, including itself.
After the 2nd pass , each cell is the sum of all of the cells in a rectangle above and left of itselt (including it's own row and column)
After the 3rd pass, each cell is the sum of a rectangle above and to the right of itself, N columns wide.
After the 4th pass each cell is the sum of an NxN rectangle below and to the right of itself.
This takes 4 operations per cell to compute the sum, as opposed to 8 for brute force (assuming you're doing a 3x3 averaging filter).
The cool thing is that if you use ordinary two's-complement arithmetic, you don't have to worry about any overflows in the first two passes; they cancel out in the last two passes.

The main issues here are utilizing all available cores and cache effeciency.
You might be interested in checking fast implementation of convolution.
However, since you do it with Boost, you can check how this is done in this Boost example
I beleive you have to change only the convolution kernel for your specialized task.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js