Why would this search method not be scalable? - c++

I want to parallel my search algorithm using openMP, vTree is a binary search tree, and I want to apply my search algorithm for each of the point set. below is a snippet of my code. the search procedure for two points is totally irrelevant and so can be parallel. though they do need to read a same tree, but once constructed, the tree wouldn't be modified any more. thus it is read-only.
However, the code below shows terrible scalability, on my 32-core platform, only 2x speed up is achieved. is it because that vTree is read by all threads? if so, how can I further optimize the code?
auto results = vector<vector<Point>>(particleNum);
auto t3 = high_resolution_clock::now();
double radius = 1.6;
#pragma omp parallel for
for (decltype(points.size()) i = 0; i < points.size(); i++)
{
vTree.search(points[i], radius, results[i]);
}
auto t4 = high_resolution_clock::now();
double searchTime = duration_cast<duration<double>>(t4 - t3).count();
the type signature for search is
void VPTree::search(const Point& p, double radius, vector<Point>& result) const
search result would be put into result.

My best guess would be that you are cache ping-pong'ing on the result vectors. I would assume that your "search" function uses the passed-in result vector as a place to put points and that you use it throughout the algorithm to insert neighbors as you encounter them in the search tree. Whenever you add a point to that result vector, the internal data of that vector object will be modified. And because all of your result vectors are packed together in contiguous memory, it is likely that different result vectors occupy the same cache lines. So, when the CPU maintains cache coherence, it will constantly lock the relevant cache lines.
The way to solve it is to use an internal, temporary vector that you only assign to the results vector once at the end (which can be done cheaply if you use move semantics). Something like this:
void VPTree::search(const Point& p, double radius, vector<Point>& result) const {
vector<Point> tmp_result;
// ... add results to "tmp_result"
result = std::move(tmp_result);
return;
}
Or, you could also just return the vector by value (which is implicitly using a move):
vector<Point> VPTree::search(const Point& p, double radius) const {
vector<Point> result;
// ... add results to "result"
return result;
}
Welcome to the joyful world of move-semantics and how awesome it can be at solving these types of concurrency / cache-coherence issues.
It is also conceivable that you're experiencing problems related to accessing the same tree from all threads, but since it's all read-only operations, I'm pretty sure that even on a conservative architecture like x86 (and other Intel / AMD CPUs) that this should not pose a significant issue, but I might be wrong (maybe a kind of "over-subscription" problem might be at play, but it's dubious). And other problems might include the fact that OpenMP does incur quite a bit of overhead (spawning threads, synchronizing, etc.) which has to be weighted against the computational cost of the actual operations you are doing within those parallel loops (and it's not always a favorable trade-off). And also, if your VPTree (I imagine stands for "Vantage-point Tree") does not have good locality of references (e.g., you implemented it as a linked-tree), then the performance is going to be terrible whichever way you use it (as I explain here).

Related

SIMD optimisation for cross-pattern access

I'm tryint to write a monte-carlo simulation of the Ising model, and I was wondering if it was possible to use SIMD optimisations for accessing data in a cross pattern.
I basically want to know if there's any way of speeding up this function.
//up/down/left/right stencil accumulation
float lattice::compute_point_energy(int row, int col) {
int accumulator=0;
accumulator+= get(row? row-1: size_-1, col);
accumulator+= get((row+1)%size_, col);
accumulator+= get(row, col? col-1: size_-1);
accumulator+= get(row, (col+1)%size_) ;
return -get(row, col) * (accumulator * J_ + H_);
}
get(i, j) is a method accesses a flat std::vector of shorts. I see that there might be a few problems: the access has lots of ternary logic going on (for periodic boundary conditions), and none of the vector elements are adjacent. Is it to make SIMD optimisations for this chunk, or should I keep digging? Re-implementing the adjacency matrix and/or using a different container (e.g. an array, or vector of different type) are an option.
SIMD is the last thing you'll want to try with this function.
I think you're trying to use an up/down/left/right 4-stencil for your computation. If so, your code should have a comment noting this.
You're losing a lot of speed in this function because of the potential for branching at your ternary operators and because modulus is relatively slow.
You'd do well to surround the two-dimensional space you're operating over with a ring of cells set to values appropriate for handling edge effects. This allows you to eliminate checks for edge effects.
For accessing your stencil, I find it often works to use something like the following:
const int width = 10;
const int height = 10;
const int offset[4] = {-1,1,-width,width};
double accumulator=0;
for(int i=0;i<4;i++)
accumulator += get(current_loc+offset[i]);
Notice that the mini-array has precalculated offsets to the neighbouring cells in your domain. A good compiler will likely unroll the foregoing loop.
Once you've done all this, appropriate choice of optimization flags may lead to automatic vectorization.
As it is, the branching and mods in your code are likely preventing auto-vectorization. You can check this by enabling appropriate flags. For Intel Compiler Collection (icc), you'll want:
-qopt-report=5 -qopt-report-phase:vec
For GCC you'll want (if I recall correctly):
-fopt-info-vec -fopt-info-missed

Memory Efficiency - Eigen::VectorXd in a loop

I have a Measurement object that has two Eigen::VectorXd members -- one for position and the other velocity.
Measurements are arranged in a dataset by scans -- i.e., at each timestep, a new scan of measurements is added to the dataset. These types are defined as:
typedef std::shared_ptr<Measurement> MeasurementPtr;
typedef std::vector<MeasurementPtr> scan_t;
typedef std::vector<scan_t> dataset_t;
At the beginning of each iteration of my algorithm, I need to apply a new transformation to each measurement. Currently, I have:
for (auto scan = dataset_.begin(); scan != dataset_.end(); ++scan)
for (auto meas = scan->begin(); meas != scan->end(); ++meas) {
// Transform this measurement to bring it into the same
// coordinate frame as the current scan
if (scan != std::prev(dataset_.end())) {
core::utils::perspective_transform(T_, (*meas)->pos);
core::utils::perspective_transform(T_, (*meas)->vel);
}
}
Where perspective_transform is defined as
void perspective_transform(const Eigen::Projective2d& T, Eigen::VectorXd& pos) {
pos = (T*pos.homogeneous()).hnormalized();
}
Adding this code increases computation time by 40x when I run the algorithm with scans in the dataset with 50 measurements in each scan -- making it rather slow. I believe this is because I have 550 small objects, each with 2 Eigen memory writes. I removed the writing of the result to memory and my benchmark shows only a slight decrease -- suggesting that this is a memory-efficiency problem and not a computation bottleneck.
How can I speed up this computation? Is there a way to first loop through and create an Eigen::Matrix from Eigen::Map that I could then do the computation once and have it automatically update the two members of the all the Measurement objects?
You might want to rework your data-structures.
Currently you have an array-of-struct (AOS), with a number of indirections.
A structure-of-arrays (SOA) is generally more efficient in memory access.
What about ?:
struct Scant_t
{
Eigen::MatrixXd position;
Eigen::MatrixXd velocity;
}
the .rowwise() and .colwise() operators might be powerfull enough to do the homogeneous transform, which would save you writing the inner loop.

Optimizing a quadruple nested "for" loop

I'm developing a 2D numerical model in c++, and I would like to speed up a specific member function that is slowing down my code. The function is required to loop over every i,j grid point in the model and then perform a double summation at every grid point over l and m. The function is as follows:
int Class::Function(void) {
double loadingEta;
int i,j,l,m;
//etaLatLen=64, etaLonLen=2*64
//l_max = 12
for (i=0; i<etaLatLen; i++) {
for (j=0; j < etaLonLen; j++) {
loadingEta = 0.0;
for (l=0; l<l_max+1; l++) {
for (m=0; m<=l; m++) {
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
}
}
etaNewArray[i][j] = loadingEta;
}
}
return 1;
}
I've been trying to change the loop order to speed things up, but to no avail. Any help would be much appreciated. Thank you!
EDIT 1:
All five arrays are allocated in the constructor of my class as follows:
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
SH_C = new double*[l_max+1];
SH_S = new double*[l_max+1];
for (int i=0; i<l_max+1; i++) {
SH_C[i] = new double[l_max+1];
SH_S[i] = new double[l_max+1];
}
etaCosMLon = new double*[etaLonLen];
etaSinMLon = new double*[etaLonLen];
for (int j=0; j<etaLonLen; j++) {
etaCosMLon[j] = new double[l_max+1];
etaSinMLon[j] = new double[l_max+1];
}
Perhaps it would be better if these were 1D arrays instead of multidimensional?
Hopping off into X-Y territory here. Rather than speeding up the algorithm, let's try and speed up data access.
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
Doesn't create a 3D array of doubles. It creates an array of pointers to arrays of pointers to arrays of doubles. Each array is its own block of memory and who knows where it's going to sit in storage. This results in a data structure that has what is called "poor spacial locality." All of the pieces of the structure may be scattered all over the place. In the 3D array you are hopping to three different places just to find out where your value is.
Because the many blocks of storage required to simulate the 3D array may be nowhere near each other, the CPU may not be able to effectively load the cache (high-speed memory) ahead of time and has to stop the useful work it's doing and wait to access slower storage, probably RAM much more frequently. Here is a good, high-level article on how much this can hurt performance.
On the other hand, if the whole array is in one block of memory, is "contiguous", the CPU can read larger chunks of the memory, maybe all of it, it needs into cache all at once. Plus if the compiler knows the memory the program will use is all in one big block it can perform all sorts of groovy optimizations that will make your program even faster.
So how do we get a 3D array that's all one memory block? If the sizes are static, this is easy
double etaLegendreArray[SIZE1][SIZE2][SIZE3];
This doesn't look to be your case, so what you want to do is allocate a 1D array, because it will be one contiguous block of memory.
double * etaLegendreArray= new double [SIZE1*SIZE2*SIZE3];
and do the array indexing math by hand
etaLegendreArray[(x * SIZE2 + y) * SIZE3 + z] = data;
Looks like that ought to be slower with all the extra math, huh? Turns out the compiler is hiding math that looks a lot like that from you every time you use a []. You lose almost nothing, and certainly not as much as you lose with one unnecessary cache miss.
But it is insane to repeat that math all over the place, sooner or later you will screw up even if the drain on readability doesn't have you wishing for death first, so you really want to wrap the 1D array in a class to helper handle the math for you. And once you do that, you might as well have that class handle the allocation and deallocation so you can take advantage of all that RAII goodness. No more for loops of news and deletes all over the place. It's all wrapped up and tied with a bow.
Here is an example of a 2D Matrix class easily extendable to 3D. that will take care of the basic functionality you probably need in a nice predictable, and cache-friendly manner.
If the CPU supports it and the compiler is optimizing enough, you might get some small gain out of the C99 fma (fused multiply-add) function, to convert some of your two step operations (multiply, then add) to one step operations. It would also improve accuracy, since you only suffer floating point rounding once for fused operation, not once for multiplication and once for addition.
Assuming I'm reading it right, you could change your innermost loop's expression from:
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
to (note no use of += now, it's incorporated in fma):
loadingEta = fma(etaLegendreArray[i][l][m], fma(SH_C[l][m], etaCosMLon[j][m], SH_S[l][m]*etaSinMLon[j][m]), loadingEta);
I wouldn't expect anything magical performance-wise, but it might help a little (again, only with optimizations turned up enough for the compiler to inline hardware instructions to do the work; if it's calling a library function, you'll lose any improvements to the function call overhead). And again, it should improve accuracy a bit, by avoiding two rounding steps you were incurring.
Mind you, on some compilers with appropriate compilation flags, they'll convert your original code to hardware FMA instructions for you; if that's an option, I'd go with that, since (as you can see) the fma function tends to reduce code readability.
Your compiler may offer vectorized versions of floating point instructions as well, which might meaningfully improve performance (see previous link on automatic conversion to FMA).
Most other improvements would require more information about the goal, the nature of the input arrays being used, etc. Simple threading might gain you something, OpenMP pragmas might be something to look at as a way to simplify parallelizing the loop(s).

Filling unordered_set is too slow

We have a given 3D-mesh and we are trying to eliminate identical vertexes. For this we are using a self defined struct containing the coordinates of a vertex and the corresponding normal.
struct vertice
{
float p1,p2,p3,n1,n2,n3;
bool operator == (const vertice& vert) const
{
return (p1 == vert.p1 && p2 == vert.p2 && p3 == vert.p3);
}
};
After filling the vertex with data, it is added to an unordered_set to remove the duplicates.
struct hashVertice
{
size_t operator () (const vertice& vert) const
{
return(7*vert.p1 + 13*vert.p2 + 11*vert.p3);
}
};
std::unordered_set<vertice,hashVertice> verticesSet;
vertice vert;
while(i<(scene->mMeshes[0]->mNumVertices)){
vert.p1 = (float)scene->mMeshes[0]->mVertices[i].x;
vert.p2 = (float)scene->mMeshes[0]->mVertices[i].y;
vert.p3 = (float)scene->mMeshes[0]->mVertices[i].z;
vert.n1 = (float)scene->mMeshes[0]->mNormals[i].x;
vert.n2 = (float)scene->mMeshes[0]->mNormals[i].y;
vert.n3 = (float)scene->mMeshes[0]->mNormals[i].z;
verticesSet.insert(vert);
i = i+1;
}
We discovered that it is too slow for data amounts like 3.000.000 vertexes. Even after 15 minutes of running the program wasn't finished. Is there a bottleneck we don't see or is another data structure better for such a task?
What happens if you just remove verticesSet.insert(vert); from the loop?
If it speeds-up dramatically (as I expect it would), your bottleneck is in the guts of the std::unordered_set, which is a hash-table, and the main potential performance problem with hash tables is when there are excessive hash collisions.
In your current implementation, if p1, p2 and p3 are small, the number of distinct hash codes will be small (since you "collapse" float to integer) and there will be lots of collisions.
If the above assumptions turn out to be true, I'd try to implement the hash function differently (e.g. multiply with much larger coefficients).
Other than that, profile your code, as others have already suggested.
Hashing floating point can be tricky. In particular, your hash
routine calculates the hash as a floating point value, then
converts it to an unsigned integral type. This has serious
problems if the vertices can be small: if all of the vertices
are in the range [0...1.0), for example, your hash function
will never return anything greater than 13. As an unsigned
integer, which means that there will be at most 13 different
hash codes.
The usual way to hash floating point is to hash the binary
image, checking for the special cases first. (0.0 and -0.0
have different binary images, but must hash the same. And it's
an open question what you do with NaNs.) For float this is
particularly simple, since it usually has the same size as
int, and you can reinterpret_cast:
size_t
hash( float f )
{
assert( /* not a NaN */ );
return f == 0.0 ? 0.0 : reinterpret_cast( unsigned& )( f );
}
I know, formally, this is undefined behavior. But if float and
int have the same size, and unsigned has no trapping
representations (the case on most general purpose machines
today), then a compiler which gets this wrong is being
intentionally obtuse.
You then use any combining algorithm to merge the three results;
the one you use is as good as any other (in this caseā€”it's
not a good generic algorithm).
I might add that while some of the comments insist on profiling
(and this is generally good advice), if you're taking 15 minutes
for 3 million values, the problem can really only be a poor hash
function, which results in lots of collisions. Nothing else will
cause that bad of performance. And unless you're familiar with
the internal implementation of std::unordered_set, the usual
profiler output will probably not give you much information.
On the other hand, std::unordered_set does have functions
like bucket_count and bucket_size, which allow analysing
the quality of the hash function. In your case, if you cannot
create an unordered_set with 3 million entries, your first
step should be to create a much smaller one, and use these
functions to evaluate the quality of your hash code.
If there is a bottleneck, you are definitely not seeing it, because you don't include any kind of timing measures.
Measure the timing of your algorithm, either with a profiler or just manually. This will let you find the bottleneck - if there is one.
This is the correct way to proceed. Expecting yourself, or alternatively, StackOverflow users to spot bottlenecks by eye inspection instead of actually measuring time in your program is, from my experience, the most common cause of failed attempts at optimization.

Calculate average using SSE with STL vectors

I'm trying to learn about vectorisation, and rather than reinvet the wheel I'm using Agner Fog's vector library
Here's my original C++/STL code
#include <vector>
#include <vectorclass.h>
template<typename T>
double mean_v1(T begin,T end) {
float mean = 0;
std::for_each(begin,end,[&mean](const double& d) { mean+=d; });
return mean / std::distance(begin,end);
}
double mean_v2(T begin,T end) {
float mean = 0;
const int distance = std::distance(begin,end); // This is expensive
const int loop = ( distance >> 2)+1; // divide by 4
const int partial = distance & 2; // remainder 4
Vec4d vec;
for(int i = 0; i < loop;++i) {
if(i == (loop-1)) {
vec.load_partial(partial,&*begin);
mean = horizontal_add(vec);
}
else {
vec.load(&*begin);
mean = horizontal_add(vec);
begin+=4; // This is expensive
}
}
return mean / distance;
}
int main(int argc,char**argv) {
using namespace boost::assign;
std::vector<float> numbers;
// Note 13 numbers, which won't fit into a sse register perfectly
numbers+=39.57,39.57,39.604,39.58,39.61,31.669,31.669,31.669,31.65,32.09,33.54,32.46,33.45;
const float mean1 = mean_v1(numbers.begin(),numbers.end());
const float mean2 = mean_v2(numbers.begin(),numbers.end());
return 0;
}
Both v1 and v2 work correctly and they both take about the same time. However profiling it shows the the std::distance() and moving the iterator along takes almost 45% of the total time. The vector adds is just 0.8% which is significantly faster than v1.
Searching the web, all the examples seem to deal with perfect number of values that fit precisely into the SSE registers. How do people deal with odd numbers of values eg for this example where setting up the loop is taking a lot longer than the calculation.
I'm thinking there must be best practices or ideas on how to deal with this scenario.
Assume I can't change the interface of mean() to take float[], but must use iterators
You're mixing float & double unnecessarily, especially as you don't let your accumulator be double your precision is totally destroyed and won't be close to satisfactory for larger series.
As the arithmetic is super light weight what's destroying your performance here is most likely memory access, read up on memory cache lines and how they work. Basically what you need to do here is probe ahead, some processors have explicit instructions for pulling stuff into your cache, otherwise you can perform a load at a memory location ahead of time. Create another level of nesting in your loop and at regular intervals prime the cache with data you know you will get to in a few iterations.
What people do to maximize performance is that they spend a lot of time actually designing their data layout. You shouldn't need to do an intermediate transformation on your data. So what people do is they allocate aligned memory ( most SIMD instruction sets either requires or imposes grave penalties for reading / writing to unaligned memory ), and then they try to aggregate data in such a way that it fits the instruction set. In fact it's often a win to pad your data up to whatever register size the instruction set supports. So if lets say you're going to process 3 dimensional vectors, padding with an extra element which is unused will almost always be a big win.