I'm trying to learn about vectorisation, and rather than reinvet the wheel I'm using Agner Fog's vector library
Here's my original C++/STL code
#include <vector>
#include <vectorclass.h>
template<typename T>
double mean_v1(T begin,T end) {
float mean = 0;
std::for_each(begin,end,[&mean](const double& d) { mean+=d; });
return mean / std::distance(begin,end);
}
double mean_v2(T begin,T end) {
float mean = 0;
const int distance = std::distance(begin,end); // This is expensive
const int loop = ( distance >> 2)+1; // divide by 4
const int partial = distance & 2; // remainder 4
Vec4d vec;
for(int i = 0; i < loop;++i) {
if(i == (loop-1)) {
vec.load_partial(partial,&*begin);
mean = horizontal_add(vec);
}
else {
vec.load(&*begin);
mean = horizontal_add(vec);
begin+=4; // This is expensive
}
}
return mean / distance;
}
int main(int argc,char**argv) {
using namespace boost::assign;
std::vector<float> numbers;
// Note 13 numbers, which won't fit into a sse register perfectly
numbers+=39.57,39.57,39.604,39.58,39.61,31.669,31.669,31.669,31.65,32.09,33.54,32.46,33.45;
const float mean1 = mean_v1(numbers.begin(),numbers.end());
const float mean2 = mean_v2(numbers.begin(),numbers.end());
return 0;
}
Both v1 and v2 work correctly and they both take about the same time. However profiling it shows the the std::distance() and moving the iterator along takes almost 45% of the total time. The vector adds is just 0.8% which is significantly faster than v1.
Searching the web, all the examples seem to deal with perfect number of values that fit precisely into the SSE registers. How do people deal with odd numbers of values eg for this example where setting up the loop is taking a lot longer than the calculation.
I'm thinking there must be best practices or ideas on how to deal with this scenario.
Assume I can't change the interface of mean() to take float[], but must use iterators
You're mixing float & double unnecessarily, especially as you don't let your accumulator be double your precision is totally destroyed and won't be close to satisfactory for larger series.
As the arithmetic is super light weight what's destroying your performance here is most likely memory access, read up on memory cache lines and how they work. Basically what you need to do here is probe ahead, some processors have explicit instructions for pulling stuff into your cache, otherwise you can perform a load at a memory location ahead of time. Create another level of nesting in your loop and at regular intervals prime the cache with data you know you will get to in a few iterations.
What people do to maximize performance is that they spend a lot of time actually designing their data layout. You shouldn't need to do an intermediate transformation on your data. So what people do is they allocate aligned memory ( most SIMD instruction sets either requires or imposes grave penalties for reading / writing to unaligned memory ), and then they try to aggregate data in such a way that it fits the instruction set. In fact it's often a win to pad your data up to whatever register size the instruction set supports. So if lets say you're going to process 3 dimensional vectors, padding with an extra element which is unused will almost always be a big win.
Related
Basically, I have a collection std::vector<std::pair<std::vector<float>, unsigned int>> which contains pairs of templates std::vector<float> of size 512 (2048 bytes) and their corresponding identifier unsigned int.
I am writing a function in which I am provided with a template and I need to return the identifier of the most similar template in the collection. I am using dot product to compute the similarity.
My naive implementation looks as follows:
// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
bool found = false;
similarity = 0.f;
for (size_t i = 0; i < collection.size(); ++i) {
const float* candidateTemplate = collection[i].first.data();
float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture.
if (consinSimilarity > similarity) {
found = true;
similarity = consinSimilarity;
label = collection[i].second;
}
}
return found;
}
How can I speed this up using parallelization. My collection can contain potentially millions of templates. I have read that you can add #pragma omp parallel for reduction but I am not entirely sure how to use it (and if this is even the best option).
Also note:
For my dot product implementation, if the base architecture supports AVX & FMA, I am using this implementation.
Will this affect performance when we parallelize since there are only a limited number of SIMD registers?
Since we don't have access to an example that actually compiles (which would have been nice), I didn't actually try to compile the example below. Nevertheless, some minor typos (maybe) aside, the general idea should be clear.
The task is to find the highest value of similarity and the corresponding label, for this we can indeed use reduction, but since we need to find the maximum of one value and then store the corresponding label, we make use of a pair to store both values at once, in order to implement this as a reduction in OpenMP.
I have slightly rewritten your code, possibly made things a bit harder to read with the original naming (temp) of the variable. Basically, we perform the search in parallel, so each thread finds an optimal value, we then ask OpenMP to find the optimal solution between the threads (reduction) and we are done.
//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair.
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function.
#pragma omp declare reduction(custom_reduction : \
std::pair<float, unsigned int>: \
reduce_custom(omp_out, omp_in)) \
initializer(omp_priv(omp_orig))
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label.
#pragma omp parallel for reduction(custom_reduction:temp)
for (size_t i = 0; i < collection.size(); ++i) {
const float* candidateTemplate = collection[i].first.data();
float consinSimilarity = getSimilarity(data, candidateTemplate, length);
if (consinSimilarity > temp.first) {
temp.first = consinSimilarity;
temp.second = collection[i].second;
}
}
if (temp.first > 0.f) {
similarity = temp.first;
label = temp.second;
return true;
}
return false;
}
Regarding your concern on the limited number of SIMD registers, their number depends on the specific CPU you are using. To the best of my understanding each core has a set number of vector registers available, so as long as you were not using more than there were available before it should be fine now as well, besides, AVX512 for instance provides 32 vector registers and 2 arithemtic units for vector operations per core, so running out of compute resources is not trivial, you are more likely to suffer due to poor memory locality (particularly in your case with vectors being saved all over the place). I might of course be wrong, if so, please feel free to correct me in the comments.
I have a Measurement object that has two Eigen::VectorXd members -- one for position and the other velocity.
Measurements are arranged in a dataset by scans -- i.e., at each timestep, a new scan of measurements is added to the dataset. These types are defined as:
typedef std::shared_ptr<Measurement> MeasurementPtr;
typedef std::vector<MeasurementPtr> scan_t;
typedef std::vector<scan_t> dataset_t;
At the beginning of each iteration of my algorithm, I need to apply a new transformation to each measurement. Currently, I have:
for (auto scan = dataset_.begin(); scan != dataset_.end(); ++scan)
for (auto meas = scan->begin(); meas != scan->end(); ++meas) {
// Transform this measurement to bring it into the same
// coordinate frame as the current scan
if (scan != std::prev(dataset_.end())) {
core::utils::perspective_transform(T_, (*meas)->pos);
core::utils::perspective_transform(T_, (*meas)->vel);
}
}
Where perspective_transform is defined as
void perspective_transform(const Eigen::Projective2d& T, Eigen::VectorXd& pos) {
pos = (T*pos.homogeneous()).hnormalized();
}
Adding this code increases computation time by 40x when I run the algorithm with scans in the dataset with 50 measurements in each scan -- making it rather slow. I believe this is because I have 550 small objects, each with 2 Eigen memory writes. I removed the writing of the result to memory and my benchmark shows only a slight decrease -- suggesting that this is a memory-efficiency problem and not a computation bottleneck.
How can I speed up this computation? Is there a way to first loop through and create an Eigen::Matrix from Eigen::Map that I could then do the computation once and have it automatically update the two members of the all the Measurement objects?
You might want to rework your data-structures.
Currently you have an array-of-struct (AOS), with a number of indirections.
A structure-of-arrays (SOA) is generally more efficient in memory access.
What about ?:
struct Scant_t
{
Eigen::MatrixXd position;
Eigen::MatrixXd velocity;
}
the .rowwise() and .colwise() operators might be powerfull enough to do the homogeneous transform, which would save you writing the inner loop.
I want to parallel my search algorithm using openMP, vTree is a binary search tree, and I want to apply my search algorithm for each of the point set. below is a snippet of my code. the search procedure for two points is totally irrelevant and so can be parallel. though they do need to read a same tree, but once constructed, the tree wouldn't be modified any more. thus it is read-only.
However, the code below shows terrible scalability, on my 32-core platform, only 2x speed up is achieved. is it because that vTree is read by all threads? if so, how can I further optimize the code?
auto results = vector<vector<Point>>(particleNum);
auto t3 = high_resolution_clock::now();
double radius = 1.6;
#pragma omp parallel for
for (decltype(points.size()) i = 0; i < points.size(); i++)
{
vTree.search(points[i], radius, results[i]);
}
auto t4 = high_resolution_clock::now();
double searchTime = duration_cast<duration<double>>(t4 - t3).count();
the type signature for search is
void VPTree::search(const Point& p, double radius, vector<Point>& result) const
search result would be put into result.
My best guess would be that you are cache ping-pong'ing on the result vectors. I would assume that your "search" function uses the passed-in result vector as a place to put points and that you use it throughout the algorithm to insert neighbors as you encounter them in the search tree. Whenever you add a point to that result vector, the internal data of that vector object will be modified. And because all of your result vectors are packed together in contiguous memory, it is likely that different result vectors occupy the same cache lines. So, when the CPU maintains cache coherence, it will constantly lock the relevant cache lines.
The way to solve it is to use an internal, temporary vector that you only assign to the results vector once at the end (which can be done cheaply if you use move semantics). Something like this:
void VPTree::search(const Point& p, double radius, vector<Point>& result) const {
vector<Point> tmp_result;
// ... add results to "tmp_result"
result = std::move(tmp_result);
return;
}
Or, you could also just return the vector by value (which is implicitly using a move):
vector<Point> VPTree::search(const Point& p, double radius) const {
vector<Point> result;
// ... add results to "result"
return result;
}
Welcome to the joyful world of move-semantics and how awesome it can be at solving these types of concurrency / cache-coherence issues.
It is also conceivable that you're experiencing problems related to accessing the same tree from all threads, but since it's all read-only operations, I'm pretty sure that even on a conservative architecture like x86 (and other Intel / AMD CPUs) that this should not pose a significant issue, but I might be wrong (maybe a kind of "over-subscription" problem might be at play, but it's dubious). And other problems might include the fact that OpenMP does incur quite a bit of overhead (spawning threads, synchronizing, etc.) which has to be weighted against the computational cost of the actual operations you are doing within those parallel loops (and it's not always a favorable trade-off). And also, if your VPTree (I imagine stands for "Vantage-point Tree") does not have good locality of references (e.g., you implemented it as a linked-tree), then the performance is going to be terrible whichever way you use it (as I explain here).
I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.
You can do this fairly easily with SSE, e.g.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 2)
{
__m128 vc = _mm_load_ps((float *)&values[idx]);
__m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
vc = _mm_mul_ps(vc, vk);
_mm_store_ps((float *)&values[idx], vc);
}
}
Note that values and scalar need to be 16 byte aligned.
Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 4)
{
__m128 vc0 = _mm_load_ps((float *)&values[idx]);
__m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
__m128 vk = _mm_load_ps(&scalar[idx]);
__m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
__m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
vc0 = _mm_mul_ps(vc0, vk0);
vc1 = _mm_mul_ps(vc1, vk1);
_mm_store_ps((float *)&values[idx], vc0);
_mm_store_ps((float *)&values[idx + 2], vc1);
}
}
Your best bet will be to use an optimised BLAS which will take advantage of whatever is available on your target platform.
One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.
Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.
What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).
For the problem of the unwanted serialization an option is to see what is the generated code for something like
float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;
because here the optimizer has in theory a little bit more freedom.
Do you have access to Intel's Integrated Performance Primitives?
Integrated Performance Primitives They have a number of functions that handle cases like this with pretty decent performance. You might have some success with your particular problem, but I would not be surprised if your compiler already does a decent job of optimizing the code.
I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:
Vector of classes
class A
{
double count;
double val;
double sumA;
double sumB;
vector<double> sumVectorC;
vector<double> sumVectorD;
}
vector<A> classes(10000);
The code that needs to run as fast as possible is something like this:
vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
The alternative is instead of one giant loop, split the computation into two separate loops such as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
or to store each member of the class in a vector like so:
Class of vectors
vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...
and then operate as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classCounts[i];
...
}
Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?
Update
The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.
As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.
Empirical data usually beats speculation.
As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).
As a general guideline: consider algorithmic improvements before low-level implementation optimisation.
As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.
Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).
It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).
if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:
array[i]<test_val
and then an equality test.
then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.
of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6
Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :
class AggregatedArray : public boost::array<double>{
size_t _size;
size_t size() const {
return size;
}
....
push_back(..){...
pop(){...
resize(...){...
};
this gets rid of the extra cache line access to the allocated array data for sumVectorC.
In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.
Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line
Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)
So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries
This what I meant when I said: "It is an engineering problem and not a theoretical one."