how to vectorize arrow::compute::Take? - llvm

I have an array of large size input_array and an array of offsets take_array. I want to return the elements with those offsets very fast. Can I vectorize it for the arrow array? If so, how?
arrow::compute::Take(input_array, take_array)
Use Case:
I am taking a subset of a really large input_array. It is used in places where OpenMP-like and MPI-like parallelism are used. So vectorization seems to be the next low-hanging fruit.
Example:
https://arrow.apache.org/docs/python/generated/pyarrow.compute.take.html
The example is used in Apache Arrow. I am also open to Gandiva, Velox, LLVM or Intel MKL if there is a better way.
https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-llvm-gcc-cpus-gpus.html#gs.3vy461
https://llvm.org/docs/Vectorizers.html
https://www.dremio.com/blog/gandiva-performance-improvements-production-query/

Related

High performance table structure for really small tables (<10 items usually) where once the table is created it doesn't change?

I am searching for a high performance C++ structure for a table. The table will have void* as keys and uint32 as values.
The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (those tables are suited for high number of items too).
So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.
Another solution which I will shortly evaluate is like that:
struct Group
{
void* items[5]; // search using SIMD
int32_t items[5];
}; // fits in cache line
struct Table
{
Group* groups;
size_t capacity;
};
Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!
EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.
Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).
If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by #JorgeBellon
in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.
A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.
Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).
You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.
I think a nice alternative is an optimization of your simple linear probing idea.
Your lookup procedure would look like this:
Slot *s = &table[hash(key)];
Slot *e = s + s->max_extent;
for (;s<e; ++s) {
if (s->key == key) {
return s->value;
}
}
return NOT_FOUND;
table[h].max_extent is the maximum number of elements you may have to look at if you're looking for an element with hash code h. You would pre-calculate this when you generate the table, so your lookup doesn't have to iterate until it gets a null. This greatly reduces the amount of probing you have to do for misses.
Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill your hash table in hash(key) order to produce the best result.
NOTE that we do not wrap around from the end to the start of the table while probing. Just allocate however many extra slots you need to avoid it.

Intrinsics: using __128 registers

I am playing with SIMD and thinking to use for Vector operations in 3D math.
Instead having
class Vec4f
{
float val[4];
//+operators here
}
I could have
class SimdVec4f
{
__m128 val; //+operators
}
But since there are just 8 available registers for __m128, what will happend if I want to have more than 8 instances of this class? Does the compiler handles the loading from memory to registers and vica versa on its own as for usual variables?
Thanks for your time and for giving me some insight into this.
It's exactly the same as when you have more int variables than there are integer registers: the compiler may have to spill them to memory if too many are live at the same time, and reload them later. Register allocation for vector registers is done pretty much the same way as register allocation for integer regs, analysing the data flow of a function and figuring out which variables are alive at the same time.
You should think of _mm_load_ps/loadu and store/storeu intrinsics as more describing the type-punning to/from vector types, not as being the only thing that can compile to a vector load or store instruction, or always compiling to a load/store.
And BTW, x86-64 has xmm0..15. Compile for 64-bit if you want code that needs several registers to be efficient.
SSE for 3D vectors:
Generally avoid keeping a single direction/geometry vector in a SIMD vector. You can add efficiently, but any cross- or dot-products or length calculations will require shuffling.
It's better if you can use a vector of 4 x values, a vector of 4 y values, etc., so you can compute 4 lengths in parallel. See https://stackoverflow.com/tags/sse/info for more, especially these slides:
SIMD at Insomniac Games (GDC 2015) which show how to lay out your data for efficient SIMD. (Struct of arrays, not array of structs).
See also Parallel programming using Haswell architecture
Sometimes you can get a minor benefit for a single vector in cases where you can't reorganize to compute lots of things in parallel. _mm_setr_ps() can be slow if the source data isn't contiguous, though.
There are already several C++ wrapper libraries for SIMD, such as Agner Fog's GPL Apache-licensed VectorClass, and some others.

Speed up float 5x5 matrix * vector multiplication with SSE

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?
The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.
In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.
I would suggest using Intel IPP and abstract yourself of dependency on techniques
If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.
This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.
What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...
I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.
If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

shortening the period of two dimensional dynamic array algorithms in C++

I defined two dimensional dynamic array and allocate memory for arrays.Dimensions of arrays are same as each other(256*256):
double **I1,**I2;
int M=256;
int N=256;
int i,j;
I1= new double *[M+1];
for(i=1;i<=M;i++)
{I1[i]=new double [N+1];}
I2= new double *[M+1];
for(i=1;i<=M;i++)
{I2[i]=new double [N+1];}
Then,I assigned values elements of arrays.I have to execute mathematical algorithms on these arrays .I used a lot of for loops.And my code worked very very slowly.
For example if I subtract I2 from I1 and asssigned substract array to another I3 two dimensional array,I used this code:
double **I3;
double temp;
//allocate I3;
I3= new double *[M+1];
for(i=1;i<=M;i++)
{I3[i]=new double [N+1];}
//I3=I1-I2
for(i=1;i<=M;i++){
for(j=1;j<=N;j++){
temp=I1[i][j]-I2[i][j];
I3[i][j]=temp;}
}
How can I short execution time of C++ without using for loops ?
Could you advise me another methods please?
Best Regards..
First of all, in most cases I would advise against manually managing your memory like this. I'm sure you have heard that C++ offers container classes to which "algorithms" can be applied. These containers are less error prone (especially in the case of exceptions), the operations are more expressive, optimized and usually well-tested, so proven to work.
In your case, with the size of array known before, a std::vector can be used with no performance loss (except at creation), since the memory is guaranteed to be continuous and can thus be used like an array. You should also think about flattening your array, calling an allocation routine in a loop is not exactly speedy - allocation is costly. When doing matrix multiplication, consider allocation in row-major / column-major pairs, this helps caching...but I digress.
This is only a general advice, though - I am not advising you to re-implement this using containers, I just felt the need to mention them.
In this specific case, since you mentioned you want to "execute mathematical algorithms" I would suggest you have a look at a numeric library that is able to do matrix / vector operations, as this seems to be what you are after.
For C++, there is Newmat for example, and the (more or less) canonical BLAS/LAPACK implementations (i.e. Netlib, AMD's ACML, ATLAS). These allow you to perform common (and not so common) operations like adding/subtracting vectors, multiplying matrices etc. much faster, both using optimized algorithms and also optimizations as SIMD instructions your processor might offer (i.e. SSE).
Obviously, there is no way to avoid iterating over these values when doing computations, but you can do it in an optimized manner and with a standard interface.
In order of importance:
Switch on compiler optimization.
Allocate a single array for each matrix and use something like M*i+j for indexing. This will allocate faster and perhaps more importantly be more compact and less fragmented than multiple allocations.
Get used to indexing starting by zero, this will save you one array element and in general zero comparisons have the potential to be faster.
I see nothing wrong in using for loops.
If you are willing to spend even more effort, you could either use a vectorized 3rd party linear algebra lib or vectorize yourself by using things like SSE* or GPUs.
Some architectures have hardware support for vector arithmetic, such that a single instruction will sum all the elements of an array of doubles.
However, the first thing you must do to speed up a program is measure it. Have you timed your program to see where the slowdown occurs?
For example, one thing you appear to be doing in a for loop is lots of heap allocation, which tends to be slow. You could combine all your arrays into one array for greater speed.
You are currently doing the logical equivalent of this:
I3 = I1 - I2;
If you did this:
I1 -= I2;
Now I1 would be storing the result. This would destroy the original value of I1, but would avoid allocating a new array-of-arrays.
Also the intention of C++ is that you define classes to represent a data type and the operations on it. So you could write a class to represent your dynamic array storage. Or use an existing one - check out the uBLAS library.
I don't understand why you say that this is very slow. You're doing 256*256 subtractions here. I don't think there is a way to avoid for loops here (even if you're using a matrix library it will probably still do the same).
You might consider allocating 256*256 floats in one go instead of calling new 256 times (and then use some indexing arithmetic because you have only one index) but then it's probably easier to find a matrix library which does this for you.
Everything is already in STL, use valarray.
See also: How can I use a std::valarray to store/manipulate a contiguous 2D array?

Optimising C++ 2-D arrays

I need a way to represent a 2-D array (a dense matrix) of doubles in C++, with absolute minimum accessing overhead.
I've done some timing on various linux/unix machines and gcc versions. An STL vector of vectors, declared as:
vector<vector<double> > matrix(n,vector<double>(n));
and accessed through matrix[i][j] is between 5% and 100% slower to access than an array declared as:
double *matrix = new double[n*n];
accessed through an inlined index function matrix[index(i,j)], where index(i,j) evaluates to i+n*j. Other ways of arranging a 2-D array without STL - an array of n pointers to the start of each row, or defining the whole thing on the stack as a constant size matrix[n][n] - run at almost exactly the same speed as the index function method.
Recent GCC versions (> 4.0) seem to be able to compile the STL vector-of-vectors to nearly the same efficiency as the non-STL code when optimisations are turned on, but this is somewhat machine-dependent.
I'd like to use STL if possible, but will have to choose the fastest solution. Does anyone have any experience in optimising STL with GCC?
If you're using GCC the compiler can analyze your matrix accesses and change the order in memory in certain cases. The magic compiler flag is defined as:
-fipa-matrix-reorg
Perform matrix flattening and
transposing. Matrix flattening tries
to replace a m-dimensional matrix with
its equivalent n-dimensional matrix,
where n < m. This reduces the level of
indirection needed for accessing the
elements of the matrix. The second
optimization is matrix transposing
that attemps to change the order of
the matrix's dimensions in order to
improve cache locality. Both
optimizations need fwhole-program
flag. Transposing is enabled only if
profiling information is avaliable.
Note that this option is not enabled by -O2 or -O3. You have to pass it yourself.
My guess would be the fastest is, for a matrix, to use 1D STL array and override the () operator to use it as 2D matrix.
However, the STL also defines a type specifically for non-resizeable numerical arrays: valarray. You also have various optimisations for in-place operations.
valarray accept as argument a numerical type:
valarray<double> a;
Then, you can use slices, indirect arrays, ... and of course, you can inherit the valarray and define your own operator()(int i, int j) for 2D arrays ...
Very likely this is a locality-of-reference issue. vector uses new to allocate its internal array, so each row will be at least a little apart in memory due to each block's header; it could be a long distance apart if memory is already fragmented when you allocate them. Different rows of the array are likely to at least incur a cache-line fault and could incur a page fault; if you're really unlucky two adjacent rows could be on memory lines that share a TLB slot and accessing one will evict the other.
In contrast your other solutions guarantee that all the data is adjacent. It could help your performance if you align the structure so it crosses as few cache lines as possible.
vector is designed for resizable arrays. If you don't need to resize the arrays, use a regular C++ array. STL operations can generally operate on C++ arrays.
Do be sure to walk the array in the correct direction, i.e. across (consecutive memory addresses) rather than down. This will reduce cache faults.
My recommendation would be to use Boost.UBLAS, which provides fast matrix/vector classes.
To be fair depends on the algorithms you are using upon the matrix.
The double name[n*m] format is very fast when you are accessing data by rows both because has almost no overhead besides a multiplication and addition and because your rows are packed data that will be coherent in cache.
If your algorithms access column ordered data then other layouts might have much better cache coherence. If your algorithm access data in quadrants of the matrix even other layouts might be better.
Try to make some research directed at the type of usage and algorithms you are using. That is specially important if the matrix are very large, since cache misses may hurt your performance way more than needing 1 or 2 extra math operations to access each address.
You could just as easily do vector< double >( n*m );
You may want to look at the Eigen C++ template library at http://eigen.tuxfamily.org/ . It generates AltiVec or sse2 code to optimize the vector/matrix calculations.
There is the uBLAS implementation in Boost. It is worth a look.
http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/matrix.htm
Another related library is Blitz++: http://www.oonumerics.org/blitz/docs/blitz.html
Blitz++ is designed to optimize array manipulation.
I have done this some time back for raw images by declaring my own 2 dimensional array classes.
In a normal 2D array, you access the elements like:
array[2][3]. Now to get that effect, you'd have a class array with an overloaded
[] array accessor. But, this would essentially return another array, thereby giving
you the second dimension.
The problem with this approach is that it has a double function call overhead.
The way I did it was to use the () style overload.
So instead of
array[2][3], change I had it do this style array(2,3).
That () function was very tiny and I made sure it was inlined.
See this link for the general concept of that:
http://www.learncpp.com/cpp-tutorial/99-overloading-the-parenthesis-operator/
You can template the type if you need to.
The difference I had was that my array was dynamic. I had a block of char memory I'd declare. And I employed a column cache, so I knew where in my sequence of bytes the next row began. Access was optimized for accessing neighbouring values, because I was using it for image processing.
It's hard to explain without the code but essentially the result was as fast as C, and much easier to understand and use.