iteration direction on an array

iteration direction on an array - c++

Say we have two arrays a and b of a fundamental type (say, a float) and we need to calculate a[i] + b[i] for every valid index i, as well as store the result. What is the best way to iterate over the arrays to maximize cache hits? Is it front-to-back, back-to-front or something else?

For this kind of operation you should use the auto-vectorization of your compiler. Iterate small i to large i. Also, the answer depends on what you mean by "store the result" and the number n of items items you are going to iterate over.
If you mean c[i] = a[i] + b[i] and n is not too small then your compiler's auto-vectorizer will optimize this best without any more changes. Even MSVC will get that one correct (at least for SSE). Your compiler will have to do some adjustments for n not a multiple of 4 (or 8 for AVX) and alignment but this cost will be amortized across n and this overhead will have a negligible effect except for small n. If n is small then you might want to consider alignment. How small is small has to be determined but I would guess it's much less than 100.
If you mean sum + = a[i] + b[i], a reduction, then you do need to think about this. This has a dependency chain so you need to unroll your loop 3-10 times. Additionally, you need to use a relaxed floating point model since floating point arithmetic is not associative and the auto-vectorization won't kick in without it so add -ffast-math to GCC (/fp:fast to MSVC). If you unroll the loop and use a a relaxed floating point model then GCC, ICC, Clang, and MSVC should auto-vectorize your reduction efficiently.

In order to utilize the cache pre-fetch capability you need to read the arrays from front to back sequentially.
Furthermore, the arrays should be SSE aligned (16 byte). Even more important is that the items (e.g. floats) will be aligned on their size (4 bytes for floats). This is important so data will not cross cache lines (slower read).
After the arrays are aligned, you can use SSE/AVX to read, add and store the results doing 4 or 8 operations in a single instruction.
Edit:
You can read more on cache prefetching here and in depth description in the Intel SW Developer Manual.

Related

High performance table structure for really small tables (<10 items usually) where once the table is created it doesn't change?

I am searching for a high performance C++ structure for a table. The table will have void* as keys and uint32 as values.
The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (those tables are suited for high number of items too).
So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.
Another solution which I will shortly evaluate is like that:
struct Group
{
void* items[5]; // search using SIMD
int32_t items[5];
}; // fits in cache line
struct Table
{
Group* groups;
size_t capacity;
};
Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!
EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.

Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).
If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by #JorgeBellon
in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.
A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.
Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).

You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.
I think a nice alternative is an optimization of your simple linear probing idea.
Your lookup procedure would look like this:
Slot *s = &table[hash(key)];
Slot *e = s + s->max_extent;
for (;s<e; ++s) {
if (s->key == key) {
return s->value;
}
}
return NOT_FOUND;
table[h].max_extent is the maximum number of elements you may have to look at if you're looking for an element with hash code h. You would pre-calculate this when you generate the table, so your lookup doesn't have to iterate until it gets a null. This greatly reduces the amount of probing you have to do for misses.
Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill your hash table in hash(key) order to produce the best result.
NOTE that we do not wrap around from the end to the start of the table while probing. Just allocate however many extra slots you need to avoid it.

How to set bits of a bit vector efficiently in parallel?

Consider a bit vector of N bits in it (N is large) and an array of M numbers (M is moderate, usually much smaller than N), each in range 0..N-1 indicating which bit of the vector must be set to 1. The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i, where 256 bits are packed into each __m256i structure.
How can this work be split efficiently accross multiple threads?
Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.

Let's assume you want to divide this work up among T threads. It's a pretty interesting problem since it isn't trivially parallelizable via partitioning and various solutions may apply for different sizes of N and M.
Fully Concurrent Baseline
You could simply divide up the array M into T partitions and have each thread work on its own partition of M with a shared N. The main problem is that since M is not sorted, all threads may access any element of N and hence stomp on each others work. To avoid this, you'd have to use atomic operations such as std::atomic::fetch_or for each modification of the shared N array, or else come up with some locking scheme. Both approaches are likely to kill performance (i.e., using an atomic operation to set a bit is likely to be an order of magnitude slower than the equivalent single-threaded code).
Let's look at ideas that are likely faster.
Private N
One relatively obvious idea to avoid the "shared N" problem which requires atomic operations for all mutations of N is simply to give each T a private copy of N and merge them at the end via or.
Unfortunately, this solution is O(N) + O(M/T) whereas the original single-threaded solution is O(M) and the "atomic" solution above is something like O(M/T)4. Since we know that N >> M this is likely to be a poor tradeoff in this case. Still, it's worth noting that the hidden constants in each term are very different: the O(N) term, which comes from the merging step0 can use 256-bit wide vpor instructions, meaning a throughput of something close to 200-500 bits/cycle (if cached), while the bit-setting step which is O(M/T) I estimate at closer to 1 bit/cycle. So this approach can certainly be the best one for moderate T even if the size of N is 10 or 100 times the size of M.
Partitions of M
The basic idea here is to partition the indexes in M such that each worker thread can then work on a disjoint part of the N array. If M was sorted, that would be trivial, but it's not, so...
A simple algorithm that will work well if M is smoothly distributed is to first partition that values of M into T buckets, with the buckets having values in the ranges [0, N/T), [N/T, 2N/T], ..., [(T-1)N/T, N). That is, divide N into T disjoint regions and then find the values of M that fall into each of them. You can spread that work across the T threads by assigning each thread an equal size chunk of M, and having them each create the T partitions and then logically merging1 them at the end so you have the T partitions of M.
The second step is to actually set all the bits: you assign one partition to each thread T which can set the bits in a "single threaded" way, i.e., not worrying about concurrent updates, since each thread is working on a disjoint partition of N2.
Both steps O(M) and the second step is identical to the single-threaded case, so the overhead for parallelizing this is the first step. I suspect the first will range from about the same speed as the second to perhaps 2-4 times as slow, depending on implementation and hardware, so you can expect a speedup on a machine with many cores, but with only 2 or 4 it might not be any better.
If the distribution of M is not smooth, such that the partitions created in the first step have very different sizes, it will work poorly because some threads will get a lot more work. A simple strategy is to create say 10 * T partitions, rather than only T and have the threads in the second pass all consume from the same queue of partitions until complete. In this way you spread the work more evenly, unless the array M is very bunched up. In that case you might consider a refinement of the first step which first essentially creates a bucketed histogram of the elements, and then a reduce stage which looks at the combined histogram to create a good partitioning.
Essentially, we are just progressively refining the first stage into a type of parallel sort/partitioning algorithm, for which there is already lots of literature. You might even find that a full (parallel) sort is fastest, since it will greatly help in bit-setting phase, since accesses will be in-order and have the best spatial locality (helping with prefetching and caching, respectively).
0 ... and also from the "allocate a private array of length N" step, although this is likely to be quite fast.
1 The conceptually simplest form of merging would be to simply copy each thread's partitions of M such that you have a contiguous partition of all of M, but in practice if the partitions are large you can just leave the partitions where they are and link them together, adding some complexity to the consuming code, but avoiding the compacting step.
2 To make it truly disjoint from a threading point of view you want to ensure the partition of N falls on "byte boundaries", and perhaps even cache-line boundaries to avoid false sharing (although the latter is likely not to be a big problem since it only occurs at the edge of each partition, and the order of processing means that you are not likely to get contention).
4 In practice, the exact "order" of the baseline concurrent solution using shared N is hard to define because there will be contention so the O(M/T) scaling will break down for large enough T. If we assume N is quite large and T is limited to typical hardware concurrency of at most a dozen cores or so it's probably an OK approximation.

#IraBaxter posted an interesting but flawed idea which can be made to work (at significant cost). I suspect #BeeOnRope's idea of partial-sort / partitioning the M array will perform better (especially for CPUs with large private caches which can keep parts of N hot). I'll summarize the modified version of Ira's idea that I described in comments on his deleted answer. (That answer has some suggestions about how big N has to be before it's worth multi-threading.)
Each writer thread gets a chunk of M with no sorting/partitioning.
The idea is that conflicts are very rare because N is large compared to the number of stores that can be in flight at once. Since setting a bit is idempotent, so we can handle conflicts (where two threads want to set different bits in the same byte) by checking the value in memory to make sure it really does have the bit set that we want after a RMW operation like or [N + rdi], al (with no lock prefix).
E.g. thread 1 tried to store 0x1 and stepped on thread 2's store of 0x2. Thread 2 must notice and retry the read-modify-write (probably with lock or to keep it simple and make multiple retries not possible) to end up with 0x3 in the conflict byte.
We need an mfence instruction before the read-back. Otherwise store-forwarding will give us the value we we just wrote before other threads see our store. In other words, a thread can observe its own stores earlier than they appear in the global order. x86 does have a Total Order for stores, but not for loads. Thus, we need mfence to prevent StoreLoad reordering. (Intel's "Loads Are not Reordered with Older Stores to the Same Location" guarantee is not as useful as it sounds: store/reload isn't a memory barrier; they're just talking about out-of-order execution preserving program-order semantics.)
mfence is expensive, but the trick that makes this better than just using lock or [N+rdi], al is that we can batch operations. e.g. do 32 or instructions and then 32 read-back. It's a tradeoff between mfence overhead per operation vs. increased chance of false-sharing (reading back cache lines that had already been invalidated by another CPU claiming them).
Instead of an actual mfence instruction, we can do the last or of a group as a lock or. This is better for throughput on both AMD and Intel. For example, according to Agner Fog's tables, mfence has one per 33c throughput on Haswell/Skylake, where lock add (same performance as or) has 18c or 19c throughput. Or for Ryzen, ~70c (mfence) vs. ~17c (lock add).
If we keep the amount of operations per fence very low, the array index (m[i]/8) + mask (1<<(m[i] & 7)) can be kept in registers for all the operations. This probably isn't worth it; fences are too expensive to do as often as every 6 or operations. Using the bts and bt bit-string instructions would mean we could keep more indices in registers (because no shift-result is needed), but probably not worth it because they're slow.
Using vector registers to hold indices might be a good idea, to avoid having to reload them from memory after the barrier. We want the load addresses to be ready as soon as the read-back load uops can execute (because they're waiting for the last store before the barrier to commit to L1D and become globally visible).
Using single-byte read-modify-write makes actual conflicts as unlikely as possible. Each write of a byte only does a non-atomic RMW on 7 neighbouring bytes. Performance still suffers from false-sharing when two threads modify bytes in the same 64B cache-line, but at least we avoid having to actually redo as many or operations. 32-bit element size would make some things more efficient (like using xor eax,eax / bts eax, reg to generate 1<<(m[i] & 31) with only 2 uops, or 1 for BMI2 shlx eax, r10d, reg (where r10d=1).)
Avoid the bit-string instructions like bts [N], eax: it has worse throughput than doing the indexing and mask calculation for or [N + rax], dl. This is the perfect use-case for it (except that we don't care about the old value of the bit in memory, we just want to set it), but still its CISC baggage is too much.
In C, a function might look something like
/// UGLY HACKS AHEAD, for testing only.
// #include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void set_bits( volatile uint8_t * restrict N, const unsigned *restrict M, size_t len)
{
const int batchsize = 32;
// FIXME: loop bounds should be len-batchsize or something.
for (int i = 0 ; i < len ; i+=batchsize ) {
for (int j = 0 ; j<batchsize-1 ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
N[idx] |= mask;
}
// do the last operation of the batch with a lock prefix as a memory barrier.
// seq_cst RMW is probably a full barrier on non-x86 architectures, too.
unsigned idx = M[i+batchsize-1];
unsigned mask = 1U << (idx&7);
idx >>= 3;
__atomic_fetch_or(&N[idx], mask, __ATOMIC_SEQ_CST);
// _mm_mfence();
// TODO: cache `M[]` in vector registers
for (int j = 0 ; j<batchsize ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
if (! (N[idx] & mask)) {
__atomic_fetch_or(&N[idx], mask, __ATOMIC_RELAXED);
}
}
}
}
This compiles to approximately what we want with gcc and clang. The asm (Godbolt) could be more efficient in several ways, but might be interesting to try this. This is not safe: I just hacked this together in C to get the asm I wanted for this stand-alone function, without inlining into a caller or anything. __atomic_fetch_or is not a proper compiler barrier for non-atomic variables the way asm("":::"memory") is. (At least the C11 stdatomic version isn't.) I should probably have used the legacy __sync_fetch_and_or, which is a full barrier for all memory operations.
It uses GNU C atomic builtins to do atomic RMW operations where desired on variables that aren't atomic_uint8_t. Running this function from multiple threads at once would be C11 UB, but we only need it to work on x86. I used volatile to get the asynchronous-modification-allowed part of atomic without forcing N[idx] |= mask; to be atomic. The idea is to make sure that the read-back checks don't optimize away.
I use __atomic_fetch_or as a memory barrier because I know it will be on x86. With seq_cst, it probably will be on other ISAs, too, but this is all a big hack.

There are a couple of operations involved in sets (A,B = set, X = element in a set):
Set operation Instruction
---------------------------------------------
Intersection of A,B A and B
Union of A,B A or B
Difference of A,B A xor B
A is subset of B A and B = B
A is superset of B A and B = A
A <> B A xor B <> 0
A = B A xor B = 0
X in A BT [A],X
Add X to A BTS [A],X
Subtract X from A BTC [A],X
Given the fact that you can use the boolean operators to replace set operations you can use VPXOR, VPAND etc.
To set, reset or test individual bits you simply use
mov eax,BitPosition
BT [rcx],rax
You can set if a set is (equal to) empty (or something else) using the following code
vpxor ymm0,ymm0,ymm0 //ymm0 = 0
//replace the previous instruction with something else if you don't want
//to compare to zero.
vpcmpeqqq ymm1,ymm0,[mem] //compare mem qwords to 0 per qword
vpslldq ymm2,ymm1,8 //line up qw0 and 1 + qw2 + 3
vpand ymm2,ymm1,ymm2 //combine qw0/1 and qw2/3
vpsrldq ymm1,ymm2,16 //line up qw0/1 and qw2/3
vpand ymm1,ymm1,ymm2 //combine qw0123, all in the lower 64 bits.
//if the set is empty, all bits in ymm1 will be 1.
//if its not, all bits in ymm1 will be 0.
(I'm sure this code can be improved using the blend/gather etc instructions)
From here you can just extend to bigger sets or other operations.
Note that bt, btc, bts with a memory operand is not limited to 64 bits.
The following will work just fine.
mov eax,1023
bts [rcx],rax //set 1024st element (first element is 0).

Is treating two uint8_ts as a uint16_t less efficient

Suppose I created a class that took a template parameter equal to the number uint8_ts I want to string together into a Big int.
This way I can create a huge int like this:
SizedInt<1000> unspeakablyLargeNumber; //A 1000 byte number
Now the question arises: am I killing my speed by using uint8_ts instead of using a larger built in type.
For example:
SizedInt<2> num1;
uint16_t num2;
Are num1 and num2 the same speed, or is num2 faster?

It would undoubtedly be slower to use uint8_t[2] instead of uint16_t.
Take addition, for example. In order to get the uint8_t[2] speed up to the speed of uint16_t, the compiler would have to figure out how to translate your add-with-carry logic and fuse those multiple instructions into a single, wider addition. I'm sure that some compilers out there are capable of such optimizations sometimes, but there are many circumstances which could make the optimization unlikely or impossible.
On some architectures, this will even apply to loading / storing, since uint8_t[2] usually has different alignment requirements than uint16_t.
Typical bignum libraries, like GMP, work on the largest words that are convenient for the architecture. On x64, this means using an array of uint64_t instead of an array of something smaller like uint8_t. Adding two 64-bit numbers is quite fast on modern microprocessors, in fact, it is usually the same speed as adding two 8-bit numbers, to say nothing of the data dependencies that are introduced by propagating carry bits through arrays of small numbers. These data dependencies mean that you will often only be add one element of your array per clock cycle, so you want those elements to be as large as possible. (At a hardware level, there are special tricks which allow carry bits to quickly move across the entire 64-bit operation, but these tricks are unavailable in software.)
If you desire, you can always use template specialization to choose the right sized primitives to make the most space-efficient bignums you want. Otherwise, using an array of uint64_t is much more typical.
If you have the choice, it is usually best to simply use GMP. Portions of GMP are written in assembly to make bignum operations much faster than they would be otherwise.

You may get better performance from larger types due to a decreased loop overhead. However, the tradeoff here is a better speed vs. less flexibility in choosing the size.
For example, if most of your numbers are, say, 5 bytes in length, switching to unit_16 would require an overhead of an extra byte. This means a memory overhead of 20%. On the other hand, if we are talking about really large numbers, say, 50 bytes or more, memory overhead would be much smaller - on the order of 2%, so getting an increase in speed would be achieved at a much smaller cost.

Does int32_t have lower latency than int8_t, int16_t and int64_t?

(I'm referring to Intel CPUs and mainly with GCC, but poss ICC or MSVC)
Is it true using int8_t, int16_t or int64_t is less efficient compared with int32_tdue to additional instructions generated to to convert between the CPU word size and the chosen variable size?
I would be interested if anybody has any examples or best practices for this? I sometimes use smaller variable sizes to reduce cacheline loads, but say I only consumed 50 bytes of a cacheline with one variable being 8-bit int, it may be quicker processing by using the remaining cacheline space and promote the 8-bit int to a 32-bit int etc?

You can stuff more uint8_ts into a cache line, so loading N uint8_ts will be faster than loading N uint32_ts.
In addition, if you are using a modern Intel chip with SIMD instructions, a smart compiler will vectorize what it can. Again, using a small variable in your code will allow the compiler to stuff more lanes into a SIMD register.
I think it is best to use the smallest size you can, and leave the details up to the compiler. The compiler is probably smarter than you (and me) when it comes to stuff like this. For many operations (say unsigned addition), the compiler can use the same code for uint8, uint16 or uint32 (and just ignore the upper bits), so there is no speed difference.
The bottom line is that a cache miss is WAY more expensive than any arithmetic or logical operation, so it is nearly always better to worry about cache (and thus data size) than simple arithmetic.
(It used to be true a long time again that on Sun workstation, using double was significantly faster than float, because the hardware only supported double. I don't think that is true any more for modern x86, as the SIMD hardware (SSE, etc) have direct support for both single and double precision).

Mark Lakata answer points in the right direction.
I would like to add some points.
A wonderful resource for understanding and taking optimization decision are the Agner documents.
The Instruction Tables document has the latency for the most common instructions. You can see that some of them perform better in the native size version.
A mov for example may be eliminated, a mul have less latency.
However here we are talking about gaining 1 clock, we would have to execute a lot of instruction to compensate for a cache miss.
If this were the whole story it would have not worth it.
The real problems comes with the decoders.
When you use some length-changing prefixes (and you will by using non native size word) the decoder takes extra cycles.
The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes.
In, nowadays, no longer more recent (but still present) microarchs the penalty was severe, specially with some kind arithmetic instructions.
In later microarchs this has been mitigated but the penalty it is still present.
Another aspect to consider is that using non native size requires to prefix the instructions and thereby generating larger code.
This is the closest as possible to the statement "additional instructions [are] generated to to convert between the CPU word size and the chosen variable size" as Intel CPU can handle non native word sizes.
With other, specially RISC, CPUs this is not generally true and more instruction can be generated.
So while you are making an optimal use of the data cache, you are also making a bad use of the instruction cache.
It is also worth nothing that on the common x64 ABI the stack must be aligned on 16 byte boundary and that usually the compiler saves local vars in the native word size or a close one (e.g. a DWORD on 64 bit system).
Only if you are allocating a sufficient number of local vars or if you are using array or packed structs you can gain benefits from using small variable size.
If you declare a single uint16_t var, it will probably takes the same stack space of a single uint64_t, so it is best to go for the fastest size.
Furthermore when it come to the data cache it is the locality that matters, rather than the data size alone.
So, what to do?
Luckily, you don't have to decide between having small data or small code.
If you have a considerable quantity of data this is usually handled with arrays or pointers and by the use of intermediate variables. An example being this line of code.
t = my_big_data[i];
Here my approach is:
Keep the external representation of data, i.e. the my_big_data array, as small as possible. For example if that array store temperatures use a coded uint8_t for each element.
Keep the internal representation of data, i.e. the t variable, as close as possible to the CPU word size. For example t could be a uint32_t or uint64_t.
This way you program optimize both caches and use the native word size.
As a bonus you may later decide to switch to SIMD instructions without have to repack the my_big_data memory layout.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
D. Knuth
When you design your structures memory layout be problem driven. For example, age values need 8 bit, city distances in miles need 16 bits.
When you code the algorithms use the fastest type the compiler is known to have for that scope. For example integers are faster than floating point numbers, uint_fast8_t is no slower than uint8_t.
When then it is time to improve the performance start by changing the algorithm (by using faster types, eliminating redundant operations, and so on) and then if it is needed the data structures (by aligning, padding, packing and so on).

Speed up float 5x5 matrix * vector multiplication with SSE

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?

The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.

In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.

I would suggest using Intel IPP and abstract yourself of dependency on techniques

If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.

This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.

What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...

I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.

If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js