C++ Adding 2 arrays together quickly

C++ Adding 2 arrays together quickly - c++

Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!

Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
Compile with gcc -msse4.1 ... or equivalent for your particular development environment.
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.
MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.

The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)
You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.
You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.
The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.

Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.

Here is an alternative.
If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.
That is ...
uint8_t array1[10][10];
uint8_t array2[10][10];
uint8_t dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];
int i;
for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
pDest[i] = pArr1[i] + pArr2[i];
}
It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.

You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.
However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!
Good luck!

Related

How to use SSE instruction set on C6678 DSP?

SSE can only be used on x86 x64 CPUs. I have a problem using the SPEEXDSP library on a TI C6678. I've never used the SSE instruction, I've tried many ways and can't get it to work on the DSP.
Is it possible to modify SSE instructions to normal C++ instructions? How to modify it?
Looking forward to your reply.
Example:
static inline double interpolate_product_double(const float* a, const float* b, unsigned int len, const spx_uint32_t oversample, float* frac) {
int i;
double ret;
__m128d sum;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
__m128 f = _mm_loadu_ps(frac);
__m128d f1 = _mm_cvtps_pd(f);
__m128d f2 = _mm_cvtps_pd(_mm_movehl_ps(f, f));
__m128 t;
for (i = 0; i < len; i += 2)
{
t = _mm_mul_ps(_mm_load1_ps(a + i), _mm_loadu_ps(b + i * oversample));
sum1 = _mm_add_pd(sum1, _mm_cvtps_pd(t));
sum2 = _mm_add_pd(sum2, _mm_cvtps_pd(_mm_movehl_ps(t, t)));
t = _mm_mul_ps(_mm_load1_ps(a + i + 1), _mm_loadu_ps(b + (i + 1) * oversample));
sum1 = _mm_add_pd(sum1, _mm_cvtps_pd(t));
sum2 = _mm_add_pd(sum2, _mm_cvtps_pd(_mm_movehl_ps(t, t)));
}
sum1 = _mm_mul_pd(f1, sum1);
sum2 = _mm_mul_pd(f2, sum2);
sum = _mm_add_pd(sum1, sum2);
sum = _mm_add_sd(sum, _mm_unpackhi_pd(sum, sum));
_mm_store_sd(&ret, sum);
return ret;
}

Yes, you can use SIMD Everywhere (SIMDe). It provides portable implementations of many intrinsics, including all of the ones in your code. Full disclosure: I am the lead developer.
Edit: replying to phuclv here since it's a bit long for a comment.
SIMDe doesn't currently use the c6x instrinsics to implement functions like we often do for NEON, AltiVec/VSX, WASM SIMD, etc. There is nothing preventing it, and patches are very much welcome, but they're not there yet.
However, every function in SiMDe has fallback implementations all the way back to standard C. Usually things don't get that far, though; even discounting the architecture-specific implementations mentioned above, if the compiler supports it the operations are also implemented using GNU C vector extensions, and even the portable fallbacks are actually annotated with OpenMP SIMD directives. Conversion functions use compiler built-ins like __builtin_convertvector, and functions which require shuffling data around will use __builtin_shuffle / __builtin_shufflevector.
Basically, SIMDe goes to great lengths to get the compiler to vectorize the whenever possible, even if SIMDe doesn't actually know how to do it. The functions above are all pretty straightforward; I don't know enough about c6x SIMD to know what kind of operations are supported in hardware, but GCC and clang (which the TI compilers are based on) generally do a very good job with all the information SIMDe gives them. Honestly, the thing I'm most worried about here is whether the c6x supports double-precision floating point in SIMD (which the code above uses)... there is a pretty good chance it only supports single-precision floats.

Is it possible to modify SSE instructions to normal C++ instructions?
There's no such thing as "C++ instructions" because C++ is a high level language with only statements and no instructions. But yes it's possible to convert SSE intrinsics to C++ expressions because they're simply multiple operations in parallel
SSE is one of the SIMD instruction sets so just convert it to the corresponding SIMD in the target architecture. In your case TI C6678 does have SIMD support:
The C64x+ and C674x DSPs support 2-way SIMD operations for 16-bit data and 4-way SIMD operations for 8-bit data. On the C66x DSP, the vector processing capability is improved by extending the width of the SIMD instructions. C66x DSPs can execute instructions that operate on 128-bit vectors.

The C66x architecture indeed supports a number of SIMD instructions, somewhat comparable to those of Intel's SSE.
You need to be aware of the processor's register set in both architectures and compare the available instructions.
For example, _mm_add_ps performs four simultaneous additions of single-precision floats, contained four by four in the SSE registers. The DSP has a similar DADDSP instruction that only performs two such additions. Hence you will need to translate one _mm_add_ps by two DADDSP.
Read the manuals (these instruction sets are online), understand what the instructions are doing, and find the equivalences. In case of a dead-end, you still have the recourse to good old scalar operations, like C[0]= A[0]+B[0]; C[1]= A[1]+B[1];

Simulating AVX-512 mask instructions

According to the documentation, from gcc 4.9 on the AVX-512 instruction set is supported, but I have gcc 4.8. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):
__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));
Now, looking through the documentation, if we have, say, four bytes left over, I could use:
__mm128i sum = _mm_add_epi16(sum,
_mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
(__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
*(__m128i *) &mem));
(Note, the type of __mmask8 doesn't seem to be documented anywhere I can find, so I am guessing...)
However, _mm_mask_cvtepu8_epi16 is an AVX-512 instruction, so is there a way to duplicate this? I tried:
mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
_mm_cvtepu8_epi16(*(__m128i *) &mem));
However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i]; gave better performance.

As I happened to stumble across this question, and it still hasn't gotten an answer, if this is still a problem...
For your example problem, you're on the right track.
Multiply is a relatively slow operation, so you should avoid the use of _mm_mullo_epi16. Use _mm_and_si128 instead as bitwise AND is a much faster operation, e.g. _mm_and_si128(_mm_cvtepu8_epi16(*(__m128i *) &mem), _mm_set_epi32(0, 0, -1, -1))
I'm not sure what you mean by a cache stall, but if memory access is a bottleneck, and the compiler won't put the constant for the above into a register, you could use something like _mm_srli_si128(vector, 8) which doesn't need any additional registers/memory loads. A shift may be slower than an AND.
If it's always 8 bytes, you can use _mm_move_epi64
None of this solves the case if the remaining number isn't a fixed number of elements (e.g. you have n%16 bytes for some arbitrary n). Note that AVX-512 doesn't really solve it either. If you need to deal with this case, you could have a table of masks and AND depending on what's remaining, e.g. _mm_and_si128(vector, masks[n & 0xf])
(_mm_mask_cvtepu8_epi16 only cares about the low half of the vector, so your example is somewhat confusing - that is, you don't need to mask anything because the later elements are completely ignored anway)
On a more generic level, mask operations are really just an embedded _mm_blend_epi16 (or equivalent). For zeroing idioms, they can easily be emulated with _mm_and_si128 / _mm_andnot_si128, as shown above.

How to effectively apply bitwise operation to (large) packed bit vectors?

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).

This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.

Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?

SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.

SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.
Problems:
1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:
a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
a += 2;
array [index] = a;
}
++index;
is not easy to do as SIMD:
a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask a2 &= mask a3 &= mask a4 &= mask
a1 >>= shift a2 >>= shift a3 >>= shift a4 >>= shift
if (a1<somevalue) if (a2<somevalue) if (a3<somevalue) if (a4<somevalue)
// help! can't conditionally perform this on each column, all columns must do the same thing
index += 4
2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome
3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.
You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.

This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.
My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.
Something like this:
inline int
minimum(int a, int b)
{
int mask = (a - b) >> 31;
return ((a & mask) | (b & ~mask));
}

If you use SSE instructions, you're obviously limited to processors that support these.
That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)
SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)
In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).
Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)
But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)
It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.
I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.

If you intend to use Microsoft Visual C++, you should read this:
http://www.codeproject.com/KB/recipes/sseintro.aspx

We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:
Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.
My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.

I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will).
Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed.
But you need much experince for a speedup over an Intel-compiler (if possible).

SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.
That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)

Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:
typedef union Vector4f
{
// Easy constructor, defaulted to black/0 vector
Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
X(a), Y(b), Z(c), W(d) { }
// Cast operator, for []
inline operator float* ()
{
return (float*)this;
}
// Const ast operator, for const []
inline operator const float* () const
{
return (const float*)this;
}
// ---------------------------------------- //
inline Vector4f operator += (const Vector4f &v)
{
for(int i=0; i<4; ++i)
(*this)[i] += v[i];
return *this;
}
inline Vector4f operator += (float t)
{
for(int i=0; i<4; ++i)
(*this)[i] += t;
return *this;
}
// Vertex / Vector
// Lower case xyzw components
struct {
float x, y, z;
float w;
};
// Upper case XYZW components
struct {
float X, Y, Z;
float W;
};
};
Just don't forget to have -msse -msse2 on your build parameters!

Although it is true that SSE is specific to some processors (SSE may be relatively safe, SSE2 much less in my experience), you can detect the CPU at runtime, and load the code dynamically depending on the target CPU.

SIMD intrinsics (such as SSE2) can speed this sort of thing up but take expertise to use correctly. They are very sensitive to alignment and pipeline latency; careless use can make performance even worse than it would have been without them. You'll get a much easier and more immediate speedup from simply using cache prefetching to make sure all your ints are in L1 in time for you to operate on them.
Unless your function needs a throughput of better than 100,000,000 integers per second, SIMD probably isn't worth the trouble for you.

Just to add briefly to what has been said before about different SSE versions being available on different CPUs: This can be checked by looking at the respective feature flags returned by the CPUID instruction (see e.g. Intel's documentation for details).

Have a look at inline assembler for C/C++, here is a DDJ article. Unless you are 100% certain your program will run on a compatible platform you should follow the recommendations many have given here.

I agree with the previous posters. Benefits can be quite large but to get it may require a lot of work. Intel documentation on these instructions is over 4K pages. You may want to check out EasySSE (c++ wrappers library over intrinsics + examples) free from Ocali Inc.
I assume my affiliation with this EasySSE is clear.

I don't recommend doing this yourself unless you're fairly proficient with assembly. Using SSE will, more than likely, require careful reorganization of your data, as Skizz points out, and the benefit is often questionable at best.
It would probably be much better for you to write very small loops and keep your data very tightly organized and just rely on the compiler doing this for you. Both the Intel C Compiler and GCC (since 4.1) can auto-vectorize your code, and will probably do a better job than you. (Just add -ftree-vectorize to your CXXFLAGS.)
Edit: Another thing I should mention is that several compilers support assembly intrinsics, which would probably, IMO, be easier to use than the asm() or __asm{} syntax.

Fastest way to see how many bytes are equal between fixed length arrays

I have 2 arrays of 16 elements (chars) that I need to "compare" and see how many elements are equal between the two.
This routine is going to be used millions of times (a usual run is about 60 or 70 million times), so I need it to be as fast as possible. I'm working on C++ (C++Builder 2007, for the record)
Right now, I have a simple:
matches += array1[0] == array2[0];
repeated 16 times (as profiling it appears to be 30% faster than doing it with a for loop)
Is there any other way that could work faster?
Some data about the environment and the data itself:
I'm using C++Builder, which doesn't have any speed optimizations to take into account. I will try eventually with another compiler, but right now I'm stuck with this one.
The data will be different most of the times. 100% equal data is usually very very rare (maybe less than 1%)

UPDATE: This answer has been modified to make my comments match the source code provided below.
There is an optimization available if you have the capability to use SSE2 and popcnt instructions.
16 bytes happens to fit nicely in an SSE register. Using c++ and assembly/intrinsics, load the two 16 byte arrays into xmm registers, and cmp them. This generates a bitmask representing the true/false condition of the compare. You then use a movmsk instruction to load a bit representation of the bitmask into an x86 register; this then becomes a bit field where you can count all the 1's to determine how many true values you had. A hardware popcnt instruction can be a fast way to count all the 1's in a register.
This requires knowledge of assembly/intrinsics and SSE in particular. You should be able to find web resources for both.
If you run this code on a machine that does not support either SSE2 or popcnt, you must then iterate through the arrays and count the differences with your unrolled loop approach.
Good luck
Edit:
Since you indicated you did not know assembly, here's some sample code to illustrate my answer:
#include "stdafx.h"
#include <iostream>
#include "intrin.h"
inline unsigned cmpArray16( char (&arr1)[16], char (&arr2)[16] )
{
__m128i first = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr1 ) );
__m128i second = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr2 ) );
return _mm_movemask_epi8( _mm_cmpeq_epi8( first, second ) );
}
int _tmain( int argc, _TCHAR* argv[] )
{
unsigned count = 0;
char arr1[16] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0 };
char arr2[16] = { 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0 };
count = __popcnt( cmpArray16( arr1, arr2 ) );
std::cout << "The number of equivalent bytes = " << count << std::endl;
return 0;
}
Some notes: This function uses SSE2 instructions and a popcnt instruction introduced in the Phenom processor (that's the machine that I use). I believe the most recent Intel processors with SSE4 also have popcnt. This function does not check for instruction support with CPUID; the function is undefined if used on a processor that does not have SSE2 or popcnt (you will probably get an invalid opcode instruction). That detection code is a separate thread.
I have not timed this code; the reason I think it's faster is because it compares 16 bytes at a time, branchless. You should modify this to fit your environment, and time it yourself to see if it works for you. I wrote and tested this on VS2008 SP1.
SSE prefers data that is aligned on a natural 16-byte boundary; if you can guarantee that then you should get additional speed improvements, and you can change the _mm_loadu_si128 instructions to _mm_load_si128, which requires alignment.

The key is to do the comparisons using the largest register your CPU supports, then fallback to bytes if necessary.
The below code demonstrates with using 4-byte integers, but if you are running on a SIMD architecture (any modern Intel or AMD chip) you could compare both arrays in one instruction before falling back to an integer-based loop. Most compilers these days have intrinsic support for 128-bit types so will NOT require ASM.
(Note that for the SIMD comparisions your arrays would have to be 16-byte aligned, and some processors (e.g MIPS) would require the arrays to be 4-byte aligned for the int-based comparisons.
E.g.
int* array1 = (int*)byteArray[0];
int* array2 = (int*)byteArray[1];
int same = 0;
for (int i = 0; i < 4; i++)
{
// test as an int
if (array1[i] == array2[i])
{
same += 4;
}
else
{
// test individual bytes
char* bytes1 = (char*)(array1+i);
char* bytes2 = (char*)(array2+i);
for (int j = 0; j < 4; j++)
{
same += (bytes1[j] == bytes2[j];
}
}
}
I can't remember what exactly the MSVC compiler supports for SIMD, but you could do something like;
// depending on compiler you may have to insert the words via an intrinsic
__m128 qw1 = *(__m128*)byteArray[0];
__m128 qw2 = *(__m128*)byteArray[1];
// again, depending on the compiler the comparision may have to be done via an intrinsic
if (qw1 == qw2)
{
same = 16;
}
else
{
// do int/byte testing
}

If you have the ability to control the location of the arrays, putting one right after the other in memory for instance, it might cause them to be loaded to the CPU's cache on the first access.
It depends on the CPU and its cache structure and will vary from one machine to another.
You can read about memory hierarchy and cache in Henessy & Patterson's Computer Architecture: A Quantitative Approach

If you need absolute lowest footprint, I'd go with assembly code. I haven't done this in a while but I'll bet MMX (or more likely SSE2/3) have instructions that can enable you to do exactly that in very few instructions.

If matches are the common case then try loading the values as 32 bit ints instead of 16 so you can compare 2 in one go (and count it as 2 matches).
If the two 32 bit values are not the same then you will have to test them separately (AND out the top and bottom 16 bit values).
The code will be more complex, but should be faster.
If you are targeting a 64-bit system you could do the same trick with 64 bit ints, and if you really want to push the limit then look at dropping into assembler and using the various vector based instructions which would let you work with 128 bits at once.

Magical compiler options will vary the time greatly. In particular making it generate SSE vectorization will likely get you a huge speedup.

Does this have to be platform independent, or will this code always run on the same type of CPU? If you restrict yourself to modern x86 CPUs, you may be able to use MMX instructions, which should allow you to operate on an array of 8 bytes in one clock tick. AFAIK, gcc allows you to embed assembly in your C code, and the Intel's compiler (icc) supports intrinsics, which are wrappers that allow you to call specific assembly instructions directly. Other SIMD instruction sets, such as SSE, may also be useful for this.

Is there any connection between the values in the arrays? Are some bytes more likely to be the same then others? Might there be some intrinsic order in the values? Then you could optimize for the most probable case.

If you explain what the data actually represents then there might be a totally different way to represent the data in memory that would make this type of brute force compare unnecessary. Care to elaborate on what the data actually represents??

Is it faster as one statement?
matches += (array1[0] == array2[0]) + (array1[1] == array2[1]) + ...;

If writing that 16 times is faster than a simple loop, then your compiler either sucks or you don't have optimization turned on.
Short answer: there's no faster way, unless you do vector operations on parallel hardware.

Try using pointers instead of arrays:
p1 = &array1[0];
p2 = &array2[0];
match += (*p1++ == *p2++);
// copy 15 times.
Of course you must measure this against other approaches to see which is fastest.
And are you sure that this routine is a bottleneck in your processing? Do you actually speed up the performance of your application as a whole by optimizing this? Again, only measurement will tell.

Is there any way you can modify the way the arrays are stored? Comparing 1 byte at a time is extremely slow considering you are probably using a 32-bit compiler. Instead if you stored your 16 bytes in 4 integers (32-bit) or 2 longs (64-bit), you would only need to perform 4 or 2 comparisons respectively.
The question to ask yourself is how much is the cost of storing the data as 4-integer or 2-long arrays. How often do you need to access the data, etc.

There's always the good old x86 REPNE CMPS instruction.

One extra possible optimization: if you are expecting that most of the time the arrays are identical then it might be slightly faster to do a memcmp() as the first step, setting '16' as the answer if the test returns true. If course if you are not expecting the arrays to be identical very often that would only slow things down.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js