Fastest way to see how many bytes are equal between fixed length arrays - c++

I have 2 arrays of 16 elements (chars) that I need to "compare" and see how many elements are equal between the two.
This routine is going to be used millions of times (a usual run is about 60 or 70 million times), so I need it to be as fast as possible. I'm working on C++ (C++Builder 2007, for the record)
Right now, I have a simple:
matches += array1[0] == array2[0];
repeated 16 times (as profiling it appears to be 30% faster than doing it with a for loop)
Is there any other way that could work faster?
Some data about the environment and the data itself:
I'm using C++Builder, which doesn't have any speed optimizations to take into account. I will try eventually with another compiler, but right now I'm stuck with this one.
The data will be different most of the times. 100% equal data is usually very very rare (maybe less than 1%)

UPDATE: This answer has been modified to make my comments match the source code provided below.
There is an optimization available if you have the capability to use SSE2 and popcnt instructions.
16 bytes happens to fit nicely in an SSE register. Using c++ and assembly/intrinsics, load the two 16 byte arrays into xmm registers, and cmp them. This generates a bitmask representing the true/false condition of the compare. You then use a movmsk instruction to load a bit representation of the bitmask into an x86 register; this then becomes a bit field where you can count all the 1's to determine how many true values you had. A hardware popcnt instruction can be a fast way to count all the 1's in a register.
This requires knowledge of assembly/intrinsics and SSE in particular. You should be able to find web resources for both.
If you run this code on a machine that does not support either SSE2 or popcnt, you must then iterate through the arrays and count the differences with your unrolled loop approach.
Good luck
Edit:
Since you indicated you did not know assembly, here's some sample code to illustrate my answer:
#include "stdafx.h"
#include <iostream>
#include "intrin.h"
inline unsigned cmpArray16( char (&arr1)[16], char (&arr2)[16] )
{
__m128i first = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr1 ) );
__m128i second = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr2 ) );
return _mm_movemask_epi8( _mm_cmpeq_epi8( first, second ) );
}
int _tmain( int argc, _TCHAR* argv[] )
{
unsigned count = 0;
char arr1[16] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0 };
char arr2[16] = { 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0 };
count = __popcnt( cmpArray16( arr1, arr2 ) );
std::cout << "The number of equivalent bytes = " << count << std::endl;
return 0;
}
Some notes: This function uses SSE2 instructions and a popcnt instruction introduced in the Phenom processor (that's the machine that I use). I believe the most recent Intel processors with SSE4 also have popcnt. This function does not check for instruction support with CPUID; the function is undefined if used on a processor that does not have SSE2 or popcnt (you will probably get an invalid opcode instruction). That detection code is a separate thread.
I have not timed this code; the reason I think it's faster is because it compares 16 bytes at a time, branchless. You should modify this to fit your environment, and time it yourself to see if it works for you. I wrote and tested this on VS2008 SP1.
SSE prefers data that is aligned on a natural 16-byte boundary; if you can guarantee that then you should get additional speed improvements, and you can change the _mm_loadu_si128 instructions to _mm_load_si128, which requires alignment.

The key is to do the comparisons using the largest register your CPU supports, then fallback to bytes if necessary.
The below code demonstrates with using 4-byte integers, but if you are running on a SIMD architecture (any modern Intel or AMD chip) you could compare both arrays in one instruction before falling back to an integer-based loop. Most compilers these days have intrinsic support for 128-bit types so will NOT require ASM.
(Note that for the SIMD comparisions your arrays would have to be 16-byte aligned, and some processors (e.g MIPS) would require the arrays to be 4-byte aligned for the int-based comparisons.
E.g.
int* array1 = (int*)byteArray[0];
int* array2 = (int*)byteArray[1];
int same = 0;
for (int i = 0; i < 4; i++)
{
// test as an int
if (array1[i] == array2[i])
{
same += 4;
}
else
{
// test individual bytes
char* bytes1 = (char*)(array1+i);
char* bytes2 = (char*)(array2+i);
for (int j = 0; j < 4; j++)
{
same += (bytes1[j] == bytes2[j];
}
}
}
I can't remember what exactly the MSVC compiler supports for SIMD, but you could do something like;
// depending on compiler you may have to insert the words via an intrinsic
__m128 qw1 = *(__m128*)byteArray[0];
__m128 qw2 = *(__m128*)byteArray[1];
// again, depending on the compiler the comparision may have to be done via an intrinsic
if (qw1 == qw2)
{
same = 16;
}
else
{
// do int/byte testing
}

If you have the ability to control the location of the arrays, putting one right after the other in memory for instance, it might cause them to be loaded to the CPU's cache on the first access.
It depends on the CPU and its cache structure and will vary from one machine to another.
You can read about memory hierarchy and cache in Henessy & Patterson's Computer Architecture: A Quantitative Approach

If you need absolute lowest footprint, I'd go with assembly code. I haven't done this in a while but I'll bet MMX (or more likely SSE2/3) have instructions that can enable you to do exactly that in very few instructions.

If matches are the common case then try loading the values as 32 bit ints instead of 16 so you can compare 2 in one go (and count it as 2 matches).
If the two 32 bit values are not the same then you will have to test them separately (AND out the top and bottom 16 bit values).
The code will be more complex, but should be faster.
If you are targeting a 64-bit system you could do the same trick with 64 bit ints, and if you really want to push the limit then look at dropping into assembler and using the various vector based instructions which would let you work with 128 bits at once.

Magical compiler options will vary the time greatly. In particular making it generate SSE vectorization will likely get you a huge speedup.

Does this have to be platform independent, or will this code always run on the same type of CPU? If you restrict yourself to modern x86 CPUs, you may be able to use MMX instructions, which should allow you to operate on an array of 8 bytes in one clock tick. AFAIK, gcc allows you to embed assembly in your C code, and the Intel's compiler (icc) supports intrinsics, which are wrappers that allow you to call specific assembly instructions directly. Other SIMD instruction sets, such as SSE, may also be useful for this.

Is there any connection between the values in the arrays? Are some bytes more likely to be the same then others? Might there be some intrinsic order in the values? Then you could optimize for the most probable case.

If you explain what the data actually represents then there might be a totally different way to represent the data in memory that would make this type of brute force compare unnecessary. Care to elaborate on what the data actually represents??

Is it faster as one statement?
matches += (array1[0] == array2[0]) + (array1[1] == array2[1]) + ...;

If writing that 16 times is faster than a simple loop, then your compiler either sucks or you don't have optimization turned on.
Short answer: there's no faster way, unless you do vector operations on parallel hardware.

Try using pointers instead of arrays:
p1 = &array1[0];
p2 = &array2[0];
match += (*p1++ == *p2++);
// copy 15 times.
Of course you must measure this against other approaches to see which is fastest.
And are you sure that this routine is a bottleneck in your processing? Do you actually speed up the performance of your application as a whole by optimizing this? Again, only measurement will tell.

Is there any way you can modify the way the arrays are stored? Comparing 1 byte at a time is extremely slow considering you are probably using a 32-bit compiler. Instead if you stored your 16 bytes in 4 integers (32-bit) or 2 longs (64-bit), you would only need to perform 4 or 2 comparisons respectively.
The question to ask yourself is how much is the cost of storing the data as 4-integer or 2-long arrays. How often do you need to access the data, etc.

There's always the good old x86 REPNE CMPS instruction.

One extra possible optimization: if you are expecting that most of the time the arrays are identical then it might be slightly faster to do a memcmp() as the first step, setting '16' as the answer if the test returns true. If course if you are not expecting the arrays to be identical very often that would only slow things down.

Related

Why is vectorization not beneficial in this for loop?

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?
It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

Simulating AVX-512 mask instructions

According to the documentation, from gcc 4.9 on the AVX-512 instruction set is supported, but I have gcc 4.8. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):
__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));
Now, looking through the documentation, if we have, say, four bytes left over, I could use:
__mm128i sum = _mm_add_epi16(sum,
_mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
(__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
*(__m128i *) &mem));
(Note, the type of __mmask8 doesn't seem to be documented anywhere I can find, so I am guessing...)
However, _mm_mask_cvtepu8_epi16 is an AVX-512 instruction, so is there a way to duplicate this? I tried:
mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
_mm_cvtepu8_epi16(*(__m128i *) &mem));
However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i]; gave better performance.
As I happened to stumble across this question, and it still hasn't gotten an answer, if this is still a problem...
For your example problem, you're on the right track.
Multiply is a relatively slow operation, so you should avoid the use of _mm_mullo_epi16. Use _mm_and_si128 instead as bitwise AND is a much faster operation, e.g. _mm_and_si128(_mm_cvtepu8_epi16(*(__m128i *) &mem), _mm_set_epi32(0, 0, -1, -1))
I'm not sure what you mean by a cache stall, but if memory access is a bottleneck, and the compiler won't put the constant for the above into a register, you could use something like _mm_srli_si128(vector, 8) which doesn't need any additional registers/memory loads. A shift may be slower than an AND.
If it's always 8 bytes, you can use _mm_move_epi64
None of this solves the case if the remaining number isn't a fixed number of elements (e.g. you have n%16 bytes for some arbitrary n). Note that AVX-512 doesn't really solve it either. If you need to deal with this case, you could have a table of masks and AND depending on what's remaining, e.g. _mm_and_si128(vector, masks[n & 0xf])
(_mm_mask_cvtepu8_epi16 only cares about the low half of the vector, so your example is somewhat confusing - that is, you don't need to mask anything because the later elements are completely ignored anway)
On a more generic level, mask operations are really just an embedded _mm_blend_epi16 (or equivalent). For zeroing idioms, they can easily be emulated with _mm_and_si128 / _mm_andnot_si128, as shown above.

What C++ type use for fastest "for cycles"?

I think this is not answered on this site yet.
I made a code which goes through many combinations of 4 numbers. The number values are from 0 to 51, so they can be stored in 6 bits, so in 1 byte, am I right? I use these 4 numbers in nested for cycles and then use them in the lowest level for cycle. So what c++ type from those which can store at least 52 values is the fastest for iterating through 4 nested for cycles?
The code looks like:
for(type first = 0; first != 49; ++first)
for(type second = first+1; second != 50; ++second)
for(type third = second+1; third != 51; ++third)
for(type fourth = third+1; fourth != 52; ++fourth) {
//using those values for about 1 bilion bit operations made in another for cycles
}
That code is very simplified and maybe there is also a better way for this kind of iterating, you can help me also with that.
Use the typedef std::uint_fast8_t from the header <cstdint>. It is supposed to be the "fastest" unsigned integer type with at least 8 bits.
The fastest is whatever the underlying processor ALU can natively work with. Now registers may be addressable in multiple formats. In that case all those formats are equally fast.
So this becomes very processor architecture specific rather than C++ specific.
If you are working on a modern day PC processor then an int is as fast as anything else for your for loops.
On an embedded system there are more things to consider. Eg. Whether the variable is stored in an aligned location or not?
On most machines, int is the fastest integer type. On all of the computers I work with, int is faster than unsigned, significantly faster than signed char.
Another issue, perhaps a bigger one, is what you are doing with those numbers. You didn't show the code, so there's no way of telling. Use int if you expect first*second to produce the expected integral value.
Yet another issue is how widely portable you expect this code to be. There's a huge distinction between code that will be ported to a number of different architectures, different compilers versus code that will be used in a limited and controlled setting. If it's the latter, write some benchmarks, and use the type under which the benchmarks perform best. The problem is a bit tougher if you are writing something for wide consumption.

C++ Adding 2 arrays together quickly

Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!
Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
Compile with gcc -msse4.1 ... or equivalent for your particular development environment.
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.
You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.
MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.
The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)
You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.
You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.
The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.
Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.
Here is an alternative.
If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.
That is ...
uint8_t array1[10][10];
uint8_t array2[10][10];
uint8_t dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];
int i;
for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
pDest[i] = pArr1[i] + pArr2[i];
}
It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.
You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.
However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!
Good luck!

How to implement strlen as fast as possible

Assume that you're working a x86 32-bits system. Your task is to implement the strlen as fast as possible.
There're two problems you've to take care:
1. address alignment.
2. read memory with machine word length(4 bytes).
It's not hard to find the first alignment address in the given string.
Then we can read memory once with the 4 bytes, and count up it the total length. But we should stop once there's a zero byte in the 4 bytes, and count the left bytes before zero byte. In order to check the zero byte in a fast way, there's a code snippet from glibc:
unsigned long int longword, himagic, lomagic;
himagic = 0x80808080L;
lomagic = 0x01010101L;
// There's zero byte in 4 bytes.
if (((longword - lomagic) & ~longword & himagic) != 0) {
// do left thing...
}
I used it in Visual C++, to compare with CRT's implementation. The CRT's is much more faster than the above one.
I'm not familiar with CRT's implementation, did they use a faster way to check the zero byte?
You could save the length of the string along with the string when creating it, as is done in Pascal.
First CRT's one is written directly in assembler. you can see it's source code here C:\Program Files\Microsoft Visual Studio 9.0\VC\crt\src\intel\strlen.asm (this is for VS 2008)
It depends. Microsoft's library really has two different versions of strlen. One is a portable version in C that's about the most trivial version of strlen possible, pretty close (and probably equivalent) to:
size_t strlen(char const *str) {
for (char const *pos=str; *pos; ++pos)
;
return pos-str;
}
The other is in assembly language (used only for Intel x86), and quite similar to what you have above, at least as far as load 4 bytes, check of one of them is zero, and react appropriately. The only obvious difference is that instead of subtracting, they basically add pre-negate the bytes and add. I.e. instead of word-0x0101010101, they use word + 0x7efefeff.
there are also compiler intrinsic versions which use the REPNE SCAS instruction pair, though these are generally on older compilers, they can still be pretty fast. there are also SSE2 versions of strlen, such as Dr Agner Fog's performance library's implementation, or something such as this
Remove those 'L' suffixes and see... You are promoting all calculations to "long"! On my 32-bits tests, that alone doubles the cost.
I also do two micro-optimizations:
Since most strings we use scan consist of ASCII chars in the range 0~127, the high bit is (almost) never set, so only check for it in a second test.
Increment an index rather than a pointer, which is cheaper on some architectures (notably x86) and give you the length for 'free'...
uint32_t gatopeich_strlen32(const char* str)
{
uint32_t *u32 = (uint32_t*)str, u, abcd, i=0;
while(1)
{
u = u32[i++];
abcd = (u-0x01010101) & 0x80808080;
if (abcd && // If abcd is not 0, we have NUL or a non-ASCII char > 127...
(abcd &= ~u)) // ... Discard non-ASCII chars
{
#if BYTE_ORDER == BIG_ENDIAN
return 4*i - (abcd&0xffff0000 ? (abcd&0xff000000?4:3) : abcd&0xff00?2:1);
#else
return 4*i - (abcd&0xffff ? (abcd&0xff?4:3) : abcd&0xff0000?2:1);
#endif
}
}
}
Assuming you know the maximum possible length, and you've initated the memory to \0 before use, you could do a binary split and go left/right depending on the value(\0, split on left, else split on right). That way you'd dramatically decrease the amount of checks you'll need to find the length. Not optimal(requires some setup), but should be really fast.
// Eric
Obviously, crafting a tight loop like this in assembler would be fastest, however if you want/need to keep it more human-readable and/or portable in C(++), you can still increase the speed of the standard function by using the register keyword.
The register keyword prompts the compiler to store the counter in a register on the CPU instead of in memory which will significantly speed up the loop.
Note however, that the register keyword is only a suggestion and the compiler is free to ignore it if it thinks it can do better, especially if certain optimization options are used. That said, while it is almost certainly going to be ignored for a local, class variable in a triple for-loop, it is likely to be honored for the code below, thus improving performance quite a bit (nearly on par with the assembler version):
size_t strlen ( const char* s ) {
for (register const char* i=s; *i; ++i);
return (i-s);
}