Implementing memcmp - c++

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
{
if(!count)
return(0);
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
}
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.

You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.

The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.

Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).

If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).

Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.

Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.

The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.

Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

Related

Strange C++ Memory Allocation

I created a simple class, Storer, in C++, playing with memory allocation. It contains six field variables, all of which are assigned in the constructor:
int x;
int y;
int z;
char c;
long l;
double d;
I was interested in how these variables were being stored, so I wrote the following code:
Storer *s=new Storer(5,4,3,'a',5280,1.5465);
cout<<(long)s<<endl<<endl;
cout<<(long)&(s->x)<<endl;
cout<<(long)&(s->y)<<endl;
cout<<(long)&(s->z)<<endl;
cout<<(long)&(s->c)<<endl;
cout<<(long)&(s->l)<<endl;
cout<<(long)&(s->d)<<endl;
I was very interested in the output:
33386512
33386512
33386516
33386520
33386524
33386528
33386536
Why is the char c taking up four bytes? sizeof(char) returns, of course, 1, so why is the program allocating more memory than it needs? This is confirmed that too much memory is being allocated with the following code:
cout<<sizeof(s->c)<<endl;
cout<<sizeof(Storer)<<endl;
cout<<sizeof(int)+sizeof(int)+sizeof(int)+sizeof(char)+sizeof(long)+sizeof(double)<<endl;
which prints:
1
32
29
confirming that, indeed, 3 bytes are being allocated needlessly. Can anyone explain to me why this is happening? Thanks.
Data alignment and compiler padding say hi!
The CPU has no notion of type, what it gets in its 32-bit (or 64-bit, or 128-bit (SSE), or 256-bit (AVX) - let's keep it simple at 32) registers needs to be properly aligned in order to be processed correctly and efficiently. Imagine a simple scenario, where you have a char, followed by an int. In a 32-bit architecture, that's 1 byte for a char and 4 bytes for an integer.
A 32-bit register would have to break on its boundary, only taking in 3 bytes of the integer and leaving the 4th byte for "a second run". It cannot process the data properly that way, so the compiler will add padding in order to make sure all the stuff is processed efficiently. And that means adding a certain amount of padding depending on the type in question.
Why is misalignment a problem?
The computer is not human, it can't just pick them out with a pair of eyes and a brain. It has to be very deterministic and cautious about how it goes about doing things. First it loads one block which contains n bytes of the given information, shift it around so that it prunes out unrelated information, then another, again, shift out a bunch of unnecessary bytes which do not have anything to do with the operation at hand and only then can it do the necessary operations. And usually you have two operands, that's just one complete. When you do all that work, only then can you actually process it. Way too much performance overhead when you can simply align the data properly (and most of the time, compilers do it for you, if you're not doing anything fancy).
Could you visualize it?
Visually - the first green byte is the mentioned char, and the three green bytes plus the first red one of the second block is the 4-byte int, colorcoded on a 4-byte access boundary (we're talking about a 32-bit register). The "instead part" at the bottom shows an ideal setup where the int hits the register properly (the char getting padded into obedience somewhere off image):
Read more on data alignment, which comes quite handy when you're dealing with fancy extensions of the instruction set like SSE (128-bit regs) or AVX (256-bit regs), so special care must be taken so that the optimizations of vectorization are not defeated ( aligning on a 16-byte boundary for SSE, 16*8 -> 128-bits).
Additional remarks on user defined alignment
phonetagger made a valid point in the comments that there are pragma directives which can be assigned through the preprocessor to force to compiler in order to align the data in a way the user, programmer specifies. But such directives, like #pragma pack(...), are a statement to the compiler that you know what you're doing and what's best for you. Be sure that you do, because if you fail to accomodate your environment, you might experience various penalties - the most obvious being using external libraries you didn't write yourself which differ in the way they pack data.
Things simply explode when they clash. Best is to advise caution in such cases and really being intimate with the issue at hand. If you're not sure, leave it to the defaults. If you are not sure but have to use something like SSE where alignment is king (and not default nor simple by a long shot), consult various resources online or ask an another question here.
I will make an analogy to help you understand.
Assume there is a long loaf of bread and you have a cutting machine that can cut it into slices of equal thickness. Then you are giving out these breads to, let's say children. Every child takes their bread and fairly do what they want to do with them (put Nutella on them and eat, etc.). They can even make thinner slices out of it and use it like that.
If one child comes up to you and says that he does not want that slice everyone is getting, but a thinner slice instead, then you will have difficulties, because your cutting machine is optimized to cut at least a minimum amount, which makes everyone happy. But when one child asks for a thinner slice, then you have to reinvent the machine or put additional complexity to it like introducing two cutting modes. You don't want that. Eventually you give up and just give him a big slice anyway.
This is the same reason why it happens. Hope you could relate to the analogy.
Data alignement is why the char has allocated 4 bytes : Data alignement
char does not take up four bytes: it takes up a single byte as usual. You can check it by printing sizeof(char). The other three bytes are padding that the compiler inserts to optimize access to other members of your class. Depending on hardware, it is often much faster to access multi-byte types, say, 4-byte integers, when they are located at an address divisible by four. A compiler may insert up to three bytes of padding before an int member to align it with a good memory address for faster access.
If you would like to experiment with class layouts, you can use a handy operation called offsetof. It takes two parameters - the name of the member and the name of the class, and it returns the number of bytes from the base address of your struct to the position of the member in memory.
cout << offsetof(Storer, x) << endl;
cout << offsetof(Storer, y) << endl;
cout << offsetof(Storer, z) << endl;
Structure members are aligned in particular ways. In general, if you want the most compact representation, list the members in decreasing order of size.
http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

C++ BOOL (typedef int) vs bool for performance

I read somewhere that using BOOL (typedef int) is better than using the standard c++ type bool because the size of BOOL is 4 bytes (i.e. a multiple of 4) and it saves alignment operations of variables into registers or something along those lines...
Is there any truth to this? I imagine that the compiler would pad the stack frames in order to keep alignments of multiple of 4s even if you use bool (1 byte)?
I'm by no means an expert on the underlying workings of alignments, registers, etc so I apologize in advance if I've got this completely wrong. I hope to be corrected. :)
Cheers!
First of all, sizeof(bool) is not necessarily 1. It is implementation-defined, giving the compiler writer freedom to choose a size that's suitable for the target platform.
Also, sizeof(int) is not necessarily 4.
There are multiple issues that could affect performance:
alignment;
memory bandwidth;
CPU's ability to efficiently load values that are narrower than the machine word.
What -- if any -- difference that makes to a particular piece of code can only be established by profiling that piece of code.
The only guaranteed size you can get in C++ is with char, unsigned char, and signed char 2), which are always exactly one byte and defined for every platform.0)1)
0) Though a byte does not have a defined size. sizeof(char) is always 1 byte, but might be 40 binary bits in fact
1) Yes, there is uint32_t and friends, but no, their definition is optional for actual C++ implementations. Use them, but you may get compile time errors if they are not available (compile time errors are always good)
2) char, unsigned char, and signed char are distinct types and it is not defined whether char is signed or not. Keep this in mind when overloading functions and writing templates.
There are three commonly accepted performance-driven practices in regards to booleans:
In if-statements order of checking the expressions matters and one needs to be careful about them.
If a check of a boolean expression causes a lot of branch mispredictions, then it should (if possible) be substituted with a bit twiddling hack.
Since boolean is a smallest data type, boolean variables should be declared last in structures and classes, so that padding does not add noticeable holes in the structure memory layout.
I've never heard about any performance gain from substituting a boolean with (unsigned?) integer however.

Difference between uint8_t and unspecified int for large matrices

I have a matrix that is over 17,000 x 14,000 that I'm storing in memory in C++. The values will never get over 255 so I'm thinking I should store this matrix as a uint8_t type instead of a regular int type. Will the regular int type will assume the native word size (64 bit so 8 bytes per cell) even with an optimizing compiler? I'm assuming I'll use 8x less memory if I store the array as uint8_t?
If you doubt this, you could have just tried it.
Of course it will be smaller.
However, it wholly depends on your usage patterns which will be faster. Profile! Profile! Profile!
Reasons for unexpected performance considerations:
alignment issues
elements sharing cache lines (could be positive on sequential access; negative in multicore scenarios)
increased need for locking on atomic reads/writes (in case of threading)
reduced applicability of certain optimized MIPS instructions (? - I'm not up-to-date with details here; also a very good optimizing compiler might simply register-allocate temporaries of the right size)
other, unrelated border conditions, originating from the surrounding code
The standard doesn't specify the exact size of int other than it's at least the size of short. On some 64-bit architectures (for example many Linux and Solaris x86 systems I work with) int is 32 bits and long is 64 bits. The exact size of each type will of course vary by compiler/hardware.
The best way to find out is to use sizeof(int) on your system and see how big it is. If you have enough RAM using the native type may in fact be significantly faster than the uint8_t.
Even the best optimizing compiler is not going to do an analysis of the values of the data that you put into your matrix and assume (anthropomorphizing here) "Hmmm. He said int but everything is between 0 and 255. I'm going to make that an array of uint8_t."
The compiler can interpret some keywords such as register and inline as suggestions rather than mandates. Types on the other hand are mandates. You told the compiler to use int so the compiler must use int. So switching to a uint8_t matrix will save you a considerable amount of memory here.

fastest way to write a bitstream on modern x86 hardware

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,
I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};
It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).
You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.
I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

Is there a relation between integer and register sizes?

Recently, I was challenged in a recent interview with a string manipulation problem and asked to optimize for performance. I had to use an iterator to move back and forth between TCHAR characters (with UNICODE support - 2bytes each).
Not really thinking of the array length, I made a curial mistake with not using size_t but an int to iterate through. I understand it is not compliant and not secure.
int i, size = _tcslen(str);
for(i=0; i<size; i++){
// code here
}
But, the maximum memory we can allocate is limited. And if there is a relation between int and register sizes, it may be safe to use an integer.
E.g.: Without any virtual mapping tools, we can only map 2^register-size bytes. Since TCHAR is 2 bytes long, half of that number. For any system that has int as 32-bits, this is not going to be a problem even if you dont use an unsigned version of int. People with embedded background used to think of int as 16-bits, but memory size will be restricted on such a device. So I wonder if there is a architectural fine-tuning decision between integer and register sizes.
The C++ standard doesn't specify the size of an int. (It says that sizeof(char) == 1, and sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long).
So there doesn't have to be a relation to register size. A fully conforming C++ implementation could give you 256 byte integers on your PC with 32-bit registers. But it'd be inefficient.
So yes, in practice, the size of the int datatype is generally equal to the size of the CPU's general-purpose registers, since that is by far the most efficient option.
If an int was bigger than a register, then simple arithmetic operations would require more than one instruction, which would be costly. If they were smaller than a register, then loading and storing the values of a register would require the program to mask out the unused bits, to avoid overwriting other data. (That is why the int datatype is typically more efficient than short.)
(Some languages simply require an int to be 32-bit, in which case there is obviously no relation to register size --- other than that 32-bit is chosen because it is a common register size)
Going strictly by the standard, there is no guarantee as to how big/small an int is, much less any relation to the register size. Also, some architectures have different sizes of registers (i.e: not all registers on the CPU are the same size) and memory isn't always accessed using just one register (like DOS with its Segment:Offset addressing).
With all that said, however, in most cases int is the same size as the "regular" registers since it's supposed to be the most commonly used basic type and that's what CPUs are optimized to operate on.
AFAIK, there is no direct link between register size and the size of int.
However, since you know for which platform you're compiling the application, you can define your own type alias with the sizes you need:
Example
#ifdef WIN32 // Types for Win32 target
#define Int16 short
#define Int32 int
// .. etc.
#elif defined // for another target
Then, use the declared aliases.
I am not totally aware, if I understand this correct, since some different problems (memory sizes, allocation, register sizes, performance?) are mixed here.
What I could say is (just taking the headline), that on most actual processors for maximum speed you should use integers that match register size. The reason is, that when using smaller integers, you have the advantage of needing less memory, but for example on the x86 architecture, an additional command for conversion is needed. Also on Intel you have the problem, that accesses to unaligned (mostly on register-sized boundaries) memory will give some penality. Off course, on todays processors things are even more complex, since the CPUs are able to process commands in parallel. So you end up fine tuning for some architecture.
So the best guess -- without knowing the architectore -- speeedwise is, to use register sized ints, as long you can afford the memory.
I don't have a copy of the standard, but my old copy of The C Programming Language says (section 2.2) int refers to "an integer, typically reflecting the natural size of integers on the host machine." My copy of The C++ Programming Language says (section 4.6) "the int type is supposed to be chosen to be the most suitable for holding and manipulating integers on a given computer."
You're not the only person to say "I'll admit that this is technically a flaw, but it's not really exploitable."
There are different kinds of registers with different sizes. What's important are the address registers, not the general purpose ones. If the machine is 64-bit, then the address registers (or some combination of them) must be 64-bits, even if the general-purpose registers are 32-bit. In this case, the compiler may have to do some extra work to actually compute 64-bit addresses using multiple general purpose registers.
If you don't think that hardware manufacturers ever make odd design choices for their registers, then you probably never had to deal with the original 8086 "real mode" addressing.