fatest method to replace strlen in c++ [duplicate] - c++

Assume that you're working a x86 32-bits system. Your task is to implement the strlen as fast as possible.
There're two problems you've to take care:
1. address alignment.
2. read memory with machine word length(4 bytes).
It's not hard to find the first alignment address in the given string.
Then we can read memory once with the 4 bytes, and count up it the total length. But we should stop once there's a zero byte in the 4 bytes, and count the left bytes before zero byte. In order to check the zero byte in a fast way, there's a code snippet from glibc:
unsigned long int longword, himagic, lomagic;
himagic = 0x80808080L;
lomagic = 0x01010101L;
// There's zero byte in 4 bytes.
if (((longword - lomagic) & ~longword & himagic) != 0) {
// do left thing...
}
I used it in Visual C++, to compare with CRT's implementation. The CRT's is much more faster than the above one.
I'm not familiar with CRT's implementation, did they use a faster way to check the zero byte?

You could save the length of the string along with the string when creating it, as is done in Pascal.

First CRT's one is written directly in assembler. you can see it's source code here C:\Program Files\Microsoft Visual Studio 9.0\VC\crt\src\intel\strlen.asm (this is for VS 2008)

It depends. Microsoft's library really has two different versions of strlen. One is a portable version in C that's about the most trivial version of strlen possible, pretty close (and probably equivalent) to:
size_t strlen(char const *str) {
for (char const *pos=str; *pos; ++pos)
;
return pos-str;
}
The other is in assembly language (used only for Intel x86), and quite similar to what you have above, at least as far as load 4 bytes, check of one of them is zero, and react appropriately. The only obvious difference is that instead of subtracting, they basically add pre-negate the bytes and add. I.e. instead of word-0x0101010101, they use word + 0x7efefeff.

there are also compiler intrinsic versions which use the REPNE SCAS instruction pair, though these are generally on older compilers, they can still be pretty fast. there are also SSE2 versions of strlen, such as Dr Agner Fog's performance library's implementation, or something such as this

Remove those 'L' suffixes and see... You are promoting all calculations to "long"! On my 32-bits tests, that alone doubles the cost.
I also do two micro-optimizations:
Since most strings we use scan consist of ASCII chars in the range 0~127, the high bit is (almost) never set, so only check for it in a second test.
Increment an index rather than a pointer, which is cheaper on some architectures (notably x86) and give you the length for 'free'...
uint32_t gatopeich_strlen32(const char* str)
{
uint32_t *u32 = (uint32_t*)str, u, abcd, i=0;
while(1)
{
u = u32[i++];
abcd = (u-0x01010101) & 0x80808080;
if (abcd && // If abcd is not 0, we have NUL or a non-ASCII char > 127...
(abcd &= ~u)) // ... Discard non-ASCII chars
{
#if BYTE_ORDER == BIG_ENDIAN
return 4*i - (abcd&0xffff0000 ? (abcd&0xff000000?4:3) : abcd&0xff00?2:1);
#else
return 4*i - (abcd&0xffff ? (abcd&0xff?4:3) : abcd&0xff0000?2:1);
#endif
}
}
}

Assuming you know the maximum possible length, and you've initated the memory to \0 before use, you could do a binary split and go left/right depending on the value(\0, split on left, else split on right). That way you'd dramatically decrease the amount of checks you'll need to find the length. Not optimal(requires some setup), but should be really fast.
// Eric

Obviously, crafting a tight loop like this in assembler would be fastest, however if you want/need to keep it more human-readable and/or portable in C(++), you can still increase the speed of the standard function by using the register keyword.
The register keyword prompts the compiler to store the counter in a register on the CPU instead of in memory which will significantly speed up the loop.
Note however, that the register keyword is only a suggestion and the compiler is free to ignore it if it thinks it can do better, especially if certain optimization options are used. That said, while it is almost certainly going to be ignored for a local, class variable in a triple for-loop, it is likely to be honored for the code below, thus improving performance quite a bit (nearly on par with the assembler version):
size_t strlen ( const char* s ) {
for (register const char* i=s; *i; ++i);
return (i-s);
}

Related

Fastest way to find first occurrence of byte in buffer

Disclaimer
I have read What is the fastest substring search algorithm?, which may be suboptimal for the single character case.
strchr requires a NUL-terminated string
I am looking for the fastest way to identify the first occurrence of a given byte in a byte buffer.
This is reminiscent of looking for the first occurrence of a character in a string except that:
the byte buffer is not NUL-terminated, instead I have an explicit length (and possibly embedded NUL characters)
the byte buffer is not allocated in a string or vector, I am only handed down a slice (aka, pointer & length)
The basic solution is:
size_t search(char const* buffer, size_t length, char c) {
return std::find(buffer, buffer + length, c) - buffer;
}
However, a quick round-trip with the Godbolt compiler (-O2 -msse2 -mavx) does not show any hint of a vectorized instruction, only some unrolling, so I am wondering whether this is the optimal.
Is there a faster way to find the first occurrence of a given byte in a buffer?
Note: only the first occurrence matters.
Note: I am exclusively concerned with modern x86_64 CPUs on Linux, though I encourage answers to be as generic as possible and mention assumptions clearly.
you can use memchr , which is usually implemented as an intrinsic function, and is usually (from my experience) much faster than any hand-rolled loop.
http://en.cppreference.com/w/c/string/byte/memchr
Edit: at least on VC++ (and I bet on GCC as well, I haven't checked), std::find will use memchr anyway if you look for a byte , so I would check if memchr actually make the program run faster.

How to effectively apply bitwise operation to (large) packed bit vectors?

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).
This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.
Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

Why are standard string functions faster than my custom string functions?

I decided to find the speeds of 2 functions :
strcmp - The standard comparison function defined in string.h
xstrcmp- A function that has same parameters and does the same, just that I created it.
Here is my xstrcmp function :
int xstrlen(char *str)
{
int i;
for(i=0;;i++)
{
if(str[i]=='\0')
break;
}
return i;
}
int xstrcmp(char *str1, char *str2)
{
int i, k;
if(xstrlen(str1)!=xstrlen(str2))
return -1;
k=xstrlen(str1)-1;
for(i=0;i<=k;i++)
{
if(str1[i]!=str2[i])
return -1;
}
return 0;
}
I didn't want to depend on strlen, since I want everything user-defined.
So, I found the results. strcmp did 364 comparisons per millisecond and my xstrcmp did just 20 comparisons per millisecond (atleast on my computer!)
Can anyone tell why this is so ? What does the xstrcmp function do to make itself so fast ?
if(xstrlen(str1)!=xstrlen(str2)) //computing length of str1
return -1;
k=xstrlen(str1)-1; //computing length of str1 AGAIN!
You're computing the length of str1 TWICE. That is one reason why your function loses the game.
Also, your implemetation of xstrcmp is very naive compared to the ones defined in (most) Standard libraries. For example, your xstrcmp compares one byte at a time, when in fact it could compare multiple bytes in one go, taking advantage of proper alignment as well, or can do little preprocessing so as to align memory blocks, before actual comparison.
strcmp and other library routines are written in assembly, or specialized C code, by experienced engineers and use a variety of techniques.
For example, the assembly implementation might load four bytes at a time into a register, and compare that register (as a 32-bit integer) to four bytes from the other string. On some machines, the assembly implementation might load eight bytes or even more. If the comparison shows the bytes are equal, the implementation moves on to the next four bytes. If the comparison shows the bytes are unequal, the implementation stops.
Even with this simple optimization, there are a number of issues to be dealt with. If the string addresses are not multiples of four bytes, the processor might not have an instruction that will load four bytes (many processors require four-byte loads to use addresses that are aligned to multiples of four bytes). Depending on the processor, the implementation might have to use slower unaligned loads or to write special code for each alignment case that does aligned loads and shifts bytes in registers to align the bytes to be compared.
When the implementation loads four bytes at once, it must ensure it does not load bytes beyond the terminating null character if those bytes might cause a segment fault (error because you tried to load an address that is not readable).
If the four bytes do contain the terminating null character, the implementation must detect it and not continue comparing further bytes, even if the current four are equal in the two strings.
Many of these issues require detailed assembly instructions, and the required control over the exact instructions used is not available in C. The exact techniques used vary from processor model to processor model and vary greatly from architecture to architecture.
Faster implementation of strlen:
//Return difference in addresses - 1 as we don't count null terminator in strlen.
int xstrlen(char *str)
{
char* ptr = str;
while (*str++);
return str - ptr - 1;
}
//Pretty nifty strcmp from here:
//http://vijayinterviewquestions.blogspot.com/2007/07/implement-strcmpstr1-str2-function.html
int mystrcmp(const char *s1, const char *s2)
{
while (*s1==*s2)
{
if(*s1=='\0')
return(0);
++s1;
++s2;
}
return(*s1-*s2);
}
I'll do the other one later if I have time. You should also note that most of these are done in assembly language or using other optimized means which will be faster than the best stright C implementation you can write.
Aside from the problems in your code (which have been pointed out already), -- at least in the gcc-C-libs, the str- and mem-functions are faster by a margin in most cases because their memory access patterns are higly optimized.
There were some discussions on the topic on SO already.
Try this:
int xstrlen(const char* s){
const char* s0 = s;
while(*s) s++;
return(s - s0);
}
int xstrcmp(const char* a, const char* b){
while(*a && *a==*b){a++; b++;}
return *a - *b;
}
This could probably be sped up with some loop unrolling.
1. Algorithm
Your implementation of strcmp could have a better algorithm. There should be no need to call strlen at all, each call to strlen will iterate over the whole length of the string again. You can find simple but effective implementations online, probably the place to start is something like:
// Adapted from http://vijayinterviewquestions.blogspot.co.uk
int xstrcmp(const char *s1, const char *s2)
{
for (;*s1==*s2;++s1,++s2)
{
if(*s1=='\0') return(0);
}
return(*s1-*s2);
}
That doesn't do everything, but should be simple and work in most cases.
2. Compiler optimisation
It's a stupid question, but make sure you turned on all the optimisation switches when you compile.
3. More sophisticated optimisations
People writing libraries will often use more advanced techniques, such as loading a 4-byte or 8-byte int at once, and comparing it, and only comparing individual bytes if the whole matches. You'd need to be an expert to know what's appropriate for this case, but you can find people discussing the most efficient implementation on stack overflow (link?)
Some standard library functions for some platforms may be hand-written in assembly if the coder can knows there's a more efficient implementation than the compiler can find. That's increasingly rare now, but may be common on some embedded systems.
4. Linker "cheating" with standard library
With some standard library functions, the linker may be able to make your program call them with less overhead than calling functions in your code because it was designed to know more about the specific internals of the functions (link?) I don't know if that applies in this case, it probably doesn't, but it's the sort of thing you have to think about.
5. OK, ok, I get that, but when SHOULD I implement my own strcmp?
Off the top of my head, the only reasons to do this are:
You want to learn how. This is a good reason.
You are writing for a platform which doesn't have a good enough standard library. This is very unlikely.
The string comparison has been measured to be a significant bottleneck in your code, and you know something specific about your strings that mean you can compare them more efficiently than a naive algorithm. (Eg. all strings are allocated 8-byte aligned, or all strings have an N-byte prefix.) This is very, very unlikely.
6. But...
OK, WHY do you want to avoid relying on strlen? Are you worried about code size? About portability of code or of executables?
If there's a good reason, open another question and there may be a more specific answer. So I'm sorry if I'm missing something obvious, but relying on the standard library is usually much better, unless there's something specific you want to improve on.

Implementing memcmp

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
{
if(!count)
return(0);
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
}
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.
You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.
The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.
Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).
If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).
Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.
Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.
The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.
Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

How to implement strlen as fast as possible

Assume that you're working a x86 32-bits system. Your task is to implement the strlen as fast as possible.
There're two problems you've to take care:
1. address alignment.
2. read memory with machine word length(4 bytes).
It's not hard to find the first alignment address in the given string.
Then we can read memory once with the 4 bytes, and count up it the total length. But we should stop once there's a zero byte in the 4 bytes, and count the left bytes before zero byte. In order to check the zero byte in a fast way, there's a code snippet from glibc:
unsigned long int longword, himagic, lomagic;
himagic = 0x80808080L;
lomagic = 0x01010101L;
// There's zero byte in 4 bytes.
if (((longword - lomagic) & ~longword & himagic) != 0) {
// do left thing...
}
I used it in Visual C++, to compare with CRT's implementation. The CRT's is much more faster than the above one.
I'm not familiar with CRT's implementation, did they use a faster way to check the zero byte?
You could save the length of the string along with the string when creating it, as is done in Pascal.
First CRT's one is written directly in assembler. you can see it's source code here C:\Program Files\Microsoft Visual Studio 9.0\VC\crt\src\intel\strlen.asm (this is for VS 2008)
It depends. Microsoft's library really has two different versions of strlen. One is a portable version in C that's about the most trivial version of strlen possible, pretty close (and probably equivalent) to:
size_t strlen(char const *str) {
for (char const *pos=str; *pos; ++pos)
;
return pos-str;
}
The other is in assembly language (used only for Intel x86), and quite similar to what you have above, at least as far as load 4 bytes, check of one of them is zero, and react appropriately. The only obvious difference is that instead of subtracting, they basically add pre-negate the bytes and add. I.e. instead of word-0x0101010101, they use word + 0x7efefeff.
there are also compiler intrinsic versions which use the REPNE SCAS instruction pair, though these are generally on older compilers, they can still be pretty fast. there are also SSE2 versions of strlen, such as Dr Agner Fog's performance library's implementation, or something such as this
Remove those 'L' suffixes and see... You are promoting all calculations to "long"! On my 32-bits tests, that alone doubles the cost.
I also do two micro-optimizations:
Since most strings we use scan consist of ASCII chars in the range 0~127, the high bit is (almost) never set, so only check for it in a second test.
Increment an index rather than a pointer, which is cheaper on some architectures (notably x86) and give you the length for 'free'...
uint32_t gatopeich_strlen32(const char* str)
{
uint32_t *u32 = (uint32_t*)str, u, abcd, i=0;
while(1)
{
u = u32[i++];
abcd = (u-0x01010101) & 0x80808080;
if (abcd && // If abcd is not 0, we have NUL or a non-ASCII char > 127...
(abcd &= ~u)) // ... Discard non-ASCII chars
{
#if BYTE_ORDER == BIG_ENDIAN
return 4*i - (abcd&0xffff0000 ? (abcd&0xff000000?4:3) : abcd&0xff00?2:1);
#else
return 4*i - (abcd&0xffff ? (abcd&0xff?4:3) : abcd&0xff0000?2:1);
#endif
}
}
}
Assuming you know the maximum possible length, and you've initated the memory to \0 before use, you could do a binary split and go left/right depending on the value(\0, split on left, else split on right). That way you'd dramatically decrease the amount of checks you'll need to find the length. Not optimal(requires some setup), but should be really fast.
// Eric
Obviously, crafting a tight loop like this in assembler would be fastest, however if you want/need to keep it more human-readable and/or portable in C(++), you can still increase the speed of the standard function by using the register keyword.
The register keyword prompts the compiler to store the counter in a register on the CPU instead of in memory which will significantly speed up the loop.
Note however, that the register keyword is only a suggestion and the compiler is free to ignore it if it thinks it can do better, especially if certain optimization options are used. That said, while it is almost certainly going to be ignored for a local, class variable in a triple for-loop, it is likely to be honored for the code below, thus improving performance quite a bit (nearly on par with the assembler version):
size_t strlen ( const char* s ) {
for (register const char* i=s; *i; ++i);
return (i-s);
}