Fast generic strlen() implementation that can accept arbitrary terminator character - c++

template <char terminator = '\0'>
size_t strlen(const char *str)
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, magic_bits, himagic, lomagic;
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0; ++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
longword_ptr = (unsigned long int *) char_ptr;
himagic = 0x80808080L;
lomagic = 0x01010101L;
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & himagic) != 0)
{
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
}
}
}
The above is glibc strlen() code. It relies on trick to Determine if a word has a zero byte to make it fast.
However, I wish to make the function work with any terminating character, not just '\0', using template. Is it possible to do something similar ?

Use std::memchr to take advantage of libc's hand-written asm
It returns a pointer to the found byte, so you can get the length by subtracting. It returns NULL on not-found, but you said you can assume there will be a match, so we don't need to check except as a debug assert.
Even better, use rawmemchr if you can assume GNU functions are available, so you don't even have to pass a length. (Or not, since glibc 2.37 deprecates it.)
#include <cstring>
size_t lenx(const char *p, int c, size_t len)
{
const void *match = std::memchr(p, c, len); // old C functions take char in an int
return static_cast<const char*>(match) - p;
}
Any decent libc implementation for a modern mainstream CPU will have a fast memchr implementation that checks multiple bytes at once, often hand-written in asm. Very similar to an strlen implementation, but with length-based loop exit condition in the unrolled loop, not just the match-finding loop exit condition.
memchr is somewhat cheaper than strchr, which has to check every byte for being a potential 0; an amount of work that doesn't go down with unrolling and SIMD. If data isn't hot in L1 cache, a good strchr can typically still keep up with available bandwidth, though, on most CPUs for most ISAs. Checking for 0s is also a correctness problem for arrays that contain a 0 byte before the byte you're looking for.
If available, it will even use SIMD instructions to check 16 or 32 bytes at once. A pure C bithack (with strict-aliasing UB) like the one you found is only used in portable fallback code in real C libraries (Why does glibc's strlen need to be so complicated to run quickly? explains this and has some links to glibc's asm implementations), or on targets where it compiles to asm as good as could be written by hand (e.g. MIPS for glibc). (But being wrapped up in a library function, the strict-aliasing UB is dealt with by some means, perhaps as simple as just not being able to inline into other code that accesses that data a different way. If you wanted to do it yourself, you'd want a typedef with something like GNU C __attribute__((may_alias)). See the link earlier in this paragraph.)
You certainly don't want a bithack that's only going to check 4 bytes at a time, especially if unsigned long is an 8-byte type on a 64-bit CPU!
If you don't know the buffer length, use len = -1 in C11 / C++17
Use rawmemchr if available, otherwise use memchr(ptr, c, -1).
That's equivalent to passing SIZE_MAX.
See Is it legal to call memchr with a too-long length, if you know the character will be found before reaching the end of the valid region?
It's guaranteed not to read past the match, or at least to behave as if it didn't, i.e. not faulting. So not reading into the next page just like an optimized strlen, and probably for performance reasons not reading into the next cache line. (At least since C++17 / C11, according to cppreference, but real implementations have almost certainly been safe to use this way for longer, at least for performance reasons.)
The ISO C++ standard itself defers to the C standard for <cstring> functions; C++17 and later defer to C11, which added this requirement that C99 didn't have. (I also don't know if there are real-world implementations that violate that standard; I'd guess not and that it was more a matter of documenting a guarantee that real implementations were already doing.)
The POSIX man page for memchr guarantees stopping on a match; I don't know how far back this guarantee goes for POSIX systems.
Implementations shall behave as if they read the memory byte by byte from the beginning of the bytes pointed to by s and stop at the first occurrence of c (if it is found in the initial n bytes).
Without a guarantee like this, it's hypothetically possible for an implementation to just use unaligned loads starting with the address you pass it, as long as it's far enough from the ptr[size-1] end of the buffer you told it about. That's unlikely for performance reasons, though.
rawmemchr()
If you're on a GNU system, glibc has rawmemchr which assumes there will be a match, not taking a size limit. So it can loop exactly like strlen does, not having a 2nd exit condition based on either length or checking every byte for a 0 as well as the given character.
Fun fact: AArch64 glibc implements it as memchr(ptr, c, -1), or as strlen if the character happens to be 0. On other ISAs, it might actually duplicate the memchr code but without checking against the end of a buffer.
Glibc 2.37 will deprecate it, so apparently not a good idea for new code to switch to it now.

Related

What is really misaligned pointer?

Misaligned pointer is scaring me. When parsing string, a helpful technique I normally use is treating a group of chars as one unit.
So if I am comparing a string to where <!-- is 4 chars that mean begin comment, I would do...
if( *(unsigned int*)string == tobe32('<!--') )
// This is beggining of a comment possibly
As you can see I handle the Endianess problem. But will I still stumble upon alignment pointer problem. As if I am at index 1 of the string, will casting it to unsigned int pointer give me a 4 byte object on a 1 byte boundary?
Yes, it will almost certainly be a misaligned pointer(a), casting the address should not actually change the address, just change how it's treated when you dereference it.
However, that's not necessarily a problem. Some environments may actually raise a hardware exception if you do this (some early ARMs, from memory), some will run a little slower (some x86s), and no doubt some won't care at all. So it would depend on your underlying environment.
However, I'd really question the need for this trick, since the fact you have to do endian conversion means it may not be as efficient as you think.
My first inclination would be to just write an inline function that checks the four characters individually, time it, and only worry about optimisation if there's a real problem. That would be something like:
// Check first four characters match. Pre-condition is that both
// legacy-C-strings are at least four characters in length.
inline bool match4(const char *str, const char *match) {
if (*str++ != *match++) return false;
if (*str++ != *match++) return false;
if (*str++ != *match++) return false;
return *str == *match;
}
This would be my starting position rather than relying on possibly non-portable solutions using casting.
(a) If you want to know the alignment requirements of certain types, you can use the C++ alignof expression, such as alignof(int), assuming you have C++11 or better and, really, you should have :-)

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Can std::memcmp read any bytes past the first difference?

Consider:
constexpr char s1[] = "a";
constexpr char s2[] = "abc";
std::memcmp(s1, s2, 3);
If memcmp stops at the first difference it sees, it will not read past the second byte of s1 (the nul terminator), however I don't see anything in the C standard to confirm this behavior, and I don't know of anything in C++ which extends it.
n1570 7.24.4.1 PDF link
int memcmp(const void *s1, const void *s2, size_t n);
The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2
Is my understanding correct that the standard describes the behavior as reading all n bytes of both arguments, but libraries can short circuit as-if they did?
The function is not guaranteed to short-circuit because the standard doesn't say it must.
Not only is it not guaranteed to short-circuit, but in practice many implementations will not. For example, glibc compares elements of type unsigned long int (except for the last few bytes), so it could read up to 7 bytes past the location which compared differently on a 64-bit implementation.
Some may think that this won't cause an access violation on the platforms glibc targets, because access to these unsigned long ints will always be aligned and therefore will not cross a page boundary. But when the two sources have a different alignment, glibc will read two consecutive unsigned long ints from one of the sources, which may be in different pages. If the different byte was in the first of those, an access violation can still be triggered before glibc performed the comparison (see function memcmp_not_common_alignment).
In short: Specifying a length that is larger than the real size of the buffer is undefined behavior even if the different byte occured before this length, and can cause crashes on common implementations.
Here's proof that it can crash: https://ideone.com/8jTREr

Aligning buffer to an N-byte boundary but not a 2N-byte one?

I would like to allocate some char buffers0, to be passed to an external non-C++ function, that have a specific alignment requirement.
The requirement is that the buffer be aligned to a N-byte1 boundary, but not to a 2N boundary. For example, if N is 64, then an the pointer to this buffer p should satisfy ((uintptr_t)p) % 64 == 0 and ((uintptr_t)p) % 128 != 0 - at least on platforms where pointers have the usual interpretation as a plain address when cast to uintptr_t.
Is there a reasonable way to do this with the standard facilities of C++11?
If not, is there is a reasonable way to do this outside the standard facilities2 which works in practice for modern compilers and platforms?
The buffer will be passed to an outside routine (adhering to the C ABI but written in asm). The required alignment will usually be greater than 16, but less than 8192.
Over-allocation or any other minor wasted-resource issues are totally fine. I'm more interested in correctness and portability than wasting a few bytes or milliseconds.
Something that works on both the heap and stack is ideal, but anything that works on either is still pretty good (with a preference towards heap allocation).
0 This could be with operator new[] or malloc or perhaps some other method that is alignment-aware: whatever makes sense.
1 As usual, N is a power of two.
2 Yes, I understand an answer of this type causes language-lawyers to become apoplectic, so if that's you just ignore this part.
Logically, to satisfy "aligned to N, but not 2N", we align to 2N then add N to the pointer. Note that this will over-allocate N bytes.
So, assuming we want to allocate B bytes, if you just want stack space, alignas would work, perhaps.
alignas(N*2) char buffer[B+N];
char *p = buffer + N;
If you want heap space, std::aligned_storage might do:
typedef std::aligned_storage<B+N,N*2>::type ALIGNED_CHAR;
ALIGNED_CHAR buffer;
char *p = reinterpret_cast<char *>(&buffer) + N;
I've not tested either out, but the documentation suggests it should be OK.
You can use _aligned_malloc(nbytes,alignment) (in MSVC) or _mm_malloc(nbytes,alignment) (on other compilers) to allocate (on the heap) nbytes of memory aligned to alignment bytes, which must be an integer power of two.
Then you can use the trick from Ken's answer to avoid alignment to 2N:
void*ptr_alloc = _mm_malloc(nbytes+N,2*N);
void*ptr = static_cast<void*>(static_cast<char*>(ptr_alloc) + N);
/* do your number crunching */
_mm_free(ptr_alloc);
We must ensure to keep the pointer returned by _mm_malloc() for later de-allocation, which must be done via _mm_free().

Authoritative "correct" way to avoid signed-unsigned warnings when testing a loop variable against size_t

The code below generates a compiler warning:
private void test()
{
byte buffer[100];
for (int i = 0; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
}
warning: comparison between signed and unsigned integer expressions
[-Wsign-compare]
This is because sizeof() returns a size_t, which is unsigned.
I have seen a number of suggestions for how to deal with this, but none with a preponderance of support and none with any convincing logic nor any references to support one approach as clearly "better." The most common suggestions seem to be:
ignore the warnings
turn off the warnings
use a loop variable of type size_t
use a loop variable of type size_t with tricks to avoid decrementing past zero
cast size_of(buffer) to an int
some extremely convoluted suggestions that I did not have the patience to follow because they involved unreadable code, generally involving vectors and/or iterators
libraries that I cannot load in the AVR / ARM embedded environments I often use.
free functions returning a valid int or long representing the byte count of T
Don't use loops (gotta love that advice)
Is there a "correct" way to approach this?
-- Begin Edit --
The example I gave is, of course, trivial, and meant only to demonstrate the type mismatch warning that can occur in an indexing situation.
#3 is not necessarily the obviously correct answer because size_t carries special risks in a decrementing loop such as
for (size_t i = myArray.size; i > 0; --i)
(the array may someday have a size of zero).
#4 is a suggestion to deal with decrementing size_t indexes by including appropriate and necessary checks to avoid ever decrementing past zero. Since that makes the code harder to read, there are some cute shortcuts that are not particularly readable, hence my referring to them as "tricks."
#7 is a suggestion to use libraries that are not generalizable in the sense that they may not be available or appropriate in every setting.
#8 is a suggestion to keep the checks readable, but to hide them in a non-member method, sometimes referred to as a "free function."
#9 is a suggestion to use algorithms rather than loops. This was offered many times as a solution to the size_t indexing problem, and there were a lot of upvotes. I include it even though I can't use the stl library in most of my environments and would have to write the code myself.
-- End Edit--
I am hoping for evidence-based guidance or references as to best practices for handling something like this. Is there a "standard text" or a style guide somewhere that addresses the question? A defined approach that has been adopted/endorsed internally by a major tech company? An emulatable solution forthcoming in a new language release? If necessary, I would be satisfied with an unsupported public recommendation from a single widely recognized expert.
None of the options on offer seem very appealing. The warnings drown out other things I want to see. I don't want to miss signed/unsigned comparisons in places where it might matter. Decrementing a loop variable of type size_t with comparison >=0 results in an infinite loop from unsigned integer wraparound, and even if we protect against that with something like for (size_t i = sizeof(buffer); i-->0 ;), there are other issues with incrementing/decrementing/comparing to size_t variables. Testing against size_t - 1 will yield a large positive 'oops' number when size_t is unexpectedly zero (e.g. strlen(myEmptyString)). Casting an unsigned size_t to an integer is a container size problem (not guaranteed a value) and of course size_t could potentially be bigger than an int.
Given that my arrays are of known sizes well below Int_Max, it seems to me that casting size_t to a signed integer is the best of the bunch, but it makes me cringe a little bit. Especially if it has to be static_cast<int>. Easier to take if it's hidden in a function call with some size testing, but still...
Or perhaps there's a way to turn off the warnings, but just for loop comparisons?
I find any of the three following approaches equally good.
Use a variable of type int to store the size and compare the loop variable to it.
byte buffer[100];
int size = sizeof(buffer);
for (int i = 0; i < size; ++i)
{
buffer[i] = 0;
}
Use size_t as the type of the loop variable.
byte buffer[100];
for (size_t i = 0; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
Use a pointer.
byte buffer[100];
byte* end = buffer + sizeof(buffer)
for (byte* p = buffer; p < end; ++p)
{
*p = 0;
}
If you are able to use a C++11 compiler, you can also use a range for loop.
byte buffer[100];
for (byte& b : buffer)
{
b = 0;
}
The most appropriate solution will depend entirely on context. In the context of the code fragment in your question the most appropriate action is perhaps to have type-agreement - the third option in your bullet list. This is appropriate in this case because the usage of i throughout the code is only to index the array - in this case the use of int is inappropriate - or at least unnecessary.
On the other hand if i were an arithmetic object involved in some arithmetic expression that was itself signed, the int might be appropriate and a cast would be in order.
I would suggest that as a guideline, a solution that involves the fewest number of necessary type casts (explicit of implicit) is appropriate, or to look at it another way, the maximum possible type agreement. There is not one "authoritative" rule because the purpose and usage of the variables involved is semantically rather then syntactically dependent. In this case also as has been pointed out in other answers, newer language features supporting iteration may avoid this specific issue altogether.
To discuss the advice you say you have been given specifically:
ignore the warnings
Never a good idea - some will be genuine semantic errors or maintenance issues, and by teh time you have several hundred warnings you are ignoring, how will you spot the one warning that is and issue?
turn off the warnings
An even worse idea; the compiler is helping you to improve your code quality and reliability. Why would you disable that?
use a loop variable of type size_t
In this precise example, that is exactly why you should do; exact type agreement should always be the aim.
use a loop variable of type size_t with tricks to avoid decrementing past zero
This advice is irrelevant for the trivial example given. Moreover I presume that by "tricks" the adviser in fact means checks or just correct code. There is no need for "tricks" and the term is entirely ambiguous - who knows what the adviser means? It suggests something unconventional and a bit "dirty", when there is not need for any solution with such attributes.
cast size_of(buffer) to an int
This may be necessary if the usage of i warrants the use of int for correct semantics elsewhere in the code. The example in the question does not, so this would not be an appropriate solution in this case. Essentially if making i a size_t here causes type agreement warnings elsewhere that cannot themselves be resolved by universal type agreement for all operands in an expression, then a cast may be appropriate. The aim should be to achieve zero warnings an minimum type casts.
some extremely convoluted suggestions that I did not have the patience to follow, generally involving vectors and/or iterators
If you are not prepared to elaborate or even consider such advice, you'd have better omitted the "advice" from your question. The use of STL containers in any case is not always appropriate to a large segment of embedded targets in any case, excessive code size increase and non-deterministic heap management are reasons to avoid on many platforms and applications.
libraries that I cannot load in an embedded environment.
Not all embedded environments have equal constraints. The restriction is on your embedded environment, not by any means all embedded environments. However the "loading of libraries" to resolve or avoid type agreement issues seems like a sledgehammer to crack a nut.
free functions returning a valid int or long representing the byte count of T
It is not clear what that means. What id a "free function"? Is that just a non-member function? Such a function would internally necessarily have a type case, so what have you achieved other than hiding a type cast?
Don't use loops (gotta love that advice).
I doubt you needed to include that advice in your list. The problem is not in any case limited to loops; it is not because you are using a loop that you have the warning, it is because you have used < with mismatched types.
My favorite solution is to use C++11 or newer and skip the whole manual size bounding entirely like so:
// assuming byte is defined by something like using byte = std::uint8_t;
void test()
{
byte buffer[100];
for (auto&& b: buffer)
{
b = 0;
}
}
Alternatively, if I can't use the ranged-based for loop (but still can use C++11 or newer), my favorite syntax becomes:
void test()
{
byte buffer[100];
for (auto i = decltype(sizeof(buffer)){0}; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
}
Or for iterating backwards:
void test()
{
byte buffer[100];
// relies on the defined modwrap semantics behavior for unsigned integers
for (auto i = sizeof(buffer) - 1; i < sizeof(buffer); --i)
{
buffer[i] = 0;
}
}
The correct generic way is to use a loop iterator of type size_t. Simply because the is the most correct type to use for describing an array size.
There is not much need for "tricks to avoid decrementing past zero", because the size of an object can never be negative.
If you find yourself needing negative numbers to describe a variable size, it is probably because you have some special case where you are iterating across an array backwards. If so, the "trick" to deal with it is this:
for(size_t i=0; i<sizeof(array); i++)
{
size_t index = sizeof(array)-1 - i;
array[index] = something;
}
However, size_t is often an inconvenient type to use in embedded systems, because it may end up as a larger type than what your MCU can handle with one instruction, resulting in needlessly inefficient code. It may then be better to use a fixed width integer such as uint16_t, if you know the maximum size of the array in advance.
Using plain int in an embedded system is almost certainly incorrect practice. Your variables must be of deterministic size and signedness - most variables in an embedded system are unsigned. Signed variables also lead to major problems whenever you need to use bitwise operators.
If you are able to use C++ 11, you could use decltype to obtain the actual type of what sizeof returns, for instance:
void test()
{
byte buffer[100];
// On macOS decltype(sizeof(buffer)) returns unsigned long, this passes
// the compiler without warnings.
for (decltype(sizeof(buffer)) i = 0; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
}