Can std::memcmp read any bytes past the first difference?

Can std::memcmp read any bytes past the first difference? - c++

Consider:
constexpr char s1[] = "a";
constexpr char s2[] = "abc";
std::memcmp(s1, s2, 3);
If memcmp stops at the first difference it sees, it will not read past the second byte of s1 (the nul terminator), however I don't see anything in the C standard to confirm this behavior, and I don't know of anything in C++ which extends it.
n1570 7.24.4.1 PDF link
int memcmp(const void *s1, const void *s2, size_t n);
The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2
Is my understanding correct that the standard describes the behavior as reading all n bytes of both arguments, but libraries can short circuit as-if they did?

The function is not guaranteed to short-circuit because the standard doesn't say it must.
Not only is it not guaranteed to short-circuit, but in practice many implementations will not. For example, glibc compares elements of type unsigned long int (except for the last few bytes), so it could read up to 7 bytes past the location which compared differently on a 64-bit implementation.
Some may think that this won't cause an access violation on the platforms glibc targets, because access to these unsigned long ints will always be aligned and therefore will not cross a page boundary. But when the two sources have a different alignment, glibc will read two consecutive unsigned long ints from one of the sources, which may be in different pages. If the different byte was in the first of those, an access violation can still be triggered before glibc performed the comparison (see function memcmp_not_common_alignment).
In short: Specifying a length that is larger than the real size of the buffer is undefined behavior even if the different byte occured before this length, and can cause crashes on common implementations.
Here's proof that it can crash: https://ideone.com/8jTREr

Related

Fast generic strlen() implementation that can accept arbitrary terminator character

template <char terminator = '\0'>
size_t strlen(const char *str)
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, magic_bits, himagic, lomagic;
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0; ++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
longword_ptr = (unsigned long int *) char_ptr;
himagic = 0x80808080L;
lomagic = 0x01010101L;
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & himagic) != 0)
{
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
}
}
}
The above is glibc strlen() code. It relies on trick to Determine if a word has a zero byte to make it fast.
However, I wish to make the function work with any terminating character, not just '\0', using template. Is it possible to do something similar ?

Use std::memchr to take advantage of libc's hand-written asm
It returns a pointer to the found byte, so you can get the length by subtracting. It returns NULL on not-found, but you said you can assume there will be a match, so we don't need to check except as a debug assert.
Even better, use rawmemchr if you can assume GNU functions are available, so you don't even have to pass a length. (Or not, since glibc 2.37 deprecates it.)
#include <cstring>
size_t lenx(const char *p, int c, size_t len)
{
const void *match = std::memchr(p, c, len); // old C functions take char in an int
return static_cast<const char*>(match) - p;
}
Any decent libc implementation for a modern mainstream CPU will have a fast memchr implementation that checks multiple bytes at once, often hand-written in asm. Very similar to an strlen implementation, but with length-based loop exit condition in the unrolled loop, not just the match-finding loop exit condition.
memchr is somewhat cheaper than strchr, which has to check every byte for being a potential 0; an amount of work that doesn't go down with unrolling and SIMD. If data isn't hot in L1 cache, a good strchr can typically still keep up with available bandwidth, though, on most CPUs for most ISAs. Checking for 0s is also a correctness problem for arrays that contain a 0 byte before the byte you're looking for.
If available, it will even use SIMD instructions to check 16 or 32 bytes at once. A pure C bithack (with strict-aliasing UB) like the one you found is only used in portable fallback code in real C libraries (Why does glibc's strlen need to be so complicated to run quickly? explains this and has some links to glibc's asm implementations), or on targets where it compiles to asm as good as could be written by hand (e.g. MIPS for glibc). (But being wrapped up in a library function, the strict-aliasing UB is dealt with by some means, perhaps as simple as just not being able to inline into other code that accesses that data a different way. If you wanted to do it yourself, you'd want a typedef with something like GNU C __attribute__((may_alias)). See the link earlier in this paragraph.)
You certainly don't want a bithack that's only going to check 4 bytes at a time, especially if unsigned long is an 8-byte type on a 64-bit CPU!
If you don't know the buffer length, use len = -1 in C11 / C++17
Use rawmemchr if available, otherwise use memchr(ptr, c, -1).
That's equivalent to passing SIZE_MAX.
See Is it legal to call memchr with a too-long length, if you know the character will be found before reaching the end of the valid region?
It's guaranteed not to read past the match, or at least to behave as if it didn't, i.e. not faulting. So not reading into the next page just like an optimized strlen, and probably for performance reasons not reading into the next cache line. (At least since C++17 / C11, according to cppreference, but real implementations have almost certainly been safe to use this way for longer, at least for performance reasons.)
The ISO C++ standard itself defers to the C standard for <cstring> functions; C++17 and later defer to C11, which added this requirement that C99 didn't have. (I also don't know if there are real-world implementations that violate that standard; I'd guess not and that it was more a matter of documenting a guarantee that real implementations were already doing.)
The POSIX man page for memchr guarantees stopping on a match; I don't know how far back this guarantee goes for POSIX systems.
Implementations shall behave as if they read the memory byte by byte from the beginning of the bytes pointed to by s and stop at the first occurrence of c (if it is found in the initial n bytes).
Without a guarantee like this, it's hypothetically possible for an implementation to just use unaligned loads starting with the address you pass it, as long as it's far enough from the ptr[size-1] end of the buffer you told it about. That's unlikely for performance reasons, though.
rawmemchr()
If you're on a GNU system, glibc has rawmemchr which assumes there will be a match, not taking a size limit. So it can loop exactly like strlen does, not having a 2nd exit condition based on either length or checking every byte for a 0 as well as the given character.
Fun fact: AArch64 glibc implements it as memchr(ptr, c, -1), or as strlen if the character happens to be 0. On other ISAs, it might actually duplicate the memchr code but without checking against the end of a buffer.
Glibc 2.37 will deprecate it, so apparently not a good idea for new code to switch to it now.

Why different size of different data types in C++

I have some questions about how to calculate the size of different data types in C++, I have int, char, unsigned char, unsigned int, double, string, after I run the sizeof(i), the computer gave me the answer of sizeof(int/unsigned int)==4; sizeof(char/unsigned char)==1; sizeof(string)==32;. I studied in many different tutorials recently, just got very confused about this result, and some claim that unsigned int size is 8 bytes, a kind like that, really confusing.....
By the way, I'm really confused about the difference between char and string, when I declare a string, I say string mystring="asd";. I also can declare a char mystring = "asd";. It is really confusing too, I am just a beginner, hope somebody could help me go to the right direction.
Can anybody help me out?

C++ was originally based on C, which was made to be a language to closely follow the hardware. And for hardware it makes sense to have many different data-types of different size (bytes, half-words, words, etc,), so it makes sense for C to follow that and this was inherited by C++ (which can also be used to make programs that run close to the hardware).
The size of the data-types depends on the compilers and the hardware it target, and can differ between platforms and even between different compilers on the same platform. For example, on a 64-bit Windows system, using the Visual Studio the type long is 32 bits (four bytes) while using GCC a long is 64 bits (eight bytes).
Generally speaking you can say that
sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long)
Also, the C++ specification says that sizeof(char) is always 1 no matter the actual size of a char. There is also no difference between an unsigned and a signed type, sizeof(unsigned int) == sizeof(signed int).
As for the size of structures and classes, roughly speaking the size of a structure or class is the sum of size of the members in the structure or class. So if you have a structure (or class) with two int member variables, then the size of the structure (or class) will be sizeof(int) + sizeof(int). This is however not the full truth, as the compiler may add padding to a structure to make member variables to end up on nicely aligned positions inside the structure, and this padding is also counted when getting the size of a structure.

The C++ standard is very open about what the size of various data types is; implementations are allowed to vary a lot.
As a quick summary of the rules:
sizeof(char) == 1
sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long)
I don't know offhand what the relationship between sizeof(int) and sizeof(unsigned int) is; the standard might specify that they must be equal, but I wouldn't rely on it.
sizeof(int) on modern desktops is typically either 4 or 8, depending on that compiler's approach to 64-bit numbers. But you shouldn't assume that.
The reason sizeof(std::string) and sizeof(char) are different is that char is the type of the smallest addressable unit in the system, and C strings are just an array of them. So if you write char* a = "abcd"; std::cout << sizeof(a) << std::endl; you will get the size of a pointer-to-char in the system. std::string, on the other hand, is a class. std::string a = "abcd"; std::cout << sizeof(a) << std::endl; will give you the full size of the std::string class, including padding, function tables, every member of std::string, etc..

Well for the size of the data types, you dont have to worry much about it. You can always use the sizeof(); operator to always check the size of the the data types. Theres is no need because the sizes of the different data types depends on the types of computer and operating systems. And for the string and char, actually string is just an object made a series of char. so if you use string to declare a string that string becomes an object of the class string. And if you use char to declare a string (By the way which is actually declared as a pointer type char*) then that string is just a series of characters of the type char. You can also declare a string using an array. For example char name[9] = Christina. So actually there are many ways to declare a string depending on its purpose. But the string object has a lot more functionalities so it. Check out string object c++ for more information about the string object. I hope this could help.

Why is rsize_t defined?

I found that strncpy_s() is defined under VS2013 as
errno_t __cdecl strncpy_s
(
_Out_writes_z_(_SizeInBytes) char * _Dst,
_In_ rsize_t _SizeInBytes,
_In_reads_or_z_(_MaxCount) const char * _Src,
_In_ rsize_t _MaxCount
);
rsize_t is:
typedef size_t rsize_t;
I think it's a trick done by Visual Studio. However, I found this function defined as follows on this page
errno_t strncpy_s
(
char *restrict dest,
rsize_t destsz,
const char *restrict src,
rsize_t count
);
Why is rsize_t defined here?
What if size_t was used here?
Any special cases to use this rsize_t?

You've encountered it in Microsoft's C++ standard library, but it actually comes from C. C 11, to be precise, which means it's not technically a part of C++.
C 11 standard, Annex K introduced all the _s functions and the corresponding typedefs, including rsize_t. There is also a "maximum value" macro RSIZE_MAX which is large enough for typical applications, but smaller than the real maximum value of the type. The secure functions do nothing and report an error when a value of type rsize_t exceeds RSIZE_MAX.
The idea is to avoid crashes on buffer overruns and similar errors caused by invalid sizes, usually resulting from using a negative value for size. In 2's complement signed value representation (the most common one), a negative number corresponds to a very large number when treated as unsigned. RSIZE_MAX should catch such incorrect use.
Quoting the "rationale" part of C11 (N1570), K.3.2:
3 Extremely large object sizes are frequently a sign that an object’s size was calculated
incorrectly. For example, negative numbers appear as very large positive numbers when
converted to an unsigned type like size_t. Also, some implementations do not support
objects as large as the maximum value that can be represented by type size_t.
4 For those reasons, it is sometimes beneficial to restrict the range of object sizes to detect
programming errors. For implementations targeting machines with large address spaces,
it is recommended that RSIZE_MAX be defined as the smaller of the size of the largest
object supported or (SIZE_MAX >> 1), even if this limit is smaller than the size of
some legitimate, but very large, objects. Implementations targeting machines with small
address spaces may wish to define RSIZE_MAX as SIZE_MAX, which means that there is no object size that is considered a runtime-constraint violation.
It is worth noting that Annex K has very few implementations and there is a proposal (N1967) to deprecate and/or remove it from the standard.

These typedefs have semantic meaning. Obviously you can use size_t here (since it's the same), but rsize_t is more verbose:
The type size_t generally covers the entire address space. ISO/IEC TR 24731-1-2007 introduces a new type rsize_t, defined to be size_t but explicitly used to hold the size of a single object. [1]
It's the similar situation as when using size_t instead of unsigned int. It's basically the same, but named differently so it's easy for you to understand what you're working with (size_t = "size of something", which implies an unsigned whole number).
It is worth noting (as suggested by the comments) that rsize_t is defined in the C specification, but not in the C++ specification.

Check endianness programmatically

I know that the most common method to test endianity programmatically is to cast to char* like this:
short temp = 0x1234;
char* tempChar = (char*)&temp;
But can it be done by casting to short* like this:
unsigned char test[2] = {1,0};
if ( *(short *)test == 1)
//Little-Endian
else
//Big-Endian
Am I right that the "test" buffer will be saved (on x86 platforms) in the memory using Little-Endian convention (from right-to-left: "0" at lower address, "1" at higher) just like in case with the "temp" var?
And more generally if I have a string:
char tab[] = "abcdef";
How would it be stored in the memory? Will it be reversed like: "fedcba"?
Thx. in advance:-)
PS.
Is there any way to see how exactly the data of a program look like in the system memory?.
I would like to see that byte-swap in Little-Endian in "real life".

Your code would probably work in practice (you could have just tried it!). However, technically, it invokes undefined behaviour; the standard doesn't allow you to access a char array through a pointer of another type.
And more generally if I have a string: char tab[] = "abcdef"; How would it be stored in the memory? Will it be reversed like: "fedcba"?
No. Otherwise tab[0] would give you f.

Your alternative method for checking endianness would work.
char tab[] = "abcdef" would be stored in that same order: abcdef
Endianness comes into play when you access multiple bytes (short, int, and so on). When you try to access tab[] as a short array using a little endian machine, you'd read it as ba, dc, fe (whatever their actual byte equivalents are, this is the order the chars are "evaluated" in the short).

It would be safer, i.e. standards-compliant, to use a union.

Both ways are not guaranteed to work, furthermore, latter invokes undefined behavior.
First fails if sizeof(char) == sizeof(short).
Second may fail for the same reason, and is also unsafe: result of the pointer cast may have wrong alignment for short, and accessing the (short) value invokes undefined behavior (3.10.15).
But yes, the char buffer is stored sequentially into memory so that &test[0] < &test[1],
and more generally, as others have already said, char tab[] = "abcdef" is not reversed or otherwise permuted regardless of endianness.

Integer to Character conversion in C

Lets us consider this snippet:
int s;
scanf("%c",&s);
Here I have used int, and not char, for variable s, now for using s for character conversion safely I have to make it char again because when scanf reads a character it only overwrites one byte of the variable it is assigning it to, and not all four that int has.
For conversion I could use s = (char)s; as the next line, but is it possible to implement the same by subtracting something from s ?

What you've done is technically undefined behaviour. The %c format calls for a char*, you've passed it an int* which will (roughly speaking) be reinterpreted. Even assuming that the pointer value is still good after reinterpreting, storing an arbitrary character to the first byte of an int and then reading it back as int is undefined behaviour. Even if it were defined, reading an int when 3 bytes of it are uninitialized, is undefined behaviour.
In practice it probably does something sensible on your machine, and you just get garbage in the top 3 bytes (assuming little-endian).
Writing s = (char)s converts the value from int to char and then back to int again. This is implementation-defined behaviour: converting an out-of-range value to a signed type. On different implementations it might clean up the top 3 bytes, it might return some other result, or it might raise a signal.
The proper way to use scanf is:
char c;
scanf("%c", &c);
And then either int s = c; or int s = (unsigned char)c;, according to whether you want negative-valued characters to result in a negative integer, or a positive integer (up to 255, assuming 8-bit char).
I can't think of any good reason for using scanf improperly. There are good reasons for not using scanf at all, though:
int s = getchar();

Are you trying to convert a digit to its decimal value? If so, then
char c = '8';
int n = c - '0';
n should 8 at this point.

That's probably not a good idea; GCC gives me a warning for that code:
main.c:10: warning: format ‘%c’ expects type ‘char *’, but
argument 2 has type ‘int *’
In this case you're ok since you're passing a pointer to more space than you need (for most systems), but what if you did it the other way around? Could be crash city. If you really want to do something like what you have there, just do the typecast or mask it - the mask will be endian-dependent.

As written this won't work reliably . The argument, &s, to scanf is a pointer to int and scanf is expecting a pointer to char. The two data type (int and char) have different sizes (at least on most architectures) so the data may get put in the wrong spot in memeory, and the other part of s may not get properly cleared.
The answers suggesting manipulation of the result after using a pointer to int rely on unspecified behavior (i.e. that scanf will put the character value it has in the least significant byte of the int you're pointing to), and are not safe.

Not but you could use the following:
s = s & 0xFF
That will blank out all of the data except the first byte. But in general all these ideas (and the ones above) are bad ideas, since not all systems store the lowest part of the integer in memory first. So if you ever have to port this code to a big endian system, you'll be screwed.
True, you may never have to port the code, but why write unportable code to begin with?
See this for more info:
http://en.wikipedia.org/wiki/Endianness

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js