Why is rsize_t defined? - c++

I found that strncpy_s() is defined under VS2013 as
errno_t __cdecl strncpy_s
(
_Out_writes_z_(_SizeInBytes) char * _Dst,
_In_ rsize_t _SizeInBytes,
_In_reads_or_z_(_MaxCount) const char * _Src,
_In_ rsize_t _MaxCount
);
rsize_t is:
typedef size_t rsize_t;
I think it's a trick done by Visual Studio. However, I found this function defined as follows on this page
errno_t strncpy_s
(
char *restrict dest,
rsize_t destsz,
const char *restrict src,
rsize_t count
);
Why is rsize_t defined here?
What if size_t was used here?
Any special cases to use this rsize_t?

You've encountered it in Microsoft's C++ standard library, but it actually comes from C. C 11, to be precise, which means it's not technically a part of C++.
C 11 standard, Annex K introduced all the _s functions and the corresponding typedefs, including rsize_t. There is also a "maximum value" macro RSIZE_MAX which is large enough for typical applications, but smaller than the real maximum value of the type. The secure functions do nothing and report an error when a value of type rsize_t exceeds RSIZE_MAX.
The idea is to avoid crashes on buffer overruns and similar errors caused by invalid sizes, usually resulting from using a negative value for size. In 2's complement signed value representation (the most common one), a negative number corresponds to a very large number when treated as unsigned. RSIZE_MAX should catch such incorrect use.
Quoting the "rationale" part of C11 (N1570), K.3.2:
3 Extremely large object sizes are frequently a sign that an object’s size was calculated
incorrectly. For example, negative numbers appear as very large positive numbers when
converted to an unsigned type like size_t. Also, some implementations do not support
objects as large as the maximum value that can be represented by type size_t.
4 For those reasons, it is sometimes beneficial to restrict the range of object sizes to detect
programming errors. For implementations targeting machines with large address spaces,
it is recommended that RSIZE_MAX be defined as the smaller of the size of the largest
object supported or (SIZE_MAX >> 1), even if this limit is smaller than the size of
some legitimate, but very large, objects. Implementations targeting machines with small
address spaces may wish to define RSIZE_MAX as SIZE_MAX, which means that there is no object size that is considered a runtime-constraint violation.
It is worth noting that Annex K has very few implementations and there is a proposal (N1967) to deprecate and/or remove it from the standard.

These typedefs have semantic meaning. Obviously you can use size_t here (since it's the same), but rsize_t is more verbose:
The type size_t generally covers the entire address space. ISO/IEC TR 24731-1-2007 introduces a new type rsize_t, defined to be size_t but explicitly used to hold the size of a single object. [1]
It's the similar situation as when using size_t instead of unsigned int. It's basically the same, but named differently so it's easy for you to understand what you're working with (size_t = "size of something", which implies an unsigned whole number).
It is worth noting (as suggested by the comments) that rsize_t is defined in the C specification, but not in the C++ specification.

Related

Converting Integer Types

How does one convert from one integer type to another safely and with setting off alarm bells in compilers and static analysis tools?
Different compilers will warn for something like:
int i = get_int();
size_t s = i;
for loss of signedness or
size_t s = get_size();
int i = s;
for narrowing.
casting can remove the warnings but don't solve the safety issue.
Is there a proper way of doing this?
You can try boost::numeric_cast<>.
boost numeric_cast returns the result of converting a value of type Source to a value of type Target. If out-of-range is detected, an exception is thrown (see bad_numeric_cast, negative_overflow and positive_overflow ).
How does one convert from one integer type to another safely and with setting off alarm bells in compilers and static analysis tools?
Control when conversion is needed. As able, only convert when there is no value change. Sometimes, then one must step back and code at a higher level. IOWs, was a lossy conversion needed or can code be re-worked to avoid conversion loss?
It is not hard to add an if(). The test just needs to be carefully formed.
Example where size_t n and int len need a compare. Note that positive values of int may exceed that of size_t - or visa-versa or the same. Note in this case, the conversion of int to unsigned only happens with non-negative values - thus no value change.
int len = snprintf(buf, n, ...);
if (len < 0 || (unsigned)len >= n) {
// Handle_error();
}
unsigned to int example when it is known that the unsigned value at this point of code is less than or equal to INT_MAX.
unsigned n = ...
int i = n & INT_MAX;
Good analysis tools see that n & INT_MAX always converts into int without loss.
There is no built-in safe narrowing conversion between int types in c++ and STL. You could implement it yourself using as an example Microsoft GSL.
Theoretically, if you want perfect safety, you shouldn't be mixing types like this at all. (And you definitely shouldn't be using explicit casts to silence warnings, as you know.) If you've got values of type size_t, it's best to always carry them around in variables of type size_t.
There is one case where I do sometimes decide I can accept less than 100.000% perfect type safety, and that is when I assign sizeof's return value, which is a size_t, to an int. For any machine I am ever going to use, the only time this conversion might lose information is when sizeof returns a value greater than 2147483647. But I am content to assume that no single object in any of my programs will ever be that big. (In particular, I will unhesitatingly write things like printf("sizeof(int) = %d\n", (int)sizeof(int)), explicit cast and all. There is no possible way that the size of a type like int will not fit in an int!)
[Footnote: Yes, it's true, on a 16-bit machine the assumption is the rather less satisfying threshold that sizeof won't return a value greater than 32767. It's more likely that a single object might have a size like that, but probably not in a program that's running on a 16-bitter.]

Can std::memcmp read any bytes past the first difference?

Consider:
constexpr char s1[] = "a";
constexpr char s2[] = "abc";
std::memcmp(s1, s2, 3);
If memcmp stops at the first difference it sees, it will not read past the second byte of s1 (the nul terminator), however I don't see anything in the C standard to confirm this behavior, and I don't know of anything in C++ which extends it.
n1570 7.24.4.1 PDF link
int memcmp(const void *s1, const void *s2, size_t n);
The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2
Is my understanding correct that the standard describes the behavior as reading all n bytes of both arguments, but libraries can short circuit as-if they did?
The function is not guaranteed to short-circuit because the standard doesn't say it must.
Not only is it not guaranteed to short-circuit, but in practice many implementations will not. For example, glibc compares elements of type unsigned long int (except for the last few bytes), so it could read up to 7 bytes past the location which compared differently on a 64-bit implementation.
Some may think that this won't cause an access violation on the platforms glibc targets, because access to these unsigned long ints will always be aligned and therefore will not cross a page boundary. But when the two sources have a different alignment, glibc will read two consecutive unsigned long ints from one of the sources, which may be in different pages. If the different byte was in the first of those, an access violation can still be triggered before glibc performed the comparison (see function memcmp_not_common_alignment).
In short: Specifying a length that is larger than the real size of the buffer is undefined behavior even if the different byte occured before this length, and can cause crashes on common implementations.
Here's proof that it can crash: https://ideone.com/8jTREr

Why is parameter to isdigit integer?

The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?
The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value
The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}

C/C++ Why to use unsigned char for binary data?

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?
In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.
You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff
The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.
Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.
The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).
I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.
Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).

string::size_type instead of int

const std::string::size_type cols = greeting.size() + pad * 2 + 2;
Why string::size_type? int is supposed to work! it holds numbers!!!
A short holds numbers too. As does a signed char.
But none of those types are guaranteed to be large enough to represent the sizes of any strings.
string::size_type guarantees just that. It is a type that is big enough to represent the size of a string, no matter how big that string is.
For a simple example of why this is necessary, consider 64-bit platforms. An int is typically still 32 bit on those, but you have far more than 2^32 bytes of memory.
So if a (signed) int was used, you'd be unable to create strings larger than 2^31 characters.
size_type will be a 64-bit value on those platforms however, so it can represent larger strings without a problem.
The example that you've given,
const std::string::size_type cols = greeting.size() + pad * 2 + 2;
is from Accelerated C++ by Koenig. He also states the reason for his choice right after this, namely:
The std::string type defines size_type to be the name of the
appropriate type for holding the number of characters in a string. Whenever we need a local
variable to contain the size of a string, we should use std::string::size_type as the type of that variable.
The reason that we have given cols a type of std::string::size_type is
to ensure that cols is capable of containing the number of characters
in greeting, no matter how large that number might be. We could simply
have said that cols has type int, and indeed, doing so would probably
work. However, the value of cols depends on the size of the input to
our program, and we have no control over how long that input might be.
It is conceivable that someone might give our program a string so long
that an int is insufficient to contain its length.
A nested size_type typedef is a requirement for STL-compatible containers (which std::string happens to be), so generic code can choose the correct integer type to represent sizes.
There's no point in using it in application code, a size_t is completely ok (int is not, because it's signed, and you'll get signed/unsigned comparison warnings).