Does implementation-definedness of char affect std::string? - c++

I thought all types were signed unless otherwise specified (like int). I was surprised to find that for char it's actually implementation-defined:
... It is implementation-defined whether a char object can hold
negative values. ... In any particular implementation, a plain char
object can take on either the same values as a signed char or an
unsigned char; which one is implementation-defined.
However std::string is really just std::basic_string<char, ...>.
Can the semantics of this program change from implementation?
#include <string>
int main()
{
char c = -1;
std::string s{1, c};
}

Yes and no.
Since a std::string contains objects of type char, the signedness of type char can affect its behavior.
The program in your question:
#include <string>
int main()
{
char c = -1;
std::string s{1, c};
}
has no visible behavior (unless terminating without producing any output is "behavior"), so its behavior doesn't depend on the signedness of plain char. A compiler could reasonably optimize out the entire body of main. (I'm admittedly nitpicking here, commenting on the code example you picked rather than the question you're asking.)
But this program:
#include <iostream>
#include <string>
int main() {
std::string s = "xx";
s[0] = -1;
s[1] = +1;
std::cout << "Plain char is " << (s[0] < s[1] ? "signed" : "unsigned") << "\n";
}
will correctly print either Plain char is signed or Plain char is unsigned.
Note that a similar program that compares two std::string objects using that type's operator< does not distinguish whether plain char is signed or unsigned, since < treats the characters as if they were unsigned, similar to the way C's memcmp works.
But this shouldn't matter 99% of the time. You almost certainly have to go out of your way to write code whose behavior depends on the signedness of char. You should keep in mind that it's implementation-defined, but if the signedness matters, you should be using signed char or (more likely) unsigned char explicitly. char is a numeric type, but you should use it to hold character data.

Related

If char and int differ only in the number of bits, why are they different when printing?

In Difference between char and int when declaring character, the accepted answer says that the difference is the size in bits. Although, MicroVirus answer says:
it plays the role of a character in a string, certainly historically. When seen like this, the value of a char maps to a specified character, for instance via the ASCII encoding, but it can also be used with multi-byte encodings (one or more chars together map to one character).
Based on those answers: In the code below, why is the output not the same (since it's just a difference in bits)? What is the mechanism that makes each type print a different "character"?
#include <iostream>
int main() {
int a = 65;
char b = 65;
std::cout << a << std::endl;
std::cout << b << std::endl;
//output :
//65
//A
}
A char may be treated as containing a numeric value and when a char is treated such it indeed differs from an int by its size -- it is smaller, typically a byte.
However, int and char are still different types and since C++ is a statically typed language, types matter. A variable's type can affect the behavior of programs, not just a variable's value. In the case in your question the two variables are printed differently because the operator << is overloaded; it treats int and char differently.

C++ islower() function Debug Assertion Failed Error [duplicate]

A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char-argument to unsigned char before calling std::toupper and std::tolower (and similar functions).
On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper like
string name = "Niels Stroustrup";
void m3() {
string s = name.substr(6,10); // s = "Stroustr up"
name.replace(0,5,"nicholas"); // name becomes "nicholas Stroustrup"
name[0] = toupper(name[0]); // name becomes "Nicholas Stroustrup"
}
(Quoted from said book, 4th edition.)
The reference says that the input needs to be representable as unsigned char.
For me this sounds like it holds for every char since char and unsigned char have the same size.
So is this cast unnecessary or was Stroustrup careless?
Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by #Keith Thompson's reply, they all have a positive representation as signed char and unsigned char?
Yes, the argument to toupper needs to be converted to unsigned char to avoid the risk of undefined behavior.
The types char, signed char, and unsigned char are three distinct types. char has the same range and representation as either signed char or unsigned char. (Plain char is very commonly signed and able to represent values in the range -128..+127.)
The toupper function takes an int argument and returns an int result. Quoting the C standard, section 7.4 paragraph 1:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of
the macro EOF . If the argument has any other value, the
behavior is undefined.
(C++ incorporates most of the C standard library, and defers its definition to the C standard.)
The [] indexing operator on std::string returns a reference to char. If plain char is a signed type, and if the value of name[0] happens to be negative, then the expression
toupper(name[0])
has undefined behavior.
The language guarantees that, even if plain char is signed, all members of the basic character set have non-negative values, so given the initialization
string name = "Niels Stroustrup";
the program doesn't risk undefined behavior. But yes, in general a char value passed to toupper (or to any of the functions declared in <cctype> / <ctype.h>) needs to be converted to unsigned char, so that the implicit conversion to int won't yield a negative value and cause undefined behavior.
The <ctype.h> functions are commonly implemented using a lookup table. Something like:
// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior
may index outside the bounds of that table.
Note that converting to unsigned:
char c = -2;
c = toupper((unsigned)c); // undefined behavior
doesn't avoid the problem. If int is 32 bits, converting the char value -2 to unsigned yields 4294967294. This is then implicitly converted to int (the parameter type), which probably yields -2.
toupper can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN to UCHAR_MAX), but it's not required to do so. Furthermore, the functions in <ctype.h> are required to accept an argument with the value EOF, which is typically -1.
The C++ standard makes adjustments to some C standard library functions. For example, strchr and several other functions are replaced by overloaded versions that enforce const correctness. There are no such adjustments for the functions declared in <cctype>.
The reference is referring to the value being representable as an unsigned char, not to it being an unsigned char. That is, the behavior is undefined if the actual value is not between 0 and UCHAR_MAX (typically 255). (Or EOF, which is basically the reason it takes an int instead of a char.)
In C, toupper (and many other functions) take ints even though you'd expect them to take chars. Additionally, char is signed on some platforms and unsigned on others.
The advice to cast to unsigned char before calling toupper is correct for C. I don't think it's needed in C++, provided you pass it an int that's in range. I can't find anything specific to whether it's needed in C++.
If you want to sidestep the issue, use the toupper defined in <locale>. It's a template, and takes any acceptable character type. You also have to pass it a std::locale. If you don't have any idea which locale to choose, use std::locale(""), which is supposed to be the user's preferred locale:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
int main()
{
std::string name("Bjarne Stroustrup");
std::string uppercase;
std::locale loc("");
std::transform(name.begin(), name.end(), std::back_inserter(uppercase),
[&loc](char c) { return std::toupper(c, loc); });
std::cout << name << '\n' << uppercase << '\n';
return 0;
}
Sadly Stroustrup was careless :-(
And yes, latin letters codes should be non-negative (and no cast are required)...
Some implementations correctly works without casting to unsigned char...
By the some experience, it may cost a several hours to find the cause of segfault of a such toupper (when it is known that a segfault are there)...
And there are also isupper, islower etc
Instead of casting the argument as unsigned char, you can cast the function. You will need to include functional header. Here's a sample code:
#include <string>
#include <algorithm>
#include <functional>
#include <locale>
#include <iostream>
int main()
{
typedef unsigned char BYTE; // just in case
std::string name("Daniel Brühl"); // used this name for its non-ascii character!
std::transform(name.begin(), name.end(), name.begin(),
(std::function<int(BYTE)>)::toupper);
std::cout << "uppercase name: " << name << '\n';
return 0;
}
The output is:
uppercase name: DANIEL BRüHL
As expected, toupper has no effect on non-ascii characters. But this casting is beneficial for avoiding unexpected behavior.

Taking an index out of const char* argument

I have the following code:
int some_array[256] = { ... };
int do_stuff(const char* str)
{
int index = *str;
return some_array[index];
}
Apparently the above code causes a bug in some platforms, because *str can in fact be negative.
So I thought of two possible solutions:
Casting the value on assignment (unsigned int index = (unsigned char)*str;).
Passing const unsigned char* instead.
Edit: The rest of this question did not get a treatment, so I moved it to a new thread.
The signedness of char is indeed platform-dependent, but what you do know is that there are as many values of char as there are of unsigned char, and the conversion is injective. So you can absolutely cast the value to associate a lookup index with each character:
unsigned char idx = *str;
return arr[idx];
You should of course make sure that the arr has at least UCHAR_MAX + 1 elements. (This may cause hilarious edge cases when sizeof(unsigned long long int) == 1, which is fortunately rare.)
Characters are allowed to be signed or unsigned, depending on the platform. An assumption of unsigned range is what causes your bug.
Your do_stuff code does not treat const char* as a string representation. It uses it as a sequence of byte-sized indexes into a look-up table. Therefore, there is nothing wrong with forcing unsigned char type on the characters of your string inside do_stuff (i.e. use your solution #1). This keeps re-interpretation of char as an index localized to the implementation of do_stuff function.
Of course, this assumes that other parts of your code do treat str as a C string.

How to work with uint8_t instead of char?

I wish to understand the situation regarding uint8_t vs char, portability, bit-manipulation, the best practices, state of affairs, etc. Do you know a good reading on the topic?
I wish to do byte-IO. But of course char has a more complicated and subtle definition than uint8_t; which I assume was one of the reasons for introducing stdint header.
However, I had problems using uint8_t on multiple occasions. A few months ago, once, because iostreams are not defined for uint8_t. Isn't there a C++ library doing really-well-defined-byte-IO i.e. read and write uint8_t? If not, I assume there is no demand for it. Why?
My latest headache stems from the failure of this code to compile:
uint8_t read(decltype(cin) & s)
{
char c;
s.get(c);
return reinterpret_cast<uint8_t>(c);
}
error: invalid cast from type 'char' to type 'uint8_t {aka unsigned char}'
Why the error? How to make this work?
The general, portable, roundtrip-correct way would be to:
demand in your API that all byte values can be expressed with at most 8 bits,
use the layout-compatibility of char, signed char and unsigned char for I/O, and
convert unsigned char to uint8_t as needed.
For example:
bool read_one_byte(std::istream & is, uint8_t * out)
{
unsigned char x; // a "byte" on your system
if (is.get(reinterpret_cast<char *>(&x)))
{
*out = x;
return true;
}
return false;
}
bool write_one_byte(std::ostream & os, uint8_t val)
{
unsigned char x = val;
return os.write(reinterpret_cast<char const *>(&x), 1);
}
Some explanation: Rule 1 guarantees that values can be round-trip converted between uint8_t and unsigned char without losing information. Rule 2 means that we can use the iostream I/O operations on unsigned char variables, even though they're expressed in terms of chars.
We could also have used is.read(reinterpret_cast<char *>(&x), 1) instead of is.get() for symmetry. (Using read in general, for stream counts larger than 1, also requires the use of gcount() on error, but that doesn't apply here.)
As always, you must never ignore the return value of I/O operations. Doing so is always a bug in your program.
A few months ago, once, because iostreams are not defined for uint8_t.
uint8_t is pretty much just a typedef for unsigned char. In fact, i doubt you could find a machine where it isn't.
uint8_t read(decltype(cin) & s)
{
char c;
s.get(c);
return reinterpret_cast<uint8_t>(c);
}
Using decltype(cin) instead of std::istream has no advantage at all, it is just a potential source of confusion.
The cast in the return-statement isn't necessary; converting a char into an unsigned char works implicitly.
A few months ago, once, because iostreams are not defined for uint8_t.
They are. Not for uint8_t itself, but most certainly for the type it actually represents. operator>> is overloaded for unsigned char. This code works:
uint8_t read(istream& s)
{
return s.get();
}
Since unsigned char and char can alias each other you can also just reinterpret_cast any pointer to a char string to an unsigned char* and work with that.
In case you want the most portable way possible take a look at Kerreks answer.

Do I need to cast to unsigned char before calling toupper(), tolower(), et al.?

A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char-argument to unsigned char before calling std::toupper and std::tolower (and similar functions).
On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper like
string name = "Niels Stroustrup";
void m3() {
string s = name.substr(6,10); // s = "Stroustr up"
name.replace(0,5,"nicholas"); // name becomes "nicholas Stroustrup"
name[0] = toupper(name[0]); // name becomes "Nicholas Stroustrup"
}
(Quoted from said book, 4th edition.)
The reference says that the input needs to be representable as unsigned char.
For me this sounds like it holds for every char since char and unsigned char have the same size.
So is this cast unnecessary or was Stroustrup careless?
Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by #Keith Thompson's reply, they all have a positive representation as signed char and unsigned char?
Yes, the argument to toupper needs to be converted to unsigned char to avoid the risk of undefined behavior.
The types char, signed char, and unsigned char are three distinct types. char has the same range and representation as either signed char or unsigned char. (Plain char is very commonly signed and able to represent values in the range -128..+127.)
The toupper function takes an int argument and returns an int result. Quoting the C standard, section 7.4 paragraph 1:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of
the macro EOF . If the argument has any other value, the
behavior is undefined.
(C++ incorporates most of the C standard library, and defers its definition to the C standard.)
The [] indexing operator on std::string returns a reference to char. If plain char is a signed type, and if the value of name[0] happens to be negative, then the expression
toupper(name[0])
has undefined behavior.
The language guarantees that, even if plain char is signed, all members of the basic character set have non-negative values, so given the initialization
string name = "Niels Stroustrup";
the program doesn't risk undefined behavior. But yes, in general a char value passed to toupper (or to any of the functions declared in <cctype> / <ctype.h>) needs to be converted to unsigned char, so that the implicit conversion to int won't yield a negative value and cause undefined behavior.
The <ctype.h> functions are commonly implemented using a lookup table. Something like:
// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior
may index outside the bounds of that table.
Note that converting to unsigned:
char c = -2;
c = toupper((unsigned)c); // undefined behavior
doesn't avoid the problem. If int is 32 bits, converting the char value -2 to unsigned yields 4294967294. This is then implicitly converted to int (the parameter type), which probably yields -2.
toupper can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN to UCHAR_MAX), but it's not required to do so. Furthermore, the functions in <ctype.h> are required to accept an argument with the value EOF, which is typically -1.
The C++ standard makes adjustments to some C standard library functions. For example, strchr and several other functions are replaced by overloaded versions that enforce const correctness. There are no such adjustments for the functions declared in <cctype>.
The reference is referring to the value being representable as an unsigned char, not to it being an unsigned char. That is, the behavior is undefined if the actual value is not between 0 and UCHAR_MAX (typically 255). (Or EOF, which is basically the reason it takes an int instead of a char.)
In C, toupper (and many other functions) take ints even though you'd expect them to take chars. Additionally, char is signed on some platforms and unsigned on others.
The advice to cast to unsigned char before calling toupper is correct for C. I don't think it's needed in C++, provided you pass it an int that's in range. I can't find anything specific to whether it's needed in C++.
If you want to sidestep the issue, use the toupper defined in <locale>. It's a template, and takes any acceptable character type. You also have to pass it a std::locale. If you don't have any idea which locale to choose, use std::locale(""), which is supposed to be the user's preferred locale:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
int main()
{
std::string name("Bjarne Stroustrup");
std::string uppercase;
std::locale loc("");
std::transform(name.begin(), name.end(), std::back_inserter(uppercase),
[&loc](char c) { return std::toupper(c, loc); });
std::cout << name << '\n' << uppercase << '\n';
return 0;
}
Sadly Stroustrup was careless :-(
And yes, latin letters codes should be non-negative (and no cast are required)...
Some implementations correctly works without casting to unsigned char...
By the some experience, it may cost a several hours to find the cause of segfault of a such toupper (when it is known that a segfault are there)...
And there are also isupper, islower etc
Instead of casting the argument as unsigned char, you can cast the function. You will need to include functional header. Here's a sample code:
#include <string>
#include <algorithm>
#include <functional>
#include <locale>
#include <iostream>
int main()
{
typedef unsigned char BYTE; // just in case
std::string name("Daniel Brühl"); // used this name for its non-ascii character!
std::transform(name.begin(), name.end(), name.begin(),
(std::function<int(BYTE)>)::toupper);
std::cout << "uppercase name: " << name << '\n';
return 0;
}
The output is:
uppercase name: DANIEL BRüHL
As expected, toupper has no effect on non-ascii characters. But this casting is beneficial for avoiding unexpected behavior.