C++ islower() function Debug Assertion Failed Error [duplicate] - c++

A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char-argument to unsigned char before calling std::toupper and std::tolower (and similar functions).
On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper like
string name = "Niels Stroustrup";
void m3() {
string s = name.substr(6,10); // s = "Stroustr up"
name.replace(0,5,"nicholas"); // name becomes "nicholas Stroustrup"
name[0] = toupper(name[0]); // name becomes "Nicholas Stroustrup"
}
(Quoted from said book, 4th edition.)
The reference says that the input needs to be representable as unsigned char.
For me this sounds like it holds for every char since char and unsigned char have the same size.
So is this cast unnecessary or was Stroustrup careless?
Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by #Keith Thompson's reply, they all have a positive representation as signed char and unsigned char?

Yes, the argument to toupper needs to be converted to unsigned char to avoid the risk of undefined behavior.
The types char, signed char, and unsigned char are three distinct types. char has the same range and representation as either signed char or unsigned char. (Plain char is very commonly signed and able to represent values in the range -128..+127.)
The toupper function takes an int argument and returns an int result. Quoting the C standard, section 7.4 paragraph 1:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of
the macro EOF . If the argument has any other value, the
behavior is undefined.
(C++ incorporates most of the C standard library, and defers its definition to the C standard.)
The [] indexing operator on std::string returns a reference to char. If plain char is a signed type, and if the value of name[0] happens to be negative, then the expression
toupper(name[0])
has undefined behavior.
The language guarantees that, even if plain char is signed, all members of the basic character set have non-negative values, so given the initialization
string name = "Niels Stroustrup";
the program doesn't risk undefined behavior. But yes, in general a char value passed to toupper (or to any of the functions declared in <cctype> / <ctype.h>) needs to be converted to unsigned char, so that the implicit conversion to int won't yield a negative value and cause undefined behavior.
The <ctype.h> functions are commonly implemented using a lookup table. Something like:
// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior
may index outside the bounds of that table.
Note that converting to unsigned:
char c = -2;
c = toupper((unsigned)c); // undefined behavior
doesn't avoid the problem. If int is 32 bits, converting the char value -2 to unsigned yields 4294967294. This is then implicitly converted to int (the parameter type), which probably yields -2.
toupper can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN to UCHAR_MAX), but it's not required to do so. Furthermore, the functions in <ctype.h> are required to accept an argument with the value EOF, which is typically -1.
The C++ standard makes adjustments to some C standard library functions. For example, strchr and several other functions are replaced by overloaded versions that enforce const correctness. There are no such adjustments for the functions declared in <cctype>.

The reference is referring to the value being representable as an unsigned char, not to it being an unsigned char. That is, the behavior is undefined if the actual value is not between 0 and UCHAR_MAX (typically 255). (Or EOF, which is basically the reason it takes an int instead of a char.)

In C, toupper (and many other functions) take ints even though you'd expect them to take chars. Additionally, char is signed on some platforms and unsigned on others.
The advice to cast to unsigned char before calling toupper is correct for C. I don't think it's needed in C++, provided you pass it an int that's in range. I can't find anything specific to whether it's needed in C++.
If you want to sidestep the issue, use the toupper defined in <locale>. It's a template, and takes any acceptable character type. You also have to pass it a std::locale. If you don't have any idea which locale to choose, use std::locale(""), which is supposed to be the user's preferred locale:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
int main()
{
std::string name("Bjarne Stroustrup");
std::string uppercase;
std::locale loc("");
std::transform(name.begin(), name.end(), std::back_inserter(uppercase),
[&loc](char c) { return std::toupper(c, loc); });
std::cout << name << '\n' << uppercase << '\n';
return 0;
}

Sadly Stroustrup was careless :-(
And yes, latin letters codes should be non-negative (and no cast are required)...
Some implementations correctly works without casting to unsigned char...
By the some experience, it may cost a several hours to find the cause of segfault of a such toupper (when it is known that a segfault are there)...
And there are also isupper, islower etc

Instead of casting the argument as unsigned char, you can cast the function. You will need to include functional header. Here's a sample code:
#include <string>
#include <algorithm>
#include <functional>
#include <locale>
#include <iostream>
int main()
{
typedef unsigned char BYTE; // just in case
std::string name("Daniel Brühl"); // used this name for its non-ascii character!
std::transform(name.begin(), name.end(), name.begin(),
(std::function<int(BYTE)>)::toupper);
std::cout << "uppercase name: " << name << '\n';
return 0;
}
The output is:
uppercase name: DANIEL BRüHL
As expected, toupper has no effect on non-ascii characters. But this casting is beneficial for avoiding unexpected behavior.

Related

Convert sqlite3 result columns to string

I am using SQLITE3 and can successfully read data from a SQLITE database
table and display it in C++ like so:
cout << sqlite3_column_text(dbResult, 1);
However, I need to convert the column result into a string.
Is there perhaps an easy way in C++ to convert char into string?
Have been trying to find a solution, but to no avail.
Any suggestion would be much appreciated.
According to doc. sqlite3_column_text() is declared as:
const unsigned char *sqlite3_column_text(sqlite3_stmt*, int iCol);
For any reason, it returns a const unsigned char*. (This might be for historical reasons to emphasize the fact that the returned string is UTF-8 encoded.)
Thus, for assignment to a std::string (which can be assigned with const char* expressions among others), a small dirty trick does the job:
std::string myResult = (const char*)sqlite3_column_text(dbResult, 1);
This reclaims the sequence of unsigned chars to be a sequence of chars.
Please, note that the signedness of char is left to the compiler implementation and may be signed or unsigned. (In the major compilers MSVC, g++, clang, it's in fact signed.) Hence, it's accompanied by types signed char and unsigned char to make the signedness explicit (and independent of the used compiler) when necessary. The conversion in the above snippet doesn't change any contents of the returned string — it just makes it compatible for the assignment to std::string.
Googling a bit, I found another Q/A where the answer explains that the "small dirty trick" is legal according to the C++ standard:
Can I turn unsigned char into char and vice versa?

If char and int differ only in the number of bits, why are they different when printing?

In Difference between char and int when declaring character, the accepted answer says that the difference is the size in bits. Although, MicroVirus answer says:
it plays the role of a character in a string, certainly historically. When seen like this, the value of a char maps to a specified character, for instance via the ASCII encoding, but it can also be used with multi-byte encodings (one or more chars together map to one character).
Based on those answers: In the code below, why is the output not the same (since it's just a difference in bits)? What is the mechanism that makes each type print a different "character"?
#include <iostream>
int main() {
int a = 65;
char b = 65;
std::cout << a << std::endl;
std::cout << b << std::endl;
//output :
//65
//A
}
A char may be treated as containing a numeric value and when a char is treated such it indeed differs from an int by its size -- it is smaller, typically a byte.
However, int and char are still different types and since C++ is a statically typed language, types matter. A variable's type can affect the behavior of programs, not just a variable's value. In the case in your question the two variables are printed differently because the operator << is overloaded; it treats int and char differently.

Does implementation-definedness of char affect std::string?

I thought all types were signed unless otherwise specified (like int). I was surprised to find that for char it's actually implementation-defined:
... It is implementation-defined whether a char object can hold
negative values. ... In any particular implementation, a plain char
object can take on either the same values as a signed char or an
unsigned char; which one is implementation-defined.
However std::string is really just std::basic_string<char, ...>.
Can the semantics of this program change from implementation?
#include <string>
int main()
{
char c = -1;
std::string s{1, c};
}
Yes and no.
Since a std::string contains objects of type char, the signedness of type char can affect its behavior.
The program in your question:
#include <string>
int main()
{
char c = -1;
std::string s{1, c};
}
has no visible behavior (unless terminating without producing any output is "behavior"), so its behavior doesn't depend on the signedness of plain char. A compiler could reasonably optimize out the entire body of main. (I'm admittedly nitpicking here, commenting on the code example you picked rather than the question you're asking.)
But this program:
#include <iostream>
#include <string>
int main() {
std::string s = "xx";
s[0] = -1;
s[1] = +1;
std::cout << "Plain char is " << (s[0] < s[1] ? "signed" : "unsigned") << "\n";
}
will correctly print either Plain char is signed or Plain char is unsigned.
Note that a similar program that compares two std::string objects using that type's operator< does not distinguish whether plain char is signed or unsigned, since < treats the characters as if they were unsigned, similar to the way C's memcmp works.
But this shouldn't matter 99% of the time. You almost certainly have to go out of your way to write code whose behavior depends on the signedness of char. You should keep in mind that it's implementation-defined, but if the signedness matters, you should be using signed char or (more likely) unsigned char explicitly. char is a numeric type, but you should use it to hold character data.

Is char and int interchangeable for function arguments in C?

I wrote some code to verify a serial number is alpha numeric in C using isalnum. I wrote the code assuming isalnum input is char. Everything worked. However, after reviewing the isalnum later, I see that it wants input as int. Is my code okay the way it is should I change it?
If I do need to change, what would be the proper way? Should I just declare an int and set it to the char and pass that to isalnum? Is this considered bad programming practice?
Thanks in advance.
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
bool VerifySerialNumber( char *serialNumber ) {
int num;
char* charPtr = serialNumber;
if( strlen( serialNumber ) < 10 ) {
printf("The entered serial number seems incorrect.");
printf("It's less than 10 characters.\n");
return false;
}
while( *charPtr != '\0' ) {
if( !isalnum(*charPtr) ) {
return false;
}
*charPtr++;
}
return true;
}
int main() {
char* str1 = "abcdABCD1234";
char* str2 = "abcdef##";
char* str3 = "abcdABCD1234$#";
bool result;
result = VerifySerialNumber( str1 );
printf("str= %s, result=%d\n\n", str1, result);
result = VerifySerialNumber( str2 );
printf("str= %s, result=%d\n\n", str2, result);
result = VerifySerialNumber( str3 );
printf("str= %s, result=%d\n\n", str3, result);
return 0;
}
Output:
str= abcdABCD1234, result=1
The entered serial number seems incorrect.It's less than 10 characters.
str= abcdef##, result=0
str= abcdABCD1234$#, result=0
You don't need to change it. The compiler will implicitly convert your char to an int before passing it to isalnum. Functions like isalnum take int arguments because functions like fgetc return int values, which allows for special values like EOF to exist.
Update: As others have mentioned, be careful with negative values of your char. Your version of the C library might be implemented carefully so that negative values are handled without causing any run-time errors. For example, glibc (the GNU implementation of the standard C library) appears to handle negative numbers by adding 128 to the int argument.* However, you won't always be able to count on having isalnum (or any of the other <ctype.h> functions) quietly handle negative numbers, so getting in the habit of not checking would be a very bad idea.
* Technically, it's not adding 128 to the argument itself, but rather it appears to be using the argument as an index into an array, starting at index 128, such that passing in, say, -57 would result in an access to index 71 of the array. The result is the same, though, since array[-57+128] and (array+128)[-57] point to the same location.
Usually it is fine to pass a char value to a function that takes an int. It will be converted to the int with the same value. This isn't a bad practice.
However, there is a specific problem with isalnum and the other C functions for character classification and conversion. Here it is, from the ISO/IEC 9899:TC2 7.4/1 (emphasis mine):
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the
macro EOF. If the argument has any other value, the behavior is
undefined.
So, if char is a signed type (this is implementation-dependent), and if you encounter a char with negative value, then it will be converted to an int with negative value before passing it to the function. Negative numbers are not representable as unsigned char. The numbers representable as unsigned char are 0 to UCHAR_MAX. So you have undefined behavior if you pass in any negative value other than whatever EOF happens to be.
For this reason, you should write your code like this in C:
if( !isalnum((unsigned char)*charPtr) )
or in C++ you might prefer:
if( !isalnum(static_cast<unsigned char>(*charPtr)) )
The point is worth learning because at first encounter it seems absurd: do not pass a char to the character functions.
Alternatively, in C++ there is a two-argument version of isalnum in the header <locale>. This function (and its friends) do take a char as input, so you don't have to worry about negative values. You will be astonished to learn that the second argument is a locale ;-)

Do I need to cast to unsigned char before calling toupper(), tolower(), et al.?

A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char-argument to unsigned char before calling std::toupper and std::tolower (and similar functions).
On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper like
string name = "Niels Stroustrup";
void m3() {
string s = name.substr(6,10); // s = "Stroustr up"
name.replace(0,5,"nicholas"); // name becomes "nicholas Stroustrup"
name[0] = toupper(name[0]); // name becomes "Nicholas Stroustrup"
}
(Quoted from said book, 4th edition.)
The reference says that the input needs to be representable as unsigned char.
For me this sounds like it holds for every char since char and unsigned char have the same size.
So is this cast unnecessary or was Stroustrup careless?
Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by #Keith Thompson's reply, they all have a positive representation as signed char and unsigned char?
Yes, the argument to toupper needs to be converted to unsigned char to avoid the risk of undefined behavior.
The types char, signed char, and unsigned char are three distinct types. char has the same range and representation as either signed char or unsigned char. (Plain char is very commonly signed and able to represent values in the range -128..+127.)
The toupper function takes an int argument and returns an int result. Quoting the C standard, section 7.4 paragraph 1:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of
the macro EOF . If the argument has any other value, the
behavior is undefined.
(C++ incorporates most of the C standard library, and defers its definition to the C standard.)
The [] indexing operator on std::string returns a reference to char. If plain char is a signed type, and if the value of name[0] happens to be negative, then the expression
toupper(name[0])
has undefined behavior.
The language guarantees that, even if plain char is signed, all members of the basic character set have non-negative values, so given the initialization
string name = "Niels Stroustrup";
the program doesn't risk undefined behavior. But yes, in general a char value passed to toupper (or to any of the functions declared in <cctype> / <ctype.h>) needs to be converted to unsigned char, so that the implicit conversion to int won't yield a negative value and cause undefined behavior.
The <ctype.h> functions are commonly implemented using a lookup table. Something like:
// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior
may index outside the bounds of that table.
Note that converting to unsigned:
char c = -2;
c = toupper((unsigned)c); // undefined behavior
doesn't avoid the problem. If int is 32 bits, converting the char value -2 to unsigned yields 4294967294. This is then implicitly converted to int (the parameter type), which probably yields -2.
toupper can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN to UCHAR_MAX), but it's not required to do so. Furthermore, the functions in <ctype.h> are required to accept an argument with the value EOF, which is typically -1.
The C++ standard makes adjustments to some C standard library functions. For example, strchr and several other functions are replaced by overloaded versions that enforce const correctness. There are no such adjustments for the functions declared in <cctype>.
The reference is referring to the value being representable as an unsigned char, not to it being an unsigned char. That is, the behavior is undefined if the actual value is not between 0 and UCHAR_MAX (typically 255). (Or EOF, which is basically the reason it takes an int instead of a char.)
In C, toupper (and many other functions) take ints even though you'd expect them to take chars. Additionally, char is signed on some platforms and unsigned on others.
The advice to cast to unsigned char before calling toupper is correct for C. I don't think it's needed in C++, provided you pass it an int that's in range. I can't find anything specific to whether it's needed in C++.
If you want to sidestep the issue, use the toupper defined in <locale>. It's a template, and takes any acceptable character type. You also have to pass it a std::locale. If you don't have any idea which locale to choose, use std::locale(""), which is supposed to be the user's preferred locale:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
int main()
{
std::string name("Bjarne Stroustrup");
std::string uppercase;
std::locale loc("");
std::transform(name.begin(), name.end(), std::back_inserter(uppercase),
[&loc](char c) { return std::toupper(c, loc); });
std::cout << name << '\n' << uppercase << '\n';
return 0;
}
Sadly Stroustrup was careless :-(
And yes, latin letters codes should be non-negative (and no cast are required)...
Some implementations correctly works without casting to unsigned char...
By the some experience, it may cost a several hours to find the cause of segfault of a such toupper (when it is known that a segfault are there)...
And there are also isupper, islower etc
Instead of casting the argument as unsigned char, you can cast the function. You will need to include functional header. Here's a sample code:
#include <string>
#include <algorithm>
#include <functional>
#include <locale>
#include <iostream>
int main()
{
typedef unsigned char BYTE; // just in case
std::string name("Daniel Brühl"); // used this name for its non-ascii character!
std::transform(name.begin(), name.end(), name.begin(),
(std::function<int(BYTE)>)::toupper);
std::cout << "uppercase name: " << name << '\n';
return 0;
}
The output is:
uppercase name: DANIEL BRüHL
As expected, toupper has no effect on non-ascii characters. But this casting is beneficial for avoiding unexpected behavior.