isdigit() function pass a Chinese parameter

isdigit() function pass a Chinese parameter - c++

When I try using the isdigit() function with a Chinese character, it reports an assert in Visual Studio 2013 in Debug mode, but there is no problem in Release mode.
I think if this function is to determine whether the parameter is a digit, why does it not return 0 if the Chinese is wrong?
This is my code:
string testString = "abcdefg12345中文";
int count = 0;
for (const auto &c : testString) {
if (isdigit(c)) {
++count;
}
}
and this is the assert :

You broke the contract of isdigit(int), which expects only ASCII characters in the range stated.
The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.
Your standard library implementation is being kind and asserting, rather than going on to blow stuff up.
There is an alternative, locale-aware isdigit(charT ch, const locale&) that you may be able to use here.
I suggest performing some further research on how "characters" work in computers, particularly with regards to encoding more "exotic"1 character sets.
1 From the perspective of computer history. Of course, to you, it is the less exotic alternative!

The isdigit() and related functions / macros in <ctypes.h> expect an int converted from an unsigned char, or EOF, which on most systems means a value in the range 0-255 (or -1 for EOF). So any value not in the range -1…255 is incorrect.
Problem 1: You are passing in a char, which on your system has range -128…+127. Solution to this problem is simple:
if (isdigit(static_cast<unsigned char>(c)))
This won't crash, however, it's not quite correct for Chinese characters.
Problem 2: Non-ASCII characters should probably use iswdigit() instead. This will correctly handle Chinese characters:
wstring testString = L"abcdefg12345中文";
int count = 0;
for (const auto &c : testString) {
if (iswdigit(c)) {
++count;
}
}

Related

C++ Atoi can't handle special characters

Im using this atoi to remove all letters from the string. But my string uses special characters as seen below, because of this my atoi exits with an error. What should I do to solve this?
#include <iostream>
#include <string>
using namespace std;
int main() {
std::string playerPickS = "Klöver 12"; // string with special characters
size_t i = 0;
for (; i < playerPickS.length(); i++) { if (isdigit(playerPickS[i])) break; }
playerPickS = playerPickS.substr(i, playerPickS.length() - i); // convert the remaining text to an integer
cout << atoi(playerPickS.c_str());
}
This is what I believe is the error. I only get this when using those special characters, thats why I think thats my problem.

char can be signed or unsigned, but isidigt without a locale overload expects a positive number (or EOF==-1). In your encoding 'ö' has a negative value. You can cast it to unsigned char first: is_digit(static_cast<unsigned char>(playerPickS[i])) or use the locale-aware variant.

atoi stops scanning when it finds something that's not a digit (roughly speaking). So, to get it to do what you want, you have to feed it something that at least starts with the string you want to convert.
From the documentation:
[atoi] Discards any whitespace characters until the first non-whitespace character is found, then takes as many characters as possible to form a valid integer number representation and converts them to an integer value. The valid integer value consists of the following parts:
(optional) plus or minus sign
numeric digits
So, now you know how atoi works, you can pre-process your string appropriately before passing it in. Good luck!
Edit: If your call to isdigit is failing to yield the desired result, the clue lies here:
The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.
So you need to check for that yourself before you call it. Casting playerPickS[i] to an unsigned int will probably work.

Converting integer to string in c++

This is the code I wrote to convert integer to string.
#include <iostream>
using namespace std;
int main()
{
string s;
int b=5;
s.push_back((char)b);
cout<<s<<endl;
}
I expected the output to be 5 but it is giving me blank space.
I know there is another way of doing it using stringstream but I want to know what is wrong in this method?

Character code for numbers are not equal to the integer the character represents in typical system.
It is granteed that character codes for decimal digits are consecutive (N3337 2.3 Character sets, Paragraph 3), so you can add '0' to convert one-digit number to character.
#include <iostream>
using namespace std;
int main()
{
string s;
int b=5;
s.push_back((char)(b + '0'));
cout<<s<<endl;
}

You are interpreting the integer 5 as a character. In ASCII encoding, 5 is the Enquiry control character as you lookup here.
The character 5 on the other hand is represented by the decimal number 53.

As others said, you can't convert an integer to a string the way you are doing it.
IMHO, the best way to do it is using the C++11 method std::to_string.
Your example would translate to:
using namespace std;
int main()
{
string s;
int b=5;
s = to_string(b);
cout<<s<<endl;
}

The problem in your code is that you are converting the integer 5 to ASCII (=> ENQ ASCII code, which is not "printable").
To convert it to ASCII properly, you have to add the ASCII code of '0' (48), so:
char ascii = b + '0';
However, to convert an integer to std::string use:
std::stringstream ss; //from <sstream>
ss << 5;
std::string s = ss.str ();
I always use this helper function in my projects:
template <typename T>
std::string toString (T arg)
{
std::stringstream ss;
ss << arg;
return ss.str ();
}

Also, you can use stringstream,
std::to_string doesn't work for me on GCC

If we were writing C++ from scratch in 2016, maybe we would make this work. However as it choose to be (mostly) backward compatible with a fairly low level language like C, 'char' is in fact just a number, that string/printing algorithms interpret as a character -but most of the language doesn't treat special. Including the cast. So by doing (char) you're only converting a 32 bit signed number (int) to a 8 bit signed number (char).
Then you interpret it as a character when you print it, since printing functions do treat it special. But the value it gets printed to is not '5'. The correspondence is conventional and completely arbitrary; the first numbers were reserved to special codes which are probably obsolete by now. As Hoffman pointed out, the bit value 5 is the code for Enquiry (whatever it means), while to print '5' the character has to contain the value 53. To print a proper space you'd need to enter 32. It has no meaning other than someone decided this was as good as anything, sometime decades ago, and the convention stuck.
If you need to know for other characters and values, what you need is an "ASCII table". Just google it, you'll find plenty.
You'll notice that numbers and letters of the same case are next to each other in the order you expect, so there is some logic to it at least. Beware, however, it's often not intuitive anyway: uppercase letters are before lowercase ones for instance, so 'A' < 'a'.
I guess you're starting to see why it's better to rely on dedicated system functions for strings!

C++ toupper Syntax

I've just been introduced to toupper, and I'm a little confused by the syntax; it seems like it's repeating itself. What I've been using it for is for every character of a string, it converts the character into an uppercase character if possible.
for (int i = 0; i < string.length(); i++)
{
if (isalpha(string[i]))
{
if (islower(string[i]))
{
string[i] = toupper(string[i]);
}
}
}
Why do you have to list string[i] twice? Shouldn't this work?
toupper(string[i]); (I tried it, so I know it doesn't.)

toupper is a function that takes its argument by value. It could have been defined to take a reference to character and modify it in-place, but that would have made it more awkward to write code that just examines the upper-case variant of a character, as in this example:
// compare chars case-insensitively without modifying anything
if (std::toupper(*s1++) == std::toupper(*s2++))
...
In other words, toupper(c) doesn't change c for the same reasons that sin(x) doesn't change x.
To avoid repeating expressions like string[i] on the left and right side of the assignment, take a reference to a character and use it to read and write to the string:
for (size_t i = 0; i < string.length(); i++) {
char& c = string[i]; // reference to character inside string
c = std::toupper(c);
}
Using range-based for, the above can be written more briefly (and executed more efficiently) as:
for (auto& c: string)
c = std::toupper(c);

As from the documentation, the character is passed by value.
Because of that, the answer is no, it shouldn't.
The prototype of toupper is:
int toupper( int ch );
As you can see, the character is passed by value, transformed and returned by value.
If you don't assign the returned value to a variable, it will be definitely lost.
That's why in your example it is reassigned so that to replace the original one.

As many of the other answers already say, the argument to std::toupper is passed and the result returned by-value which makes sense because otherwise, you wouldn't be able to call, say std::toupper('a'). You cannot modify the literal 'a' in-place. It is also likely that you have your input in a read-only buffer and want to store the uppercase-output in another buffer. So the by-value approach is much more flexible.
What is redundant, on the other hand, is your checking for isalpha and islower. If the character is not a lower-case alphabetic character, toupper will leave it alone anyway so the logic reduces to this.
#include <cctype>
#include <iostream>
int
main()
{
char text[] = "Please send me 400 $ worth of dark chocolate by Wednesday!";
for (auto s = text; *s != '\0'; ++s)
*s = std::toupper(*s);
std::cout << text << '\n';
}
You could further eliminate the raw loop by using an algorithm, if you find this prettier.
#include <algorithm>
#include <cctype>
#include <iostream>
#include <utility>
int
main()
{
char text[] = "Please send me 400 $ worth of dark chocolate by Wednesday!";
std::transform(std::cbegin(text), std::cend(text), std::begin(text),
[](auto c){ return std::toupper(c); });
std::cout << text << '\n';
}

toupper takes an int by value and returns the int value of the char of that uppercase character. Every time a function doesn't take a pointer or reference as a parameter the parameter will be passed by value which means that there is no possible way to see the changes from outside the function because the parameter will actually be a copy of the variable passed to the function, the way you catch the changes is by saving what the function returns. In this case, the character upper-cased.

Note that there is a nasty gotcha in isalpha(), which is the following: the function only works correctly for inputs in the range 0-255 + EOF.
So what, you think.
Well, if your char type happens to be signed, and you pass a value greater than 127, this is considered a negative value, and thus the int passed to isalpha will also be negative (and thus outside the range of 0-255 + EOF).
In Visual Studio, this will crash your application. I have complained about this to Microsoft, on the grounds that a character classification function that is not safe for all inputs is basically pointless, but received an answer stating that this was entirely standards conforming and I should just write better code. Ok, fair enough, but nowhere else in the standard does anyone care about whether char is signed or unsigned. Only in the isxxx functions does it serve as a landmine that could easily make it through testing without anyone noticing.
The following code crashes Visual Studio 2015 (and, as far as I know, all earlier versions):
int x = toupper ('é');
So not only is the isalpha() in your code redundant, it is in fact actively harmful, as it will cause any strings that contain characters with values greater than 127 to crash your application.
See http://en.cppreference.com/w/cpp/string/byte/isalpha: "The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF."

Convert a single character to lowercase in C++ - tolower is returning an integer

I'm trying to convert a string to lowercase, and am treating it as a char* and iterating through each index. The problem is that the tolower function I read about online is not actually converting a char to lowercase: it's taking char as input and returning an integer.
cout << tolower('T') << endl;
prints 116 to the console when it should be printing T.
Is there a better way for me to convert a string to lowercase?
I've looked around online, and most sources say to "use tolower and iterate through the char array", which doesn't seem to be working for me.
So my two questions are:
What am I doing wrong with the tolower function that's making it return 116 instead of 't' when I call tolower('T')
Are there better ways to convert a string to lowercase in C++ other than using tolower on each individual character?

That's because there are two different tolower functions. The one that you're using is this one, which returns an int. That's why it's printing 116. That's the ASCII value of 't'. If you want to print a char, you can just cast it back to a char.
Alternatively, you could use this one, which actually returns the type you would expect it to return:
std::cout << std::tolower('T', std::locale()); // prints t
In response to your second question:
Are there better ways to convert a string to lowercase in C++ other than using tolower on each individual character?
Nope.

116 is indeed the correct value, however this is simply an issue of how std::cout handles integers, use char(tolower(c)) to achieve your desired results
std::cout << char(tolower('T')); // print it like this

It's even weirder than that - it takes an int and returns an int. See http://en.cppreference.com/w/cpp/string/byte/tolower.
You need to ensure the value you pass it is representable as an unsigned char - no negative values allowed, even if char is signed.
So you might end up with something like this:
char c = static_cast<char>(tolower(static_cast<unsigned char>('T')));
Ugly isn't it? But in any case converting one character at a time is very limiting. Try converting 'ß' to upper case, for example.

To lower is int so it returns int. If you check #include <ctype> you will see that definition is int tolower ( int c ); You can use loop to go trough string and to change every single char to lowe case. For example
while (str[i]) // going trough string
{
c=str[i]; // ging c value of current char in string
putchar (tolower(c)); // changing to lower case
i++; //incrementing
}

the documentation of int to_lower(int ch) mandates that ch must either be representable as an unsigned char or must be equal to EOF (which is usually -1, but don't rely on that).
It's not uncommon for character manipulation functions that have been inherited from the c standard library to work in terms of ints. There are two reasons for this:
In the early days of C, all arguments were promoted to int (function prototypes did not exist).
For consistency these functions need to handle the EOF case, which for obvious reasons cannot be a value representable by a char, since that would mean we'd have to lose one of the legitimate encodings for a character.
http://en.cppreference.com/w/cpp/string/byte/tolower
The answer is to cast the result to a char before printing.
e.g.:
std::cout << static_cast<char>(std::to_lower('A'));

Generally speaking to convert an uppercase character to a lowercase, you only need to add 32 to the uppercase character as this number is the ASCII code difference between lowercase and uppercase characters, e.g., 'a'-'A'=97-67=32.
char c = 'B';
c += 32; // c is now 'b'
printf("c=%c\n", c);
Another easy way would be to first map the uppercase character to an offset within the range of English alphabets 0-25 i.e. 'a' is index '0' and 'z' is index '25' inclusive and then remap it to a lowercase character.
char c = 'B';
c = c - 'A' + 'a'; // c is now 'b'
printf("c=%c\n", c);

C++ Strip non-ASCII Characters from string

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.
bool invalidChar (char c)
{
return !isprint((unsigned)c);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
I tested this method on "Prusæus, Ægyptians," and it did nothing
I also attempted to substitute isprint for isalnum
The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.
Ref:
How can you strip non-ASCII characters from a string? (in C#)
How to strip all non alphanumeric characters from a string in c++?
Edit:
I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:
// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH
Error Dialog
MSVC++ Debug Library
Debug Assertion Failed!
Program: //myproject
File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c
Line: //Above
Expression:(unsigned)(c+1)<=256
Edit:
Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.
Solution:
bool invalidChar (char c)
{
return !(c>=0 && c <128);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
If someone else would like to copy/paste this, I can check this question off.
EDIT:
For future reference: try using the __isascii, iswascii commands

Solution:
bool invalidChar (char c)
{
return !(c>=0 && c <128);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
EDIT:
For future reference: try using the __isascii, iswascii commands

At least one problem is in your invalidChar function. It should be:
return !isprint( static_cast<unsigned char>( c ) );
Casting a char to an unsigned is likely to give some very, very big
values if the char is negative (UNIT_MAX+1 + c). Passing such a
value toisprint` is undefined behavior.

Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());
}
I think it looks cleaner

isprint depends on the locale, so the character in question must be printable in the current locale.
If you want strictly ASCII, check the range for [0..127]. If you want printable ASCII, check the range and isprint.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js