How to compare a string char and an char int in c++? - c++

string str='中test'
first_char = str[0]
How can I compare first_char with an int 128? I want to test whether the first char is an ascii or not.
Something like this:
if char(first_char) < 128:
return true

In C++ (and C), the signedness of a char is implementation-defined. Hence, a simple less-than operator will not suffice. You need some bitwise action:
bool is_ascii( char c )
{
return !(c & 0x80);
}
As soon as you begin messing with UTF-8 text (or any other non-ASCII text) the usual assumptions about what a character is go out the window. You should use a library, such as ICU, to help you. (Every modern OS has ICU installed already, so this should not be a difficult requirement.)

Related

isdigit() function pass a Chinese parameter

When I try using the isdigit() function with a Chinese character, it reports an assert in Visual Studio 2013 in Debug mode, but there is no problem in Release mode.
I think if this function is to determine whether the parameter is a digit, why does it not return 0 if the Chinese is wrong?
This is my code:
string testString = "abcdefg12345中文";
int count = 0;
for (const auto &c : testString) {
if (isdigit(c)) {
++count;
}
}
and this is the assert :
You broke the contract of isdigit(int), which expects only ASCII characters in the range stated.
The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.
Your standard library implementation is being kind and asserting, rather than going on to blow stuff up.
There is an alternative, locale-aware isdigit(charT ch, const locale&) that you may be able to use here.
I suggest performing some further research on how "characters" work in computers, particularly with regards to encoding more "exotic"1 character sets.
1 From the perspective of computer history. Of course, to you, it is the less exotic alternative!
The isdigit() and related functions / macros in <ctypes.h> expect an int converted from an unsigned char, or EOF, which on most systems means a value in the range 0-255 (or -1 for EOF). So any value not in the range -1…255 is incorrect.
Problem 1: You are passing in a char, which on your system has range -128…+127. Solution to this problem is simple:
if (isdigit(static_cast<unsigned char>(c)))
This won't crash, however, it's not quite correct for Chinese characters.
Problem 2: Non-ASCII characters should probably use iswdigit() instead. This will correctly handle Chinese characters:
wstring testString = L"abcdefg12345中文";
int count = 0;
for (const auto &c : testString) {
if (iswdigit(c)) {
++count;
}
}

Convert a single character to lowercase in C++ - tolower is returning an integer

I'm trying to convert a string to lowercase, and am treating it as a char* and iterating through each index. The problem is that the tolower function I read about online is not actually converting a char to lowercase: it's taking char as input and returning an integer.
cout << tolower('T') << endl;
prints 116 to the console when it should be printing T.
Is there a better way for me to convert a string to lowercase?
I've looked around online, and most sources say to "use tolower and iterate through the char array", which doesn't seem to be working for me.
So my two questions are:
What am I doing wrong with the tolower function that's making it return 116 instead of 't' when I call tolower('T')
Are there better ways to convert a string to lowercase in C++ other than using tolower on each individual character?
That's because there are two different tolower functions. The one that you're using is this one, which returns an int. That's why it's printing 116. That's the ASCII value of 't'. If you want to print a char, you can just cast it back to a char.
Alternatively, you could use this one, which actually returns the type you would expect it to return:
std::cout << std::tolower('T', std::locale()); // prints t
In response to your second question:
Are there better ways to convert a string to lowercase in C++ other than using tolower on each individual character?
Nope.
116 is indeed the correct value, however this is simply an issue of how std::cout handles integers, use char(tolower(c)) to achieve your desired results
std::cout << char(tolower('T')); // print it like this
It's even weirder than that - it takes an int and returns an int. See http://en.cppreference.com/w/cpp/string/byte/tolower.
You need to ensure the value you pass it is representable as an unsigned char - no negative values allowed, even if char is signed.
So you might end up with something like this:
char c = static_cast<char>(tolower(static_cast<unsigned char>('T')));
Ugly isn't it? But in any case converting one character at a time is very limiting. Try converting 'ß' to upper case, for example.
To lower is int so it returns int. If you check #include <ctype> you will see that definition is int tolower ( int c ); You can use loop to go trough string and to change every single char to lowe case. For example
while (str[i]) // going trough string
{
c=str[i]; // ging c value of current char in string
putchar (tolower(c)); // changing to lower case
i++; //incrementing
}
the documentation of int to_lower(int ch) mandates that ch must either be representable as an unsigned char or must be equal to EOF (which is usually -1, but don't rely on that).
It's not uncommon for character manipulation functions that have been inherited from the c standard library to work in terms of ints. There are two reasons for this:
In the early days of C, all arguments were promoted to int (function prototypes did not exist).
For consistency these functions need to handle the EOF case, which for obvious reasons cannot be a value representable by a char, since that would mean we'd have to lose one of the legitimate encodings for a character.
http://en.cppreference.com/w/cpp/string/byte/tolower
The answer is to cast the result to a char before printing.
e.g.:
std::cout << static_cast<char>(std::to_lower('A'));
Generally speaking to convert an uppercase character to a lowercase, you only need to add 32 to the uppercase character as this number is the ASCII code difference between lowercase and uppercase characters, e.g., 'a'-'A'=97-67=32.
char c = 'B';
c += 32; // c is now 'b'
printf("c=%c\n", c);
Another easy way would be to first map the uppercase character to an offset within the range of English alphabets 0-25 i.e. 'a' is index '0' and 'z' is index '25' inclusive and then remap it to a lowercase character.
char c = 'B';
c = c - 'A' + 'a'; // c is now 'b'
printf("c=%c\n", c);

Check if character is upper case in older c-style

I'm working on a lab that requires password authentication as both an older c-string and a string class. I have the string class version working. I've gotten the password entered as an array using cin.getline(password, 20)
strlen(password) also works correctly.
I've been searching for how to determing is the older c-string version contains an uppercase letter in any of it's values. Everything is saying to use isupper, which is from the newer string class(as far as I can tell).
Is there a way to do this? I'm considering just verifying using the string class version then inputting it into the char array.
There is a function called isupper in the C standard library, which takes a single character as an argument. (It doesn't matter where the character comes from, a C string or somewhere else.) This is probably what you are meant to use.
There is an isupper() function in the C standard library, as well - in <ctype>
It takes a char parameter, so you would need to iterate over the character array and call it for each character.
There's some good information about it here.
Since you know that C uses ASCII, you could create your own function:
bool upper(char chr)
{
return chr >= 'A' && chr <= 'Z'; // same as return chr >= 65 && chr <= 90
}

C++ Strip non-ASCII Characters from string

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.
bool invalidChar (char c)
{
return !isprint((unsigned)c);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
I tested this method on "Prusæus, Ægyptians," and it did nothing
I also attempted to substitute isprint for isalnum
The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.
Ref:
How can you strip non-ASCII characters from a string? (in C#)
How to strip all non alphanumeric characters from a string in c++?
Edit:
I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:
// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH
Error Dialog
MSVC++ Debug Library
Debug Assertion Failed!
Program: //myproject
File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c
Line: //Above
Expression:(unsigned)(c+1)<=256
Edit:
Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.
Solution:
bool invalidChar (char c)
{
return !(c>=0 && c <128);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
If someone else would like to copy/paste this, I can check this question off.
EDIT:
For future reference: try using the __isascii, iswascii commands
Solution:
bool invalidChar (char c)
{
return !(c>=0 && c <128);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
EDIT:
For future reference: try using the __isascii, iswascii commands
At least one problem is in your invalidChar function. It should be:
return !isprint( static_cast<unsigned char>( c ) );
Casting a char to an unsigned is likely to give some very, very big
values if the char is negative (UNIT_MAX+1 + c). Passing such a
value toisprint` is undefined behavior.
Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());
}
I think it looks cleaner
isprint depends on the locale, so the character in question must be printable in the current locale.
If you want strictly ASCII, check the range for [0..127]. If you want printable ASCII, check the range and isprint.

HEX assignement in C

I have generated a long sequence of bytes which looks as follows:
0x401DA1815EB560399FE365DA23AAC0757F1D61EC10839D9B5521F.....
Now, I would like to assign it to a static unsigned char x[].
Obviously, I get the warning that hex escape sequence out of range when I do this here
static unsigned char x[] = "\x401DA1815EB56039.....";
The format it needs is
static unsigned char x[] = "\x40\x1D\xA1\x81\x5E\xB5\x60\x39.....";
So I am wondering if in C there is a way for this assignment without me adding the
hex escape sequence after each byte (could take quite a while)
I don't think there's a way to make a literal out of it.
You can parse the string at runtime and store it in another array.
You can use sed or something to rewrite the sequence:
echo 401DA1815EB560399FE365DA23AAC0757F1D61EC10839D9B5521F | sed -e 's/../\\x&/g'
\x40\x1D\xA1\x81\x5E\xB5\x60\x39\x9F\xE3\x65\xDA\x23\xAA\xC0\x75\x7F\x1D\x61\xEC\x10\x83\x9D\x9B\x55\x21F
AFAIK, No.
But you can use the regex s/(..)/\\x$1/g to convert your sequence to the last format.
No there is no way to do that in C or C++. The obvious solution is to write a program to insert the '\x' sequences at the correct point in the string. This would be a suitable task for a scripting language like perl, but you can also easily do it in C or C++.
If the sequence is fixed, I suggest following the regexp-in-editor suggestion.
If the sequence changes dynamically, you can relatively easily convert it on runtime.
char in[]="0x401DA1815EB560399FE365DA23AAC0757F1D61EC10839D9B5521F..."; //or whatever, loaded from a file or such.
char out[MAX_LEN]; //or malloc() as l/2 or whatever...
int l = strlen(in);
for(int i=2;i<l;i+=2)
{
out[i/2-1]=16*AsciiAsHex(in[i])+AsciiAsHex(in[i]+1);
}
out[i/2-1]='\0';
...
int AsciiAsHex(char in)
{
if(in>='0' && in<='9') return in-'0';
if(in>='A' && in<='F') return in+10-'A';
if(in>='a' && in<='f') return in+10-'a';
return 0;
}