incorrect character recognition in string array C++

incorrect character recognition in string array C++ - c++

i have to write a program that asks user to input characters, store them in a string array, then read the last character of the string and then determine whether it's an integer or a letter or neither of those.
#include<iostream>
using namespace std;
int main()
{
string str;
cout << "Enter a line of string: ";
getline(cin, str);
char i = str.length()-1;
cout<<str[i];
cout<<endl;
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z')) cout << "it's a letter";
else if (str[i] >=0) cout << "it's a number";
else cout << "it is something else";
return 0;
}
i came up with this code, it recognizes both letters and integers, but i have 2 problems with it:
1) if the last character of string is a symbol, for example an '*', then the program says it's an integer, although it's not.
how could i correct this?
2) also, if i type in an additional condition for the following statement in order to recognize integers from 0 to 9, the code fails:
else if ((str[i] >=0) && (str[i] <=9)) cout << "it's a number";
it works only if it's stated as (str[i] >=0)
why is it so?

Because numbers in string are not integer numbers, but font symbols/characters. Think about what is 'a', if the computer operates only with 1 and 0 values, it doesn't have any "a" symbol in HW. So the letter "a" must be encoded into some 01100001 pattern (for example 0x61 is 'a' in ASCII and UTF8 encoding). In similar way the digits are also encoded as particular bit patterns, for example 0x30 to 0x39 in ASCII and UTF8 for digits 0 to 9.
So your test has to be if ((str[i] >= '0') && ... (notice the digit zero is in apostrophes, telling compiler to use the character encoding, and will compile to 0x30 or in binary 00110000).
You can also write if ((str[i] >= 48) && .. and it will compile to identical machine code, because for compiler there is no difference between 48, 0x30 or '0'. But for humans reading your source later and trying understand what you were trying to achieve, the '0' is probably best fit to explain your intent)
Also notice the encoding which is used (as defined by your compiler and source) may use different values to encode particular symbol, causing the string processing to be sometimes quite tricky business, for example in UTF8 the common English characters are encoded identically with the 7 bit ASCII encoding (which is trivial old encoding, and you should start by learning a bit about it) and each letter fits into single byte (8 bits), but the accented and extra characters are encoded as series of bytes (two to five), so for example trivial things like "count how many characters are in string" turn into complex task requiring rich tables with definition of unicode characters and their category/function, to correctly count characters as in human way (simple Arabic text consisting of 4 letters may require for example 12 bytes of memory, so simple strlen will return 12 then). ... if you need these things for unicode encoding, you better use some library, than implementing it yourself.
Your code will work well for simple ASCII-like strings only, which is enough to practice stuff like this, just remember everything in the computer is encoded as 1 and 0 (sounds, characters, pixel/vector graphics, ... everything, at some point, down below, under many layers of abstraction and utilities, there is some point where the information is encoded as series of bit values 1 or 0, and nothing else), so when you are trying to process some data, it may be helpful to learn about how the thing is actually encoded in the HW and how much of that is abstracted from you by your programming language/library+OS API.
You can also check in debugger in memory view, what values are used to encode your string, and how it looks in terms of byte values.

Your condition in 'else if' doesn't seem to satisfy your requirement.
Try something like:
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z'))
cout << "it's a letter";
else if (str[i] >= '0' && str[i] <= '9')
cout << "it's a number";
else
cout << "it is something else";

Related

Code not identifying Swedish characters despite #include<clocale> in c++

I have to write a code in C++ that identifies and counts English and non-English characters in a string.
The user writes an input and the program must count user's letters and notify when it finds non-English letters.
My problem is that I get a question mark instead of the non-English letter!
At the beginning of the code I wrote:
...
#include <clocale>
int main() {
std::setlocale(LC_ALL, "sv_SE.UTF-8");
...
(the locale is Swedish)
If I try to print out Swedish letters before the counting loops (as a test), it does work, so I guess the clocale is working fine.
But when I launch the counting loop below,
for (unsigned char c: rad) {
if (c < 128) {
if (isalpha(c) != 0)
bokstaver++;
}
if (c >= 134 && c <= 165) {
cout << "Your text contains a " << c << '\n';
bokstaver++;
}
}
my non-English letter is taken into account but not printed out with cout.
I used unsigned char since non-English letters are between ASCII 134 and 165, so I really don't know what to do.
test with the word blå:

non-English letters are between ASCII 134 and 165
No, they aren't. Non english characters are not between any ASCII characters in UTF-8. Non ASCII characters consist of two or more code units (those individual code units themselves can represent some character in ASCII) . å for example consists of 0xC3 followed by 0xA5.
The C and C++ library functions which only accept a single char (such as std::isalpha) are not useful when using UTF-8 because that single char can only represent a single code unit.

How does s[i]^=32 convert upper to lower case?

int main()
{
string s;
cout << "enter the string :" << endl;
cin >> s;
for (int i = 0; i < s.length(); i++)
s[i] ^= 32;
cout << "modified string is : " << s << endl;
return 0;
}
I saw this code which converts uppercase to lowercase on stackoverflow.
But I don't understand the line s[i] = s[i]^32.
How does it work?

^= is the exclusive-or assignment operator. 32 is 100000 in binary, so ^= 32 switches the fifth bit in the destination. In ASCII, lower and upper case letters are 32 positions apart, so this converts lower to upper case, and also the other way.
But it only works for ASCII, not for Unicode for example, and only for letters. To write portable C++, you should not assume the character encoding to be ASCII, so please don't use such code. #πάντα ῥεῖs answer shows a way to do it properly.

How does it work?
Let's see for ASCII value 'A':
'A' is binary 1000001
XORed with 32 (binary 100000)
yields any value where the upper character indicating bit isn't set:
1000001
XOR
100000
= 1100001 == 'a' in ASCII.
Any sane and portable c or c++ application should use tolower():
int main()
{
string s;
cout<<"enter the string :"<<endl;
cin>>s;
for (int i=0;i<s.length();i++) s[i] = tolower( (unsigned char)s[i] );
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cout<<"modified string is : "<<s<<endl;
return 0;
}
The s[i]=s[i]^32 (cargo cult) magic, relies on ASCII table specific mapping to numeric char values.
There are other char code tables like e.g. EBCDIC
, where the
s[i]=s[i]^32
method miserably fails to retrieve the corresponding lower case letters.
There's a more sophisticated c++ version of converting to lower case characters shown in the reference documentation page of std::ctype::tolower().

In C++, like its predecessor C, a char is a numeric type. This is after all how characters are represented on the hardware and these languages don't hide that from you.
In ASCII, letters have the useful property that the difference between an uppercase and a lowercase letter is a single binary bit: the 5th bit (if we start numbering from the right starting at 0).
Uppercase A is represented by the byte 0b01000001 (0x41 in hex), and lowercase a is represented by the byte 0b01100001 (0x61 in hex). Notice that the only difference between uppercase and lowercase A is the fifth bit. This pattern continues from B to Z.
So, when you do ^= 32 (which, incidentally, is 2 to the 5th power) on a number that represents an ASCII character, what that does is toggle the 5th bit - if it is 0, it becomes 1, and vice versa, which changes the character from upper to lower case and vice versa.

convert char[] of hexadecimal numbers to char[] of letters corresponding to the hexadecimal numbers in ascii table and reversing it

I have a char a[] of hexadecimal characters like this:
"315c4eeaa8b5f8aaf9174145bf43e1784b8fa00dc71d885a804e5ee9fa40b16349c146fb778cdf2d3aff021dfff5b403b510d0d0455468aeb98622b137dae857553ccd8883a7bc37520e06e515d22c954eba5025b8cc57ee59418ce7dc6bc41556bdb36bbca3e8774301fbcaa3b83b220809560987815f65286764703de0f3d524400a19b159610b11ef3e"
I want to convert it to letters corresponding to each hexadecimal number like this:
68656c6c6f = hello
and store it in char b[] and then do the reverse
I don't want a block of code please, I want explanation and what libraries was used and how to use it.
Thanks

Assuming you are talking about ASCII codes. Well, first step is to find the size of b. Assuming you have all characters by 2 hexadecimal digits (for example, a tab would be 09), then size of b is simply strlen(a) / 2 + 1.
That done, you need to go through letters of a, 2 by 2, convert them to their integer value and store it as a string. Written as a formula you have:
b[i] = (to_digit(a[2*i]) << 4) + to_digit(a[2*i+1]))
where to_digit(x) converts '0'-'9' to 0-9 and 'a'-'z' or 'A'-'Z' to 10-15.
Note that if characters below 0x10 are shown with only one character (the only one I can think of is tab, then instead of using 2*i as index to a, you should keep a next_index in your loop which is either added by 2, if a[next_index] < '8' or added by 1 otherwise. In the later case, b[i] = to_digit(a[next_index]).
The reverse of this operation is very similar. Each character b[i] is written as:
a[2*i] = to_char(b[i] >> 4)
a[2*i+1] = to_char(b[i] & 0xf)
where to_char is the opposite of to_digit.

Converting the hexadecimal string to a character string can be done by using std::substr to get the next two characters of the hex string, then using std::stoi to convert the substring to an integer. This can be casted to a character that is added to a std::string. The std::stoi function is C++11 only, and if you don't have it you can use e.g. std::strtol.
To do the opposite you loop over each character in the input string, cast it to an integer and put it in an std::ostringstream preceded by manipulators to have it presented as a two-digit, zero-prefixed hexadecimal number. Append to the output string.
Use std::string::c_str to get an old-style C char pointer if needed.
No external library, only using the C++ standard library.

Forward:
Read two hex chars from input.
Convert to int (0..255). (hint: sscanf is one way)
Append int to output char array
Repeat 1-3 until out of chars.
Null terminate the array
Reverse:
Read single char from array
Convert to 2 hexidecimal chars (hint: sprintf is one way).
Concat buffer from (2) to final output string buffer.
Repeat 1-3 until out of chars.
Almost forgot to mention. stdio.h and the regular C-runtime required only-assuming you're using sscanf and sprintf. You could alternatively create a a pair of conversion tables that would radically speed up the conversions.

Here's a simple piece of code to do the trick:
unsigned int hex_digit_value(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
return -1;
}
std::string dehexify(std::string const & s)
{
std::string result(s.size() / 2);
for (std::size_t i = 0; i != s.size(); ++i)
{
result[i] = hex_digit_value(s[2 * i]) * 16
+ hex_digit_value(s[2 * i + 1]);
}
return result;
}
Usage:
char const a[] = "12AB";
std::string s = dehexify(a);
Notes:
A proper implementation would add checks that the input string length is even and that each digit is in fact a valid hex numeral.
Dehexifying has nothing to do with ASCII. It just turns any hexified sequence of nibbles into a sequence of bytes. I just use std::string as a convenient "container of bytes", which is exactly what it is.
There are dozens of answers on SO showing you how to go the other way; just search for "hexify".

Each hexadecimal digit corresponds to 4 bits, because 4 bits has 16 possible bit patterns (and there are 16 possible hex digits, each standing for a unique 4-bit pattern).
So, two hexadecimal digits correspond to 8 bits.
And on most computers nowadays (some Texas Instruments digital signal processors are an exception) a C++ char is 8 bits.
This means that each C++ char is represented by 2 hex digits.
So, simply read two hex digits at a time, convert to int using e.g. an istringstream, convert that to char, and append each char value to a std::string.
The other direction is just opposite, but with a twist.
Because char is signed on most systems, you need to convert to unsigned char before converting that value again to hex digits.

Conversion to and from hexadecimal can be done using hex, like e.g.
cout << hex << x;
cin >> hex >> x;
for a suitable definition of x, e.g. int x
This should work for string streams as well.

counting unicode characters in c++

How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.
EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.

In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).
Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.
The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).
Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.

If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.
while (*p != 0)
{
if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
++count;
++p;
}
Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:
while (*p != 0)
{
if ((*p & 0xc0) != 0x80)
++count;
++p;
}
Or if you want to be super clever and make it a 2-liner:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
The Wikipedia page for UTF-8 clearly shows the patterns.

A discussion with a full routine written in C++ is at http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

I know, it's late for this thread but, it could help
with ICU stuff, I did it like this:
string TheString = "blabla" ;
UnicodeString uStr = UnicodeString::fromUTF8( theString.c_str() ) ;
cout << "length = " << uStr.length( ) << endl ;

I wouldn't consider this a language-centric question. The UTF-8 format is fairly simple; decoding it from a file should be only a few lines of code in any language.
open file
until eof
if file.readchar & 0xC0 != 0x80
increment count
close file

How do I convert a single char in string to an int

Keep in mind, if you choose to answer the question, I am a beginner in the field of programming and may need a bit more explanation than others as to how the solutions work.
Thank you for your help.
My problem is that I am trying to do computations with parts of a string (consisting only of numbers), but I do not know how to convert an individual char to an int. The string is named "message".
for (int place = 0; place < message.size(); place++)
{
if (secondPlace == 0)
{
cout << (message[place]) * 100 << endl;
}
}
Thank you.

If you mean that you want to convert the character '0' to the integer 0, and '1' to 1, et cetera, than the simplest way to do this is probably the following:
int number = message[place] - '0';
Since the characters for digits are encoded in ascii in ascending numerical order, you can subtract the ascii value of '0' from the ascii value of the character in question and get a number equal to the digit.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js