Error in getting ASCII of character in C++ - c++

I saw this question : How to convert an ASCII char to its ASCII int value?
The most voted answer (https://stackoverflow.com/a/15999291/14911094) states the solution as :
Just do this:
int(k)
But i am having issues with this.
My code is :
std::cout << char(144) << std::endl;
std::cout << (int)(char(144)) << std::endl;
std::cout << int('É') << std::endl;
Now the output comes as :
É
-112
-55
Now i can understand the first line but what is happening for the second an the third lines?
Firstly how can some ASCII be negative and secondly how can that be different for the same character.
Also as far as i have tested this is not some random garbage from the memory as this stays same for every time i run the program also :
If i change it to 145 :
æ
-111
The output to changes by 1 so as far as i guess this may due to some kind of overflow.
But i cannot get it exactly as i am converting to int and that should be enough(4 bytes) to store the result.
Can any one suggest a solution?

If your platform is using ASCII for the character encoding (most do these days), then bear in mind that ASCII is only a 7 bit encoding.
It so happens that char is a signed type on your platform. (The signedness or otherwise of char doesn't matter for ASCII as only the first 7 bits are required.)
Hence char(144) gives you a char with a value of -112. (You have a 2's complement char type on your platform: from C++14 you can assume that, but you can't in C).
The third line implies that that character (which is not in the ASCII set) has a value of -55.
int(unsigned char('É'))
would force it to a positive value on all but the most exotic of platforms.

The C++ standard only guarantees that characters in the basic execution character set1 have non-negative encodings. Characters outside that basic set may have negative encodings - it depends on the locale.
Upper- and lowercase Latin alphabet, decimal digits, most punctuation, and control characters like tab, newline, form feed, etc.

Related

Visual Studio C++ C2022. Too big for character error occurs when trying to print a Unicode character

When I try to print a Unicode character to console. Visual Studio gives me an error. How do I fix this and get Visual Studio to print the Unicode character?
#include <iostream>
int main() {
std::cout << "\x2713";
return 0;
}
Quite simply, \x2713 is too large for a single character. If you wanted two characters, you need to do \x27\x13, if you wanted the wide character, then you need to prefix with L, i.e. L"\x2713", then use std::wcout instead of std::cout.
Note, from the C++20 standard (draft) [lex.ccon]/7 (emphasis mine):
The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character-literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character-literals with no prefix) or wchar_t (for character-literals prefixed by L).
Essentially, the compiler may treat that character how it wants; g++ issues a warning, and MSVC (for me) is a compiler error (clang also treats as an error)
\xNNN (any positive number of hex digits) means a single byte whose value is given by NNN; unless in a string literal or character literal prefixed by L in which case it means a wchar_t whose value is given by NNN.
If you are looking to encode a Unicode code point, the the syntax is \uNNNN (exactly 4 digits) or \UNNNNNNNN (exactly 8 digits). Note that this is the code point, not a UTF representation.
Using the u or U forms instead of L avoids portability problems due to wchar_t having different size on different platforms.
To get well-defined behaviour you can manually specify the encoding of a string literal, e.g.:
std::cout << u8"\u2713" << std::endl;
which will encode the code point as UTF-8. Of course you still need a UTF-8 aware terminal to see meaningful output.
Without the encoding prefix then it is up to the compiler (I think) in what way to encode the code point.
See:
Escape sequences
String literal

Get decimal value of Unicode Character C++

How do I get the decimal values of Unicode Character such as "Ồ"
std::string a = "Ồ";
unsigned char c = a[0];
long val = long(c);
cout << val << endl;
OUTPUT
7,891;
Your question may look pretty straight-forward but as we delve into it, we'll find it isn't as simple as it might first appear.
The first problem is that std::string is defined as std::basic_string<char> which isn't really compatible with "Ồ". Thus the results you get from your code will probably depend on the compiler you use and/or the environment and OS you are running on. For example, my copy of Visual Studio treats "Ồ" as an invalid ASCII character and puts "?" (or 0x3F) in `a[0]'.
The second problem is that the character "Ồ" is more than eight bits wide, so it may not fit into the variable c. Whatever the compiler put into a[0], the variable c will only hold char bits of that value. Again, the results you get are likely to change depending on the compiler you use and/or the environment you run in.
Leaving that aside, let's start by assuming the character "Ồ" is LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE (0x1ED2). With that assumption, one might imagine that the answer we are seeking to get is 0x1ED2 right? But not necessarily.
There are several ways to encode a Unicode character. The UTF-32 encoding is 0x1ED2 (or 0x00001ED2 if we include all the leading zeros to get thirty-two bits). The UTF-8 encoding is 0xE1BB92.
So the decimal value of "Ồ" is 7,890 if it is encoded in UTF-32 or 14,793,618 if it is encoded in UTF-8 (I'm ignoring the effects of endianness to keep things simple)
The Unicode site has a FAQ on encodings and Wikipedia has a page too.
As you can see, the answer to your question (to some extent) depends on the encoding you want to use. One C++ way to deal with encodings is std::codecvt. Another solution is to just treat your string as a sequence of bytes - which your code attempts to do - but that rather depends on you knowing how your system encodes strings, what endianness you are dealing with, etc. And the code won't necessarily be portable.
Another wrinkle to consider is that - in the general case - "Ồ" might not be one character. Obviously it is one character in your code. But if you read a string in from a disk file say and when printed or displayed that file produces "Ồ" we can't assume the file contains a single "Ồ" character.
Unicode defines COMBINING CIRCUMFLEX ACCENT (0x0302) and COMBINING GRAVE ACCENT (0x0300) as separate characters which can be combined with other characters. And it defines intermediate characters like LATIN CAPITAL LETTER O WITH GRAVE and LATIN CAPITAL LETTER O WITH ACUTE so there are actually several ways you can create a string in memory (or in a disk file) that would give you the same effect as the character "Ồ".

Unsigned byte for 10 conflicts with newline

Is there a way to differentiate the first value (which is a number 10 saved as an unsigned char) from the newline character in the following demo code?
int main() {
unsigned char ch1(10), ch2('\n');
std::cout << (int)ch1 << " " << (int)ch2 << std::endl;
}
The output is
10 10
I want to write to a file such characters as unsigned bytes, but also want the newline character be distinguishable from a number '10' when read at a later time.
Any suggestions?
regards,
Nikhil
There is no way. You write the same byte, and preserve no other information.
You need to think of other way of encoding you values, or reserve one value for your sentinel (like 255 or 0). Of course, you need to be sure, that this value is not present in your input.
Other possibility it to use one byte-value as 'special' character to escape your control codes. Similar as '\' is used to give special meaning to 'n' in '\n'. But it makes all parsing more complicated, as your values may be now one- or two-byte long. Unless you are under tight pressure memory-wise, I would advice to store values as their string representation, this is usually more readable.
No, a char of value 10 is a newline. Take a look at an ascii table. You'll see that the number 10 would be two different chars (49 and 48, respectively).

C++ Converting/casting String letters to integer values (ASCII)

I'm currently doing a project where I need to create two data structures that will be used to contain strings. One of them has to be a form of linked list, and I've been adivsed to seperate the words out into seperate lists inside of it for each letter of the alphabet. I'm required to think about efficiency so I have an array of Head pointers of size 26, and am wanting to convert the first character of the word given into an integer so I can put it into the subscript, such as:
//a string called s is passed as a parameter to the function
int i = /*some magic happens here*/ s.substr(0, 1);
currentPointer = heads[i]; //then I start iterating through the list
I've been searching around and all I've seemd to have found is how to covert number characters that are in strings into integers, and not letter characters, and am wondering how on earth I can get this working without resorting to a huge and ugly set of if statements
When you are setting i to the value of the first character, you are getting the ASCII value.
So i is out of your 0-25 range : See man ascii
You can reduce it by substraction of the first alaphabet ascii letter. (Be care full with the case)
std::string s("zoro");
int i = s[0];
std::cout << "Ascii value : " << i << " Reduced : " << i - 'a' << std::endl;
Which produce the ASCII value 'z' = 112 and 25 for the reduced value, as expected.
I think you are confusing values with representations. "Ten" "10" and "1 1 1 1 1 1 1 1 1 1" are all the same value, just represented differently.
I've been searching around and all I've seemd to have found is how to covert number characters that are in strings into integers, and not letter characters
There's no difference. Characters are always represented by integers anyway. It's just a matter of representation. Just present the value the way you want.
By the way, this is a key concept programmers have to understand. So it's worth spending some time thinking about it.
A classic example of this misunderstanding is a question like "I have a variable i that has some value in decimal. How can I make it store a value in hex?" Of course that makes no sense, it stores values and hex and decimal are representations. If you have ten cars, you have ten in cars, not in decimal or hex. If i has the value ten, then the value ten is in i, not a representation of ten in decimal or hex.
Of course, when you display the value stored in i, you have to choose how to represent it. You can display it as ten, or 10, | | | | | | | | | | |, or whatever.
And you might have a string that has a representation of the value "ten" in hex, and you might need to assign that value to a variable. That requires converting from a representation to the value it represents.
There are input and output functions that input and out values in various representations.
I suspect you want to convert digits stored in strings as characters to integers, e.g. character '9' to integer 9.
In order to do so:
char c = '9';
int x = c - '0';
That will work regardless of whether you have a computer using ASCII or EBCDIC...
In this case, you don't seem to need atoi or itoa (neither is going to do anything very useful with, for example, J). You just want something like:
int i = tolower(s[0])-'a';
In theory that's not portable -- if there's any chance of the code being used on a machine that uses EBCDIC (i.e., an IBM or compatible mainframe) you'll want to use something like 'z'-'a' as the size of your array, since it won't be exactly 26 (EBCDIC includes some other characters inserted between some letters, so the letters are in order but not contiguous).
Probably more importantly, if you want to support languages other than English, things change entirely in a hurry -- you might have a different number of letters than 26, they might not all be contiguous, etc. For such a case your basic design is really the problem. Rather than fixing that one line of code, you probably need to redesign almost completely.
As an aside, however, there's a pretty good chance that a linked list isn't a very good choice here.

How to convert from ASCII to string or symbol

I want to output this ^ to the console, but I want to do it using ASCII code and not the value itself. Does anyone have an idea of how to do this?
That symbol is called a caret. The ASCII code is 0x5e in hexadecimal (= 94 in decimal).
C version:
printf("%c", 0x5e);
C++ version:
std::cout << static_cast<char>(0x5e);
Both of these assume that you are running on a system where the default character encoding assigns the caret symbol the value 0x5e.
To avoid having to rely on this assumption it is better to not use the ASCII code but instead use '^'.
The hexadecimal value for the caret character (^) is most often 0x5e (94 in decimal).
std::cout << static_cast<char> (0x5e) << " " << (char)94 << " " << '\x5e';
output on my playform: "^ ^ ^"
I write "most often" because the standard doesn't guarantee what integer value is used to represent a certain character, therefore you shouldn't do what you are implying.
Even though it will probably work out the way you want to (since most modern operating-system represent caret using that value) it's not recommended.. if it's not in the standard no one can guarantee you that it is going to work on all platforms, in all cases.
What does the standard say?
2.3/3   Character sets   [lex.charset]
The execution character set and the execution wide-character set are
implementation-defined supersets of the basic execution character set
and the basic execution wide-character set, respectively. The values
of the members of the execution character sets and the sets of
additional members are locale-specific.
According to this page: http://www.ascii-code.com/
It's ASCII code 0x5e.