How compiler identifies the ASCII code of multiple characters

How compiler identifies the ASCII code of multiple characters - c++

int var;
var=' '; // this is a single space
cout << var; // prints 32
var = ' '; // double space
cout << var; // prints 8224. Why?
How the compiler calculates this (8224) for two spaces?
This happens with every multi-character literal.

This is what C++ standard N3690 mentions about multicharacter literals:
An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character
set, is conditionally-supported, has type int, and has an implementation-defined value.
So the answer is that the corresponding int value is implementation-specific.
While for single-char literal:
An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

a char in c++ is a byte(with possible int values from 0 to 255).
So in your case when you cast the two white-spaces ' ' to an int, behind the scenes it's just a base-256 conversion. To be more precise:
the white-space ' ' has an ASCII of 32.
So, two white-spaces will be cast to an int of: 32 + 256*32 = 8224.
EDIT
this is how your two characters are represented in memory, where one char-block is a byte, which can have values ranging in 0-255:
|char| char|.
when you cast this two blocks to an int, you make a base-256 conversion, i.e. the ASCII of the right char block, which is 32 we multiply by 256^0. Then the ASCII of the next char block, i.e. 32 we multiply by 256^1.
Step 2. is implementation dependent as #saurav-sahu mentions, e.g. if it's big endian or little endian.
I tried to give you an intuition of what goes behind the system, but as pete_becker has correctly pointed to, it's highly implementation specific, e.g. the char type can be interpreted as a signed or unsigned value and so on.

Related

Unreported error VS 2015: Hex char specifier [duplicate]

This question already has answers here:
Multi-character constant warnings
(6 answers)
What do single quotes do in C++ when used on multiple characters?
(5 answers)
Closed 3 years ago.
I wanted this: char c = '\x20' ;
But by mistake I typed this: char c = 'x20';
The VS2015 compiler reported a warning 'converting integer to char', there was no error, the code ran but the value of c was 48 (decimal). Can anyone explain how the erroneous format conversion works, assuming it is a valid form (I didn't think it was). Or is this maybe an error that VS15 doesn't recognise?

'x20' is a multicharacter literal. Per [lex.ccon]/2:
A character literal that does not begin with u8, u, U, or L is
an ordinary character literal. An ordinary character literal that
contains a single c-char representable in the execution character
set has type char, with value equal to the numerical value of the
encoding of the c-char in the execution character set.
An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an
ordinary character literal containing a single c-char not
representable in the execution character set, is
conditionally-supported, has type int, and has an
implementation-defined value.
Therefore, from a standard perspective, your implementation supports this conditionally-supported construct, and you get an implementation-defined value of type int which, when converted to type char, results in char(48).
Per Microsoft Visual Studio C++ Documentation:
Microsoft Specific
Multiple characters in the literal fill corresponding bytes as needed
from high-order to low-order. To create a char value, the compiler
takes the low-order byte. To create a wchar_t or char16_t value,
the compiler takes the low-order word. The compiler warns that the
result is truncated if any bits are set above the assigned byte or
word.
char c0 = 'abcd'; // C4305, C4309, truncates to 'd'
wchar_t w0 = 'abcd'; // C4305, C4309, truncates to '\x6364'
In your case, you use 'x20'. The compiler takes the low-order byte — '0', which is char(48) under ASCII encoding.

How to mix hexadecimal char and normal char in string literal in C++? [duplicate]

This question already has answers here:
How to properly add hex escapes into a string-literal?
(3 answers)
Limit the Length of a Hexadecimal Escape Sequence in a C-String [duplicate]
(1 answer)
Closed 4 years ago.
Is it possible to mix '\xfd' and 'a' in a single string literal?
For example:
unsigned char buff1[] = "\xfda";
unsigned char buff1[] = "\x0f\x0015899999999";
VC++2015 reports:
Error C2022 '-1717986919': too big for character

As mentioned by the other answer '\xfda' is considered as a single hex character literal. To get a string literal with '\xfd' and 'a' you need to split the string.
"\xfd" "a"
Adjacent string literal tokens are concatenated, which means that for example "ab" "cd" is the same as "abcd".

You will not be able to do so using a hex character literal in a single string. [lex.ccon]/8 states
The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_t (for character literals prefixed by L). [ Note: If the value of a character literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed. — end note ]
emphasis mine
This means '\xfda' is considered a single hex character literal since all of its digits are valid hex digits. What you can do is use multiple string literals that will be concatenated for you to break it up like
unsigned char buff1[] = "\xfd" "a";
Another option would be to switch to using an octal literal if you want 'a' to be part of the string. That would be "\375a".

Not possible, as explained well in NathanOliver's answer. But there is also no need you can simply use two literals:
unsigned char buff1[] = "\x0f\x00""15899999999";

char val = 'abcd'. Using multi character char

I have a confusion of how the compiler handles a char variable with multiple characters. I understand that a char is 1 byte and it can contain one character like ASCII.
But when I try:
char _val = 'ab';
char _val = 'abc';
char _val = 'abcd';
They compiles fine and when I print _val it always prints the last character. But when I did
char _val = 'abcde';
Then I got a compiler error:
Error 1 error C2015: too many characters in constant
So my questions are:
Why does the compiler always takes the last character when multiple characters are used? What is the compiler mechanism in this situation.
Why did I get a too many characters error when I put 5 characters. 2 characters is more than what a char can handle so why 5?
I am using Visual Studio 2013.
Thank you.

[lex.ccon]/1:
An ordinary character literal that contains more than one c-char is a
multicharacter literal. A multicharacter literal [..] is conditionally-supported, has type int, and
has an implementation-defined value.
Why does the compiler always takes the last character when multiple
characters are used? What is the compiler mechanism in this situation.
Most compilers just shift the character values together in order: That way the last character occupies the least significant byte, the penultimate character occupies the byte next to the least significant one, and so forth.
I.e. 'abc' would be equivalent to 'c' + ((int)'b')<<8) + (((int)'a')<<16) (Demo).
Converting this int back to a char will have an implementation defined value - that might just emerge from taking the value of the int modulo 256. That would simply give you the last character.
Why did I get a too many characters error when I put 5 characters. 2
characters is more than what a char can handle so why 5?
Because on your machine an int is probably four bytes large. If the above is indeed the way your compiler arranges multicharacter constants in, he cannot put five char values into an int.

Printing char by integer qualifier

I am trying to execute the below program.
#‎include‬ "stdio.h"
#include "string.h"
void main()
{
char c='\08';
printf("%d",c);
}
I'm getting the output as 56 . But for any numbers other than 8 , the output is the number itself , but for 8 the answer is 56.
Can somebody explain ?

A characters that begins with \0 represents Octal number, is the base-8 number system, and uses the digits 0 to 7. So \08 is invalid representation of octal number because 8 ∉ [0, 7], hence you're getting implementation-defined behavior.
Probably your compiler recognize a Multibyte Character '\08' as '\0' one character and '8' as another and interprets as '\08' as '\0' + '8' which makes it '8'. After looking at the ASCII table, you'll note that the decimal value of '8' is 56.
Thanks to #DarkDust, #GrijeshChauhan and #EricPostpischil.

The value '\08' is considered to be a multi-character constant, consisting of \0 (which evaluates to the number 0) and the ASCII character 8 (which evaluates to decimal 56). How it's interpreted is implementation defined. The C99 standard says:
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
So if you would assign '\08' to something bigger than a char, like int or long, it would even be valid. But since you assign it to a char you're "chopping off" some part. Which part is probably also implementation/machine dependent. In your case it happens to gives you value of the 8 (the ASCII character which evaluates to the number 56).
Both GCC and Clang do warn about this problem with "warning: multi-character character constant".

\0 is used to represent octal numbers in C/C++. Octal base numbers are from 0->7 so \08 is a multi-character constant, consisting of \0, the compiler interprets \08 as \0 + 8, which makes it '8' whose ascii value is 56 . Thats why you are getting 56 as output.

As other answers have said, these kind of numbers represent octal characters (base 8). This means that you have to write '\010' for 8, '\011' for 9, etc.
There are other ways to write your assign:
char c = 8;
char c = '\x8'; // hexadecimal (base 16) numbers

C++ char type values are defined by OS?

I know that first 128 symbols of char type are ASCII symbols, I mean if you print them, they are figured as it is in ASCII table. Now what about the rest of them? The other 128 symbols are not strictly defined as I understand. On what it depends what will be printed if I print all possible char values like below?
char a = 0;
for (int i = 0; i < 256; i++)
{
if (i == 128)
cout << "------------------------------" <<endl;
cout << a++ <<endl;
}
Can I configure the output?

The first 128 values of char do not necessarily correspond with the ASCII characters. The values of char correspond to characters in the execution character set which is an implementation-defined set. The values of the members of this character set are locale-specific (§2.3/3):
The values of the members of the execution character sets and the sets of additional members are locale-specific.
A character literal, such as 'a', has type char and value equal to that characters value in the execution character set. Likewise for string literals. If a character in your literals falls outside the implementation-defined execution character set, it has an implementation-defined value (§2.14.4/5):
If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.
In many compilers, you can configure the execution character set. For example, with g++, you can use the -fexec-charset option.
Once you output your text, the interpretation of it is up to the medium in which it is being viewed, such as a terminal.

Since the terminal interprets the bytes written by std::cout, you can usually configure your terminal to show the bytes as Latin-1, Latin-15, Cyrillic, or anything else you want.
In your program, you cannot configure how it is shown on the display. The only thing you can do is, how the bytes are interpreted by your code. So, in order to use Latin-1, both your program and the terminal must aggree about the meaning of these bytes to be Latin-1.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js