Why extended ASCII (special) characters take 2 bytes to get stored? - c++

ASCII ranging from 32 to 126 are printable. 127 is DEL and thereafter are considered the extended characters.
To check, how are they stored in the std::string, I wrote a test program:
int main ()
{
string s; // ASCII
s += "!"; // 33
s += "A"; // 65
s += "a"; // 97
s += "â"; // 131
s += "ä"; // 132
s += "à"; // 133
cout << s << endl; // Print directly
for(auto i : s) // Print after iteration
cout << i;
cout << "\ns.size() = " << s.size() << endl; // outputs 9!
}
The special characters visible in the code above actually look different and those can be seen in this online example (also visible in vi).
In the string s, first 3 normal characters acquire 1 byte each as expected. The next 3 extended characters take surprisingly 2 bytes each.
Questions:
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
[Note: This may also apply to C and other languages.]

Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
If you define 'being ASCII' as containing only bytes in the range [0, 256), then all data is ASCII: [0, 256) is the same as the range a byte is capable of representing, thus all data represented with bytes is ASCII, under your definition.
The issue is that your definition is incorrect, and you're looking incorrectly at how data types are determined; The kind of data represented by a sequence of bytes is not determined by those bytes. Instead, the data type is metadata that is external to the sequence of bytes. (This isn't to say that it's impossible to examine a sequence of bytes and determine statistically what kind of data it is likely to be.)
Let's examine your code, keeping the above in mind. I've taken the relevant snippets from the two versions of your source code:
s += "â"; // 131
s += "ä"; // 132
s += "â"; // 131
s += "ä"; // 132
You're viewing these source code snippets as text rendered in a browser, rather than as raw binary data. You've presented these two things as the 'same' data, but in fact they're not the same. Pictured above are two different sequences of characters.
However there is something interesting about these two sequences of text elements: one of them, when encoded into bytes using a certain encoding scheme, is represented by the same sequence of bytes as the other sequence of text elements when that sequence is encoded into bytes using a different encoding scheme. That is, the same sequence of bytes on disk may represent two different sequences of text elements depending on the encoding scheme! In other words, in order to figure out what the sequence of bytes means, we have to know what kind of data it is, and therefore what decoding scheme to use.
So here's what probably happened. In vi you wrote:
s += "â"; // 131
s += "ä"; // 132
You were under the impression that vi would represent those characters using extended ASCII, and thus use the bytes 131 and 132. But that was incorrect. vi did not use extended ASCII, and instead it represented those characters using a different scheme (UTF-8) which happens to use two bytes to represent each of those characters.
Later, when you opened the source code in a different editor, that editor incorrectly assumed the file was extended ASCII and displayed it as such. Since extended ASCII uses a single byte for every character, it took the two bytes vi used to represent each of those characters, and showed one character for each byte.
The bottom line is that you're incorrect that the source code is using extended ASCII, and so your assumption that those characters would be represented by single bytes with the values 131 and 132 was incorrect.
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
Your program isn't doing this. The characters are printing okay in your ideone.com example because independently printing out the two bytes that represent those characters works to display that character. Here's an example that makes this clear: live example.
std::cout << "Printed together: '";
std::cout << (char)0xC3;
std::cout << (char)0xA2;
std::cout << "'\n";
std::cout << "Printed separated: '";
std::cout << (char)0xC3;
std::cout << '/';
std::cout << (char)0xA2;
std::cout << "'\n";
Printed together: 'â'
Printed separated: '�/�'
The '�' character is what shows up when an invalid encoding is encountered.
If you're asking how you can write a program that does do this, the answer is to use code that understands the details of the encoding being used. Either get a library that understands UTF-8 or read the UTF-8 spec yourself.
You should also keep in mind that the use of UTF-8 here is simply because this editor and compiler use UTF-8 by default. If you were to write the same code with a different editor and compile it with a different compiler, the encoding could be completely different; assuming code is UTF-8 can be just as wrong as your earlier assumption that the code was extended ASCII.

Your terminal probably uses UTF-8 encoding. It uses one byte for ASCII characters, and 2-4 bytes for everything else.

The basic source character set for C++ source code does not include extended ASCII characters (ref. §2.3 in ISO/IEC 14882:2011) :
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " ’
So, an implementation has to map those characters from the source file to characters in the basic source character set, before passing them on to the compiler. They will likely be mapped to universal character names, following ISO/IEC 10646 (UCS) :
The universal-character-name construct provides a way to name other characters.
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN.
A universal character name in a narrow string literal (as in your case) may be mapped to multiple chars, using multibyte encoding (ref. §2.14.5 in ISO/IEC 14882:2011) :
In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding.
That's what you're seeing for those 3 last characters.

Related

How to make std::regex match Utf8

I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.
I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.
here's my test case:
#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main(int argc, char** argv)
{
// make a string with 3 UTF8 characters
const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
string tobesearched((char*)p);
// want to match the UTF8 character before c
string pattern(".c");
regex re(pattern);
std::smatch match;
bool r = std::regex_search(tobesearched, match, re);
if (r)
{
// m.size() will be bytes, and we expect 3
// expect 0xC2, 0x80, 'c'
string m = match[0];
cout << "match length " << m.size() << endl;
// but we only get 2, we get the 0x80 and the 'c'.
// so it's matching on single bytes and not utf8
// code here is just to dump out the byte values.
for (int i = 0; i < m.size(); ++i)
{
int c = m[i] & 0xff;
printf("%02X ", c);
}
printf("\n");
}
else
cout << "not matched\n";
return 0;
}
I wanted the pattern ".c" to match 3 bytes of my tobesearched string, where the first two are a 2-byte utf8 character followed by 'c'.
Some regex flavours support \X which will match a single unicode character, which may consist of a number of bytes depending on the encoding. It is common practice for regex engines to get the bytes of the subject string in an encoding the engine is designed to work with, so you shouldn't have to worry about the actual encoding (whether it is US-ASCII, UTF-8, UTF-16 or UTF-32).
Another option is the \uFFFF where FFFF refers to the unicode character at that index in the unicode charset. With that, you could create a ranged match inside a character class i.e. [\u0000-\uFFFF]. Again, it depends on what the regex flavour supports. There is another variant of \u in \x{...} which does the same thing, except the unicode character index must be supplied inside curly braces, and need not be padded e.g. \x{65}.
Edit: This website is amazing for learning more about regex across various flavours https://www.regular-expressions.info
Edit 2: To match any Unicode-exclusive character, i.e. excluding characters in the ASCII table / 1 byte characters, you can try "[\x{80}-\x{FFFFFFFF}]" i.e. any character that has a value of 128-4,294,967,295 which is from the first character outside the ASCII range to the last unicode charset index which currently uses up to a 4-byte representation (was originally to be 6, and may change in future).
A loop through the individual bytes would be more efficient, though:
If the lead bit is 0, i.e. if its signed value is > -1, it is a 1 byte char representation. Skip to the next byte and start again.
Else if the lead bits are 11110 i.e. if its signed value is > -17, n=4.
Else if the lead bits are 1110 i.e. if its signed value is > -33, n=3.
Else if the lead bits are 110 i.e. if its signed value is > -65, n=2.
Optionally, check that the next n bytes each start with 10, i.e. for each byte, if it has a signed value < -63, it is invalid UTF-8 encoding.
You now know that the previous n bytes constitute a unicode-exclusive character. So, if the NEXT character is 'c' i.e. == 99, you can say it matched - return true.

What assumption is safe for a C++ implementation's character set?

In The C++ Programming Language 6.2.3, it says:
It is safe to assume that the implementation character set includes
the decimal digits, the 26 alphabetic characters of English, and some
of the basic punctuation characters. It is not safe to assume
that:
There are no more than 127 characters in an 8-bit character set (e.g., some sets provide 255 characters).
There are no more alphabetic characters than English provides (most European
languages provide more, e.g., æ, þ, and ß).
The alphabetic characters are contiguous (EBCDIC leaves a gap between 'i' and 'j').
Every character used to write C++ is available (e.g.,
some national character sets do not provide {, }, [, ], |, and
\).
A char fits in 1 byte. There are embedded processors
without byte accessing hardware for which a char is 4 bytes. Also, one
could reasonably use a 16-bit Unicode encoding for the basic chars.
I'm not sure I understand the last two statements.
In section 2.3 of the standard, it says:
The basic source character set consists of 96 characters: the space
character, the control characters representing horizontal tab,
vertical tab, form feed, and new-line, plus the following 91 graphical
characters:
a b c d e f g h i j k l m n o p q r s t u v w x y
z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2
3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " ' ...
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits.
We can see that it is stated by the standard that characters like { } [ ] | \ are part of the basic execution character set. Then why TC++PL says it's not safe to assume that those characters are available in the implementation's character set?
And for the size of a char, in section 5.3.3 of the standard:
The sizeof operator yields the number of bytes in the object
representation of its operand. ... ... sizeof(char), sizeof(signed
char) and sizeof(unsigned char) are 1.
We can see that the standard states that a char is of 1 byte. What is the point TC++PL trying to make here?
The word "byte" seems to be used sloppily in the first quote. As far as C++ is concerned, a byte is always a char, but the number of bits it holds is platform-dependent (and available in CHAR_BITS). Sometimes you want to say "a byte is eight bits", in which case you get a different meaning, and that may have been the intended meaning in the phrase "a char has four bytes".
The execution character set may very well be larger than or incompatible with the input character set provided by the environment. Trigraphs and alternate tokens exist to allow the representation of execution-set characters with fewer input characters on such restricted platforms (e.g. not is identical for all purposes to !, and the latter is not available in all character sets or keyboard layouts).
It used to be the case that some national variants of ASCII, such as the Scandinavian languages, used accented alphabetic characters for the code points where US ASCII has punctuation such as [, ], {, }. These are the reason that C89 included trigraphs — they allow code to be written in the 'invariant subset' of ISO 646. See the chart of the characters used in the national variants on the Wikipedia page.
For example, someone in Scandinavia might have to read:
#include <stdio.h>
int main(int argc, char **argv)
Å
for (int i = 1; i < argc; i++)
printf("%s\n", argvÆiØ);
return 0;
ø
instead of:
#include <stdio.h>
int main(int argc, char **argv)
{
for (int i = 1; i < argc; i++)
printf("%s\n", argv[i]);
return 0;
}
Using trigraphs, you might write:
??=include <stdio.h>
int main(int argc, char **argv)
??<
for (int i = 1; i < argc; i++)
printf("%s??/n", argv??(i??));
return 0;
??>
which is equally ghastly in any language.
I'm not sure how much of an issue this still is, but that's why the comments are there.

Including decimal equivalent of a char in a character array

How do I create a character array using decimal/hexadecimal representation of characters instead of actual characters.
Reason I ask is because I am writing C code and I need to create a string that includes characters that are not used in English language. That string would then be parsed and displayed to an LCD Screen.
For example '\0' decodes to 0, and '\n' to 10. Are there any more of these special characters that i can sacrifice to display custom characters. I could send "Temperature is 10\d C" and degree sign is printed instead of '\d'. Something like this would be great.
Assuming you have a character code that is a degree sign on your display (with a custom display, I wouldn't necessarily expect it to "live" at the common place in the extended IBM ASCII character set, or that the display supports Unicode character encoding) then you can use the encoding \nnn or \xhh, where nnn is up to three digits in octal (base 8) or hh is up to two digits of hex code. Unfortunately, there is no decimal encoding available - Dennis Ritchie and/or Brian Kernighan were probably more used to using octal, as that was quite common at the time when C was first developed.
E.g.
char *str = "ABC\101\102\103";
cout << str << endl;
should print ABCABC (assuming ASCII encoding)
You can directly write
char myValues[] = {1,10,33,...};
Use \u00b0 to make a degree sign (I simply looked up the unicode code for it)
This requires unicode support in the terminal.
Simple, use std::ostringstream and casting of the characters:
std::string s = "hello world";
std::ostringstream os;
for (auto const& c : s)
os << static_cast<unsigned>(c) << ' ';
std::cout << "\"" << s << "\" in ASCII is \"" << os.str() << "\"\n";
Prints:
"hello world" in ASCII is "104 101 108 108 111 32 119 111 114 108 100 "
A little more research and i found answer to my own question.
Characters follower by a '\' are called escape sequence.
You can put octal equivalent of ascii in your string by using escape sequence from'\000' to '\777'.
Same goes for hex, 'x00' to 'xFF'.
I am printing my custom characters by using 'xC1' to 'xC8', as i only had 8 custom characters.
Every thing is done in a single line of code: lcd_putc("Degree \xC1")

Why do I get three different numbers when i try to output a UTF-8 character?

I'm trying to tokenize the input consisting of UTF-8 characters. While some trying the learn utf8 i get an output that i cannot understand. when i input the characher π (pi) i get three different numbers 207 128 10. How can i use them to control which category it is belong to?
ostringstream oss;
oss << cin.rdbuf();
string input = oss.str();
for(int i=0; i<input.size(); i++)
{
unsigned char code_unit = input[i];
cout << (int)code_unit << endl;
}
Thanks in advance.
Characters encoded with UTF-8 may take up more than a single byte (and often do). The number of bytes used to encode a single code point can vary from 1 byte to 6 bytes (or 4 under RFC 3629). In the case of π, the UTF-8 encoding, in binary, is:
11001111 10000000
That is, it is two bytes. You are reading these bytes out individually. The first byte has decimal value 207 and the second has decimal value 128 (if you interpret as an unsigned integer). The following byte that you're reading has decimal value 10 and is the Line Feed character which you're giving when you hit enter.
If you're going to do any processing of these UTF-8 characters, you're going to need to interpret what the bytes mean. What exactly you'll need to do depends on how you're categorising the characters.

Understanding Endianness - a variable value

I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?
The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.
the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");
That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.
For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.
The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.
The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.
In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;