Logic behind converting a character to UTF-8 - c++

I have the following piece of code which the comment in code says it converts any character greater than 7F to UTF-8. I have the following questions on this code:
if((const unsigned char)c > 0x7F)
{
Buffer[0] = 0xC0 | ((unsigned char)c >> 6);
Buffer[1] = 0x80 | ((unsigned char)c & 0x3F);
return Buffer;
}
How does this code work?
Does the current windows code page I am using has any effect on the character placed in Buffer?

For starters, the code doesn't work, in general. By
coincidence, it works if the encoding in char (or unsigned
char) is ISO-8859-1, because ISO-8859-1 has the same code
points as the first 256 Unicode code points. But ISO-8859-1 has
largely been superceded by ISO-8859-15, so it probably won't
work. (Try it for 0xA4, for example. The Euro sign in
ISO-8859-15. It will give you a completely different
character.)
There are two correct ways to do this conversion, both of which
depend on knowing the encoding of the byte being entered (which
means that you may need several versions of the code, depending
on the encoding). The simplest is simply to have an array with
256 strings, one per character, and index into that. In which
case, you don't need the if. The other is to translate the
code into a Unicode code point (32 bit UTF-32), and translate
that into UTF-8 (which can require more than two bytes for some
characters: the Euro character is 0x20AC: 0xE2, 0x82, 0xAC).
EDIT:
For a good introduction to UTF-8:
http://www.cl.cam.ac.uk/~mgk25/unicode.html. The title says it
is for Unix/Linux, but there is very little, if any, system
specific information in it (and such information is clearly
marked).

Related

String to Unicode, and Unicode to decimal code point (C++)

Despite seing a lot of questions of the forum about unicode and string conversion (in C/C++) and Googling for hours on the topic, I still can't find a straight explanation to what seems to me like a very basic process. Here is what I want to do:
I have a string which potentially uses any characters of any possible language. Let's take cyrillic for example. So say I have:
std::string str = "сапоги";
I want to loop over each character making up that string and:
Know/print the character's Unicode value
Convert that Unicode value to a decimal value
I really Googled that for hours and couldn't find a straight answer. If someone could show me how this could be done, it would be great.
EDIT
So I managed to get that far:
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <locale>
#include <codecvt>
#include <iomanip>
// utility function for output
void hex_print(const std::string& s)
{
std::cout << std::hex << std::setfill('0');
for(unsigned char c : s)
std::cout << std::setw(2) << static_cast<int>(c) << ' ';
std::cout << std::dec << '\n';
}
int main()
{
std::wstring test = L"сапоги";
std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
std::string u8str = conv1.to_bytes(test);
hex_print(u8str);
return 1;
}
Result:
04 41 04 30 04 3f 04 3e 04 33 04 38
Code
Which is correct (it maps to unicode). The problem is that I don't know whether I should use utf-8, 16 or something else (as pointed out by Chris in the comment). Is there a way I can find out about that? (whatever encoding it uses originally or whatever encoding needs to be used?)
EDIT 2
I thought I would address some of the comments with a second edit:
"Convert that Unicode value to a decimal value" Why?
I will explain why, but I also wanted to comment in a friendly way, that my problem was not 'why' but 'how';-). You can assume the OP has a reason for asking this question, yet of course, I understand people are curious as to why... so let me explain. The reason why I need all this is because I ultimately need to read the glyphs from a font file (TrueType OpenType doesn't matter). It happens that these files have a table called cmap that is some sort of associative array that maps the value of a character (in the form on a code point) to the index of the glyph in the font file. The code points in the table are not defined using the notation U+XXXX but directly in the decimal counterpart of that number (assuming the U+XXXX notation is the hexadecimal representation of a uint16 number [or U+XXXXXX if greater than uint16 but more on that later]). So in summary the letter г in Cyrillic ([gueu]) has code point value U+0433 which in decimal form is 1075. I need the value 1075 to do a lookup in the cmap table.
// utility function for output
void hex_print(const std::string& s)
{
std::cout << std::hex << std::setfill('0');
uint16_t i = 0, dec;
for(unsigned char c : s) {
std::cout << std::setw(2) << static_cast<int>(c) << ' ';
dec = (i++ % 2 == 0) ? (c << 8) : (dec | c);
printf("Unicode Value: U+%04x Decimal value of code point: %d\n", codePoint, codePoint);
}
}
std::string is encoding-agnostic. It essentially stores bytes. std::wstring is weird, though also not defined to hold any specific encoding. In Windows, wchar_t is used for UTF-16
Yes exactly, I think when you understand that "while" you think (at least I did) that strings were just storing "ASCII" characters (hold on here), this appears to be really wrong. In fact std::string as suggested by the comment only seems to store 'bytes'. Though clearly if you look at the bytes of the string english you get:
std::string eng = "english";
hex_print(eng);
65 6e 67 6c 69 73 68
and if you do the same thing with "сапоги you get:
std::string cyrillic = "сапоги";
hex_print(cyrillic );
d1 81 d0 b0 d0 bf d0 be d0 b3 d0 b8
What I'd really like to know/understand is how is this conversion implicitly done? Why UTF-8 encoding here rather the UTF-16 and is there a possibility of changing that that (or is that defined by my IDE or OS?)? Clearly when I copy paste the string сапоги in my text editor, it actually copies an array of 12 bytes already (these 12 bytes could be utf-8 or utf-16).
I think there is a confusion between Unicode and encoding. Codepoint (AFAIK) is just a character code. UTF 16 gives you the code, so you can say your 0x0441 is a с codepoint in case of Cyrillic small letter es. To my understanding UTF16 maps one-to-one with Unicode codepoint which have a range of 1M and something characters. However, other encoding techniques, for example UTF-8 does not maps directly to Unicode codepoint. So, I guess, you better stick to the UTF-16
Exactly! I found this comment very useful indeed. Because yes, there is confusion (and I was confused) with regards to the fact that the way you encode the Unicode code point value has nothing to do with the Unicode value itself, well sort of because in fact things can be misleading as I will show now. You can indeed encode the string сапоги using UTF8 and you will get:
d1 81 d0 b0 d0 bf d0 be d0 b3 d0 b8
So clearly it has nothing to do with the Unicode values of the glyphs indeed. Now if you encode the same string using UTF16 you get:
04 41 04 30 04 3f 04 3e 04 33 04 38
Where 04 and 41 are indeed the two bytes (in Hexadecimal form) of the letter с ([se] in cyrillic). In this case at least, there is a direct mapping between the unicode value and its uint16 representation. And this is why (per Wiki's explanation [source]):
Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.
But as someone suggested in the comment, some code points values go beyond what you can define with 2 bytes. For example:
1D307 𝌇 TETRAGRAM FOR FULL CIRCLE (Tai Xuan Jing Symbols)
which is what this comment was suggesting:
To my knowledge, UTF-16 doesn't cover all characters unless you use surrogate pairs. It was meant to originally, when 65k was more than enough, but that went out the window, making it an extremely awkward choice now
Though to be perfectly exact UTF-16 like UTF-8 CAN encode ALL characters though it can use up to 4 bytes for doing so (as you suggested it would use surrogate pairs if more than 2 bytes are needed).
I tried to do a conversion to UTF-32 using mbrtoc32 but cuchar is strangely missing on Mac.
BTW, if you don't know what a surrogate pair is (I didn't) there's a nice post about this on the forum.
For your purposes, finding and printing the value of each character, you probably want to use char32_t, because that has no multi-byte strings or surrogate pairs and can be converted to decimal values just by casting to unsigned long. I would link to an example I wrote, but it sounds as if you want to solve this problem yourself.
C++14 directly supports the types char8_t, char16_t and char32_t, in addition to the legacy wchar_t that sometimes means UCS-32, sometimes UTF-16LE, sometimes UTF-16BE, sometimes something different. It also lets you store strings at runtime, no matter what character set you saved your source file in, in any of these formats with the u8", u" and U" prefixes, and the \uXXXX unicode escape as a fallback. For backward compatibility, you can encode UTF-8 with hex escape codes in an array of unsigned char.
Therefore, you can store the data in any format you want. You could also use the facet codecvt<wchar_t,char,mbstate_t>, which all locales are required to support. There are also the multi-byte string functions in <wchar.h> and <uchar.h>.
I highly recommend you store all new external data in UTF-8. This includes your source files! (Annoyingly, some older software still doesn’t support it.) It may also be convenient to use the same character set internally as your libraries, which will be UTF-16 (wchar_t) on Windows. If you need fixed-length characters that can hold any codepoint with no special cases, char32_t will be handy.
Originally computers were designed for the American market and used Ascii - the American code for information interchange. This had 7 bit codes, and just the basic English letters and a few punctuation marks, plus codes at the lower end designed to drive paper and ink printer terminals.
This became inadequate as computers developed and started to be used for language processing as much as for numerical work. The first thing that happened was that various expansions to 8 bits were proposed. This could either cover most of the decorated European characters (accents, etc) or it could give a series of basic graphics good for creating menus and panels, but you couldn't achieve both. There was still no way of representing non-Latin character sets like Greek.
So a 16-bit code was proposed, and called Unicode. Microsoft adopted this very early and invented the wchar WCHAR (it has various identifiers) to hold international characters. However it emerged that 16 bits wasn't enough to hold all glyphs in common use, also the Unicode consortium intoducuced some minor incompatibilities with Microsoft's 16-bit code set.
So Unicode can be a series of 16-bit integers. That's wchar string. Ascii text now has zero characters between in the high bytes, so you can't pass a wide string to a function expectign Ascii. Since 16 bits was nearly but not quite enough, a 32 bit unicode set was also produced.
However when you saved unicode to a file, this created problems, was it 16 bit of 32 bit> And was it big-endian or little-endian. So a flag at the start of the data was proposed to remedy this. The problem was that the file contents, memorywise, no longer match the string contents.
C++ std:;string was templated so it could use basic chars or one of the wide types, almost always in practice Microsoft's 16 bit near-unicode encoding.
The UTF-8 was invented to come to the rescue. This a multi-byte variable length encoding, which uses the fact that ascii is only 7 bits. So if the high bit is set, it means that you have two, three, or four bytes in the character. Now a very large number of string are English language or mainly human-readable numbers, so essentially ascii. These strings are the same in Ascii as in UTF-8, which mkaes life a whole lot easier. You have no byte order convention problems. You do have the problem that you must decode the UTF-8 to code points with as not entirely trivial function, and remember to advance your read position by the correct number of bytes.
UTF-8 is really the answer, but the other encodings are still in use and you will come across them.

Converting a uint8_t to its binary representation

I have a variable of type uint8_t which I'd like to serialize and write to a file (which should be quite portable, at least for Windows, which is what I'm aiming at).
Trying to write it to a file in its binary form, I came accross this working snippet:
uint8_t m_num = 3;
unsigned int s = (unsigned int)(m_num & 0xFF);
file.write((wchar_t*)&s, 1); // file = std::wofstream
First, let me make sure I understand what this snippet does - it takes my var (which is basically an unsigned char, 1 byte long), converts it into an unsigned int (which is 4 bytes long, and not so portable), and using & 0xFF "extracts" only the least significant byte.
Now, there are two things I don't understand:
Why convert it into unsigned int in the first place, why can't I simply do something like
file.write((wchar_t*)&m_num, 1); or reinterpret_cast<wchar_t *>(&m_num)? (Ref)
How would I serialize a longer type, say a uint64_t (which is 8 bytes long)? unsigned int may or may not be enough here.
uint8_t is 1 byte, same as char
wchar_t is 2 bytes in Windows, 4 bytes in Linux. It is also depends on endianness. You should avoid wchar_t if portability is a concern.
You can just use std::ofstream. Windows has an additional version for std::ofstream which accepts UTF16 file name. This way your code is compatible with Windows UTF16 filenames and you can still use std::fstream. For example
int i = 123;
std::ofstream file(L"filename_in_unicode.bin", std::ios::binary);
file.write((char*)&i, sizeof(i)); //sizeof(int) is 4
file.close();
...
std::ifstream fin(L"filename_in_unicode.bin", std::ios::binary);
fin.read((char*)&i, 4); // output: i = 123
This is relatively simple because it's only storing integers. This will work on different Windows systems, because Windows is always little-endian, and int size is always 4.
But some systems are big-endian, you would have to deal with that separately.
If you use standard I/O, for example fout << 123456 then integer will be stored as text "123456". Standard I/O is compatible, but it takes a little more disk space and can be a little slower.
It's compatibility versus performance. If you have large amounts of data (several mega bytes or more) and you can deal with compatibility issues in future, then go ahead with writing bytes. Otherwise it's easier to use standard I/O. The performance difference is usually not measurable.
It is impossible to write unit8_t values to a wofstream because a wofstream only writes wide characters and doesn't handle binary values at all.
If what you want to do is to write a wide character representing a code point between 0 and 255, then your code is correct.
If you want to write binary data to a file then your nearest equivalent is ofstream, which will allow you to write bytes.
To answer your questions:
wofstream::write writes wide characters, not bytes. If you reinterpret the address of m_num as the address of a wide character, you will be writing a 16-bit or 32-bit (depending on platform) wide character of which the first byte (that is, the least significant or most significant, depending on platform) is the value of m_num and the remaining bytes are whatever happens to occur in memory after m_num. Depending on the character encoding of the wide characters, this may not even be a valid character. Even if valid, it is largely nonsense. (There are other possible problems if wofstream::write expects a wide-character-aligned rather than a byte-aligned input, or if m_num is immediately followed by unreadable memory).
If you use wofstream then this is a mess, and I shan't address it. If you switch to a byte-oriented ofstream then you have two choices. 1. If you will only ever be reading the file on the same system, file.write(&myint64value,sizeof(myint64value)) will work. The sequence in which the bytes of the 64-bit value are written will be undefined, but the same sequence will be used when you read back, so this doesn't matter. Don't try do something analogous with wofstream because it's dangerous! 2. Extract each of the 8 bytes of myint64value separately (shift right by a multiple of 8 bits and then take the bottom 8 bits) and then write it. This is fully portable because you control the order in which the bytes are written.

Does the g++ 4.8.2 compiler support Unicode characters?

Consider the following statements -
cout<<"\U222B";
int a='A';
cout<<a;
The first statement prints an integration sign (the character equivalent to the Unicode code point) whereas the second cout statement prints the ASCII value 65.
So I want to ask two things -
1) If my compiler supports Unicode character set then why it is implementing the ASCII character set and showing the ascii values of the characters?
2) With reference to this question - what is the difference in defining the 'byte' in terms of computer memory and in terms of C++?
Does my compiler implement 16-bit or 32-bit byte? If yes, then why do the value of CHAR_BIT is set to 8?
In answer to your first question, the bottom 128 code points of Unicode are ASCII. There's no real distinction between the two.
The reason you're seeing 65 is because the thing you're outputting (a) is an int rather than a char (it may have started as a char but, by putting it into a, you modified how it would be treated in future).
For your second question, a byte is a char, at least as far as the ISO C and C++ standards are concerned. If CHAR_BIT is defined as 8, that's how wide your char type is.
However, you should keep in mind the difference between Unicode code points and Unicode representations (such as UTF-8). Having CHAR_BIT == 8 will still allow Unicode to work if UTF-8 representation is used.
My advice would be to capture the output of you program with a hex dump utility, you may well find the Unicode character is coming out as e2 88 ab, which is the UTF-8 representation of U+222B. It will then be interpreted by something outside of the program (eg, the terminal program) to render the correct glyph(s):
#include <iostream>
using namespace std;
int main() { cout << "\u222B\n"; }
Running that program above shows what's being output:
pax> g++ -o testprog testprog.cpp ; ./testprog
∫
pax> ./testprog | hexdump
0000000 e2 88 ab 0a
You could confirm that by generating the same UTF-8 byte sequence in a different way:
pax> printf "\xe2\x88\xab\n"
∫
There are several different questions/issues here:
As paxdiablo pointed out, you're seeing "65" because you're outputting "a" (value 'A' = ASCII 65) as an "int".
Yes, gcc supports Unicode source files: --finput-charset=OPTION
The final issue is whether the C++ compiler treats your "strings" as 8-bit ASCII or n-bit Unicode.
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian:
How well is Unicode supported in C++11?
PS:
As far as language support for Unicode:
Java was designed from the ground up for Unicode.
Unfortunately, at the time that meant only UTF-16. Java 5 supported nicode 6.0, Java 7 Unicode 6.0 and the current Java 8 supports Unicode 6.2.
.Net is newer. C#, VB.Net and C++/CLI all fully support Unicode 4.0.
Newer versions of .Net support newer versions of Unicode. For example, .Net 4.0 supports Unicode 5.1](What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?).
Python3 also supports Unicode 4.0: http://www.diveintopython3.net/strings.html
For of all, sorry for my English if it has mistakes.
A C++ byte is any defined amount of bits large enough to transport every character of a set specified by the standard. This required set of characters is a subset of ASCII, and that previously defined "amount of bits" must be the memory unit for chars, the tiniest memory atom of C++. Every other type must be a multiple of sizeof(char) (any C++ value is a bunch of chars continously stored on memory).
So, sizeof(char) must be 1 by definition, because is the memory measurement unit of C++. If that 1 means 1 physical byte or not is an implementation issue, but universally accepted as 1 byte.
What I don't understand is what do you mean with 16-bit or 32-bit byte.
Other related question is about the encoding your compiled applies to your source texts, literal strings included. A compiler, if I'm not wrong, normalizes each translation unit (source code file) to an encoding of its choice to handle the file.
I don't really know what happens under the hood, but perhaps you have read something somewhere about source file/internal encoding, and 16 bits/32bits encodings and all the mess is blended on your head. I'm still confused either though.

What's the standard-defined endianness of std::wstring?

I know the UTF-16 has two types of endiannesses: big endian and little endian.
Does the C++ standard define the endianness of std::wstring? or it is implementation-defined?
If it is standard-defined, which page of the C++ standard provide the rules on this issue?
If it is implementation-defined, how to determine it? e.g. under VC++. Does the compiler guarantee the endianness of std::wstring is strictly dependent on the processor?
I have to know this; because I want to send the UTF-16 string to others. I must add the correct BOM in the beginning of the UTF-16 string to indicate its endianness.
In short: Given a std::wstring, how should I reliably determine its endianness?
Endianess is MACHINE dependent, not language dependent. Endianess is defined by the processor and how it arranges data in and out of memory. When dealing with wchar_t (which is wider than a single byte), the processor itself upon a read or write aligns the multiple bytes as it needs to in order to read or write it back to RAM again. Code simply looks at it as the 16 bit (or larger) word as represented in a processor internal register.
For determining (if that is really what you want to do) endianess (on your own), you could try writing a KNOWN 32 bit (unsigned int) value out to ram, then read it back using a char pointer. Look for the ordering that is returned.
It would look something like this:
unsigned int aVal = 0x11223344;
char * myValReadBack = (char *)(&aVal);
if(*myValReadBack == 0x11) printf("Big endian\r\n");
else printf("Little endian\r\n");
Im sure there are other ways, but something like the above should work, check my little versus big though :-)
Further, until Windows RT, VC++ really only compiled to intel type processors. They really only have had 1 endianess type.
It is implementation-defined. wstring is just a string of wchar_t, and that can be any byte ordering, or for that matter, any old size.
wchar_t is not required to be UTF-16 internally and UTF-16 endianness does not affect how wchar's are stored, it's a matter of saving and reading it.
You have to use an explicit procedure of converting wstring to a UTF-16 bytestream before sending it anywhere. Internal endianness of wchar is architecture-dependent and it's better to use some opaque interfaces for converting than try to convert it manually.
For the purposes of sending the correct BOM, you don't need to know the endianness. Just use the code \uFEFF. That will be bigendian or little-endian depending on the endianness of your implementation. You don't even need to know whether your implementation is UTF-16 or UTF-32. As long as it is some unicode encoding, you'll end up with the appropriate BOM.
Unfortunately, neither wchars nor wide streams are guaranteed to be unicode.

Multibyte character constants and bitmap file header type constants

I have some existing code that I've used to write out an image to a bitmap file. One of the lines of code looks like this:
bfh.bfType='MB';
I think I probably copied that from somewhere. One of the other devs says to me "that doesn't look right, isn't it supposed to be 'BM'?" Anyway it does seem to work ok, but on code review it gets refactored to this:
bfh.bfType=*(WORD*)"BM";
A google search indicates that most of the time, the first line seems to be used, while some of the time people will do this:
bfh.bfType=0x4D42;
So what is the difference? How can they all give the correct result? What does the multi-byte character constant mean anyway? Are they the same really?
All three are (probably) equivalent, but for different reasons.
bfh.bfType=0x4D42;
This is the simplest to understand, it just loads bfType with a number that happens to represent ASCII 'M' in bits 8-15 and ASCII 'B' in bits 0-7. If you write this to a stream in little-endian format, then the stream will contain 'B', 'M'.
bfh.bfType='MB';
This is essentially equivalent to the first statement -- it's just a different way of expressing an integer constant. It probably depends on the compiler exactly what it does with it, but it will probably generate a value according to the endian-ness of the machine you compile on. If you compile and execute on a machine of the same endian-ness, then when you write the value out on the stream you should get 'B', 'M'.
bfh.bfType=*(WORD*)"BM";
Here, the "BM" causes the compiler to create a block of data that looks like 'B', 'M', '\0' and get a char* pointing to it. This is then cast to WORD* so that when it's dereferenced it will read the memory as a WORD. Hence it reads the 'B', 'M' into bfType in whatever endian-ness the machine has. Writing it out using the same endian-ness will obviously put 'B', 'M' on your stream. So long as you only use bfType to write out to the stream this is the most portable version. However, if you're doing any comparisons/etc with bfType then it's probably best to pick an endian-ness for it and convert as necessary when reading or writing the value.
I did not find the API, but according to http://cboard.cprogramming.com/showthread.php?t=24453, the bfType is a bitmapheader. A value of BM would most likely mean "bitmap".
0x4D42 is a hexadecimal value (0x4D for M and 0x42 for B). In the little endian way of writing (least significate byte first), that would be the same as "BM" (not "MB"). If it also works with "MB" then probably some default value is taken.
Addendum to tehvan's post:
From Wikipedia's entry on BMP:
File header
Note that the first two bytes of the BMP file format (thus the BMP header) are stored in big-endian order. This is the magic number 'BM'. All of the other integer values are stored in little-endian format (i.e. least-significant byte first).
So it looks like the refactored code is correct according to the specification.
Have you tried opening the file with 'MB' as the magic number with a few different photo-editors?