Are there delimiter bytes for UTF8 characters? - c++

If I have a byte array that contains UTF8 content, how would I go about parsing it? Are there delimiter bytes that I can split off to get each character?

Take a look here...
http://en.wikipedia.org/wiki/UTF-8
If you're looking to identify the boundary between characters, what you need is in the table in "Description".
The only way to get a high bit zero is the ASCII subset 0..127, encoded in a single byte. All the non-ASCII codepoints have 2nd byte onwards with "10" in the highest two bits. The leading byte of a codepoint never has that - it's high bits indicate the number of bytes, but there's some redundancy - you could equally watch for the next byte that doesn't have the "10" to indicate the next codepoint.
0xxxxxxx : ASCII
10xxxxxx : 2nd, 3rd or 4th byte of code
11xxxxxx : 1st byte of code, further high bits indicating number of bytes
A codepoint in unicode isn't necessarily the same as a character. There are modifier codepoints (such as accents), for instance.

Bytes that have the first bit set to 0 are normal ASCII characters. Bytes that have their first bit set to 1 are part of a UTF-8 character.
The first byte in every UTF-8 character has its second bit set to 1, so that the byte has the most significant bits 11. Each following byte that belongs to the same UTF-8 character starts with 10 instead.
The first byte of each UTF-8 character additionally indicates how many of the following bytes belong to the character, depending on the number of bits that are set to 1 in the most significant bits of that byte.
For more details, see the Wikipedia page for UTF-8.

Related

Loop through Unicode string as character

With the following string, the size is incorrectly output. Why is this, and how can I fix it?
string str = " ██████";
cout << str.size();
// outputs 19 rather than 7
I'm trying to loop through str character by character so I can read it into into a vector<string> which should have a size of 7, but I can't do this since the above code outputs 19.
TL;DR
The size() and length() members of basic_string returns the size in units of the underlying string, not the number of visible characters. To get the expected number:
Use UTF-16 with u prefix for very simple strings that contain no non-BMP, no combining characters and no joining characters
Use UTF-32 with U prefix for very simple strings that don't contain any combining or joining characters
Normalize the string and count for arbitrary Unicode strings
" ██████" is a space followed by a series of 6 U+2588 characters. Your compiler seems to be using UTF-8 for std::string. UTF-8 is a variable-length encoding and many letters are encoded using multiple bytes (because obviously you can't encode more than 256 characters with just one byte). In UTF-8 code points between U+0800 and U+FFFF are encoded by 3 bytes. Therefore the length of the the string in UTF-8 is 1 + 6*3 = 19 bytes.
You can check with any Unicode converter like this one and see that the string is encoded as 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 in UTF-8, and you can also loop through each byte of your string to check
If you want the total number of visible characters in the string then it's a lot trickier and churill's solution doesn't work. Read the example in Twitter
If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:
café 0x63 0x61 0x66 0xC3 0xA9 Using the “é” character, called the “composed character”.
café 0x63 0x61 0x66 0x65 0xCC 0x81 Using the combining diacritical, which overlaps the “e”
You need a Unicode library like ICU to normalize the string and count. Twitter for example uses Normalization Form C
Edit:
Since you're only interested in box-drawing characters which doesn't seem to lie outside the BMP and don't contain any combining characters, UTF-16 and UTF-32 will work. Like std::string, std::wstring is also a basic_string and doesn't have a mandatory encoding. In most implementations it's often either UTF-16 (Windows) or UTF-32 (*nix) so you may use it, but it's unreliable and depends on source code encoding. The better way is to use std::u16string (std::basic_string<char16_t>) and std::u32string (std::basic_string<char32_t>). They'll work regardless of system and encoding of the source file
std::wstring wstr = L" ██████";
std::u16string u16str = u" ██████";
std::u32string u32str = U" ██████";
std::cout << str.size(); // may work, returns the number of wchar_t characters
std::cout << u16str.size(); // always returns the number of UTF-16 code units
std::cout << u32str.size(); // always returns the number of UTF-32 code units
In case you're interested in how to work out on that for all Unicode characters continue reading below
The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.
[...]
Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes
Twitter - Counting characters
See also
When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings
Getting Twitter characters count
Why is the length of this string longer than the number of characters in it?
std::string only contains 1 byte long chars (usually 8 bit, contains UTF-8 char), you need wchar_t and std::wstring to achieve what you want:
std::wstring str = L" ██████";
std::cout << str.size();
Allthough this prints 7 (one space and 6 unicode chars). Notice the L before the string literal, so it will be interpreted as a wide string.

Reading a file with unknown UTF8 strings and known ASCII mixed

Sorry for the confusing title, I am not really sure how to word this myself. I will try and keep my question as simple as possible.
I am working on a system that keeps a "catalog" of strings. This catalog is just a simple flat text file that is indexed in a specific manner. The syntax of the files has to be in ASCII, but the contents of the strings can be UTF8.
Example of a file:
{
STRINGS: {
THISHASTOBEASCII: "But this is UTF8"
HELLO1: "Hello, world"
HELLO2: "您好"
}
}
Reading a UTF8 file isn't the problem here, I don't really care what's between the quotation marks as it's simply copied to other places, no changes are made to the strings.
The problem is that I need to parse the bracket and the labels of the strings to properly store the UTF8 strings in memory. How would I do this?
EDIT: Just realised I'm overcomplicating it. I should just copy and store whatever is between the two "", as UTF8 can be read into bytes >_<. Marked for closing.
You can do it just in your UTF-8 processing method which you mentioned.
Actually, one byte UTF-8 characters also follow the ASCII rule.
1 Byte UTF-8 are like 0XXXXXXX. For more bytes UTF-8. The total bytes is start with ones followed by a zero and then other bytes start with 10.
Like 3-bytes: 1110XXXX 10XXXXXX 10XXXXXX
5-bytes: 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
When you go through the character array, just check each char you read. You will know whether it's an ASCII (by & 0x80 get false) or a part of multi-bytes character (by & 0x80 get true)
Note: All the unicode are 3-byte UTF-8. Unicode currently use 2 valid bytes (16 bits) and 3-byte UTF-8 is also 16 valit bits.(See the counts of 'X' I listed above)
ASCII is a subset of UTF-8, and UTF-8 can be processed using standard 8-bit string parsing functions. So the entire file can be processed as UTF-8. Just strip off the portions you do not need.

How to find whether byte read is japanese or english?

I have an array which contains Japanese and ascii characters.
I am trying to find whether characters read is English character or Japanese characters.
in order to solve this i followed as
read first byte , if multicharcterswidth is not equal to one, move pointer to next byte
now display whole two byte together and display that Japanese character has been read.
if multicharcterswidth is equal to one, display the byte. and show message english has been read.
above algo work fine but fails in case of halfwidth form of Japanese eg.シ,ァ etc. as it is only one byte.
How can i find out whether characters are Japanese or English?
**Note:**What i tried
I read from web that first byte will tell whether it is japanese or not which i have covered in step 1 of my algo. But It won't work for half width.
EDIT:
The problem i was solving i include control characters 0X80 at start and end of my characters to identify the string of characters.
i wrote following to identify the end of control character.
cntlchar.....(my characters , can be japnese).....cntlchar
if ((buf[*p+1] & 0X80) && (mbMBCS_charWidth(&buf[*p]) == 1))
// end of control characters reached
else
// *p++
it worked fine when for english but didn't work for japanese half width.
How can i handle this?
Your data must be using Windows Codepage 932. That is a guess, but examining the codepoints shows what you are describing.
The codepage shows that characters in the range 00 to 7F are "English" (a better description is "7-bit ASCII"), the characters in the ranges 81 to 9F and E0 to FF are the first byte of a multibyte code, and everything between A1 and DF are half-width Kana characters.
For individual bytes this is impractical to impossible. For larger sets of data you could do statistical analysis on the bytes and see if it matches known English or Japanese patterns. For example, vowels are very common in English text but different Japanese letters would have similar frequency patterns.
Things get more complicated than testing bits if your data includes accented characters.
If you're dealing with Shift-JIS data and Windows-1252 encoded text, ideally you just remap it to UTF-8. There's no standard way to identify text encoding within a text file, although things like MIME can help if added on externally as metadata.

Reading a UTF-8 Unicode file through non-unicode code

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.
My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.
What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.
My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?
UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.
Thus there shall not be 0 character in your UTF-8 file.
Check Wikipedia for UTF-8
Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

C/C++: How to convert 6bit ASCII to 7bit ASCII

I have a set of 6 bits that represent a 7bit ASCII character. How can I get the correct 7bit ASCII code out of the 6 bits I have? Just append a zero and do an bitwise OR?
Thanks for your help.
Lennart
ASCII is inherently a 7-bit character set, so what you have is not "6-bit ASCII". What characters make up your character set? The simplest decoding approach is probably something like:
char From6Bit( char c6 ) {
// array of all 64 characters that appear in your 6-bit set
static SixBitSet[] = { 'A', 'B', ... };
return SixBitSet[ c6 ];
}
A footnote: 6-bit character sets were quite popular on old DEC hardware, some of which, like the DEC-10, had a 36-bit architecture where 6-bit characters made some sense.
You must tell us how your 6-bit set of characters looks, I don't think there is a standard.
The easiest way to do the reverse mapping would probably be to just use a lookup table, like so:
static const char sixToSeven[] = { ' ', 'A', 'B', ... };
This assumes that space is encoded as (binary) 000000, capital A as 000001, and so on.
You index into sixToSeven with one of your six-bit characters, and get the local 7-bit character back.
I can't imagine why you'd be getting old DEC-10/20 SIXBIT, but if that's what it is, then just add 32 (decimal). SIXBIT took the ASCII characters starting with space (32), so just add 32 to the SIXBIT character to get the ASCII character.
The only recent 6-bit code I'm aware of is base64. This uses four 6-bit printable characters to store three 8-bit values (6x4 = 8x3 = 24 bits).
The 6-bit values are drawn from the characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
which are the values 0 thru 63. Four of these (say UGF4) are used to represent three 8-bit values.
UGF4 = 010100 000110 000101 111000
= 01010000 01100001 01111000
= Pax
If this is how your data is encoded, there are plenty of snippets around that will tell you how to decode it (and many languages have the encoder and decoder built in, or in an included library). Wikipedia has a good article for it here.
If it's not base64, then you'll need to find out the encoding scheme. Some older schemes used other lookup methods of the shift-in/shift-out (SI/SO) codes for choosing a page within character sets but I think that was more for choosing extended (e.g., Japanese DBCS) characters rather than normal ACSII characters.
If I were to give you the value of a single bit, and I claimed it was taken from Windows XP, could you reconstruct the entire OS?
You can't. You've lost information. There is no way to reconstruct that, unless you have some knowledge about what was lost. If you know that, say, the most significant bit was chopped off, then you can set that to zero, and you've reconstructed at least half the characters correctly.
If you know how 'a' and 'z' are represented in your 6-bit encoding, you might be able to guess at what was removed by comparing them to their 7-bit representations.