Reading a file with unknown UTF8 strings and known ASCII mixed - c++

Sorry for the confusing title, I am not really sure how to word this myself. I will try and keep my question as simple as possible.
I am working on a system that keeps a "catalog" of strings. This catalog is just a simple flat text file that is indexed in a specific manner. The syntax of the files has to be in ASCII, but the contents of the strings can be UTF8.
Example of a file:
{
STRINGS: {
THISHASTOBEASCII: "But this is UTF8"
HELLO1: "Hello, world"
HELLO2: "您好"
}
}
Reading a UTF8 file isn't the problem here, I don't really care what's between the quotation marks as it's simply copied to other places, no changes are made to the strings.
The problem is that I need to parse the bracket and the labels of the strings to properly store the UTF8 strings in memory. How would I do this?
EDIT: Just realised I'm overcomplicating it. I should just copy and store whatever is between the two "", as UTF8 can be read into bytes >_<. Marked for closing.

You can do it just in your UTF-8 processing method which you mentioned.
Actually, one byte UTF-8 characters also follow the ASCII rule.
1 Byte UTF-8 are like 0XXXXXXX. For more bytes UTF-8. The total bytes is start with ones followed by a zero and then other bytes start with 10.
Like 3-bytes: 1110XXXX 10XXXXXX 10XXXXXX
5-bytes: 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
When you go through the character array, just check each char you read. You will know whether it's an ASCII (by & 0x80 get false) or a part of multi-bytes character (by & 0x80 get true)
Note: All the unicode are 3-byte UTF-8. Unicode currently use 2 valid bytes (16 bits) and 3-byte UTF-8 is also 16 valit bits.(See the counts of 'X' I listed above)

ASCII is a subset of UTF-8, and UTF-8 can be processed using standard 8-bit string parsing functions. So the entire file can be processed as UTF-8. Just strip off the portions you do not need.

Related

How to use extended character set in reading ini file? (C++ lang.)

I face one little problem. I am from country that uses extended character set in language (specifically Latin Extended-A due to characters like š,č,ť,ý,á,...).
I have ini file containing these characters and I would like to read them into program. Unfortunatelly, it is not working with getPrivateProfileStringW or ...A.
Here is part of source code. I hope it will help someone to find solution, because I am getting a little desperate. :-)
SOURCE CODE:
wchar_t pcMyExtendedString[200];
GetPrivateProfileStringA(
"CATEGORY_NAME",
"SECTION_NAME",
"error",
pcMyExtendedString,
200,
PATH_TO_INI_FILE
);
INI FILE:
[CATEGORY_NAME]
SECTION_NAME= ľščťžýáíé
Characters ý,á,í,é are readed correctly - they are from character set Latin-1 Supplement. Their hexa values are correct (0xFD, 0xE1, 0xED,...).
Characters ľ,š,č,ť,ž are readed incorrectly - they are from character set Latin Extended-A Their hexa values are incorrect (0xBE, 0x9A, 0xE8,...). Expected are values like 0x013E, 0x0161, 0x010D, ...
How could be this done? Is it possible or should I avoid these characters at all?
GetPrivateProfileString doesn't do any character conversion. If the call succeed, it will gives you exactly what is in the file.
Since you want to have unicode characters, your file is probably in UTF-8 or UTF-16. If your file is UTF-8, you should be able to read it with GetPrivateProfileStringA, but it will give you a char array that will contain the correct UTF-8 characters (that is, not 0x013E, because 0x013E is not UTF-8).
If your file is UTF-16, then GetPrivateProfileStringW should work, and give you the UTF-16 codes (0x013E, 0x0161, 0x010D, ...) in a wchar_t array.
Edit: Actually your file is encoded in Windows-1250. This is a single byte encoding, so GetPrivateProfileStringA works fine, and you can convert it to UTF-16 if you want by using MultiByteToWideChar with 1250 as code page parameter.
Try saving the file in UTF-8 - CodePage 65001 encoding, most likely your file would be in Western European (Windows) - CodePage 1252.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

How to get Unicode for Chracter strings(UTF-8) in c or c++ language (Linux)

I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greek etc .
The given string is in UTF-8 Format.
If there is any way to know Unicode for UTF-8 Character? For example:
Character '≠' has U+2260 Unicode value.
Character '建' has U+5EFA Unicode value.
The utf-8 encoding is a variable width encoding of unicode. Each unicode code point can be encoded from one to four char.
To decode a char* string and extract a single code point, you read one byte. If the most significant bit is set then, the code point is encoded on multiple characters, otherwise it is the unicode code point. The number of bits set counting from the most-significant bit indicate how many char are used to encode the unicode code point.
This table explain how to make the conversion:
UTF-8 (char*) | Unicode (21 bits)
------------------------------------+--------------------------
0xxxxxxx | 00000000000000000xxxxxxx
------------------------------------+--------------------------
110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx
------------------------------------+--------------------------
1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx
------------------------------------+--------------------------
11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx
Based on that, the code is relatively straightforward to write. If you don't want to write it, you can use a library that does the conversion for you. There are many available under Linux : libiconv, icu, glib, ...
libiconv can help you with converting the utf-8 string to utf-16 or utf-32. Utf-32 would be the savest option if you really want to support every possible unicode codepoint.

Are there delimiter bytes for UTF8 characters?

If I have a byte array that contains UTF8 content, how would I go about parsing it? Are there delimiter bytes that I can split off to get each character?
Take a look here...
http://en.wikipedia.org/wiki/UTF-8
If you're looking to identify the boundary between characters, what you need is in the table in "Description".
The only way to get a high bit zero is the ASCII subset 0..127, encoded in a single byte. All the non-ASCII codepoints have 2nd byte onwards with "10" in the highest two bits. The leading byte of a codepoint never has that - it's high bits indicate the number of bytes, but there's some redundancy - you could equally watch for the next byte that doesn't have the "10" to indicate the next codepoint.
0xxxxxxx : ASCII
10xxxxxx : 2nd, 3rd or 4th byte of code
11xxxxxx : 1st byte of code, further high bits indicating number of bytes
A codepoint in unicode isn't necessarily the same as a character. There are modifier codepoints (such as accents), for instance.
Bytes that have the first bit set to 0 are normal ASCII characters. Bytes that have their first bit set to 1 are part of a UTF-8 character.
The first byte in every UTF-8 character has its second bit set to 1, so that the byte has the most significant bits 11. Each following byte that belongs to the same UTF-8 character starts with 10 instead.
The first byte of each UTF-8 character additionally indicates how many of the following bytes belong to the character, depending on the number of bits that are set to 1 in the most significant bits of that byte.
For more details, see the Wikipedia page for UTF-8.

Reading a UTF-8 Unicode file through non-unicode code

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.
My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.
What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.
My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?
UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.
Thus there shall not be 0 character in your UTF-8 file.
Check Wikipedia for UTF-8
Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.