Reading a UTF-8 Unicode file through non-unicode code

Reading a UTF-8 Unicode file through non-unicode code - c++

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.
My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.
What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.
My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?

UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.
Thus there shall not be 0 character in your UTF-8 file.
Check Wikipedia for UTF-8

Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

Related

JFlex String Regex Strange Behaviour

I am trying to write a JSON string parser in JFlex, so far I have
string = \"((\\(\"|\\|\/|b|f|n|r|t|u[0-9a-fA-F]{4})) | [^\"\\])*\"
which I thought captured the specs (http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf).
I have tested it on the control characters and standard characters and symbols, but for some reason it does not accept £ or ( or ) or ¬. Please can someone let me know what is causing this behaviour?

Perhaps you are running in JLex compatability mode? If so, please see the following from the official JFlex User's Manual. It seems that it will use 7bit character codes for input by default, whereas what you want is 16bit (unicode).
You can fix this by adding the line %unicode after the first %%.
Input Character sets
%7bit
Causes the generated scanner to use an 7 bit input character set (character codes 0-127). If an input character with a code greater than 127 is encountered in an input at runtime, the scanner will throw an ArrayIndexOutofBoundsException. Not only because of this, you should consider using the %unicode directive. See also Encodings for information about character encodings. This is the default in JLex compatibility mode.
%full
%8bit
Both options cause the generated scanner to use an 8 bit input character set (character codes 0-255). If an input character with a code greater than 255 is encountered in an input at runtime, the scanner will throw an ArrayIndexOutofBoundsException. Note that even if your platform uses only one byte per character, the Unicode value of a character may still be greater than 255. If you are scanning text files, you should consider using the %unicode directive. See also section Econdings for more information about character encodings.
%unicode
%16bit
Both options cause the generated scanner to use the full Unicode input character set, including supplementary code points: 0-0x10FFFF. %unicode does not mean that the scanner will read two bytes at a time. What is read and what constitutes a character depends on the runtime platform. See also section Encodings for more information about character encodings. This is the default unless the JLex compatibility mode is used (command line option --jlex).

Reading a file with unknown UTF8 strings and known ASCII mixed

Sorry for the confusing title, I am not really sure how to word this myself. I will try and keep my question as simple as possible.
I am working on a system that keeps a "catalog" of strings. This catalog is just a simple flat text file that is indexed in a specific manner. The syntax of the files has to be in ASCII, but the contents of the strings can be UTF8.
Example of a file:
{
STRINGS: {
THISHASTOBEASCII: "But this is UTF8"
HELLO1: "Hello, world"
HELLO2: "您好"
}
}
Reading a UTF8 file isn't the problem here, I don't really care what's between the quotation marks as it's simply copied to other places, no changes are made to the strings.
The problem is that I need to parse the bracket and the labels of the strings to properly store the UTF8 strings in memory. How would I do this?
EDIT: Just realised I'm overcomplicating it. I should just copy and store whatever is between the two "", as UTF8 can be read into bytes >_<. Marked for closing.

You can do it just in your UTF-8 processing method which you mentioned.
Actually, one byte UTF-8 characters also follow the ASCII rule.
1 Byte UTF-8 are like 0XXXXXXX. For more bytes UTF-8. The total bytes is start with ones followed by a zero and then other bytes start with 10.
Like 3-bytes: 1110XXXX 10XXXXXX 10XXXXXX
5-bytes: 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
When you go through the character array, just check each char you read. You will know whether it's an ASCII (by & 0x80 get false) or a part of multi-bytes character (by & 0x80 get true)
Note: All the unicode are 3-byte UTF-8. Unicode currently use 2 valid bytes (16 bits) and 3-byte UTF-8 is also 16 valit bits.(See the counts of 'X' I listed above)

ASCII is a subset of UTF-8, and UTF-8 can be processed using standard 8-bit string parsing functions. So the entire file can be processed as UTF-8. Just strip off the portions you do not need.

How to use extended character set in reading ini file? (C++ lang.)

I face one little problem. I am from country that uses extended character set in language (specifically Latin Extended-A due to characters like š,č,ť,ý,á,...).
I have ini file containing these characters and I would like to read them into program. Unfortunatelly, it is not working with getPrivateProfileStringW or ...A.
Here is part of source code. I hope it will help someone to find solution, because I am getting a little desperate. :-)
SOURCE CODE:
wchar_t pcMyExtendedString[200];
GetPrivateProfileStringA(
"CATEGORY_NAME",
"SECTION_NAME",
"error",
pcMyExtendedString,
200,
PATH_TO_INI_FILE
);
INI FILE:
[CATEGORY_NAME]
SECTION_NAME= ľščťžýáíé
Characters ý,á,í,é are readed correctly - they are from character set Latin-1 Supplement. Their hexa values are correct (0xFD, 0xE1, 0xED,...).
Characters ľ,š,č,ť,ž are readed incorrectly - they are from character set Latin Extended-A Their hexa values are incorrect (0xBE, 0x9A, 0xE8,...). Expected are values like 0x013E, 0x0161, 0x010D, ...
How could be this done? Is it possible or should I avoid these characters at all?

GetPrivateProfileString doesn't do any character conversion. If the call succeed, it will gives you exactly what is in the file.
Since you want to have unicode characters, your file is probably in UTF-8 or UTF-16. If your file is UTF-8, you should be able to read it with GetPrivateProfileStringA, but it will give you a char array that will contain the correct UTF-8 characters (that is, not 0x013E, because 0x013E is not UTF-8).
If your file is UTF-16, then GetPrivateProfileStringW should work, and give you the UTF-16 codes (0x013E, 0x0161, 0x010D, ...) in a wchar_t array.
Edit: Actually your file is encoded in Windows-1250. This is a single byte encoding, so GetPrivateProfileStringA works fine, and you can convert it to UTF-16 if you want by using MultiByteToWideChar with 1250 as code page parameter.

Try saving the file in UTF-8 - CodePage 65001 encoding, most likely your file would be in Western European (Windows) - CodePage 1252.

Read text-file in C++ with fopen without linefeed conversion

I'm working with text-files (UTF-8) on Windows and want to read them using C++.
To open the file corrently, I use fopen. As described here, there are two options for opening the file:
Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
Binary mode "rb" (The file will be read byte by byte).
Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.
Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?
I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.

The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.
It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.

ASCII Value for Nothing

Is there an ascii value I can put into a char in C++, that represents nothing? I tried 0 but it ends up screwing up my file so I can't read it.

ASCII 0 is null. Other than that, there are no "nothing" characters in traditional ASCII. If appropriate, you could use a control character like SOH (start of heading), STX (start of text), or ETX (end of text). Their ASCII values are 1, 2, and 3 respectively.
For the full list of ASCII codes that I used for this explaination, see this site

Sure. Use any character value that won't appear in your regular data. This is commonly referred to as a delimited text file. Popular choices for delimiters include spaces, tabs, commas, semi-colons, vertical-bar characters, and tilde.

In a C++ source file, '\0' represents a 0 byte. However, C++ strings are usually null-terminated, which means that '\0' represents the end of the string - which may be what is messing up your file.
If you really want to store a 0 byte in a data file, you need to use some other encoding. A simplistic one would use some other character - 0xFF, for example - that doesn't appear in your data, or some length/data format or something similar.
Whatever encoding you choose to use, the application writing the file and the one reading it need to agree on what the encoding is. And that is a whole new nightmare.

The null character '\0' still takes up a byte.
Does your software recognize the null character as an end-of-file character?
If your software is reading in this file, you can define a place holder character (one that isn't the same as data) but you'll also need to handle that character. As in, say '*' is your place-holder. You will read in the character but not add it to the structure that stores your data. It will still take up space in your file, but it won't take up space in your data structure.
Am I answering your question or missing it?

Do you mean a value you can write which won't actually change the file? The answer is no.
Maybe post a little more about what you're trying to accomplish.

it would depend on what kind of file it is and who is parsing it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js