Detecting Unicode in files in Windows 10 - c++

Now Windows 10 Notepad does not require unicode files to have the BOM header and it does not encode the header by default. This does break the existing code that checks the header to determine Unicode in files. How can I now tell in C++ if a file is in unicode?
Source: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/
The code we have to determine Unicode:
int IsUnicode(const BYTE p2bytes[3])
{
if( p2bytes[0]==0xEF && p2bytes[1]==0xBB p2bytes[2]==0xBF)
return 1; // UTF-8
if( p2bytes[0]==0xFE && p2bytes[1]==0xFF)
return 2; // UTF-16 (BE)
if( p2bytes[0]==0xFF && p2bytes[1]==0xFE)
return 3; // UTF-16 (LE)
return 0;
}
If it's so much pain, why isn't there a typical function to determine the encoding?

You should use the W3C method, which it is something like:
if you know the encoding, use that
if there is a BOM, use it to determine the encoding
decode as UTF-8. UTF-8 has strict byte sequence rules (which it is the purpose of UTF-8: being able to find the first byte of a character). So if the file it is not UTF-8, very probably it will fail the decoding: on ANSI (cp-1252) it is not frequent to have accented letters followed by a symbols, and not at all probable that every time you have such sequence. Latin-1: you may get control characters (instead of symbols), but it is also very seldom to have control characters C1 only after accented letters, and always C1 after accented characters.
if decoding fails (maybe you can just test first 4096 bytes, or 10 bytes above 127), use the standard 8-bit encoding of the OS (probably cp-1252 on windows).
This method should work very well. It is biased on UTF-8, but the world went to such directions long ago. Determining which codepage is much more difficult.
You may add a step before the last step. If there are various 00 bytes, you may be in a UTF-16 or UTF-32 form. Unicode requires that you know which form (e.g. from side channel), else the files should have a BOM. But you can guess the form (UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE) according the position of 00 in the file (new lines, and some ASCII characters are considered common scripts, so they are used in many scripts, so you should have many 00).

Now Windows 10 does not require unicode files to have the BOM header.
Windows never had this requirement. Every program can read text files like it wants to.
Maybe interesting: a BOM may not be desirable for UTF-8 because it breaks ASCII compatibility.
This does break the existing code that checks the header to determine Unicode in files.
This is a misunderstanding. Other code likely had Unicode support for a longer time than Notepad from Windows.
How can I now tell in C++ if a file is in unicode?
Typically you would check for the presence of a BOM and then use that information of course.
Next you can try to read (the beginning of) the file with all possible encodings. The ones that throw an exception are obviously not suitable.
From the remaining encodings, you could use a heuristic to determine the encoding.
And if it still was the wrong choice, give the user an option to change the encoding manually. That's how it is done in many editors, like Notepad++.

Related

How to store Unicode characters in an array?

I'm writing a C++ wxWidgets calculator application, and I need to store the characters for the operators in an array. I have something like int ops[10] = {'+', '-', '*', '/', '^'};. What if I wanted to also store characters such as √, ÷ and × in said array, in a way so that they are also displayable inside a wxTextCtrl and a custom button?
This is actually a hairy question, even though it does not look like it at first. Your best option is to use Unicode Control Sequences instead of adding the special characters with your Source Code Editor.
wxString ops[]={L"+", L"-", L"*", L"\u00F7"};
You need to make sure that characters such as √, ÷ and × are being compiled correctly.
Your sourcefile (.cpp) needs to store them in a way that ensures the compiler is generating the correct characters. This is harder than it looks, especially when svn, git, windows and linux are involved.
Most of the time .cpp files are ANSI or 8 bit encoded and do not support Unicode constants out of the box.
You could save your sourcefile in UTF-8 codepage so that these characters are preserved. But not all compilers accept UTF-8
The best way to do that is to encode these using Unicode control characters.
wxString div(L"\u00F7"); is the string for ÷. Or in your case perhaps wxChar div('\u00F7'). You have to look up the Unicode control sequences for the other special chars. This way your source file will contain ANSI chars only and will be accepted by all compilers. You will also avoid code page problems when you exchange source files with different OS platforms.
Then you have to make sure that you compile wxWidgets with UNICODE awareness (although I think this is the default for wx3.x). Then, if your OS supports it, these special characters should show up.
Read up on Unicode controls (Wikipedia). Also good input is found in utf8everywhere.org. The file .editorconfig can also be of help.
Prefer to use wchar_t instead of int.
wchar_t ops[10] = {L'+', L'-', L'*', L'/', L'^', L'√', L'÷', L'×'};
These trivially support the characters you describe, and trivially and correctly convert to wxStrings.

How to detect unicode file names in Linux

I have a windows application written in C++. In this we used to check a file name is unicode or not using the wcstombs() function. If the conversion fails, we assume that it is unicode file name. Likewise when i tried the same in Linux, the conversion doesn't fail. I know in windows, the default charset is LATIN whereas the default charset of Linux is UTF8. Based on whether file name is unicode or not, we have different set of codings. Since I couldn't figure it out in Linux, I can't make my application portable for Unicode characters. Is there any other work around for this or am I doing anything wrong ?
utf-8 has the nice property that all ascii characters are represented as in ascii, and all non-ascii characters are represented as sequences of two or more bytes >=128. so all you have to check for ascii is the numerical magnitude of unsigned byte. if >=128, then non-ascii, which with utf-8 as the basic encoding means "unicode" (even if within range of latin-1, and note that latin-1 is a proper subset of unicode, constituting the first 256 code points).
howevever, note that while in Windows a filename is a sequence of characters, in *nix it is a sequence of bytes.
and so ideally you should really ignore what those bytes might encode.
might be difficult to reconcile with naïve user’s view, though

Reading text files of unknown encoding in C++

What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?
Is there some class that auto-detects the encoding?
I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.
This is impossible in the general case. If the file contains exactly
the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of
the ISO 8859 variants. Several heuristics can be used as a guess,
however: read the first "page" (512 bytes or so), then, in the following
order:
See if the block starts with a BOM in one of the Unicode
formats
Look at the first four bytes. If they contain `'\0'`, you're probably
dealing with some form of UTF-16 or UTF-32, according to the following
pattern:
'\0', other, '\0', other
UTF16BE
other, '\0', other, '\0'
UTF16LE
'\0', '\0', '\0', other
UTF32BE
other, '\0', '\0', '\0'
UTF32RLE
Look for a byte with the top bit set. If it's the start of a legal
UTF-8 character, then the file is probably in UTF-8. Otherwise... in
the regions where I've worked, ISO 8859-1 is generally the best
guess.
Otherwise, you more or less have to assume ASCII, until you
encounter a byte with the top bit set (at which point, you use the
previous heuristic).
But as I said, it's not 100% sure.
(PS. How do I format a table here. The text in point 2 is declared as
an HTML table, but it doesn't seem to be showing up as one.
One of the ways(brute force) of doing can be
Built a list of suitable encodings (only iso-codepages and unicode)
Iterate over all considered encodings
Encode the text using this encoding
Encode it back to Unicode
Compare the results for errors
If no errors remember the encoding that produced the fewest bytes
Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.

C++ ifstream UTF8 first characters

Why does a file saved as UTF8 (in Notepad++) have this character in the beginning of the fstream I opened to it in my c++ program?
´╗┐
I have no idea what it is, I just know that it's not there when I save to ASCII.
UPDATE: If I save it to UTF8 (without BOM) it's not there.
How can I check the encoding of a file (ASCII or UTF8, everything else will be rejected ;) ) in c++. Is it exactly these characters?
Thanks!
When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.
When you save a UTF-8 file, there's no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF. Those bytes correspond to box-drawing characters in most OEM code pages (which is the default for a console window on Windows).
The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.
The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes.
If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they're there, skip them.
Update: You can convert the U+FEFF ZERO WIDTH NO BREAK characters into U+2060 WORD JOINER except at the beginning of a file [Gillam, Richard, Unicode Demystified, Addison-Wesley, 2003, p. 108]. My personal code does this. If, when decoding UTF-8, I see the 0xEF 0xBB 0xBF at the beginning of the file, I take it as a happy sign that I indeed have UTF-8. If the file doesn't begin with those bytes, I just proceed decoding normally. If, while decoding later in the file, I encounter a U+FEFF, I emit U+2060 and proceed. This means U+FEFF is used only as a BOM and not as its deprecated meaning.
Without knowing what those characters really are (i.e., without a hex dump) it's only a guess, but my immediate guess would be that what you're seeing is the result of taking a byte order mark (BOM) and (sort of) encoding it as UTF-8. Technically, you're not allowed to/supposed to do that, but in practice it's actually fairly common.
Just to clarify, you should realize that this not really a byte-order mark. The basic idea of a byte-order mark simply doesn't apply to UTF-8. Theoretically, UTF-8 encoding is never supposed to be applied to a BOM -- but you can ignore that, and apply the normal UTF-8 encoding rules to the values that make up a BOM anyway, if you want to.
Why does a file saved as UTF8 not have this character in the beginning [...] I have no idea what it is, I just know that it's not there when I save to ASCII.
I suppose you are referring to the Byte Order Mark (BOM) U+FEFF, a zero-width, non-breaking space character. Here (notepad++ 5.4.3) a file saved as UTF-8, has the characters EF BB BF at the beginning. I suppose that's what's a BOM encoded in UTF-8.
How can I check the encoding of a file
You cannot. You have to know what encoding your file was written in. While Unicde encoded files might start with a BOM, I don't think there's a requirement that they do so.
Regarding your second point, every valid ASCII string is also a valid UTF-8 string, so you don't have to check for ASCII explicitly. Simply read the file using UTF-8, if the file doesn't contain a valid UTF-8 string, you will get an error.
I'm guessing you meant to ask, why does it have those characters. Those characters are probably the byte order mark, which according to that link in UTF-8 are the bytes EF BB BF.
As for knowing what encoding a file is in, you cannot derive that from the file itself. You have to know it ahead of time (or ask the user who supplies you with the file). For a better understanding of encoding without having to do a lot of reading, I highly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Read Unicode Files

I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data.
Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare).
For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
open file with wopen, or _wfopen as binary
read the first bytes to identify encoding using the BOM
if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
if the encoding is utf-16le (little endian) read in a wchar_t array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
BOM (http://unicode.org/faq/utf_bom.html)
wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.
When a Unicode stream-I/O
function operates in text mode, the
source or destination stream is
assumed to be a sequence of multibyte
characters. Therefore, the Unicode
stream-input functions convert
multibyte characters to wide
characters (as if by a call to the
mbtowc function). For the same reason,
the Unicode stream-output functions
convert wide characters to multibyte
characters (as if by a call to the
wctomb function).
Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.
Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.
The intended way of handling charsets is to let the locale system do it.
You have to have set the correct locale before opening your stream.
BTW you tag your question C++, you wrote about fgets and fgetws but not
IOStreams; is your problem C++ or C ?
For C:
#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */
For C++
#include <locale>
std::locale::global(std::locale(""));
Then wide IO (wstream, fgetws) should work if you environment is correctly
set for Unicode. If not, you'll have to change your environment (I don't
how it works under Windows, for Unix, setting the LC_ALL variable is the
way, see locale -a for supported values). Alternatively, replacing the
empty string by the locale would also work, but then you hardcode the
locale in your program and your users won't perhaps appreciate that.
If your system doesn't support an adequate locale, in C++ have the
possibility to write a facet for the conversion yourself. But that outside
of the scope of this answer.
You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.
First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++
For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library).
If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.
And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.