How to check whether text file is encoded in UTF-8?

How to check whether text file is encoded in UTF-8? - c++

How to check whether text file is encoded in UTF-8 in C++?

Try to read it as UTF-8 and see if UTF-8 encoding is broken or not and if not, if there are valid Unicode points only.
But still there's no guarantee the file is in UTF-8 or ASCII or something else. How would you interpret a file containing a single byte, the letter A? ASCII? UTF-8? Other? Likewise, what if the file starts with the BOM by sheer luck but isn't really UTF-8 or isn't intended to be UTF-8?
This article may be of interest.

You can never know for sure that any piece of binary data was intended to represent UTF-8. However, you can always check if it can be interpreted as UTF-8. The simplest way would be to just try and convert it (say to UTF-32) and see if you get no errors. If all you need is the validation, then you can do the same thing without actually writing the output. (You'll need to write this yourself, but it's easy.)
Note that it is crucial for security reasons to abort the conversion entirely at the first error, and not try to "recover" somehow.

Try converting to UTF-16. If you get no errors, then it is very likely UTF-8.
But not matter what you do, it is still a best guess.

Related

Character Encoding in C++ internally?

If I create a string literal with the u8 prefix, does the machine code knows and says, that the corresponding value of that variable should be encoded in UTF-8?
So that no matter where I run the program, the computer knows how to encode it every time? Or does the machine code doesn't say, encode it like this and this?
Because if I encode something in normal char, and something in UTF-8 (e.g. with u8), then what is the difference and how does the computer know the encoding, if the machine code doesn't say anything about it?

u8"..." strings are always encoded in UTF-8, as specified in [lex.string]/1.
The encoding of "..." strings depends on the compiler (and on the source file encoding), but it shouldn't be hard to configure your IDE to save files in UTF-8, and your compiler to not touch UTF-8 in plain string literals.
In any case, the encoding is handled entirely at compile-time. In the compiled code the strings are just sequences of bytes; there is no conversion between encodings at runtime, unless you explicitly call some function that does that.

If I create a string literal with the u8 prefix, does the machine code
knows and says, that the corresponding value of that variable should
be encoded in UTF-8?
Machine code knows nothing. Compiler encodes the literal into UTF-8 and generate the correct sequence of bytes.
So that no matter where I run the program, the computer knows how to
encode it every time? Or does the machine code doesn't say, encode it
like this and this?
The sequence of bytes is then emitted at runtime and the output device that will receive this sequence will translate it correctly if it knows how to. That means that, for example, a console that accepts UTF-8 encoding will show correct chars, if not garbage is shown.

Yes the character will almost certainly be encoded in UTF-8 but note that the standard doesn't require char8_t to be 8-bit, just that it needs to be capable of storing UTF-8 code units so some weird C++ runtime could use 16-bit characters with only 8-bits stored in each element.
Also note that char8_t is only able to store ASCII characters, all other characters require multiple code units so need to be stored in a char8_t string/array even if they are only a single character.

Reading text files of unknown encoding in C++

What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?
Is there some class that auto-detects the encoding?

I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.

This is impossible in the general case. If the file contains exactly
the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of
the ISO 8859 variants. Several heuristics can be used as a guess,
however: read the first "page" (512 bytes or so), then, in the following
order:
See if the block starts with a BOM in one of the Unicode
formats
Look at the first four bytes. If they contain `'\0'`, you're probably
dealing with some form of UTF-16 or UTF-32, according to the following
pattern:
'\0', other, '\0', other
UTF16BE
other, '\0', other, '\0'
UTF16LE
'\0', '\0', '\0', other
UTF32BE
other, '\0', '\0', '\0'
UTF32RLE
Look for a byte with the top bit set. If it's the start of a legal
UTF-8 character, then the file is probably in UTF-8. Otherwise... in
the regions where I've worked, ISO 8859-1 is generally the best
guess.
Otherwise, you more or less have to assume ASCII, until you
encounter a byte with the top bit set (at which point, you use the
previous heuristic).
But as I said, it's not 100% sure.
(PS. How do I format a table here. The text in point 2 is declared as
an HTML table, but it doesn't seem to be showing up as one.

One of the ways(brute force) of doing can be
Built a list of suitable encodings (only iso-codepages and unicode)
Iterate over all considered encodings
Encode the text using this encoding
Encode it back to Unicode
Compare the results for errors
If no errors remember the encoding that produced the fewest bytes
Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.

Distinguishing between string formats

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?

Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.

Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)

In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

C++ ifstream UTF8 first characters

Why does a file saved as UTF8 (in Notepad++) have this character in the beginning of the fstream I opened to it in my c++ program?
´╗┐
I have no idea what it is, I just know that it's not there when I save to ASCII.
UPDATE: If I save it to UTF8 (without BOM) it's not there.
How can I check the encoding of a file (ASCII or UTF8, everything else will be rejected ;) ) in c++. Is it exactly these characters?
Thanks!

When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.
When you save a UTF-8 file, there's no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF. Those bytes correspond to box-drawing characters in most OEM code pages (which is the default for a console window on Windows).
The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.
The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes.
If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they're there, skip them.
Update: You can convert the U+FEFF ZERO WIDTH NO BREAK characters into U+2060 WORD JOINER except at the beginning of a file [Gillam, Richard, Unicode Demystified, Addison-Wesley, 2003, p. 108]. My personal code does this. If, when decoding UTF-8, I see the 0xEF 0xBB 0xBF at the beginning of the file, I take it as a happy sign that I indeed have UTF-8. If the file doesn't begin with those bytes, I just proceed decoding normally. If, while decoding later in the file, I encounter a U+FEFF, I emit U+2060 and proceed. This means U+FEFF is used only as a BOM and not as its deprecated meaning.

Without knowing what those characters really are (i.e., without a hex dump) it's only a guess, but my immediate guess would be that what you're seeing is the result of taking a byte order mark (BOM) and (sort of) encoding it as UTF-8. Technically, you're not allowed to/supposed to do that, but in practice it's actually fairly common.
Just to clarify, you should realize that this not really a byte-order mark. The basic idea of a byte-order mark simply doesn't apply to UTF-8. Theoretically, UTF-8 encoding is never supposed to be applied to a BOM -- but you can ignore that, and apply the normal UTF-8 encoding rules to the values that make up a BOM anyway, if you want to.

Why does a file saved as UTF8 not have this character in the beginning [...] I have no idea what it is, I just know that it's not there when I save to ASCII.
I suppose you are referring to the Byte Order Mark (BOM) U+FEFF, a zero-width, non-breaking space character. Here (notepad++ 5.4.3) a file saved as UTF-8, has the characters EF BB BF at the beginning. I suppose that's what's a BOM encoded in UTF-8.
How can I check the encoding of a file
You cannot. You have to know what encoding your file was written in. While Unicde encoded files might start with a BOM, I don't think there's a requirement that they do so.

Regarding your second point, every valid ASCII string is also a valid UTF-8 string, so you don't have to check for ASCII explicitly. Simply read the file using UTF-8, if the file doesn't contain a valid UTF-8 string, you will get an error.

I'm guessing you meant to ask, why does it have those characters. Those characters are probably the byte order mark, which according to that link in UTF-8 are the bytes EF BB BF.
As for knowing what encoding a file is in, you cannot derive that from the file itself. You have to know it ahead of time (or ask the user who supplies you with the file). For a better understanding of encoding without having to do a lot of reading, I highly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Determine input encoding by examining the input bytes

I'm getting console input from the user and want to encode it to UTF-8. My understanding is C++ does not have a standard encoding for input streams, and that it instead depends on the compiler, the runtime environment, localization, and what not.
How can I determine the input encoding by examining the bytes of the input?

In general, you can't. If I shoot a stream of randomly generated bytes at your app how can it determine their "encoding"? You simply have to specify that your application accepts certain encodings, or make an assumption that what the OS hands you will be suitably encoded.

Generally checking whether input is UTF is a matter of heuristics -- there's no definitive algorithm that'll state you "yes/no". The more complex the heuristic, the less false positives/negatives you will get, however there's no "sure" way.
For an example of heuristics you can check out this library : http://utfcpp.sourceforge.net/
bool valid_utf8_file(iconst char* file_name)
{
ifstream ifs(file_name);
if (!ifs)
return false; // even better, throw here
istreambuf_iterator<char> it(ifs.rdbuf());
istreambuf_iterator<char> eos;
return utf8::is_valid(it, eos);
}
You can either use it, or check its sources how they have done it.

Use the built-in operating system means. Those vary from one OS to another. On Windows, it's always better to use WideChar APIs and not think of encoding at all.
And if your input comes from a file, as opposed to a real console, then all bets are off.

Jared Oberhaus answered this well on a related question specific to java.
Basically there are a few steps you can take to make a reasonable guess, but ultimately it's just guesswork without explicit indication. (Hence the (in)famous BOM marker in UTF-8 files)

As has already been said in response to the question John Weldon has pointed to, there are a number of libraries which do character encoding recognition. You could also take a look at the
source of the unix file command and see what tests it uses to determine file encoding. From the man page of file:
ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.
PCRE provides a function to test a given string for its completely being valid UTF-8.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js