C++ ifstream UTF8 first characters - c++

Why does a file saved as UTF8 (in Notepad++) have this character in the beginning of the fstream I opened to it in my c++ program?
´╗┐
I have no idea what it is, I just know that it's not there when I save to ASCII.
UPDATE: If I save it to UTF8 (without BOM) it's not there.
How can I check the encoding of a file (ASCII or UTF8, everything else will be rejected ;) ) in c++. Is it exactly these characters?
Thanks!

When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.
When you save a UTF-8 file, there's no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF. Those bytes correspond to box-drawing characters in most OEM code pages (which is the default for a console window on Windows).
The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.
The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes.
If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they're there, skip them.
Update: You can convert the U+FEFF ZERO WIDTH NO BREAK characters into U+2060 WORD JOINER except at the beginning of a file [Gillam, Richard, Unicode Demystified, Addison-Wesley, 2003, p. 108]. My personal code does this. If, when decoding UTF-8, I see the 0xEF 0xBB 0xBF at the beginning of the file, I take it as a happy sign that I indeed have UTF-8. If the file doesn't begin with those bytes, I just proceed decoding normally. If, while decoding later in the file, I encounter a U+FEFF, I emit U+2060 and proceed. This means U+FEFF is used only as a BOM and not as its deprecated meaning.

Without knowing what those characters really are (i.e., without a hex dump) it's only a guess, but my immediate guess would be that what you're seeing is the result of taking a byte order mark (BOM) and (sort of) encoding it as UTF-8. Technically, you're not allowed to/supposed to do that, but in practice it's actually fairly common.
Just to clarify, you should realize that this not really a byte-order mark. The basic idea of a byte-order mark simply doesn't apply to UTF-8. Theoretically, UTF-8 encoding is never supposed to be applied to a BOM -- but you can ignore that, and apply the normal UTF-8 encoding rules to the values that make up a BOM anyway, if you want to.

Why does a file saved as UTF8 not have this character in the beginning [...] I have no idea what it is, I just know that it's not there when I save to ASCII.
I suppose you are referring to the Byte Order Mark (BOM) U+FEFF, a zero-width, non-breaking space character. Here (notepad++ 5.4.3) a file saved as UTF-8, has the characters EF BB BF at the beginning. I suppose that's what's a BOM encoded in UTF-8.
How can I check the encoding of a file
You cannot. You have to know what encoding your file was written in. While Unicde encoded files might start with a BOM, I don't think there's a requirement that they do so.

Regarding your second point, every valid ASCII string is also a valid UTF-8 string, so you don't have to check for ASCII explicitly. Simply read the file using UTF-8, if the file doesn't contain a valid UTF-8 string, you will get an error.

I'm guessing you meant to ask, why does it have those characters. Those characters are probably the byte order mark, which according to that link in UTF-8 are the bytes EF BB BF.
As for knowing what encoding a file is in, you cannot derive that from the file itself. You have to know it ahead of time (or ask the user who supplies you with the file). For a better understanding of encoding without having to do a lot of reading, I highly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related

Detecting Unicode in files in Windows 10

Now Windows 10 Notepad does not require unicode files to have the BOM header and it does not encode the header by default. This does break the existing code that checks the header to determine Unicode in files. How can I now tell in C++ if a file is in unicode?
Source: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/
The code we have to determine Unicode:
int IsUnicode(const BYTE p2bytes[3])
{
if( p2bytes[0]==0xEF && p2bytes[1]==0xBB p2bytes[2]==0xBF)
return 1; // UTF-8
if( p2bytes[0]==0xFE && p2bytes[1]==0xFF)
return 2; // UTF-16 (BE)
if( p2bytes[0]==0xFF && p2bytes[1]==0xFE)
return 3; // UTF-16 (LE)
return 0;
}
If it's so much pain, why isn't there a typical function to determine the encoding?
You should use the W3C method, which it is something like:
if you know the encoding, use that
if there is a BOM, use it to determine the encoding
decode as UTF-8. UTF-8 has strict byte sequence rules (which it is the purpose of UTF-8: being able to find the first byte of a character). So if the file it is not UTF-8, very probably it will fail the decoding: on ANSI (cp-1252) it is not frequent to have accented letters followed by a symbols, and not at all probable that every time you have such sequence. Latin-1: you may get control characters (instead of symbols), but it is also very seldom to have control characters C1 only after accented letters, and always C1 after accented characters.
if decoding fails (maybe you can just test first 4096 bytes, or 10 bytes above 127), use the standard 8-bit encoding of the OS (probably cp-1252 on windows).
This method should work very well. It is biased on UTF-8, but the world went to such directions long ago. Determining which codepage is much more difficult.
You may add a step before the last step. If there are various 00 bytes, you may be in a UTF-16 or UTF-32 form. Unicode requires that you know which form (e.g. from side channel), else the files should have a BOM. But you can guess the form (UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE) according the position of 00 in the file (new lines, and some ASCII characters are considered common scripts, so they are used in many scripts, so you should have many 00).
Now Windows 10 does not require unicode files to have the BOM header.
Windows never had this requirement. Every program can read text files like it wants to.
Maybe interesting: a BOM may not be desirable for UTF-8 because it breaks ASCII compatibility.
This does break the existing code that checks the header to determine Unicode in files.
This is a misunderstanding. Other code likely had Unicode support for a longer time than Notepad from Windows.
How can I now tell in C++ if a file is in unicode?
Typically you would check for the presence of a BOM and then use that information of course.
Next you can try to read (the beginning of) the file with all possible encodings. The ones that throw an exception are obviously not suitable.
From the remaining encodings, you could use a heuristic to determine the encoding.
And if it still was the wrong choice, give the user an option to change the encoding manually. That's how it is done in many editors, like Notepad++.

Read file with unknown character type

I need to read text from a file that could contain any type of character (char, char8_t, wchar_t, etc). How can i determine which type of character is used and create an instance of basic_ifstream<char_type> depending on that type?
So I guess you want to auto-detect the encoding of an unknown text file.
This is impossible to do in a 100% reliable way. However, my experience shows that you can achieve very high reliability (> 99.99%) in most practical situations. The bigger the file, the most reliable it is to guess its encoding: some tenths of bytes is usually already enough to be confident with the guess.
A valid Unicode code point is a value from U+1 to U+10FFFF included, excluding the surrogate range U+D800 to U+DFFF. Code point U+0 is actually valid, but excluding it highly decreases the number of false positive guesses (NUL bytes should never appear in any practical text file). For an even better guess, we can exclude some more very rare control characters.
Here is the algorithm I would propose:
If the file begins with a valid BOM (UTF-8, UTF-16BE/LE, UTF-32BE/LE), trust that BOM.
If the file contains only ASCII characters (non null bytes < 128), treat it as ASCII (use char).
If the file is valid UTF-8, then assume it is UTF-8 (use char8_t, but char will work also). Note that ASCII is a subset of UTF-8, so the previous check could be bypassed.
If the file is valid UTF-32 (check both little and big endian versions), then assume UTF-32 (char32_t, possibly also wchar_t on Linux or macOS). Swap the bytes if needed.
If the file is valid UTF-16 (check both little and big endian versions), including restrictions on surrogate pairs, and there is a higher correlation between even or odd bytes than between all bytes together, assume UTF-16 (char16_t, possibly also wchar_t on Windows). Swap the bytes if needed.
Otherwise, the file is probably not in some Unicode encoding, and may use old code pages. Good luck to auto-detect which one. The more common one by far is 8859-1 (Latin-1), use char. It may also be some raw binary data.
It's impossible to know for sure. You have to be told what the character type is. Frequently text files will begin with a Byte-Order-Mark to clue you in, but even that's not entirely foolproof.
You can make reasonable guesses as to the file contents, for example, if you "know" that most of it is ASCII-range text, it should be easy to figure out if the file is full of char or wchar_t characters. Even this relies on assumptions and should not be considered bulletproof.

how I'm able to store a japanese character in 1 byte normal string?

std::string str1="いい";
std::string str2="الحانةالريفية";
WriteToLog(str1.size());
WriteToLog(str2.size());
I get "2,13" in my log file which is the exact number of characters in those strings. But how the japanese and arabic characters fit into one byte. I hope str.size() is supposed to return no of bytes used by the string.
On my UTF-8-based locale, I get 6 and 26 bytes respectively.
You must be using a locale that uses the high 8-bit portion of the character set to encode these non-Latin characters, using one byte per character.
If you switch to a UTF-8 locale, you should get the same results as I did.
The answer is, you can't.
Those strings don't contain what you think they contain.
First make sure you save your source file as UTF-8 with BOM, or as UTF-16. (Visual Studio calls these UTF-8 with signature and Unicode).
Don't use any other encoding, as then the meaning of that string literal changes as you move your source file between computers with different language settings.
Then, you need to make sure the compiler uses a suitable character set to embed those strings in your binary. That's called the execution character set → see Does VC have a compile option like '-fexec-charset' in GCC to set the execution character set?
Or you can go for the portable solution, which is encoding the strings to UTF-8 yourself, and then writing the string literals as bytes: "\xe3\x81\x84\xe3\x81\x84".
They're using MBCS (Multi-byte character set).
Underlying while Unicode will encode all characters in two bytes, MBCS will encode common characters in a single byte, and will use an extension character first byte to denote its going to use more than one byte for this character. Confusingly depending on what character you chose for that second character in the Japanese string, you size may have been 3, not 2 or 4.
MBCS is a bit dated, it's recommended to use Unicode for new development when possible. See link below for more info.
https://msdn.microsoft.com/en-us/library/5z097dxa.aspx

Reading text files of unknown encoding in C++

What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?
Is there some class that auto-detects the encoding?
I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.
This is impossible in the general case. If the file contains exactly
the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of
the ISO 8859 variants. Several heuristics can be used as a guess,
however: read the first "page" (512 bytes or so), then, in the following
order:
See if the block starts with a BOM in one of the Unicode
formats
Look at the first four bytes. If they contain `'\0'`, you're probably
dealing with some form of UTF-16 or UTF-32, according to the following
pattern:
'\0', other, '\0', other
UTF16BE
other, '\0', other, '\0'
UTF16LE
'\0', '\0', '\0', other
UTF32BE
other, '\0', '\0', '\0'
UTF32RLE
Look for a byte with the top bit set. If it's the start of a legal
UTF-8 character, then the file is probably in UTF-8. Otherwise... in
the regions where I've worked, ISO 8859-1 is generally the best
guess.
Otherwise, you more or less have to assume ASCII, until you
encounter a byte with the top bit set (at which point, you use the
previous heuristic).
But as I said, it's not 100% sure.
(PS. How do I format a table here. The text in point 2 is declared as
an HTML table, but it doesn't seem to be showing up as one.
One of the ways(brute force) of doing can be
Built a list of suitable encodings (only iso-codepages and unicode)
Iterate over all considered encodings
Encode the text using this encoding
Encode it back to Unicode
Compare the results for errors
If no errors remember the encoding that produced the fewest bytes
Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.