How to detect unicode file names in Linux - c++

I have a windows application written in C++. In this we used to check a file name is unicode or not using the wcstombs() function. If the conversion fails, we assume that it is unicode file name. Likewise when i tried the same in Linux, the conversion doesn't fail. I know in windows, the default charset is LATIN whereas the default charset of Linux is UTF8. Based on whether file name is unicode or not, we have different set of codings. Since I couldn't figure it out in Linux, I can't make my application portable for Unicode characters. Is there any other work around for this or am I doing anything wrong ?

utf-8 has the nice property that all ascii characters are represented as in ascii, and all non-ascii characters are represented as sequences of two or more bytes >=128. so all you have to check for ascii is the numerical magnitude of unsigned byte. if >=128, then non-ascii, which with utf-8 as the basic encoding means "unicode" (even if within range of latin-1, and note that latin-1 is a proper subset of unicode, constituting the first 256 code points).
howevever, note that while in Windows a filename is a sequence of characters, in *nix it is a sequence of bytes.
and so ideally you should really ignore what those bytes might encode.
might be difficult to reconcile with naïve user’s view, though

Related

Avoid / set character set conversion /encoding for std::cout / std::cerr

General question
Is there a possibility to avoid character set conversion when writing to std::cout / std::cerr?
I do something like
std::cout << "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" << std::endl;
And I want the output to be written to the console maintaining the UTF-8 encoding (my console uses UTF-8 encoding, but my C++ Standard Library, GNUs libstdc++, doesn't think so for some reason).
If there's no possibility to forbid character encoding conversion: Can I set std::cout to use UTF-8, so it hopefully figures out itself that no conversion is needed?
Background
I used the Windows API function SetConsoleOutputCP(CP_UTF8); to set my console's encoding to UTF-8.
The problem seems to be that UTF-8 does not match the code page typicallly used for my system's locale and libstdc++ therefore sets up std::cout with the default ANSI code page instead of correctly recognizing the switch.
Edit: Turns out I misinterpreted the issue and the solution is actually a lot simpler (or not...).
The "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" was just meant as a placeholder (and I shouldn't have used it as it has hidden the actual issue).
In my real code the "UTF-8 string" is a Glib::ustring, and those are by definition UTF-8 encoded.
However I did not realize that the output operator << was defined in glibmm in a way that forces character set conversion.
It uses g_locale_from_utf8() internally which in turn uses g_get_charset() to determine the target encoding.
Unfortunately the documentation for g_get_charset() states
On Windows the character set returned by this function is the so-called system default ANSI code-page. That is the character set used by the "narrow" versions of C library and Win32 functions that handle file names. It might be different from the character set used by the C library's current locale.
which simply means that glib will neither care for the C locale I set nor will it attempt to determine the encoding my console actually uses and basically makes it impossible to use many glib functions to create UTF-8 output. (As a matter of fact this also means that this issue has the exact same cause as the issue that triggered my other question: Force UTF-8 encoding in glib's "g_print()").
I'm currently considering this a bug in glib (or a serious limitation at best) and will probably open a report in the issue tracker for it.
You are looking at the wrong side, as you are talking about a string literal, included in your source code (and not input from your keyboard), and for that to work properly you have to tell the compiler which encoding is being used for all those characters (I think the first c++ spec that mentions non-ascii charsets is c++11)
As you are using actually the UTF charset, you should have to encode all them in at least a wchar_t to be considered as such, or to agree in the translator (probably this is what happens) that UTF chars will be UTF-8 encoded, when used as string literals. This will commonly mean that they will be printed as UTF-8 and, if you use a UTF-8 compliant console device, they will be printed ok, without any other problem.
I know there's a gcc option to specify the encoding used in string literals for a source file, and there should be another in clang also. Check the documentation and probably this will solve any issues. But the best thing to be portable, is not to depend on the codeset or use one like ISO-10646 (but know that full utf coverage is not only utf-8, utf-8 is only a way to encode UTF chars, and as so, it's only a way to represent UTF characters)
Another issue, is that C++11 doesn't refer to the UTF consortium standard, but to the ISO counterpart (ISO-10646, I think), both are similar, but not equal, and the character encodings are similar, but not equal (the codesize of the ISO is 32 bit while the Unicode consortium's is 21 bit, for example). These and other differences between them make some tricks to go in C++ and produce problems when one is thinking in strict Unicode.
Of course, to output correct strings on a UTF-8 terminal, you have to encode UTF codes to utf-8 format before sending them to the terminal. This is true, even if you have already them encoded as utf-8 in a string object. If you say they are already utf-8 then no conversion is made at all... but if you don't say, the normal consideration is that you are using normal utf codes (but limiting to 8bit codes), limiting yourself to eight bit codes, and encoding them to utf-8 before printing... this leads to encoding errors (double encoding) as something like ú (unicode code \u00fa) should be encoded in utf-8 as the character sequence { 0xc3, 0xba };, but if you don't say the string literal is indeed in utf-8, both characters will be handled as the two characters codes for Â(\u00c3) and º(\u00ba) characters, and will be recoded as { 0xc3, 0x83, 0xc2, 0xba }; that will show them incorrectly. This is very common error and you should probably have seen it when some encoding is done incorrectly. Source for the samples here.

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.
Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

Internal and external encoding vs. Unicode

Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list
I have created this one to clarify.
What are the encodings used for C style strings?
Is Linux using UTF-8 to encode strings?
How does external encoding relate to the encoding used by narrow and wide strings?
Implementation defined. Or even application defined; the standard
doesn't really put any restrictions on what an application does with
them, and expects a lot of the behavior to depend on the locale. All
that is really implemenation defined is the encoding used in string
literals.
In what sense. Most of the OS ignores most of the encodings; you'll
have problems if '\0' isn't a nul byte, but even EBCDIC meets that
requirement. Otherwise, depending on the context, there will be a few
additional characters which may be significant (a '/' in path names,
for example); all of these use the first 128 encodings in Unicode, so
will have a single byte encoding in UTF-8. As an example, I've used
both UTF-8 and ISO 8859-1 for filenames under Linux. The only real
issue is displaying them: if you do ls in an xterm, for example,
ls and the xterm will assume that the filenames are in the same
encoding as the display font.
That mainly depends on the locale. Depending on the locale, it's
quite possible for the internal encoding of a narrow character string not to
correspond to that used for string literals. (But how could it be
otherwise, since the encoding of a string literal must be determined at
compile time, where as the internal encoding for narrow character
strings depends on the locale used to read it, and can vary from one
string to the next.)
If you're developing a new application in Linux, I would strongly
recommend using Unicode for everything, with UTF-32 for wide character
strings, and UTF-8 for narrow character strings. But don't count on
anything outside the first 128 encoding points working in string
literals.
This depends on the architecture. Most Unix architectures are using UTF-32 for wide strings (wchar_t) and ASCII for (char). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (for wchar_t).
No. Most system calls on Linux are encoding agnostic (they don't care what the encoding is, since they are not interpreting it in any way). External encoding is actually defined by your current locale.
The internal encoding used by narrow and wide strings is fixed, it does not change with changing locale. By changing the locale you are chaning the translation functions that encode and decode data which enters/leaves your program (assuming you stick with standard C text functions).

How to UTF-8 encode a character/string

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks
If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.
To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

Read Unicode Files

I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data.
Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare).
For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
open file with wopen, or _wfopen as binary
read the first bytes to identify encoding using the BOM
if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
if the encoding is utf-16le (little endian) read in a wchar_t array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
BOM (http://unicode.org/faq/utf_bom.html)
wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.
When a Unicode stream-I/O
function operates in text mode, the
source or destination stream is
assumed to be a sequence of multibyte
characters. Therefore, the Unicode
stream-input functions convert
multibyte characters to wide
characters (as if by a call to the
mbtowc function). For the same reason,
the Unicode stream-output functions
convert wide characters to multibyte
characters (as if by a call to the
wctomb function).
Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.
Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.
The intended way of handling charsets is to let the locale system do it.
You have to have set the correct locale before opening your stream.
BTW you tag your question C++, you wrote about fgets and fgetws but not
IOStreams; is your problem C++ or C ?
For C:
#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */
For C++
#include <locale>
std::locale::global(std::locale(""));
Then wide IO (wstream, fgetws) should work if you environment is correctly
set for Unicode. If not, you'll have to change your environment (I don't
how it works under Windows, for Unix, setting the LC_ALL variable is the
way, see locale -a for supported values). Alternatively, replacing the
empty string by the locale would also work, but then you hardcode the
locale in your program and your users won't perhaps appreciate that.
If your system doesn't support an adequate locale, in C++ have the
possibility to write a facet for the conversion yourself. But that outside
of the scope of this answer.
You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.
First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++
For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library).
If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.
And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.