is there any way to use environment variables in c++ for path to file?
Idea is to use them without expending so I don't need to use wchar for languages with unicode standard when I want to save/read file.
//EDIT
Little edit with more explanations.
So what I try to achieve is to read/write to file without worrying about characters in path. So I don't want to use wchar as path but it should work if path contains some wide chars.
There are functions getenv and GetEnvironmentVariable but they need to set proper language in Language for non-Unicode programs in windows settings (Constrol Panel -> Clock, Language, and Region -> Region and Language -> Administrative) which need some actions from users and this is something that I try to avoid.
There are functions getenv and GetEnvironmentVariable but they need to set proper language in Language for non-Unicode programs in windows settings
This is specifically a Windows problem.
On other platforms such as Linux, filepaths and environment variables are natively byte-based; you can access them using the standard C library functions that take byte-string paths like fopen() and getenv(). The pathnames may represent Unicode strings to the user (decoded using some encoding, almost always UTF-8 which can encode any character), but to the code they're just byte strings.
Windows, on the other hand, has filenames and environment variables that are natively strings of 16-bit (UTF-16) code units (which are nearly the same thing as Unicode character code points, but not quite, because that would be too easy... but that's a sadness for another time). You can call Win32 file-handling APIs like CreateFileW() and GetEnvironmentVariableW() using UTF-16 code unit strings (wchar_t, when compiled on Windows) and access any file names directly.
There are also old-school legacy byte-based Win32 functions like GetEnvironmentVariableA() (which is what GetEnvironmentVariable() points to if you are compiling a non-Unicode project). If you call those functions, Windows has to convert from the char byte strings you give it to UTF-16 strings, using some encoding.
That encoding is the ‘ANSI’ (‘A’) locale-specific default code page, which is what “Language for non-Unicode programs” sets.
Although that encoding can be changed by the user, it can't be set to UTF-8 or any other encoding that supports all characters, so even if you ask the user to change it, that still doesn't let you access all files. Thus the Win32 A APIs are always to be avoided.
The problem comes when you want to access files in a manner that works on both Windows and the other platforms. If you call the C standard library with byte strings, the Microsoft C runtime library adapts those calls to call the Win32 A byte-based APIs, which as above are irritatingly limited.
So your unattractive choices are:
use wchar_t and std::wstring strings in your code, using only Win32 APIs for interacting with filenames and environment variables, and accepting that your code will never run on other platforms, or;
use char and UTF-8-encoded std::string strings, and give up on your code accessing filenames and environment variables with non-ASCII characters in on Windows, or;
write a load of branching #ifdef code to switch between using C standard functions for filename and environment interaction, or using Win32 APIs with a bunch of UTF-8-char-to-wchar_t string conversions in between, so that code works across multiple platforms, or;
use a library that encapsulates (3) for you.
Notably there is boost::nowide (since Boost 1.73) which contains boost::nowide::getenv.
This isn't entirely Microsoft's fault: Windows NT was designed in their early days of Unicode before UTF-8 or the astral planes were invented, when it was thought that 16-bit code unit strings were a totally sensible way to store text, and not a lamentable disaster like we know it is now. It is, however, very sad that Windows has not been updated since then to treat UTF-8 as a first-class citizen and provide an easy way to write cross-platform applications.
The standard library gives you the function getenv. Here is an example:
#include <cstdlib>
int main()
{
char* pPath;
pPath = getenv("PATH");
if (pPath)
std::cout << "Path =" << pPath << std::endl;
return 0;
}
Related
I want to understand the difference between char and wchar_t ? I understand that wchar_t uses more bytes but can I get a clear cut example to differentiate when I would use char vs wchar_t
Short anwser:
You should never use wchar_t in modern C++, except when interacting with OS-specific APIs (basically use wchar_t only to call Windows API functions).
Long answer:
Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what).
In a C++ program you have two locales:
Standard C library locale set by std::setlocale
Standard C++ library locale set by std::locale::global
Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen, std::fstream::open etc). Behavior differs between OSes:
Linux is encoding agnostic, so those function simply pass char string to underlying system call
On Windows char string is converted to wide string using user specific locale before system call is made
Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C" locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string assuming it is UTF-8 sequences and everything "just works".
Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *) and std::fstream::open(const char *) to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen, std::fstream::open(const wchar_t *) on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/
With C++17 you can use std::filesystem::path to store file path in a portable way, but it is still broken on Windows:
Implicit constructor std::filesystem::path::path(const char *) uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.
std::error_category::message(int) for both error categories returns error description using user-specific encoding.
So what we have on Windows is:
Standard library functions that open files are broken and should never be used.
Arguments passed to main(int, char**) are broken and should never be used.
WinAPI functions ending with *A and macros are broken and should never be used.
std::filesystem::path is partially broken and should never be used directly.
Error categories returned by std::generic_category and std::system_category are broken and should never be used.
If you need long term solution for a non-trivial project, I would recommend:
Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
Re-implementing standard error categories returned by std::generic_category and std::system_category so that they would always return UTF-8 encoded strings.
Wrapping std::filesystem::path so that new class would always use UTF-8 when converting path to string and string to path.
Wrapping all required functions from std::filesystem so that they would use your path wrapper and your error categories.
Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).
You can check this link for further explanation: http://utf8everywhere.org/
Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.
Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.
The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.
The wchar_t, a.k.a. wide characters, provides more room for encodings.
Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.
Prefer Unicode to handle large character sets (such as emojis).
Never use wchar_t.
When possible, use (some kind of array of) char, such as std::string, and ensure that it is encoded in UTF-8.
When you must interface with APIs that don't speak UTF-8, use char16_t or char32_t. Never use them otherwise; they provide only illusory advantages and encourage faulty code.
Note that there are plenty of cases where more than one char32_t is required to represent a single user-visible character. OTOH, using UTF-8 with char forces you to handle variable width very early.
Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";
My question seems to have confused folks. Here's something concrete:
Our code does the following:
FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);
Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).
Under UNICODE build target, the above produces a garbage file full of ????.
I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.
Here's the question as it used to exist:
FILE* streams can be opened in t(ranslated) or b(inary) modes.
Desktop applications can be compiled for UNICODE or MBCS (under
Windows).
If my application is compiled for MBCS, then writing MBCS strings to a
"wt" stream results in a well-formed text file containing MBCS text
for the system code page (i.e. the code page "for non Unicode
software").
Because our software generally uses the _t versions of most string &
stream functions, in MBCS builds output is handled primarily by
puts(pszMBString) or something similar putc etc. Since
pszMBString is already in the system code page (e.g. 932 when
running on a Japanese machine), the string is written out verbatim
(although line terminators are massaged by puts and gets
automatically).
However, if my application is compiled for UNICODE, then writing MBCS
strings to a "wt" stream results in garbage (lots of "?????"
characters) (i.e. I convert the UNICODE to the system's default code
page and then write that to the stream using, for example,
fwrite(pszNarrow, 1, length, stream)).
I can open my streams in binary mode, in which case I'll get the
correct MBCS text... but, the line terminators will no longer be
PC-style CR+LF, but instead will be UNIX-style LF only. This, because
in binary (non-translated) mode, the file stream doesn't handle the
LF->CR+LF translation.
But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct
line terminators and MBCS text files using the system's code page.
Obviously I can manually adjust the line terminators myself and use
binary streams. However, this is a very invasive approach, as I now
have to find every bit of code throughout the system that writes text
files, and alter it so that it does all of this correctly. What blows
my mind, is that UNICODE target is stupider / less capable than the
MBCS target we used to use! Surely there is a way to toggle the C
library to say "output narrow strings as-is but handle line
terminators properly, exactly as you'd do in MBCS builds"?!
Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).
My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC.
Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).
BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!
Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.
If using MFC, one needs to:
// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE
For C++ and C libraries, one must tell the libraries to use the system code page:
// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).
The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).
One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).
Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software
The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.
It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:
FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);
In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.
Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.
Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:
When a Unicode stream-I/O function operates in text mode (the
default), the source or destination stream is assumed to be a sequence
of multibyte characters. Therefore, the Unicode stream-input functions
convert multibyte characters to wide characters (as if by a call to
the mbtowc function). For the same reason, the Unicode stream-output
functions convert wide characters to multibyte characters (as if by a
call to the wctomb function).
This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.
int main()
{
setlocale(LC_ALL, ".932");
FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
fputws(L"刃物種類\n", fout);
fclose(fout);
}
An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.
When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.
Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".
I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.
I'm currently taking care of an application that uses std::string and char for string operations - which is fine on linux, since Linux is agnostic to Unicode (or so it seems; I don't really know, so please correct me if I'm telling stories here). This current style naturally leads to this kind of function/class declarations:
std::string doSomethingFunkyWith(const std::string& thisdata)
{
/* .... */
}
However, if thisdata contains unicode characters, it will be displayed wrongly on windows, since std::string can't hold unicode characters on Windows.
So I thought up this concept:
namespace MyApplication {
#ifdef UNICODE
typedef std::wstring string_type;
typedef wchar_t char_type;
#else
typedef std::string string_type;
typedef char char_type;
#endif
/* ... */
string_type doSomethingFunkyWith(const string_type& thisdata)
{
/* ... */
}
}
Is this a good concept to go with to support Unicode on windows?
My current toolchain consists of gcc/clang on Linux, and wine+MinGW for Windows support (crosstesting also happens via wine), if that matters.
Multiplatform issues comes from the fact that there are many encodings, and a wrong encoding pick will lead to encóding Ãssues. Once you tackle that problem, you should be able to use std::wstring on all your program.
The usual workflow is:
raw_input_data = read_raw_data()
input_encoding = "???" // What is your file or terminal encoding?
unicode_data = convert_to_unicode(raw_input_data, input_encoding)
// Do something with the unicode_data, store in some var, etc.
output_encoding = "???" // Is your terminal output encoding the same as your input?
raw_output_data = convert_from_unicode(unicode_data, output_encoding)
print_raw_data(raw_data)
Most Unicode issues comes from wrongly detecting the values of input_encoding and output_encoding. On a modern Linux distribution this is usually UTF-8. On Windows YMMV.
Standard C++ don't know about encodings, you should use some library like ICU to do the conversion.
How you store a string within your application is entirely up to you -- after all, nobody would know as long as the strings stay within your application. The problem starts when you try to read or write strings from the outside world (console, files, sockets etc.) and this is where the OS matters.
Linux isn't exactly "agnostic" to Unicode -- it does recognize Unicode but the standard library functions assume UTF-8 encoding, so Unicode strings fit into standard char arrays. Windows, on the other hand, uses UTF-16 encoding, so you need a wchar_t array to represent 16-bit characters.
The typedefs you proposed should work fine, but keep in mind that this alone doesn't make your code portable. As an example, if you want to store text in files in a portable manner, you should choose one encoding and stick to it across all platforms -- this could require converting between encodings on certain platforms.
Linux does support Unicode, it simply uses UTF-8. Probably a better way to make your system portable would be to make use of International Components for Unicode and treat all std::string objects as containing UTF-8 characters, and convert them to UTF-16 as needed when invoking Windows functions. It almost always makes sense to use UTF-8 over UTF-16, as UTF-8 uses less space for some of the most commonly used characters (e.g. English*) and more space for less frequent characters, whereas UTF-16 wastes space equally for all characters, no matter how frequently they are used.
While you can use your typedefs, this will mean that you have to write two copies of every single function that has to deal with strings. I think it would be more efficient to simply do all internal computations in UTF-8 and simply translate that to/from UTF-16 if necessary when inputting/outputting as needed.
*For HTML, XML, and JSON that use English as part of the encoding (e.g. "<html>, <body>, etc.) regardless of the language of the values, this can still be a win for foreign languages.
The problem for Linux and using Unicode is that all the IO and most system functions use UTF-8 and the wide character type is 32 bit. Then there is interfacing to Java and other programs which requires UTF-16.
As a suggestion for Unicode support, see the OpenRTL library at http://code.google.com/p/openrtl which supports all UTF-8, UTF-16 and UTF-32 on windows, Linux, Osx and Ios. The Unicode support is not just the character types, but also Unicode collation, normalization, case folding, title casing and about 64 different Unicode character properties per full unsigned 32 bit character.
The OpenRTL code is ready now to support char8_t, char16_t and char32_t for the new C++ standards as well, allthough the same character types are supported using macros for existing C and C++ compilers. I think for Unicode and strings processing that it might be what you want for your library.
The point is that if you use OpenRTL, you can build the system using the OpenRTL "char_t" type. This supports the notion that your entire library can be built in either UTF8, UTF16 or UTF32 mode, even on Linux, because OpenRTL is already handling all the interfacing to a lot of system functions like files and io stuff. It has its own print_f functions for example.
By default the char_t is mapping to the wide character type. So on windows it is 32 bit and on Linux it is 32 bit. But you can make it also make it 8 bit everywhere for example. Also it has the support to do fast UTF decoding inside loops using macros.
So instead of ifdeffing between wchar_t and char, you can build everything using char_t and OpenRTL takes care of the rest.
At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support