Change narrow string encoding or missing std::filesystem::path::imbue - c++

I'm on Windows and I'm constructing std::filesystem::path from std::string. According to constructor reference (emphasis mine):
If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
If I understand correctly, this means string content will be treated as encoded in ANSI under Windows. To treat it as encoded in UTF-8, I need to use std::filesystem::u8path() function. See the demo: http://rextester.com/PXRH65151
I want constructor of path to treat contents of narrow string as UTF-8 encoded. For boost::filesystem::path I could use imbue() method to do this:
boost::filesystem::path::imbue(std::locale(std::locale(), new std::codecvt_utf8_utf16<wchar_t>()));
However, I do not see such method in std::filesystem::path. Is there a way to achieve this behavior for std::filesystem::path? Or do I need to spit u8path all over the place?

My solution to this problem is to fully alias the std::filesystem to a different namespace named std::u8filesystem with classes and methods that treat std::string as UTF-8 encoded. Classes inherit their corresponding in std::filesystem with same name, without adding any field or virtual method to offer full API/ABI interoperability. Full proof of concept code here, tested only on Windows so far and far to be complete. The following snippet shows the core working of the helper:
std::wstring U8ToW(const std::string &string);
namespace std
{
namespace u8filesystem
{
#ifdef WIN32
class path : public filesystem::path
{
public:
path(const std::string &string)
: fs::path(U8ToW(path))
{
}
inline std::string string() const
{
return filesystem::path::u8string();
}
}
#else
using namespace filesystem;
#endif
}
}

For the sake of performance, path does not have a global way to define locale conversions. Since C++ pre-20 does not have a specific type for UTF-8 strings, the system assumes any char strings are narrow character strings. So if you want to use UTF-8 strings, you have to spell it out explicitly, either by providing an appropriate conversion locale to the constructor or by using u8path.
C++20 gave us char8_t, which is always presumed to be UTF-8. So if you consistently use char8_t-based strings (like std::u8string), path's implicit conversion will pick up on it and work appropriately.

Related

Convert a string to std filesystem path

A file path is passed as a string. How do I convert this string to a std::filesystem::path? Example:
#include <filesystem>
std::string inputPath = "a/custom/path.ext";
const std::filesystem::path path = inputPath; // Is this assignment safe?
Yes, this construction is safe:
const std::filesystem::path path = inputPath; // Is this assignment safe?
That is not assignment, that is copy initialization. You are invoking this constructor:
template< class Source >
path( const Source& source );
which takes:
Constructs the path from a character sequence provided by source (4), which is a pointer or an input iterator to a null-terminated character/wide character sequence, an std::basic_string or an std::basic_string_view,
So you're fine. Plus, it would be really weird if you couldn't construct a filesystem::path from a std::string.
I'm aware that this is a 5 year old question, but since it is still the top result for the search "string to filesystem::path" IMHO it requires some additional comment on usages on Windows (although the original question probably was implicitly focused on Linux, judging from the chosen path separator).
For Windows systems, the implicit construction of a filesystem::path from string is not safe with respect to encoding (despite the fact that it is legal according to the standard). If you have a path as a std::string that contains non-ASCII characters, you have to be absolutely clear about whether it is encoded in the local code-page or UTF-8 (there is no such distinction under Linux). The implicit construction shown in the question will assume encoding in the local code-page, while many 3rd party libraries will assume UTF-8 as the standard (e.g. QT in QString::toStdString() return UTF-8; strings in gRPC/protobuf should be expected being UTF-8 etc.). Therefore, you have to be very careful not to mix these:
const std::string path_as_string = "\xe4\xb8\xad"; //chinese character Zhong in UTF-8
const std::filesystem::path wrong_path = path_as_string; //looks innocent, but incorrect encoding is used!
std::ofstream stream;
stream.open(std::filesystem::current_path() / wrong_path); //create a file with the broken name
stream.close();
The correct way to construct the path in this case would be
const std::filesystem::path correct_path = std::filesystem::u8path(path_as_string);
Please also note, that this is a perfectly valid path, and you can easily create a file containing such characters using regular applications or the Windows Explorer. Due to the high potential to get the conversion wrong, IMHO it would have been preferable for Windows development, to not have any implicit constructor in the current situation, but requests to have an option to disable that behavior in the STL implementations have been rejected with reference to the standard. On the other hand, it seems that Windows will slowly move to UTF-8 as default code-page in the future (see e.g. this comment). Until then, the above mentioned caveats should to be taken into account.

How to Deal with Varying String types?

I have to work with an API that is using Microsoft's TCHAR macros and such, so I was wondering if I could use C++ in a way to simplify the task. So i was wondering if there is a way to support implicit conversion and why/why not std::string doesn't support converting from a smaller char size:
#include <Windows.h>
using String = std::basic_string<TCHAR>; // say TCHAR = wchar_t or equivalent
String someLiteralString = "my simple ansi string"; // Error here obviously
// some polymorphic class...
const TCHAR* MyOverriddenFunction() override { return someLiteralString.c_str(); }
// end some polymorphic class
The reason implicit conversion isn't supported is that conversion can be complicated. The simple case is when the string to convert is pure ASCII as in your example, but there's no way to guarantee that. The creators of the standard wisely stayed away from that problem.
If you don't know whether your strings are wide-character or not, you can use Microsoft's _T() macro around each string literal to generate the proper characters. But you say you don't want to do that.
Modern Windows programming always uses wide characters in the API. Chances are your program is too, otherwise the code you've shown would not cause an error. It's very unlikely that once you've used wide characters you'll switch back to narrow ones. A simple one-character change to your literals will make them wide-character to match the string type:
String someLiteralString = L"my simple ansi string";
Use the (ATL/MFC) CStringT class, it will make your life much easier.
http://msdn.microsoft.com/en-us/library/ms174284(v=vs.80).aspx

Encode/Decode std::string to UTF-16

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded).
I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format (actually modeled as a bytestream) including the generation/recognition of surrogate pairs and all that Unicode stuffs (I'm admittedly no expert with)...
Any suggestions? Thanks!
EDIT: forgot to mention it should be cross-platform (Win / Mac) and cannot use C++11.
C++11 has this functionality:
std::string s = u8"Hello, World!";
// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;
std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);
However to my knowledge the only implementation that has this so far is libc++. C++11 also has std::codecvt_utf8_utf16<char16_t> which some other implementations have. Specifically, codecvt_utf8_utf16 works in VS 2010 and above, and since wchar_t is used by Windows to represent UTF-16 you can use this to convert between UTF-8 and Windows' native encoding.
The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding
schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes.
— [locale.codecvt] 22.4.1.4/3
Oh, and std::codecvt specializations have protected destructors, and wstring_convert requires access to the destructor so you really need an adapter:
template <class Facet>
class usable_facet : public Facet {
public:
using Facet::Facet; // inherit constructors
~usable_facet() {}
// workaround for compilers without inheriting constructors:
// template <class ...Args> usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
};
template<typename internT, typename externT, typename stateT>
using codecvt = usable_facet<std::codecvt<internT, externT, stateT>>;
std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>> convert;
Did you look at Boost.Locale? This page, in particular, describes how to do UTF to UTF conversions and how to integrate it with IOStreams.
I would suggest having a look at:
Convert C++ std::string to UTF-16-LE encoded string
And check out the iconv function. It's a C library, no requirements for C++11.
There's also a Win32 specific iconv library at https://github.com/win-iconv/win-iconv.

WChars, Encodings, Standards and Portability

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
Portability and serialization are orthogonal concepts.
Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
Updates
Following many very nice comments I'd like to add a few observations:
If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++
No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with
int main(int argc, char** argv)
you have already lost Unicode support for command line arguments. You have to write
int wmain(int argc, wchar_t** argv)
instead, or use the GetCommandLineW function, none of which is specified in the C standard.
More specifically,
any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.
I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.
I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)
DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.
DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.
The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.
Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:
Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn 𐀀 into a string of wchar_t?
Text editor: How do you find grapheme cluster boundaries in a wchar_t string?
If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.
Your program requirements may differ and wchar_t may work fine for you.
Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.
There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.
The facets are described in 22.5 [locale.stdcvt] (from n3242).
I don't understand how this doesn't satisfy at least some of your requirements:
namespace ns {
typedef char32_t char_t;
using std::u32string;
// or use user-defined literal
#define LIT u32
// Communicate with interface0, which wants utf-8
// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;
inline std::string
to_interface0(string const& s)
{
return converter0().to_bytes(s);
}
inline string
from_interface0(std::string const& s)
{
return converter0().from_bytes(s);
}
// Communitate with interface1, which wants utf-16
// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;
inline std::wstring
to_interface0(string const& s)
{
return converter1().to_bytes(s);
}
inline string
from_interface0(std::wstring const& s)
{
return converter1().from_bytes(s);
}
} // ns
Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).
In fact I wrote the above to be as short as possible but you'd really want helpers like this:
template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
return converter0().from_bytes(std::forward<T>(t)...);
}
which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

Is there an STL string class that properly handles Unicode?

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32.
So does anyone know of cross-platform drop-in replacement classes that provide full UTF-8, UTF-16 and UTF-32 support?
And let's not forget the lightweight, very user-friendly, header-only UTF-8 library UTF8-CPP. Not a drop-in replacement, but can easily be used in conjunction with std::string and has no external dependencies.
Well in C++0x there are classes std::u32string and std::u16string. GCC already partially supports them, so you can already use them, but streams support for unicode is not yet done Unicode support in C++0x.
It's not STL, but if you want proper Unicode in C++, then you should take a look at ICU.
There is no support of UTF-8 on the STL. As an alternative youo can use boost codecvt:
//...
// My encoding type
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wstringstream oss;
oss.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(oss));
std::wcout << oss.str() << std::endl;
}
For UTF-8 support, there is the Glib::ustring class. It is modeled after std::string but is utf-8 aware,e.g. when you are scanning the string with an iterator. It also has some restrictions, e.g. the iterator is always const, as replacing a character can change the length of the string and so it can invalidate other iterators.
ustring does not automatically converts other encodings to utf-8, Glib library has various conversion functions for this. You can validate whether the string is a valid utf-8 though.
And also, ustring and std::string are interchangeable, i.e. ustring has a cast operator to std::string so you can pass a ustring as a parameter where an std::string is expected, and vice versa of course, as ustring can be constructed from std::string.
Qt has QString which uses UTF-16 internally, but has methods for converting to or from std::wstring, UTF-8, Latin1 or locale encoding. There is also the QTextCodec class which can convert QStrings to or from basically anything. But using Qt for just strings seems like an overkill to me.
Also look at http://grigory.info/UTF8Strings.About.html it is UTF8 native.