How do I write a std::codecvt facet? I'd like to write ones that go from UTF-16 to UTF-8, which go from UTF-16 to the systems current code page (windows, so CP_ACP), and to the system's OEM codepage (windows, so CP_OEM).
Cross-platform is preferred, but MSVC on Windows is fine too. Are there any kinds of tutorials or anything of that nature on how to correctly use this class?
I've written one based on iconv. It can be used on windows or on any POSIX OS.
(You will need to link with iconv obviously).
Enjoy
The answer for the "how to" question is to follow the codecvt reference. I was not able to find any better instructions in the Internet two years ago.
Important notices
theoretically there is no need for such work. codecvt_byname should be enough on any standard supporting platform. But in reality there are some compilers that don't support or badly support this class.
There is also a difference in interfaces of codecvt_byname on different compilers.
my working example is implemented with state template parameter of codecvt. Always use standard mbstate type there as this is the only way to use your codecvt with standard iostream classes.
std::mbstate_t type can't be used as a pointer on 64bit platforms in a cross-platform way.
stateless conversions work for short strings, but may fail if you try to convert a data chunk greater that streambuf internal buffer size (UTF is essentially stateful encoding)
The problem with this std::codecvt is it's a solution looking for a problem. Or rather, the problem it's trying to solve is unsolvable, so anybody trying to use it as a solution is going to be very disappointed.
If you don't know which character set your input or output is, then std::codecvt isn't ever going to be able to help you. Conversely, if you do know which character sets you're using, then you can trivially convert between them with a single function call. Wrapping that function call in a complicated mess of templates doesn't change those fundamentals.
...and that's why nobody uses std::codecvt. I recommend you just do what everybody else does, and pretend it never happened.
Related
I see "CString" in MFC, and "QString" in QT.
what is the difference among string, CString, QString?
Why do not use "string" directly?
They're different variation on string types.
std::string is the one from the ISO standard and probably preferred in situations where you want portability. It is required to be provided by all implementations claiming to conform with the standard.
CString is, as you say, from MFC (documented here) and will generally only work in that environment. If you're programming exclusively to Windows, you can probably use that. It may have extra features not provided by std::string.
Similarly, QString is the Qt variation, documented here, and is meant to represent strings in programs using Qt. Like CString, it's more tightly bound to its environment so may offer efficiencies over std::string.
Looking around (doing your research for you basically) I found some stuff.
String: Does NOT support character encoding, no special functionality vs the others(.)
QString: Plenty of useful functions, some better compatibilities, supports character encoding, default UTF-16(.)
CString: Plenty of useful functions, some better compatibilities, and good for Unicode and Ascii compilation(..), ...
There are also some more things that are not mentioned here, the sources are
. http://blog.rburchell.com/2010/08/strings-and-qt.html
.. http://forums.codeguru.com/showthread.php?319932-CString-vs-std-string
... Elsewhere
.... Built to work better with its own framework
I hope I was helpful, as this is my first post.
We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.
We can't decide whether the best approach:
Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.
To me they seem like they are both about as good as each other.
and UTF-8 in linux.
It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.
We can't decide whether the best approach
As usual it depends.
If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.
If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.
In second case you have a choice:
Use cross platform library that provides strings support (Qt, ICU, for example)
Use bare pointers (I consider std::string a "bare pointer" too)
If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.
When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).
My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).
I agree that bare pointers approach is not for everyone. It is good when:
You work with entire strings and splitting, searching, comparing is a rare task
You can use same encoding in all components and need a conversion only when using platform API
All your supported platforms has API to:
Convert from your encoding to that is used in API
Convert from API encoding to that is used in your code
Pointers is not a problem in your team
From my a little specific experience it is actually a very common case.
When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).
From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.
Advantages of UTF-8:
Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
C std library works great with UTF-8 strings. (*)
C++ std library works great with UTF-8 (std::string and friends). (*)
Legacy code works great with UTF-8.
Quite any platform supports UTF-8.
Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
No Little-Endian/Big-Endian mess.
You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".
(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.
Disadvantage is questionable:
Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
Harder (a little actually) to iterate over symbols.
So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.
But encoding is not the only question you need to answer.
There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.
For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.
In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).
I'd use the same encoding internally, and normalize the data at entry point. This will involve less code, less gotchas, and will allow you to use the same cross platform library for string processing.
I'd use unicode (utf-16) because it's simpler to handle internally and should perform better because of the constant length for each character. UTF-8 is ideal for output and storage because it's backwards compliant with latin ascii, and unly uses 8 bits for English characters. But inside the program 16-bit is simpler to handle.
C++11 provides the new string types u16string and u32string. Depending on the support your compiler versions deliver, and the expected life expectancy, it might be an idea to stay forward-compatible to those.
Other than that, using the ICU library is probably your best shot at cross-platform compatibility.
This seems to be quite enlightening on the topic. http://www.utf8everywhere.org/
Programming with UTF-8 is difficult as lengths and offsets are mixed up. e.g.
std::string s = Something();
std::cout << s.substr(0, 4);
does not necessarily find the first 4 chars.
I would use whatever a wchar_t is. On Windows that will be UTF-16. On some *nix platforms it might be UTF-32.
When saving to a file, I would recommend converting to UTF-8. That often makes the file smaller, and removes any platform dependencies due to differences in sizeof(wchar_t) or to byte order.
I'm targeting Windows but I don't see any reason why some API code I'm writing cannot use basic C++ types. What I want to do is expose methods that return strings and ints. In the C# world I'd just use string, and have a unicode string, but in VC++ I've got the option of using std::string, std::wstring, or MFC/ATL CStrings.
Should I just use std::wstring exclusively to support unicode, or can I use std::string which would be compiled to unicode based on my build settings? I'm leaning toward the latter. I'd prefer to provide Get[Item]AsCString() methods on my objects for other string types.
Also should I be using size_t instead of integer?
The API is going to be used by me, and perhaps a future developer working on the C++ GUI. It is a way to separate concerns. My preferences:
Intuitiveness for other developers.
Forward compatibility with VC++
Compatibility with other C++ compilers
Performance (this is a lesser concern for me, but need the startup time for rest of my app)
Any guides would be appreciated.
You should probably stick to the STL string type. The MFC CString class is built on top of that nowadays anyway.
As has been noted before, using wstring is not a magic bullet to address Unicode issues since there are many Unicode characters that still require multiple wchars to encode.
Using Utf-8 instead has potential benefits (you don't have to worry about endianness for example).
On Windows, all modern kernels are wchar based, so there is a (minimal) performance overhead involved if you use the 8bit char versions of APIs.
In your situation it would take me few hours / days to develop an opinion and decide. First of all, I very much prefer C_API to C++_API, even for C++ code. Then the answer would be char*, or wchar*, or TCHAR*. Now, try to guess if you REALLY expect the need for UNICODE. Great majority of my projects (including those with GUIs), had no need for UNICODE, the simplicity and familiarity of plain C-arrays is often hard to beat.
In short, try to predict what will be your needs, do not try to look too far into future (2 years is a good mark), then come up with the simplest solution to meet the needs.
Last: To answer your question more directly, I would start with std::string as my 1st choice to evaluate. Unless I would find some bid advantage in favor of the other choices, I would stay with it.
Using std::wstring/string instead of the MFC CString will allow you to port your code to other frameworks (e.g. Qt for Windows).
Even when using std::string you could encode the strings in UTF-8, so your API will still be able to return UNICODE strings.
Keep in mind that even wstring is really UTF-16 and not the full 32 bits UNICODE (while on some operating systems wstring is UTF-32).
So I've finally gotten back to my main task - porting a rather large C++ project from Windows to the Mac.
Straight away I've been hit by the problem where wchar_t is 16-bits on Windows but 32-bits on the Mac. This is a problem because all of the strings are represented by wchar_t and there will be string data going back and forth between Windows and Mac machines (in both on-disk data and network data forms). Because of the way in which it works it wouldn't be totally straightforward to convert the strings into some common format before sending and receiving the data.
We've also really started to support a lot more languages recently and so we're starting to deal with a lot of Unicode data (as well as dealing with right-to-left languages).
Now, I could be conflating multiple ideas here and causing more problems for myself than needed which is why I'm asking this question. We're thinking that storing all of our in-memory string data as UTF-8 makes a lot of sense. It solves the wchar_t being different sizes problem, it means we can easily support multiple languages and it also dramatically reduces our memory footprint (we have a LOT of - mostly English - strings loaded) - but it doesn't seem like many people are doing this. Is there something we're missing? There's the obvious problem you have to deal with where string length can be less than the memory size storing that string data.
Or is using UTF-16 a better idea? Or should we stick to wchar_t and write code to convert between wchar_t and, say, Unicode in places where we read/write to the disk or the network?
I realize this is dangerously close to asking for opinions - but we're nervous that we're overlooking something obvious because it doesn't seem like there are many Unicode string classes (for example) - but yet there's plenty of code for converting to/from Unicode like in boost::locale, iconv, utf-cpp and ICU.
Always use a protocol defined to the byte when a file or network connection is involved. Do not rely on how a C++ compiler stores anything in memory. For Unicode text, this means choosing both an encoding and a byte order (okay, UTF-8 doesn't care about byte order). Even if the platforms you currently want to support have similar architectures, another popular platform with different behavior or even a new OS for one of your existing platforms will likely come along, and you'll be glad you wrote portable code.
I tend to use UTF-8 as the internal representation. You only lose string length checking, with isn't really useful anyways. For Windows API conversion, I use my own Win32 conversion functions I devised here. As Mac and linux are (for the most part standard UTF-8-aware, no need to convert anything there). Free bonuses you get:
use plain old std::string.
byte-wise network/stream transport.
For most languages, nice memory footprint.
For more functionality: utf8cpp
As a rule of thumb: UTF-16 for processing, UTF-8 for communication & storage.
Sure, any rule can be broken and this one is not carved in stone.
But you have to know when it is ok to break it.
For instance it might be a good idea to use something else if the environment you are using wants something else. But Mac OS X APIs use UTF-16, same as Windows. So UTF-16 makes more sense.
It is more straightforward to convert before you put/get things on the net (because you probably do it in 2-3 routines) than doing all the conversions to call OS APIs.
It also matter the type of application you develop.
If it is something with very little text processing, and very little calls to the system (something like an email server that mostly moves things around without changing them), then UTF-8 might be a good choice.
So, as much as you might hate this answer, "it depends".
ICU has a C++ string class, UnicodeString
What's the current best practice for handling generic text in a platform independent way?
For example, on Windows there are the "A" and "W" versions of APIs. Down at the C layer we have the "_tcs" functions (like _tcscpy) which map to either "wcscpy" or "strcpy". And in the STL I've frequently used something like:
typedef std::basic_string<TCHAR> tstring;
What issues if any arise from these sorts of patterns on other systems?
There is no support for a generic (variable-width) chararacter like TCHAR in standard C++. C++ does have wchar_t, but the encoding isn't guaranteed. C++1x will much improve things once we have char16_t and char32_t as well as UTF-{8,16,32} literals.
I personally am not a big fan of generic characters because they lead to some nasty problems (like conversion) and, what's more, if you are using a type (like TCHAR) that might ever have a maximum width of 8, you might as well code with char. If you really need that backwards-compatibility, just use UTF-8; it is specifically designed to be a strict superset of ASCII. You may have to use conversion APIs (especially on Windows, which for some bizarre reason is UTF-16), but at least it'll be consistent.
EDIT: To actually answer the original question, other platforms typically have no such construct. You will have to define your TCHAR on that platform, or else use a library that provides one (but as you should no doubt be able to guess, I'm not a big fan of that concept in libraries either).
One thing to be careful of is to make sure for all static libraries that you have, and modules that use these static libraries, that you use the same char format. Because otherwise your code will compile, but not link properly.
I typically create my own t types based on the stl types. tstring, tstringstream, and even down to boost types like tpath_t.
Unicode character set + the encoding that makes the most sense for your data. I typically use UTF-8 because it's convenient with traditional C / C++ functions and the data I deal with doesn't cause too much bloat.
Some APIs (Windows) and cross language tools (Java) use UTF-16 so that might be a consideration.
One practice I wish we had been better at is to leave text as an array bytes for doing low tech operations like copying, simple comparison, simple searching, etc. When you need the richer more character aware operations you can convert to some super string (icu strings are nice -- but heavy) and define the layers / entry points that need to do this as opposed to naively doing it everywhere. The needless conversations kills our performance -- especially when combined with an XML DOM library which also uses the "super" strings.