My source base is mostly using UTF8, but some older library has Windows Latin1 encoded strings hardcoded within it.
I was hoping Boost would have a clear conversion feature, but I did not find such. Do I really need to hand-code such a commonplace solution?
Looking for a portable solution, running on Linux.
(This Q is similar, but not quite the same)
Edit: ICU seems to be the right answer, but it's a bit overkill for my needs. I ended up doing string-replace for the known few extended chars that were used.
International Components for Unicode (ICU) does have the solutions you are looking for. Boost can be compiled with support for ICU, e.g. for Boost regular expressions, but precompiled versions of Boost usually don't include it.
Related
I see "CString" in MFC, and "QString" in QT.
what is the difference among string, CString, QString?
Why do not use "string" directly?
They're different variation on string types.
std::string is the one from the ISO standard and probably preferred in situations where you want portability. It is required to be provided by all implementations claiming to conform with the standard.
CString is, as you say, from MFC (documented here) and will generally only work in that environment. If you're programming exclusively to Windows, you can probably use that. It may have extra features not provided by std::string.
Similarly, QString is the Qt variation, documented here, and is meant to represent strings in programs using Qt. Like CString, it's more tightly bound to its environment so may offer efficiencies over std::string.
Looking around (doing your research for you basically) I found some stuff.
String: Does NOT support character encoding, no special functionality vs the others(.)
QString: Plenty of useful functions, some better compatibilities, supports character encoding, default UTF-16(.)
CString: Plenty of useful functions, some better compatibilities, and good for Unicode and Ascii compilation(..), ...
There are also some more things that are not mentioned here, the sources are
. http://blog.rburchell.com/2010/08/strings-and-qt.html
.. http://forums.codeguru.com/showthread.php?319932-CString-vs-std-string
... Elsewhere
.... Built to work better with its own framework
I hope I was helpful, as this is my first post.
I want to use unicode string in c++ with any library which implements a lot of its routine. I want to work with the boost libraries. And I found locale library. But I did not find that a lot of people use it, don't they? What can you say from your experience about this library? Are there any other boost libraries which implements unicode string routine?
UPDATE:
There is a problem with the use of another libraries in some my modules. I don't want to tie them to a lot of different libraries (boost is ok), but I need a unicode string routine (mb class). Why unicode? Mb in some characters of the strings will appear japanese symbols or from other language. And they must be treated as english characters.
Please excuse the self promotion here, but you might be interested in the answer I wrote here: What are the tradeoffs between boost::locale and std::locale?, comparing boost::locale to std::locale.
Depending on what you need to do with your text, boost::locale is probably the best approach to adding unicode support to your c++ code. This is especially true if you need cross platform support, or you want to use UTF-8 on Windows.
I hope there is any library out there that provides such functionality so that I do not need to dig too much on charset specification.
C++, and hopefully Chinese, and hopefully Windows.
Yes, ICU is a mature library providing Unicode and Globalization support. Among other things it provides easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.
I have not tried to program with it it myself, but in the Unix world the Gnu Library libiconv is very widely used. It is also available for Windows. It is probably a bit more slim then the ICU.
I'm looking for a portable and easy-to-use string library for C/C++, which helps me to work with Unicode input/output. In the best case, it will store its strings in memory in UTF-8, and allow me to convert strings from ASCII to UTF-8/UTF-16 and back. I don't need much more besides that (ok, a liberal license won't hurt). I have seen that C++ comes with a <locale> header, but this seems to work on wchar_t only, which may or may not be UTF-16 encoded, plus I'm not sure how good this is actually.
Uses cases are for example: On Windows, the unicode APIs expect UTF-16 strings, and I need to convert ASCII or UTF-8 strings to pass it on to the API. Same goes for XML parsing, which may come with UTF-16, but I actually only want to process internally with UTF-8 (or, for that matter, if I switch internally to UTF-16, I'll need a conversion to that anyway).
So far, I've taken a look at the ICU, which is quite huge. Moreover, it wants to be built using it own project files, while I'd prefer a library for which there is either a CMake project or which is easy to build (something like compile all these .c files, link and good to go), instead of shipping something large as the ICU along my application.
Do you know such a library, which is also being maintained? After all, this seems to be a pretty basic problem.
UTF8-CPP seems to be exactly what you want.
I'd recommend that you look at the GNU iconv library.
There is another portable C library for string conversion between UTF-8, UTF-16, UTF-32, wchar - mdz_unicode library.
What is the best practice of Unicode processing in C++?
Use ICU for dealing with your data (or a similar library)
In your own data store, make sure everything is stored in the same encoding
Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.
If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.
EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.
Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
Start with the ICU Userguide
Here is a checklist for Windows programming:
All strings enclosed in _T("my string")
strlen() etc. functions replaced with _tcslen() etc.
Use LPTSTR and LPCTSTR instead of char * and const char *
When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
For C++ strings, use std::wstring instead of std::string
Look at
Case insensitive string comparison in C++
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
The Code-Page Model
Double-Byte Character Sets in Windows
Unicode
Compatibility Issues in Mixed Environments
Unicode Data Conversion
Migrating Windows-Based Programs to Unicode
Summary
Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!
I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.
My code is under the New BSD license and can be found here:
http://code.google.com/p/netwidecc/downloads/list
It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.
As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.
Willow Schlanger's example code seems like a good one (see his answer for more details).
I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.
Here's a list of the embedded libraries that seem decent.
Embedded libraries
http://code.google.com/p/netwidecc/downloads/list (UTF8, UTF16LE, UTF16BE, UTF32)
http://www.cprogramming.com/tutorial/unicode.html (UTF8)
http://utfcpp.sourceforge.net/ (Simple UTF8 library)
Use IBM's International Components for Unicode
Have a look at the recommendations of UTF-8 Everywhere