Boost.Locale - Unicode string in C++ - c++

Can I make all std::string in my application support Unicode with Boost.Locale? After reading the documentation I can say yes. But I don't understand how it works. The main question is can I still use boost string algorithms library or Boost.Lexical_Cast libraries? If yes, why I need boost::locale::to_upper and similar format methods, if I have these methods in boost string algorithm library.

Yes, you can make all strings in your application Unicode encoded with Boost.Locale.
To make it work you imbue the locale into the string, or set the default global locale to your new unicode-based locale (generated by Boost.Locale).
See here for how to do that: http://www.boost.org/libs/locale/doc/html/locale_gen.html
and http://www.boost.org/libs/locale/doc/html/faq.html
The string manipulation APIs in Boost.Locale are different to the ones provided in the Boost string algorithm library.
See here for why the Boost.Locale functions are better: http://www.boost.org/libs/locale/doc/html/conversions.html
You can still use boost::lexical_cast, provided you set the global locale correctly (as, if I recall correctly, you can't explicitly pass a locale object to Boost.LexicalCast).
Keep in mind however that this will 'break' some cases, for example, when converting an integer to a string, instead of using the C locale (as was probably your previous default), it will use a different one, which may insert separators etc. When doing conversions that are NOT displayed to the user, you may wish to use std::stringstream et al directly to avoid these unwanted formatting changes.
I highly suggest you read the Boost.Locale documentation in full, as it should address most of your concerns (especially the FAQ, generation backend information, etc.).

Related

Locale invariant guarantee of boost::lexical_cast<>

I'm using boost::lexical_cast<std::string>(double) for converting doubles to string, generating JSON serialized byte stream, that is (on remote side) parsed by .NET.
I'm able to force the .NET to use InvariantCulture for parsing, thereby returning predictable result on every possible language.
However, I was not able to find this guarantee in boost::lexical_cast documentation. I tried it a little bit, and it works the same way for different locales set. But, I cannot be sure only from few tests, am I missing something in the documentation, or this cannot be guaranted at all, and I have to use something else?
EDIT:
I've found an issue.
std::locale::global(std::locale("Czech"));
std::cout << boost::lexical_cast<std::string>(0.15784465) << std::endl;
returns 0,15784465, and that is undesired. Can I force the boost::lexical_cast<> not to be aware of locales?
Can I force the boost::lexical_cast<> not to be aware of locales?
No, I don't think that is possible. The best you can do is call
std::locale::global(std::locale::classic());
to set the global locale to the "C" locale as boost::lexical_cast relies on the global locale. However, the problem is if somewhere else in the code the global locale is set to something else before calling boost::lexical_cast, then you still have the same problem.
Therefore, a robust solution would be imbue a stringstream like so, and you can be always sure that this works:
std::ostringstream oss;
oss.imbue(std::locale::classic());
oss.precision(std::numeric_limits<double>::digits10);
oss << 0.15784465;
A better solution to this problem is to use a boost::locale instead of a std::locale as the globale locale. From the documentation:
Setting the global locale has bad side effects... it affects even printf and libraries like boost::lexical_cast giving incorrect or unexpected formatting. In fact many third-party libraries are broken in such a situation.
Unlike the standard localization library, Boost.Locale never changes the basic number formatting, even when it uses std based localization backends, so by default, numbers are always formatted using C-style locale. Localized number formatting requires specific flags.
Boost locale requires you to specify explicitly when you want numeric formatting to be locale aware, which is more consistent with recent library decisions like std::money_put.

Writing unicode C++ source code

I saw on the project properties on Visual Studio 2012 that you can select the Character set for your application.
I use Unicode Character set.
What is the problem with Multi-byte character set? Or better, why should I use the Unicode?
Take for example this piece of code from a DLL that I am doing
RECORD_API int startRecording(
char *cam_name, // Friendly video device name
char *time, // Max time for recording
char *f_width, // Frame width
char *f_height, // Frame height
char *file_path) // Complete output file path
{
...
}
A lot of Unicode functions from Windows.h header use wchar_t parameters; should I use wchar_t also for my functions parameters?
Should I always explicit the W functions (example: ShellExecuteW) ?
First, regardless of what the interface says, the question isn't
Unicode or not, but UTF-16 or UTF-8. Practically speaking, for
external data, you should only use UTF-8. Internally, it
depends on what you are doing. Conversion of UTF-8 to UTF-16 is
an added complication, but for more complex operations, it may
be easier to work in UTF-16. (Although the differences between
UTF-8 and UTF-16 aren't enormous. To reap any real benefits,
you'd have to use UTF-32, and even then...)
In practice, I would avoid the W functions completely, and
always use char const* at the system interface level. But
again, it depends on what you are doing. This is just a general
guideline. For the rest, I'd stick with std::string unless
there was some strong reason not to.
You don't need to explicitly call, the ..W version of a function as this should already be covered by the include files and the settings that you use. So if you compile for Unicodesupport, then the W version of your system call will be used, otherwise the A.
Personally I would only compile for Unicode if you can really test it. At least you shouldn't assume that your application really can work properly in all cases, just because you compiled for it. Compiling for it is only the first step, but of course, you must consequently use the appropriate types and test your code, to be sure that there are no effects you may not have noticed.

Quick'n'dirty way to prototype some serialization in C++?

I need to do a prototype that involves some serialization in C++. It is a quick'n'dirty prototype, so I don't need to solve the problem generally, provide good error checking, or anything like that. But at the same time, I do need to be able to serialize strings of arbitrary length and with arbitrary charcters.
Are there some best practices for how to whip up a quick data serialization in C++? Normally I'd just have output records into a text file with one record per line, but my strings may have new lines in them.
You could consider using JSON, notably thru JsonCpp. You could also use libs11n, a full fledged, template friendly, C++ serialization framework.
(If you want a C library for Json, consider jansson).
You might also consider using old XDR or ASN1 technology.
For a quick & dirty prototype, I do recommend the JsonCpp library mentioned above. And using JSON in that case is useful, since it is a textual, nearly-human-friendly, format.
Later you could even perhaps consider going to MongoDb which has a Json-like model.
Checkout serialization with boost:
http://www.boost.org/doc/libs/1_51_0/libs/serialization/doc/index.html
Not dirty at all but definitely quick.
If you do not mind binary data, for each string dump a length (cast to a char*) and then the value of the string to file. It is very easy to read back. POD structs can also be dumped directly by casting to a char*

using boost string algorithm with MFC CString to check for the end of a string

I need to check whether my CString object in MFC ends with a specific string.
I know that boost::algorithm has many functions meant for string manipulation and that in the header boost/algorithm/string/predicate.hpp could it be used for that purpose.
I usually use this library with std::string. Do you know a convenient way to use this library also with CString?
I know that the library is generic that can be used also with other string libraries used as template arguments, but it is not clear (and whether is possible) to apply this feature to CString.
Can you help me with that in case it is possible?
According to Boost String Algorithms Library, "consult the design chapter to see precise specification of supported string types", which says amongst other things, "first requirement of string-type is that it must [be] accessible using Boost.Range", and note at the bottom the MFC/ATL implementation written by Shunsuke Sogame which should allow you to combine libraries.
Edit: Since you mention regex in the comments below, this is all you really need to do (assuming a unicode build):
CString inputString;
wcmatch matchGroups;
wregex yourRegex(L"^(.*)$"), regex::icase);
if (regex_search(static_cast<LPCWSTR>(inputString), matchGroups, yourRegex))
{
CString firstCapture = matchGroups[1].str().c_str();
}
Note how we reduce the different string types to raw pointers to pass them between libraries. Replace my contrived yourRegex with your requirements, including whether or not you ignore case or are explicit about anchors.
Why don't you save yourself the trouble and just use CStringT::Right?

Localization of string literals

I need to localize error messages from a compiler. As it stands, all error messages are spread throughout the source code as string literals in English. We want to translate these error messages into German. What would be the best way to approach this? Leave the string literals as-is, and map the char* to another language inside the error reporting routine, or replace the english literals with a descriptive macro, eg. ERR_UNDEFINED_LABEL_NAME and have this map to the correct string at compile time?
How is such a thing approached in similar projects?
On Windows, typically this is done by replacing the string with integer constants, and then using LoadString or similar to get them from a resource in a DLL or EXE. Then you can have multiple language DLLs and a single EXE.
On Unixy systems I believe the most typical approach is gettext. The end result is similar, but instead of defining integer constants, you wrap your English string literals in a macro, and it will apply some magic to turn that into a localized string.
The most flexible way would be for the compiler to read the translated messages from message catalogs, with the choice of language being made according to the locale. This would require changing the compiler to use some tool like
gettext.
Just a quick thought...
Could you overload your error reporting routine? Say you are using
printf("MESSAGE")
You could overload it in a way that "MESSAGE" is the input, and you hash it to the corresponding message in German.
Could this work?
You could use my CMsg() and CFMsg() wrappers around the LoadString() API. They make your life easier to load and format the strings pulled out of the resources.
And of course, appTranslator is your best friend to translate your resources ;-)
Disclaimer: I'm the author of appTranslator.
On Windows you can use the resource compiler and the WinAPI load functions to have localized strings and other resources. FindResource() and its specialized derivatives like LoadString() will automatically load language specific resources according to the user's current locale. FindResourceEx() even allows you to manually specify the language version of the resource you wish to retrieve.
In order to enable this in your program you must first change your program to compile the strings in an resource file(.rc) and use LoadString() to fetch the strings at runtime instead of using a literal string. Within the resource file you then setup multiple language versions of the STRINGTABLEs you use, with the LANGUAGE modifier. The multi-lingual resources are then loaded based on the search order described here on MSDN: Multiple-Language Resources
Note: If you have no reason to need a single executable, or are doing something like using a user selected language from within your app, it gives you more control and less confusion to compile each language in a seperate dll and load them dynamically rather than have a large single resource file and trying to dynamically switch locales.
Here is an example of a multiple language StringTable resource file (ie:strings.rc):
#define IDS_HELLOSTR 361
STRINGTABLE
LANGUAGE LANG_ENGLISH, SUBLANG_ENGLISH_CAN
BEGIN
IDS_HELLO, "Hello!"
END
STRINGTABLE
LANG_FRENCH, SUBLANG_NEUTRAL
BEGIN
IDS_HELLO, "Bonjour!"
END