None Unicode library and windows locale - c++

I'm using(what seems to be) an ansi(or ascii??) dll library. I think it is such because the header file provided with the lib shows function using char*'s and LPSTR and LPCSTR and structs with char arrays.
This dll is loaded via ::LoadLibrary from a cpp/cli class library that wraps its functionality and exposes it to c#. A c# console app and various other class libs use this cli lib to perform operation.
I can make the cli assembly ether mutibyte or Unicode(which as far as I understand is the same in terms of language support) and c# apps are always Unicode.
This native dll is essentially a broker for a propriety back end server, it passes information back and froth from and to the server.
The issue I'm running into is that the native dll lib will only operate correctly for a particular language if the os locale, for none Unicode apps, its running in is set to that particular language.
I.e. if i want the app to correctly work with Chinese characters, that locale needs to be set. What I find hard to grasp is why does the locale matter for the broker. I understand that if the server is an ansi app if a user wanted to store none Unicode Chinese on it setting the locale on the server would make sense so it would in the client, but not in the middle man that just passes things along. Furthermore the whole thing is getting very confusing.
Is there away to pass Unicode to something like a char array in c++? Wold that even work in this scenario?
Here's a scenario I'm thinking about:
c# app gets url encoded string
c# app decodes the string and passes it to cli
cli somehow converts the String^(or should it be byte[] at
this point) to char[] and passes it to the native lib
Should this really be possible? In terms of memory layout it should, i mean char is just a byte no?
Am I approaching this the right way? is there there a better way to accomplish cross language support. Mind you the vendor is on record saying that there is no way to mix languages in the api, but that's not what I'm looking for. I just dont want to have to run an instance of the software on a separate os for each language i want to support.

What is confusing in this case, is that the DLL has a broken interface. Broken in the following sense: it does not support all of the Unicode codepoints. This is regardless of the type of parameters: char array is perfectly good for supporting all of unicode.
How do we know this? It is because, according to you, what it does depends on the system locale setting.
So, what to do? If the DLL source code is not under your control, you will not have it. You can, however, solve the problem of one ANSI codepage by setting the locale. It does not work for some languages.
Better would be to urge the DLL vendor to support unicode. Best encoding is, of course, UTF-8 - and this way it does not break existing code because the types LPCSTR remain the same.

I ended up using the approach described here:
Is it possible to set ANSI encoding per application in windows
This is what worked for me

Related

How to properly localize a cross platform program?

I am currently making a game engine that is eventually going to support all platforms. Currently I am working on the Windows support with the Win32 API. Reading the documentation, it suggests that I use wide strings/chars and the Unicode version of API functions so that my application can be localized. But if I use wide versions of everything (wcout wstring wchar_t etc.), I will have to make my entire game engine use those wide types. That also means that when working with other platforms, I will also have to use wide types, or I will have to convert between them.
My idea is that maybe my code will only be compiled with wide string types on Windows and be compiled with normal string types on other platforms perhaps with macro definitions. Is that the best option to do this? And how may I go about doing this?
I also don't really understand how unicode works in c++. If I set my system locale to English, then I will get a compiler warning from MSVC if I have any Chinese characters stored in a normal string type. However now I set my system locale to Chinese and enabled UTF-8 encoding, I get no compiler warnings if I store Unicode characters in normal strings. I also have no idea how unicode works on other platforms. Can somebody explain this for me?
Thanks in advance.

What are the disadvantages to not using Unicode in Windows?

What are the disadvantages to not using Unicode on Windows?
By Unicode, I mean WCHAR and the wide API functions. (CreateWindowW, MessageBoxW, and so on)
What problems could I run into by not using this?
Your code won't be able to deal correctly with characters outside the currently selected codepage when dealing with system APIs1.
Typical problems include unsupported characters being translated to question marks, inability to process text with special characters, in particular files with "strange characters" in their names/paths.
Also, several newer APIs are present only in the "wide" version.
Finally, each API call involving text will be marginally slower, since the "A" versions of APIs are normally just thin wrappers around the "W" APIs, that convert the parameters to UTF-16 on the fly - so, you have some overhead in respect to a "plain" W call.
Nothing stops you to work in a narrow-characters Unicode encoding (=>UTF-8) inside your application, but Windows "A" APIs don't speak UTF-8, so you'd have to convert to UTF-16 and call the W versions anyway.
I believe the gist of the original question was "should I compile all my Windows apps with "#define _UNICODE", and what's the down side if I don't?
My original reply was "Yeah, you should. We've moved 8-bit ASCII, and '_UNICODE' is a reasonable default for any modern Windows code."
For Windows, I still believe that's reasonably good advice. But I've deleted my original reply. Because I didn't realize until I re-read my own links how much "UTF-16 is quite a sad state of affairs" (as Matteo Italia eloquently put it).
For example:
http://utf8everywhere.org/
Microsoft has ... mistakenly used ‘Unicode’ and ‘widechar’ as
synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be
set as the encoding for narrow string WinAPI, one must compile her
code with _UNICODE rather than _MBCS. Windows C++ programmers are
educated that Unicode must be done with ‘widechars’. As a result of
this mess, they are now among the most confused ones about what is the
right thing to do about text.
I heartily recommend these three links:
The Absolute Minimum Every Software Developer Should Know about Unicode
Should UTF-16 Be Considered Harmful?
UTF-8 Everywhere
IMHO...

Spanish characters in C++ Windows/Mac/iOS

I'm having some issues with spanish characters displaying in an iOS app. The code in question is all C++, and shared between both a Windows app and an iOS app. Compiled in Windows using Visual Studio 2010 (character set is Multi-byte). And Compiled using Xcode 4.2 on the Mac.
Currently, the code is using char pointers, and my first thought was that I need to switch over to wchar_t pointers instead. However, I noticed that the Spanish characters I want to output display just fine in Windows using just char pointers. This made me think those characters are a part of the multi-byte character set and I don't need to go to all the trouble of updating everything to wchar_t until I'm ready to do some Japanese, Russian, Arabic, etc. translations.
Unfortunatly, while the Spanish characters do display property in the Windows app, they do not display right once they hit the Mac/iOS. Experimenting with wchar_t there, I see that they will display properly if I convert everything over. But what I don't understand, and hoping someone can enlighten me as to the reason... is why are the characters perfectly valid on the Windows machine, same code, and dislaying as gibberish (requiring wchar_t instead) in the Mac environment?
Is visual studio doing something to my char pointers behind the scenes that the Mac is not doing? In other words, is the Microsoft environment simply being more forgiving to my architectural oversight when I used char pointers instead of wchar_t?
Seeing as how I already know my answer is to convert from char pointers to wchar_t pointers, my real question then is "Why does the Mac require wchar_t but in Windows I can use char for the same characters?"
Thanks.
Mac and Windows use different codepages--they both have Spanish characters available, but they show up as different character values, so the same bytes will appear differently on each platform.
The best way to deal with localization in a cross-platform codebase is UTF8. UTF8 is supported natively in NSString -stringWithUTF8String: and in Windows Unicode applications by calling MultiByteToWideChar with CP_UTF8. In fact, since it's Unicode, you can even use the same technique to handle more complicated languages like Chinese.
Don't use wide characters in cross-platform code if you can help it. This gets complicated because wchar_t is actually 32 bits wide on OS X. In fact, it's wasteful of memory for that reason as well.
http://en.wikipedia.org/wiki/UTF-8
None of char, wchar_t, string or wstring have any encoding attached to them. They just contain whatever binary soup your compiler decides to interpret the source files as. You have three variables that could be off:
What your code contains (in the actual file, between the '"' characters, on a binary level).
What your compiler thinks this is. For example, you may have a UTF-8 source file, but the compiler could turn wchar_t[] literals into proper UCS-4. (I wish MSVC 2010 could do this, but as far as I know, it does not support UTF-8 at all.)
What your rendering API expects. On Windows, this is usually Little-Endian UTF-16 (as a LPWCHAR pointer). For the old LPCHAR APIs, it is usually the "current codepage", which could be anything as far as I know. iOS and Mac OS use UTF-16 internally I think, but they are very explicit about what they accept and return.
No class or encoding can help you if there is a mismatch between any of these.
In an IDE like Xcode or Eclipse, you can see the encoding of a file in its property sheet. In Xcode 4, this is the right-most pane, bring it up with cmd+alt+0 if it's hidden. If the characters look right in the code editor, the encoding is correct. A first step is to make sure that both Xcode and MSVC are interpreting the same source files the same way. Then you need to figure what they are turned in into memory right before rendering. And then you need to ensure that both rendering APIs expect the same character set at all.
Or, just move your strings into text files separate from your source code, and in a well-defined encoding. UTF-8 is great for this, but everything will work that can encode all necessary characters. Then only translate your strings for rendering (if necessary).
I just saw this answer which gives even more reasons for the latter option: https://stackoverflow.com/a/1866668/401925

Windows Codepage Interactions with Standard C/C++ filenames?

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.
It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.
In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.
Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.
Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.
I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.
Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.
It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).
It also means that it can use Japanese file names on a Japanese system only.
Change the system code page and the application "stops working"
I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).
This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a
In the long run you might consider moving to Unicode (and using _wfopen, wofstream).
I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:
for C: glib uses filenames in UTF-8;
for C++: glibmm also uses filenames in UTF-8, requires glib;
for C++: boost can use wstring for filenames.
I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.
Is somebody still watching this? I've just researched this question and found no answers anywhere, so I can try to explain my findings here.
In VS2005 the fstream filename handling is the odd man out: it doesn't use the system default encoding, the one you get with GetACP and set in Control Panel/Region and Language/Administrative. But always CP 1252 -- I believe.
This can cause big confusion, and Microsoft has removed this quirk in later VS versions.
All workarounds for VS2005 have their drawbacks:
Convert your code to use Unicode everywhere
Never open fstreams using narrow character filenames, always convert to them to Unicode using the system default encoding yourself, the use wide character filename open/ctor
Retrieve the codepage using GetACP(), then do a
matching setlocale:
setlocale (LC_ALL, ("." + lexical_cast<string> (GetACP())).c_str())
I'm nearly certain that on Linux, the filename string is a UTF-8 string (on the EXT3 filesystem, for example, the only disallowed chars are slash and NULL), stored in a normal char *. The man page doesn't seem to mention character encoding, which is what leads me to believe it is the system standard of UTF-8. OS X likely uses the same, since it comes from similar roots, but I am less sure about this.
You may have to set the thread locale to the system default locale.
See here for a possible reason for your problems:
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100887
Mac OS X uses Unicode as its native character encoding. The basic string objects are CFString and NSString. They store array of characters as Unicode.

Best way to design for localization of strings

This is kinda a general question, open for opinions. I've been trying to come up with a good way to design for localization of string resources for a Windows MFC application and related utilities. My wishlist is:
Must preserve string literals in code (as opposed to replacing with macro #define resource ID's), so that the messages are still readable inline
Must allow localized string resources (duh)
Must not impose additional run-time environment restrictions (eg: dependency on .NET, etc.)
Should have minimal obtrusion into existing code (the less modification the better)
Should be debuggable
Should generate resource files which are editable by common tools (ie: common format)
Should not use copy/paste comment blocks to preserve literal strings in code, or anything else which creates the potential for de-synchronization
Would be nice to allow static (compile-time) checking that every "notated" string is in the resource file(s)
Would be nice to allow cross-language resource string pooling (for components in various languages, eg: native C++ and .NET)
I have a way which fulfills all my wishlist to some extent except for static checking, but I have had to develop a bit of custom code to achieve it (and it has limitations). I'm wondering if anyone has solved this problem in a particularly good way.
Edit:
The solution I currently have looks like this:
ShowMessage( RESTRING( _T("Some string") ) );
ShowMessage( RESTRING( _T("Some string with variable %1"), sNonTranslatedStringVariable ) );
I then have a custom utility to parse out the strings from within the 'RESTRING' blocks and put them into a .resx file for localization, and a separate C# COM object to load them from localized resource files with fallback. If the C# object is not available (or cannot load), I fallback to the string in the code. The macro expands to a template class which calls the COM object and does the formatting, etc.
Anyway, I thought it would be useful to add what I have now for reference.
We use the English string as the ID.
If it fails the look up from the international resource object (loaded from the I18N dll installed) then we default to the ID string.
Code looks like:
doAction(I18N.get("Press OK to continue"));
As part of the build processes we have a perl script that parses all source for string constants. It builds a temp file of all strings in the application and then compares these against the resource strings in each local to see if they exists. Any missing strings generates an e-mail to the appropriate translation team.
We can have multiple dll for each local. The name of the dll is based on RFC 3066
language[_territory][.codeset][#modifier]
We try and extract the locale from the machine and be as specific as possible when loading the I18N dll but fallback to less specific local variations if the more specific version is not present.
Example:
In the UK: If the local was en_GB.UTF-8
(I use the term dll loosely not in the specific windows sense).
First look for the I18N.en_GB.UTF-8 dll. If this dll does not exist fall back to I18N.en_GB. If this dll does not exist fall back to I18N.en If this dll does not exist fall beck to I18N.default
The only exception to this rule is:
Simplified Chinese (zh_CN) where the fallback is US English (en_US). If the machine does not support simplified Chinese then it is unlikely to support full Chinese.
The simple way is to only use string IDs in your code - no literal strings.
You can then produce different versions of the.rc file for each language and either create resource only DLLs or simply different language builds.
There are a couple of shareware utilstohelp localising the rc file which handle resizing dialog elements for languages with longer words and warnign about missing translations.
A more complicated problem is word order, if you have several numbers in a printf which must be in a different order for different language's grammar.
There are some extended printf classes on codeproject that let you specify things like printf("word %1s and %2s",var1,var2) so you can switch %1s and %2s if necessary.
I don't know much about how this is normally done on Windows, but the way localized strings are handled in Apple's Cocoa framework works pretty well. They have a very basic text-format file that you can send to a translator, and some preprocessor macros to retrieve the values from the files.
In your code, you'll see the strings in your native language, rather than as opaque IDs.
Since it is open for opinions, here is how I do it.
My localized text file is a simple tab delimited text file that can be loaded in Excel and edited.
The first column is for the define and each column to the right is a subsequent language, for example:
ID ENGLISH FRENCH GERMAN
STRING_YES YES OUI YA
STRING_NO NO NON NEIN
Then in my makefile is a cusom build step that generates a strings.h file and a strings.dat. In my case it builds an enum list for the string ids and then a binary file with offsets for the text. Since in my app the user can change the language at any time i have them all in memory but you could easily have your pre-processer generate a different output file for each language if necessary.
The thing that I like about this design is that if any strings are missing then I would get a compile error whereas if strings were looked up at runtime then you might not know about a missing string in a seldom used part of the code until later.
Your solution is quite similar to the Unix/Linux "gettext" solution. In fact, you would not need to write the extraction routines.
I'm not sure why you want the _RESTRING macro to handle multiple arguments. My code (using wxWidgets' support for gettext) looks like this: MyString.Format(_("Some string with variable %ls"), _("variable"));. That is to say, String::Format(...) gets two individually translated arguments. In hindsight, Boost::Format would have been better, but it too would allow boost::format(_("Some string with variable %1")) % _("variable");
(We use the _() macro for brevity)
On one project I had localized into 10+ languages, I put everything that was to be localized into a single resource-only dll. At install time, the user selected which dll got installed with their application.
I only had to deliver the English dll to the localization team. They returned a localized dll to me for each language which I included in the build.
I know it's not perfect, but it worked.
You want an advanced utility that I've always wanted to write but never had the time to.
If you don't find such a tool, you may want to fallback on my CMsg() and CFMsg() wrapper classes that allow to very easily pull strings from the resource table. (CFMsg even provide a FormatMessage one-liner wrapper.
And yes, in the absence of that tool you're looking for, keeping a copy of the string in comment is a good solution. Regarding desynchronisation of the comment, remember that string literals are very rarely changed.
http://www.codeproject.com/KB/string/stringtable.aspx
BTW, native Win32 programs and .NET programs have a totally different resource storage management. You'll have a hard time finding a common solution for both.