How to properly localize a cross platform program?

How to properly localize a cross platform program? - c++

I am currently making a game engine that is eventually going to support all platforms. Currently I am working on the Windows support with the Win32 API. Reading the documentation, it suggests that I use wide strings/chars and the Unicode version of API functions so that my application can be localized. But if I use wide versions of everything (wcout wstring wchar_t etc.), I will have to make my entire game engine use those wide types. That also means that when working with other platforms, I will also have to use wide types, or I will have to convert between them.
My idea is that maybe my code will only be compiled with wide string types on Windows and be compiled with normal string types on other platforms perhaps with macro definitions. Is that the best option to do this? And how may I go about doing this?
I also don't really understand how unicode works in c++. If I set my system locale to English, then I will get a compiler warning from MSVC if I have any Chinese characters stored in a normal string type. However now I set my system locale to Chinese and enabled UTF-8 encoding, I get no compiler warnings if I store Unicode characters in normal strings. I also have no idea how unicode works on other platforms. Can somebody explain this for me?
Thanks in advance.

Related

None Unicode library and windows locale

I'm using(what seems to be) an ansi(or ascii??) dll library. I think it is such because the header file provided with the lib shows function using char*'s and LPSTR and LPCSTR and structs with char arrays.
This dll is loaded via ::LoadLibrary from a cpp/cli class library that wraps its functionality and exposes it to c#. A c# console app and various other class libs use this cli lib to perform operation.
I can make the cli assembly ether mutibyte or Unicode(which as far as I understand is the same in terms of language support) and c# apps are always Unicode.
This native dll is essentially a broker for a propriety back end server, it passes information back and froth from and to the server.
The issue I'm running into is that the native dll lib will only operate correctly for a particular language if the os locale, for none Unicode apps, its running in is set to that particular language.
I.e. if i want the app to correctly work with Chinese characters, that locale needs to be set. What I find hard to grasp is why does the locale matter for the broker. I understand that if the server is an ansi app if a user wanted to store none Unicode Chinese on it setting the locale on the server would make sense so it would in the client, but not in the middle man that just passes things along. Furthermore the whole thing is getting very confusing.
Is there away to pass Unicode to something like a char array in c++? Wold that even work in this scenario?
Here's a scenario I'm thinking about:
c# app gets url encoded string
c# app decodes the string and passes it to cli
cli somehow converts the String^(or should it be byte[] at
this point) to char[] and passes it to the native lib
Should this really be possible? In terms of memory layout it should, i mean char is just a byte no?
Am I approaching this the right way? is there there a better way to accomplish cross language support. Mind you the vendor is on record saying that there is no way to mix languages in the api, but that's not what I'm looking for. I just dont want to have to run an instance of the software on a separate os for each language i want to support.

What is confusing in this case, is that the DLL has a broken interface. Broken in the following sense: it does not support all of the Unicode codepoints. This is regardless of the type of parameters: char array is perfectly good for supporting all of unicode.
How do we know this? It is because, according to you, what it does depends on the system locale setting.
So, what to do? If the DLL source code is not under your control, you will not have it. You can, however, solve the problem of one ANSI codepage by setting the locale. It does not work for some languages.
Better would be to urge the DLL vendor to support unicode. Best encoding is, of course, UTF-8 - and this way it does not break existing code because the types LPCSTR remain the same.

I ended up using the approach described here:
Is it possible to set ANSI encoding per application in windows
This is what worked for me

What are the disadvantages to not using Unicode in Windows?

What are the disadvantages to not using Unicode on Windows?
By Unicode, I mean WCHAR and the wide API functions. (CreateWindowW, MessageBoxW, and so on)
What problems could I run into by not using this?

Your code won't be able to deal correctly with characters outside the currently selected codepage when dealing with system APIs1.
Typical problems include unsupported characters being translated to question marks, inability to process text with special characters, in particular files with "strange characters" in their names/paths.
Also, several newer APIs are present only in the "wide" version.
Finally, each API call involving text will be marginally slower, since the "A" versions of APIs are normally just thin wrappers around the "W" APIs, that convert the parameters to UTF-16 on the fly - so, you have some overhead in respect to a "plain" W call.
Nothing stops you to work in a narrow-characters Unicode encoding (=>UTF-8) inside your application, but Windows "A" APIs don't speak UTF-8, so you'd have to convert to UTF-16 and call the W versions anyway.

I believe the gist of the original question was "should I compile all my Windows apps with "#define _UNICODE", and what's the down side if I don't?
My original reply was "Yeah, you should. We've moved 8-bit ASCII, and '_UNICODE' is a reasonable default for any modern Windows code."
For Windows, I still believe that's reasonably good advice. But I've deleted my original reply. Because I didn't realize until I re-read my own links how much "UTF-16 is quite a sad state of affairs" (as Matteo Italia eloquently put it).
For example:
http://utf8everywhere.org/
Microsoft has ... mistakenly used ‘Unicode’ and ‘widechar’ as
synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be
set as the encoding for narrow string WinAPI, one must compile her
code with _UNICODE rather than _MBCS. Windows C++ programmers are
educated that Unicode must be done with ‘widechars’. As a result of
this mess, they are now among the most confused ones about what is the
right thing to do about text.
I heartily recommend these three links:
The Absolute Minimum Every Software Developer Should Know about Unicode
Should UTF-16 Be Considered Harmful?
UTF-8 Everywhere
IMHO...

Spanish characters in C++ Windows/Mac/iOS

I'm having some issues with spanish characters displaying in an iOS app. The code in question is all C++, and shared between both a Windows app and an iOS app. Compiled in Windows using Visual Studio 2010 (character set is Multi-byte). And Compiled using Xcode 4.2 on the Mac.
Currently, the code is using char pointers, and my first thought was that I need to switch over to wchar_t pointers instead. However, I noticed that the Spanish characters I want to output display just fine in Windows using just char pointers. This made me think those characters are a part of the multi-byte character set and I don't need to go to all the trouble of updating everything to wchar_t until I'm ready to do some Japanese, Russian, Arabic, etc. translations.
Unfortunatly, while the Spanish characters do display property in the Windows app, they do not display right once they hit the Mac/iOS. Experimenting with wchar_t there, I see that they will display properly if I convert everything over. But what I don't understand, and hoping someone can enlighten me as to the reason... is why are the characters perfectly valid on the Windows machine, same code, and dislaying as gibberish (requiring wchar_t instead) in the Mac environment?
Is visual studio doing something to my char pointers behind the scenes that the Mac is not doing? In other words, is the Microsoft environment simply being more forgiving to my architectural oversight when I used char pointers instead of wchar_t?
Seeing as how I already know my answer is to convert from char pointers to wchar_t pointers, my real question then is "Why does the Mac require wchar_t but in Windows I can use char for the same characters?"
Thanks.

Mac and Windows use different codepages--they both have Spanish characters available, but they show up as different character values, so the same bytes will appear differently on each platform.
The best way to deal with localization in a cross-platform codebase is UTF8. UTF8 is supported natively in NSString -stringWithUTF8String: and in Windows Unicode applications by calling MultiByteToWideChar with CP_UTF8. In fact, since it's Unicode, you can even use the same technique to handle more complicated languages like Chinese.
Don't use wide characters in cross-platform code if you can help it. This gets complicated because wchar_t is actually 32 bits wide on OS X. In fact, it's wasteful of memory for that reason as well.
http://en.wikipedia.org/wiki/UTF-8

None of char, wchar_t, string or wstring have any encoding attached to them. They just contain whatever binary soup your compiler decides to interpret the source files as. You have three variables that could be off:
What your code contains (in the actual file, between the '"' characters, on a binary level).
What your compiler thinks this is. For example, you may have a UTF-8 source file, but the compiler could turn wchar_t[] literals into proper UCS-4. (I wish MSVC 2010 could do this, but as far as I know, it does not support UTF-8 at all.)
What your rendering API expects. On Windows, this is usually Little-Endian UTF-16 (as a LPWCHAR pointer). For the old LPCHAR APIs, it is usually the "current codepage", which could be anything as far as I know. iOS and Mac OS use UTF-16 internally I think, but they are very explicit about what they accept and return.
No class or encoding can help you if there is a mismatch between any of these.
In an IDE like Xcode or Eclipse, you can see the encoding of a file in its property sheet. In Xcode 4, this is the right-most pane, bring it up with cmd+alt+0 if it's hidden. If the characters look right in the code editor, the encoding is correct. A first step is to make sure that both Xcode and MSVC are interpreting the same source files the same way. Then you need to figure what they are turned in into memory right before rendering. And then you need to ensure that both rendering APIs expect the same character set at all.
Or, just move your strings into text files separate from your source code, and in a well-defined encoding. UTF-8 is great for this, but everything will work that can encode all necessary characters. Then only translate your strings for rendering (if necessary).
I just saw this answer which gives even more reasons for the latter option: https://stackoverflow.com/a/1866668/401925

wopen calls when porting to Linux

I have an application which was developed under Windows, but for gcc. The code is mostly OS-independent, with very few classes which are Windows specific because a Linux port was always regarded as necessary.
The API, especially that which gets called as a direct result of user interaction, is using wide char arrays instead of char arrays (as a side note, I cannot change the API itself - at this point, std::wstring cannot be used). These are considered as encoded in UTF-16.
In some places, the code opens files, mostly using the windows-specific _wopen function call. The problem with this is there is no wopen-like substitute for Linux because Linux "only deals with bytes".
The question is: how do I port this code ? What if I wanted to open a file with the name "something™.log", how would I go about doing so in Linux ? Is a cast to char* sufficient, would the wide chars be picked up automatically based on the locale (probably not) ? Do I need to convert manually ? I'm a bit confused regarding this, perhaps someone could point me to some documentation regarding the matter.

The strategy I took on Mac hinges on the fact that Mac OS X uses utf-8 in all its file io POSIX api's.
I thus created a type "fschar" thats a char in windows non unicode builds, wchar_t in windows UNICODE builds and char (again) when building for Mac OS.
I pass around all file system strings using this type. String literals are encoded with wrappers (TEXT("literal")) to get the correct encoding - all my data files store utf-8 characters on disk that, on windows UNICODE builds, I MultiByteToWideChar to convert to utf16.

Linux does not support UTF16 filenames. It does however support UTF8 files and those can be opened using plain old fopen().
What you should do is convert you wide strings to UTF8.

Is TCHAR still relevant?

I'm new to Windows programming and after reading the Petzold book I wonder:
is it still good practice to use the TCHAR type and the _T() function to declare strings or if I should just use the wchar_t and L"" strings in new code?
I will target only Windows 2000 and up and my code will be i18n from the start up.

The short answer: NO.
Like all the others already wrote, a lot of programmers still use TCHARs and the corresponding functions. In my humble opinion the whole concept was a bad idea. UTF-16 string processing is a lot different than simple ASCII/MBCS string processing. If you use the same algorithms/functions with both of them (this is what the TCHAR idea is based on!), you get very bad performance on the UTF-16 version if you are doing a little bit more than simple string concatenation (like parsing etc.). The main reason are Surrogates.
With the sole exception when you really have to compile your application for a system which doesn't support Unicode I see no reason to use this baggage from the past in a new application.

I have to agree with Sascha. The underlying premise of TCHAR / _T() / etc. is that you can write an "ANSI"-based application and then magically give it Unicode support by defining a macro. But this is based on several bad assumptions:
That you actively build both MBCS and Unicode versions of your software
Otherwise, you will slip up and use ordinary char* strings in many places.
That you don't use non-ASCII backslash escapes in _T("...") literals
Unless your "ANSI" encoding happens to be ISO-8859-1, the resulting char* and wchar_t* literals won't represent the same characters.
That UTF-16 strings are used just like "ANSI" strings
They're not. Unicode introduces several concepts that don't exist in most legacy character encodings. Surrogates. Combining characters. Normalization. Conditional and language-sensitive casing rules.
And perhaps most importantly, the fact that UTF-16 is rarely saved on disk or sent over the Internet: UTF-8 tends to be preferred for external representation.
That your application doesn't use the Internet
(Now, this may be a valid assumption for your software, but...)
The web runs on UTF-8 and a plethora of rarer encodings. The TCHAR concept only recognizes two: "ANSI" (which can't be UTF-8) and "Unicode" (UTF-16). It may be useful for making your Windows API calls Unicode-aware, but it's damned useless for making your web and e-mail apps Unicode-aware.
That you use no non-Microsoft libraries
Nobody else uses TCHAR. Poco uses std::string and UTF-8. SQLite has UTF-8 and UTF-16 versions of its API, but no TCHAR. TCHAR isn't even in the standard library, so no std::tcout unless you want to define it yourself.
What I recommend instead of TCHAR
Forget that "ANSI" encodings exist, except for when you need to read a file that isn't valid UTF-8. Forget about TCHAR too. Always call the "W" version of Windows API functions. #define _UNICODE just to make sure you don't accidentally call an "A" function.
Always use UTF encodings for strings: UTF-8 for char strings and UTF-16 (on Windows) or UTF-32 (on Unix-like systems) for wchar_t strings. typedef UTF16 and UTF32 character types to avoid platform differences.

If you're wondering if it's still in practice, then yes - it is still used quite a bit. No one will look at your code funny if it uses TCHAR and _T(""). The project I'm working on now is converting from ANSI to unicode - and we're going the portable (TCHAR) route.
However...
My vote would be to forget all the ANSI/UNICODE portable macros (TCHAR, _T(""), and all the _tXXXXXX calls, etc...) and just assume unicode everywhere. I really don't see the point of being portable if you'll never need an ANSI version. I would use all the wide character functions and types directly. Preprend all string literals with a L.

I would still use the TCHAR syntax if I was doing a new project today. There's not much practical difference between using it and the WCHAR syntax, and I prefer code which is explicit in what the character type is. Since most API functions and helper objects take/use TCHAR types (e.g.: CString), it just makes sense to use it. Plus it gives you flexibility if you decide to use the code in an ASCII app at some point, or if Windows ever evolves to Unicode32, etc.
If you decide to go the WCHAR route, I would be explicit about it. That is, use CStringW instead of CString, and casting macros when converting to TCHAR (eg: CW2CT).
That's my opinion, anyway.

I would like to suggest a different approach (neither of the two).
To summarize, use char* and std::string, assuming UTF-8 encoding, and do the conversions to UTF-16 only when wrapping API functions.
More information and justification for this approach in Windows programs can be found in http://www.utf8everywhere.org.

The Introduction to Windows Programming article on MSDN says
New applications should always call the Unicode versions (of the API).
The TEXT and TCHAR macros are less useful today, because all applications should use Unicode.
I would stick to wchar_t and L"".

TCHAR/WCHAR might be enough for some legacy projects. But for new applications, I would say NO.
All these TCHAR/WCHAR stuff are there because of historical reasons. TCHAR provides a seemly neat way (disguise) to switch between ANSI text encoding (MBCS) and Unicode text encoding (UTF-16). In the past, people did not have an understanding of the number of characters of all the languages in the world. They assumed 2 bytes were enough to represent all characters and thus having a fixed-length character encoding scheme using WCHAR. However, this is no longer true after the release of Unicode 2.0 in 1996.
That is to say:
No matter which you use in CHAR/WCHAR/TCHAR, the text processing part in your program should be able to handle variable length characters for internationalization.
So you actually need to do more than choosing one from CHAR/WCHAR/TCHAR for programming in Windows:
If your application is small and does not involve text processing (i.e. just passing around the text string as arguments), then stick with WCHAR. Since it is easier this way to work with WinAPI with Unicode support.
Otherwise, I would suggest using UTF-8 as internal encoding and store texts in char strings or std::string. And covert them to UTF-16 when calling WinAPI. UTF-8 is now the dominant encoding and there are lots of handy libraries and tools to process UTF-8 strings.
Check out this wonderful website for more in-depth reading:
http://utf8everywhere.org/

Yes, absolutely; at least for the _T macro. I'm not so sure about the wide-character stuff, though.
The reason being is to better support WinCE or other non-standard Windows platforms. If you're 100% certain that your code will remain on NT, then you can probably just use regular C-string declarations. However, it's best to tend towards the more flexible approach, as it's much easier to #define that macro away on a non-windows platform in comparison to going through thousands of lines of code and adding it everywhere in case you need to port some library to windows mobile.

IMHO, if there's TCHARs in your code, you're working at the wrong level of abstraction.
Use whatever string type is most convenient for you when dealing with text processing - this will hopefully be something supporting unicode, but that's up to you. Do conversion at OS API boundaries as necessary.
When dealing with file paths, whip up your own custom type instead of using strings. This will allow you OS-independent path separators, will give you an easier interface to code against than manual string concatenation and splitting, and will be a lot easier to adapt to different OSes (ansi, ucs-2, utf-8, whatever).

The only reasons I see to use anything other than the explicit WCHAR are portability and efficiency.
If you want to make your final executable as small as possible use char.
If you don't care about RAM usage and want internationalization to be as easy as simple translation, use WCHAR.
If you want to make your code flexible, use TCHAR.
If you only plan on using the Latin characters, you might as well use the ASCII/MBCS strings so that your user does not need as much RAM.
For people who are "i18n from the start up", save yourself the source code space and simply use all of the Unicode functions.

TCHAR is not relevant anymore, since now we have UNICODE. You should use UTF-16 wchar_t* strings instead.
Windows APIs takes wchar_t* as strings, and it is UTF-16.

Just adding to an old question:
NO
Go start a new CLR C++ project in VS2010. Microsoft themselves use L"Hello World", 'nuff said.

TCHAR have a new meaning to port from WCHAR to CHAR.
https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
Recent releases of Windows 10 have used the ANSI code page and -A
APIs as a means to introduce UTF-8 support to apps. If the ANSI code
page is configured for UTF-8, -A APIs operate in UTF-8.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js