Best way to design for localization of strings

Best way to design for localization of strings - c++

This is kinda a general question, open for opinions. I've been trying to come up with a good way to design for localization of string resources for a Windows MFC application and related utilities. My wishlist is:
Must preserve string literals in code (as opposed to replacing with macro #define resource ID's), so that the messages are still readable inline
Must allow localized string resources (duh)
Must not impose additional run-time environment restrictions (eg: dependency on .NET, etc.)
Should have minimal obtrusion into existing code (the less modification the better)
Should be debuggable
Should generate resource files which are editable by common tools (ie: common format)
Should not use copy/paste comment blocks to preserve literal strings in code, or anything else which creates the potential for de-synchronization
Would be nice to allow static (compile-time) checking that every "notated" string is in the resource file(s)
Would be nice to allow cross-language resource string pooling (for components in various languages, eg: native C++ and .NET)
I have a way which fulfills all my wishlist to some extent except for static checking, but I have had to develop a bit of custom code to achieve it (and it has limitations). I'm wondering if anyone has solved this problem in a particularly good way.
Edit:
The solution I currently have looks like this:
ShowMessage( RESTRING( _T("Some string") ) );
ShowMessage( RESTRING( _T("Some string with variable %1"), sNonTranslatedStringVariable ) );
I then have a custom utility to parse out the strings from within the 'RESTRING' blocks and put them into a .resx file for localization, and a separate C# COM object to load them from localized resource files with fallback. If the C# object is not available (or cannot load), I fallback to the string in the code. The macro expands to a template class which calls the COM object and does the formatting, etc.
Anyway, I thought it would be useful to add what I have now for reference.

We use the English string as the ID.
If it fails the look up from the international resource object (loaded from the I18N dll installed) then we default to the ID string.
Code looks like:
doAction(I18N.get("Press OK to continue"));
As part of the build processes we have a perl script that parses all source for string constants. It builds a temp file of all strings in the application and then compares these against the resource strings in each local to see if they exists. Any missing strings generates an e-mail to the appropriate translation team.
We can have multiple dll for each local. The name of the dll is based on RFC 3066
language[_territory][.codeset][#modifier]
We try and extract the locale from the machine and be as specific as possible when loading the I18N dll but fallback to less specific local variations if the more specific version is not present.
Example:
In the UK: If the local was en_GB.UTF-8
(I use the term dll loosely not in the specific windows sense).
First look for the I18N.en_GB.UTF-8 dll. If this dll does not exist fall back to I18N.en_GB. If this dll does not exist fall back to I18N.en If this dll does not exist fall beck to I18N.default
The only exception to this rule is:
Simplified Chinese (zh_CN) where the fallback is US English (en_US). If the machine does not support simplified Chinese then it is unlikely to support full Chinese.

The simple way is to only use string IDs in your code - no literal strings.
You can then produce different versions of the.rc file for each language and either create resource only DLLs or simply different language builds.
There are a couple of shareware utilstohelp localising the rc file which handle resizing dialog elements for languages with longer words and warnign about missing translations.
A more complicated problem is word order, if you have several numbers in a printf which must be in a different order for different language's grammar.
There are some extended printf classes on codeproject that let you specify things like printf("word %1s and %2s",var1,var2) so you can switch %1s and %2s if necessary.

I don't know much about how this is normally done on Windows, but the way localized strings are handled in Apple's Cocoa framework works pretty well. They have a very basic text-format file that you can send to a translator, and some preprocessor macros to retrieve the values from the files.
In your code, you'll see the strings in your native language, rather than as opaque IDs.

Since it is open for opinions, here is how I do it.
My localized text file is a simple tab delimited text file that can be loaded in Excel and edited.
The first column is for the define and each column to the right is a subsequent language, for example:
ID ENGLISH FRENCH GERMAN
STRING_YES YES OUI YA
STRING_NO NO NON NEIN
Then in my makefile is a cusom build step that generates a strings.h file and a strings.dat. In my case it builds an enum list for the string ids and then a binary file with offsets for the text. Since in my app the user can change the language at any time i have them all in memory but you could easily have your pre-processer generate a different output file for each language if necessary.
The thing that I like about this design is that if any strings are missing then I would get a compile error whereas if strings were looked up at runtime then you might not know about a missing string in a seldom used part of the code until later.

Your solution is quite similar to the Unix/Linux "gettext" solution. In fact, you would not need to write the extraction routines.
I'm not sure why you want the _RESTRING macro to handle multiple arguments. My code (using wxWidgets' support for gettext) looks like this: MyString.Format(_("Some string with variable %ls"), _("variable"));. That is to say, String::Format(...) gets two individually translated arguments. In hindsight, Boost::Format would have been better, but it too would allow boost::format(_("Some string with variable %1")) % _("variable");
(We use the _() macro for brevity)

On one project I had localized into 10+ languages, I put everything that was to be localized into a single resource-only dll. At install time, the user selected which dll got installed with their application.
I only had to deliver the English dll to the localization team. They returned a localized dll to me for each language which I included in the build.
I know it's not perfect, but it worked.

You want an advanced utility that I've always wanted to write but never had the time to.
If you don't find such a tool, you may want to fallback on my CMsg() and CFMsg() wrapper classes that allow to very easily pull strings from the resource table. (CFMsg even provide a FormatMessage one-liner wrapper.
And yes, in the absence of that tool you're looking for, keeping a copy of the string in comment is a good solution. Regarding desynchronisation of the comment, remember that string literals are very rarely changed.
http://www.codeproject.com/KB/string/stringtable.aspx
BTW, native Win32 programs and .NET programs have a totally different resource storage management. You'll have a hard time finding a common solution for both.

Related

Pass parameters when compiling a dll in c

It is possible to send parameters when compiling a project in VS?
I have a .dll and it has to be compiled for more countries. The country id it is needed the code, that's why I need separated builds for every country. So, I was thinking if there is a way to send the country id as a parameter at compilation, no to modify the code every time I need to do a build for a certain country?

I will briefly tackle some descriptions of the pros and cons of each approach mentioned above, although I would tend to use the locale file approach unless there were a very strong case, or requirement, for obfuscation via separate compiled dlls.
load the content that varies by country/language via a locale resource file; the code will remain the same but strings, formulae etc can all be loaded from a resource on the filesystem which has the relevant entries for the locale chosen.
Advantages: single codebase; multiple locale files; easy code maintenance; single release targets all locales that a resource exists for; can be easily expanded to new regions by addition of a simple resource/locale file
Disadvantages: requires external resource/locale file
use #define to wrap code so that each compilation route depends on a particular #define:
Advantages: releases are more secure, as all material is within a compiled dll;
Disadvantages: compilation is more complex as it requires parameterisation; addition of new locales means addition of code to the codebase; scope for errors to be introduced is multiplied by the number of locales to be supported + 1
Multiple configurations: this has the same issues and advantages as using #defines

None Unicode library and windows locale

I'm using(what seems to be) an ansi(or ascii??) dll library. I think it is such because the header file provided with the lib shows function using char*'s and LPSTR and LPCSTR and structs with char arrays.
This dll is loaded via ::LoadLibrary from a cpp/cli class library that wraps its functionality and exposes it to c#. A c# console app and various other class libs use this cli lib to perform operation.
I can make the cli assembly ether mutibyte or Unicode(which as far as I understand is the same in terms of language support) and c# apps are always Unicode.
This native dll is essentially a broker for a propriety back end server, it passes information back and froth from and to the server.
The issue I'm running into is that the native dll lib will only operate correctly for a particular language if the os locale, for none Unicode apps, its running in is set to that particular language.
I.e. if i want the app to correctly work with Chinese characters, that locale needs to be set. What I find hard to grasp is why does the locale matter for the broker. I understand that if the server is an ansi app if a user wanted to store none Unicode Chinese on it setting the locale on the server would make sense so it would in the client, but not in the middle man that just passes things along. Furthermore the whole thing is getting very confusing.
Is there away to pass Unicode to something like a char array in c++? Wold that even work in this scenario?
Here's a scenario I'm thinking about:
c# app gets url encoded string
c# app decodes the string and passes it to cli
cli somehow converts the String^(or should it be byte[] at
this point) to char[] and passes it to the native lib
Should this really be possible? In terms of memory layout it should, i mean char is just a byte no?
Am I approaching this the right way? is there there a better way to accomplish cross language support. Mind you the vendor is on record saying that there is no way to mix languages in the api, but that's not what I'm looking for. I just dont want to have to run an instance of the software on a separate os for each language i want to support.

What is confusing in this case, is that the DLL has a broken interface. Broken in the following sense: it does not support all of the Unicode codepoints. This is regardless of the type of parameters: char array is perfectly good for supporting all of unicode.
How do we know this? It is because, according to you, what it does depends on the system locale setting.
So, what to do? If the DLL source code is not under your control, you will not have it. You can, however, solve the problem of one ANSI codepage by setting the locale. It does not work for some languages.
Better would be to urge the DLL vendor to support unicode. Best encoding is, of course, UTF-8 - and this way it does not break existing code because the types LPCSTR remain the same.

I ended up using the approach described here:
Is it possible to set ANSI encoding per application in windows
This is what worked for me

Open-source C++ scanning library

Rationale: In my day-to-day C++ code development, I frequently need to
answer basic questions such as who calls what in a very large C++ code
base that is frequently changing. But, I also need to have some
automated way to exactly identify what the code is doing around a
particular area of code. "grep" tools such as Cscope are useful (and
I use them heavily already), but are not C++-language-aware: They
don't give any way to identify the types and kinds of lexical
environment of a given use of a type or function a such way that is
conducive to automation (even if said automation is limited to
"read-only" operations such as code browsing and navigation, but I'm
asking for much more than that below).
Question: Does there exist already an open-source C/C++-based library
(native, not managed, not Microsoft- or Linux-specific) that can
statically scan or analyze a large tree of C++ code, and can produce
result sets that answer detailed questions such as:
What functions are called by some supplied function?
What functions make use of this supplied type?
Ditto the above questions if C++ classes or class templates are involved.
The result set should provide some sort of "handle". I should be able
to feed that handle back to the library to perform the following types
of introspection:
What is the byte offset into the file where the reference was made?
What is the reference into the abstract syntax tree (AST) of that
reference, so that I can inspect surrounding code constructs? And
each AST entity would also have file path, byte-offset, and
type-info data associated with it, so that I could recursively walk
up the graph of callers or referrers to do useful operations.
The answer should meet the following requirements:
API: The API exposed must be one of the following:
C or C++ and probably is "C handle" or C++-class-instance-based
(and if it is, must be generic C o C++ code and not Microsoft- or
Linux-specific code constructs unless it is to meet specifics of
the given platform), or
Command-line standard input and standard output based.
C++ aware: Is not limited to C code, but understands C++ language
constructs in minute detail including awareness of inter-class
inheritance relationships and C++ templates.
Fast: Should scan large code bases significantly faster than
compiling the entire code base from scratch. This probably needs to
be relaxed, but only if Incremental result retrieval and Resilient
to small code changes requirements are fully met below.
Provide Result counts: I should be able to ask "How many results
would you provide to some request (and no don't send me all of the
results)?" that responds on the order of less than 3 seconds versus
having to retrieve all results for any given question. If it takes
too long to get that answer, then wastes development time. This is
coupled with the next requirement.
Incremental result retrieval: I should be able to then ask "Give me
just the next N results of this request", and then a handle to the
result set so that I can ask the question repeatedly, thus
incrementally pulling out the results in stages. This means I
should not have to wait for the entire result set before seeing
some subset of all of the results. And that I can cancel the
operation safely if I have seen enough results. Reason: I need to
answer the question: "What is the build or development impact of
changing some particular function signature?"
Resilient to small code changes: If I change a header or source
file, I should not have to wait for the entire code base to be
rescanned, but only that header or source file
rescanned. Rescanning should be quick. E.g., don't do what cscope
requires you to do, which is to rescan the entire code base for
small changes. It is understood that if you change a header, then
scanning can take longer since other files that include that header
would have to be rescanned.
IDE Agnostic: Is text editor agnostic (don't make me use a specific
text editor; I've made my choice already, thank you!)
Platform Agnostic: Is platform-agnostic (don't make me only use it
on Linux or only on Windows, as I have to use both of those
platforms in my daily grind, but I need the tool to be useful on
both as I have code sandboxes on both platforms).
Non-binary: Should not cost me anything other than time to download
and compile the library and all of its dependencies.
Not trial-ware.
Actively Supported: It is likely that sending help requests to mailing lists
or associated forums is likely to get a response in less than 2
days.
Network agnostic: Databases the library builds should be able to be used directly on
a network from 32-bit and 64-bit systems, both Linux and Windows
interchangeably, at the same time, and do not embed hardcoded paths
to filesystems that would otherwise "root" the database to a
particular network.
Build environment agnostic: Does not require intimate knowledge of my build environment, with
the notable exception of possibly requiring knowledge of compiler
supplied CPP macro definitions (e.g. -Dmacro=value).

I would say that CLang Index is a close fit. However I don't think that it stores data in a database.
Anyway the CLang framework offer what you actually need to build a tool tailored to your needs, if only because of its C, C++ and Objective-C parsing / indexing capabitilies. And since it's provided as a set of reusable libraries... it was crafted for being developed on!

I have to admit that I haven't used either because I work with a lot of Microsoft-specific code that uses Microsoft compiler extensions that i don't expect them to understand, but the two open source analyzers I'm aware of are Mozilla Pork and the Clang Analyzer.

If you are looking for results of code analysis (metrics, graphs, ...) why not use a tool (instead of API) to do that? If you can, I suggest you to take a look at Understand.
It's not free (there's a trial version) but I found it very useful.

Maybe Doxygen with GraphViz could be the answer of some of your constraints but not all,for example the analysis of Doxygen is not incremental.

What is the best way to split up utility functions in a library to maximize reusability?

I have a recurring problem with a statically linked library I've written (or in some cases, code was accumulated from open sources).
This library, MFC Toolbox Library by name, has a lot of free functions, classes, and so on which support MFC programming, Win32 API programming, as well as the venerable C-library and newer C++ standard library.
In short, this is a working library with tools that apply to my daily work, that I've accumulated over more than a decade, and is indispensable to our products. As such, it has a rich mixture of utilities and augmentations for all of these various technologies, and often internally mixes usage of all of these technologies to create further support.
For example, I have a String Utilities.h and String Utilities.cpp which provide a plethora of string-related free-functions and even a class or two.
And often I find that I have a pair of functions, one that works without need of MFC or its CStrings, and another sibling function that does need these things. For example:
////////////////////////////////////////////////////////////////////////
// Line Terminator Manipulation
////////////////////////////////////////////////////////////////////////
// AnsiToUnix() Convert Mac or PC style string to Unix style string (i.e. no CR/LF or CR only, but rather LF only)
// NOTE: in-place conversion!
TCHAR * AnsiToUnix(TCHAR * pszAnsi, size_t size);
template <typename T, size_t size>
T * AnsiToUnix(T (&pszBuffer)[size]) { return AnsiToUnix(pszBuffer, size); }
inline TCHAR * AnsiToUnix(Toolbox::AutoCStringBuffer & buffer) { return AnsiToUnix(buffer, buffer.size()); }
// UnixToAnsi() Converts a Unix style string to a PC style string (i.e. CR or LF alone -> CR/LF pair)
CString UnixToAnsi(const TCHAR * source);
As you can see, AnsiToUnix doesn't require a CString. Because Unix uses a single Carriage Return as a line terminator, and Windows ANSI strings use CR+LF as a line terminator, I am guaranteed that the resulting string will fit within the original buffer space. But for the reverse conversion, the string is almost guaranteed to grow, adding an extra LF for every occurrence of a CR, and hence it is desirable to use a CString (or perhaps a std::string) to provide for the automatic growth of the string.
This is just one example, and in and of itself, is not too beastly to consider converting from CString to std::string to remove the dependency upon MFC from that portion of the library. However, there are other examples where the dependency is much more subtle, and the work greater to change it. Further, the code is well tested as is. If I go and try to remove all MFC dependencies, I am likely to introduce subtle errors to the code, which would potentially compromise our product, and exacerbate the amount of time needed on this essentially not-strictly-necessary task.
The important thing I wanted to get across is that here we have a set of functions, all very related to one another (ANSI->UNIX, UNIX->ANSI), but where one side uses MFC, and the other only uses character arrays. So, if I am trying to provide a library header that is as reusable as possible, it is desirable to break out the functions which are all dependent on MFC into one header, and those which are not into another, so that it is easier to distribute said files to other projects which do not employ MFC (or whatever technology is in question: e.g. It would be desirable to have all functions which don't require Win32 headers - which are simply augmentations to C++, to have their own header, and etc.).
My question to all of you, is how do you manage these issues - Technology dependency vs. related functions all being in the same place?
How do you break down your libraries - divide things out? What goes with what?
Perhaps it is important to add my motivation: I would like to be able to publish articles and share code with others, but generally speaking, they tend to use portions of the MFC Toolbox Library, which themselves use other parts, creating a deep web of dependencies, and I don't want to burden the reader / programmer / consumer of these articles and code-projects with so much baggage!
I can certainly strip out just the parts needed for a given article or project, but that seems like a time-intensive and pointless endeavor. It would be much more sensible, to my mind, to clean up the library in such a way that I can more easily share things without dragging the entire library with me. i.e. Reorganize it once, rather than having to dig things out each and every time...
Here's another good example:
UINT GetPlatformGDILimit()
{
return CSystemInfo::IsWin9xCore() ? 0x7FFF : 0x7FFFFFFF;
}
GetPlatformGDILimit() is a fairly generic, utilitarian free function. It really doesn't have anything to do with CSystemInfo, other than as a client. So it doesn't belong in "SystemInfo.h". And it is just a single free-function - surely nobody would try to put it in its own header? I have placed it in "Win32Misc.h", which has an assortment of such things - free functions mostly which augment the Win32 API. Yet, this seemingly innocuous function is dependent upon CSystemInfo, which itself uses CStrings and a whole slew of other library functions to make it able to do its job efficiently, or in fewer lines of code, or more robustly, or all of the above.
But if I have a demo project that refers to one or two functions in the Win32Misc.h header, then I get into the bind of needing to either extract just the individual functions that project needs (and everything that those functions depends upon, and everything those entities depend upon, etc.) -- or I have to try to include the Win32Misc.h and its .cpp - which drags even more unwanted overhead with them (just in order for the demo project to compile).
So what rule of thumb do folks use to guide yourselves as to where to draw the line - what goes with what? How to keep C++ libraries from becoming a dependency tree from hell? ;)

Personally I'd break it down on functionality. String manipulation in one library. Integral types in another (except perhaps char put that into the string lib)
I would certainly keep platform dependant stuff away from non platform dependant stuff. Vendor specific stuff away from the non vendor specific. This might require two or even three string libraries.
Perhaps you could use the paradigm "does it require MFC?" anything that requires mfc should be split out. Then move on to "does it require windows" again do some splitting. and so forth...
Without a doubt some projects will require all libraries have to be compiled in VC++ and only run on windows, that's just the way it goes. Other projects will happily compile on linux using just a subset of the libraries and compilable with gcc.
DC

If you use only conformat types in your public interfaces, and keep the interfaces seperated from the implementations, this becomes a non-issue.
Keep in mind that when you introduce a function like this:
std::string ProcessData();
...and put the source code for this in a module separate from the code that will call it (for example, in a DLL), you break the seperate-interface-from-implementation edict. This is because the STL is a source code library, and every compiler that uses your library functions can and will have different implementations and different binary layouts for the utilities you use.

In a very vague answer, KISS is the best policy. However, it seems that the code has come too far and has reached the point of no return. This is unfortunate because what you would want to do is have separate libraries that are autonomous entities, meaning they don't depend on any outside stuff. You create an MFC helper functions library and another library for other helpers or whatever. Then you decide which ones you want and when. All the dependencies are within each library and they are stand-alone.
Then it just becomes a matter of which library to include or not.
Also, using condition includes within header files works well if you want only certain things under certain scenarios. However, I'm still not entirely sure if I have interpreted the problem correctly.

How do you handle command line options and config files?

What packages do you use to handle command line options, settings and config files?
I'm looking for something that reads user-defined options from the command line and/or from config files.
The options (settings) should be dividable into different groups, so that I can pass different (subsets of) options to different objects in my code.
I know of boost::program_options, but I can't quite get used to the API. Are there light-weight alternatives?
(BTW, do you ever use a global options object in your code that can be read from anywhere? Or would you consider that evil?)

At Google, we use gflags. It doesn't do configuration files, but for flags, it's a lot less painful than using getopt.
#include <gflags/gflags.h>
DEFINE_string(server, "foo", "What server to connect to");
int main(int argc, char* argv[]) {
google::ParseCommandLineFlags(&argc, &argv, true);
if (!server.empty()) {
Connect(server);
}
}
You put the DEFINE_foo at the top of the file that needs to know the value of the flag. If other files also need to know the value, you use DECLARE_foo in them. There's also pretty good support for testing, so unit tests can set different flags independently.

For command lines and C++, I've been a fan of TCLAP: Templatized Command Line Argument Parser.
http://sourceforge.net/projects/tclap/

Well, you're not going to like my answer. I use boost::program_options. The interface takes some getting used to, but once you have it down, it's amazing. Just make sure to do boatloads of unit testing, because if you get the syntax wrong you will get runtime errors.
And, yes, I store them in a singleton object (read-only). I don't think it's evil in that case. It's one of the few cases I can think of where a singleton is acceptable.

If Boost is overkill for you, GNU Gengetopt is probably, too, but IMHO, it's a fun tool to mess around with.
And, I try to stay away from global options objects, I prefer to have each class read its own config. Besides the whole "Globals are evil" philosophy, it tends to end up becoming an ever-growing mess to have all of your configuration in one place, and also it's harder to tell what configuration variables are being used where. If you keep the configuration closer to where it's being used, it's more obvious what each one is for, and easier to keep clean.
(As to what I use, personally, for everything recently it's been a proprietary command line parsing library that somebody else at my company wrote, but that doesn't help you much, unfortunately)

I've been using TCLAP for a year or two now, but randomly I stumbled across ezOptionParser. ezOptionParser doesn't suffer from "it shouldn't have to be this complex"-syndrome the same way that other option parsers do.
I'm pretty impressed so far and I'll likely be using it going forward, specifically because it supports config files. TCLAP is a more sophisticated library, but the simplicity and extra features from ezOptionParser is very compelling.
Other perks from its website include (as of 0.2.0):
Pretty printing of parsed inputs for debugging.
Auto usage message creation in three layouts (aligned, interleaved or staggered).
Single header file implementation.
Dependent only on STL.
Arbitrary short and long option names (dash '-' or plus '+' prefixes not required).
Arbitrary argument list delimiters.
Multiple flag instances allowed.
Validation of required options, number of expected arguments per flag, datatype ranges, user defined ranges, membership in lists and case for string lists.
Validation criteria definable by strings or constants.
Multiple file import with comments.
Exports to file, either set options or all options including defaults when available.
Option parse index for order dependent contexts.

GNU getopt is pretty nice. If you want a C++ feel, consider getoptpp which is a wrapper around the native getopt.
As far as configuration file is concerned, you should try to make it as stupid as possible so that parsing is easy. If you are bit considerate, you might want to use yaac&lex but that would be really a big bucks for small apps.
I also would like to suggest that you should support both config files and command line options in your application. Config files are better for those options which are to be changed less frequently. Command-line options are good when you want to pass the immediate changing arguments (typically when you are creating a app, which would be called by some other program.)

If you are working with Visual Studio 2005 on x86 and x64 Windows there is some good Command Line Parsing utilities in the SimpleLibPlus library. I have used it and found it very useful.

Not sure about command line argument parsing. I have not needed very rich capabilities in that area and have generally rolled my own to save adding more dependencies to my software. Depending upon what your needs are you may or may not want to try this route. The C++ programs I have written are generally not invoked from the command line.
On the other hand, for a config file you really can't beat an XML based format. It's readable, extensible, structured, etc... :) Plus there are lots of XML parsers out there. Despite the fact it is a C library, I tend to use libxml2 from xmlsoft.org.

Try Apache Ant. Its primary usage is Java projects, but there isn't anything Java about it, and its usable for almost anything.
Usage is fairly simple and you've got a lot of community support too. It's really good at doing things the way you're asking.
As for global options in code, I think they're quite necessary and useful. Don't misuse them, though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js