Writing unicode C++ source code

Writing unicode C++ source code - c++

I saw on the project properties on Visual Studio 2012 that you can select the Character set for your application.
I use Unicode Character set.
What is the problem with Multi-byte character set? Or better, why should I use the Unicode?
Take for example this piece of code from a DLL that I am doing
RECORD_API int startRecording(
char *cam_name, // Friendly video device name
char *time, // Max time for recording
char *f_width, // Frame width
char *f_height, // Frame height
char *file_path) // Complete output file path
{
...
}
A lot of Unicode functions from Windows.h header use wchar_t parameters; should I use wchar_t also for my functions parameters?
Should I always explicit the W functions (example: ShellExecuteW) ?

First, regardless of what the interface says, the question isn't
Unicode or not, but UTF-16 or UTF-8. Practically speaking, for
external data, you should only use UTF-8. Internally, it
depends on what you are doing. Conversion of UTF-8 to UTF-16 is
an added complication, but for more complex operations, it may
be easier to work in UTF-16. (Although the differences between
UTF-8 and UTF-16 aren't enormous. To reap any real benefits,
you'd have to use UTF-32, and even then...)
In practice, I would avoid the W functions completely, and
always use char const* at the system interface level. But
again, it depends on what you are doing. This is just a general
guideline. For the rest, I'd stick with std::string unless
there was some strong reason not to.

You don't need to explicitly call, the ..W version of a function as this should already be covered by the include files and the settings that you use. So if you compile for Unicodesupport, then the W version of your system call will be used, otherwise the A.
Personally I would only compile for Unicode if you can really test it. At least you shouldn't assume that your application really can work properly in all cases, just because you compiled for it. Compiling for it is only the first step, but of course, you must consequently use the appropriate types and test your code, to be sure that there are no effects you may not have noticed.

Related

Make Windows32 API members expand to A, or Change Character set in vs2019 (c++)

My problem is as follows. When I pass in strings into the methods/classes of the win32 api, I get the Problem, that the type "const char*" cannot be used to assign variables of type LPCWSTR. I than made a helper method, that converts const char* manually to LPCWSTR. In the most cases, this did the trick, but In the CreateWindow() function the same error remains.
I then read online, that to easily avoid this problem, one could change the Character set to UTF-8, but soon found out that vs2019 does not have this setting, where it was 2017.
What I want to know, basically, is whether there is a way change the character set in vs2019, or a way to manually force those Methods to expand to the A type instead of W by default (CreateWindow should automatically expand to CreateWindowA, instead of expanding to CreateWindowW).

CreateWindow expands to CreateWindowW if UNICODE is defined during compiling, otherwise it expands to CreateWindowA. Same with TCHAR (wchar_t/char, respectively) and all other W/A-based APIs.
So, either set the project's character set to ANSI/MBCS, or you can simply #undef UNICODE where needed. Prior to Windows 10 version 1903 (build 18362), the A APIs simply do not support UTF-8 at all. But since then, you can opt in to enable UTF-8 support via the application manifest.
That being said, you should not rely on the TCHAR-based APIs if your string data is not using TCHAR to begin with.
If you are working with char data specifically, use the A APIs directly (CreateWindowA, etc), unless your data is UTF-8 (or different than the user's locale), in which case convert it to UTF-16 using MultiByteToWideChar or equivalent and then call the W APIs directly.
If you are working with wchar_t data specifically, use the W APIs directly (CreateWindowW, etc).

C++ C2440 Error, cannot convert from 'std::_String_iterator<_Elem,_Traits,_Alloc>' to 'LPCTSTR'

Unfortunately I have been tasked with compiling an old C++ DLL so that it will work on Win 7 64 Bit machines. I have zero C++ experience. I have muddled through other issues, but this one has stumped me and i have not found a solution elsewhere. The following code is throwing a C2440 compiler error.
The error: "error C2440: '' : cannot convert from 'std::_String_iterator<_Elem,_Traits,_Alloc>' to 'LPCTSTR'"
the code:
#include "StdAfx.h"
#include "Antenna.h"
#include "MyTypes.h"
#include "objbase.h"
const TCHAR* CAntenna::GetMdbPath()
{
if(m_MdbPath.size() <= 0) return NULL;
return LPCTSTR(m_MdbPath.begin());
}
"m_MdbPath" is defined in the Antenna.h file as
string m_MdbPath;
Any assistance or guidance someone could provide would be extremely helpful. Thank you in advance. I am happy to provide additional details on the code if needed.

std::string has a .c_str() member function that should accomplish what you're looking for. It will return a const char* (const wchar_t* with std::wstring). std::string also has an empty() member function and I recommend using nullptr instead of the NULL macro.
const TCHAR* CAntenna::GetMdbPath()
{
if(m_MdbPath.empty()) return nullptr;
return m_MdbPath.c_str();
}

The solution:
const TCHAR* CAntenna::GetMdbPath()
{
if(m_MdbPath.size() <= 0) return NULL;
return m_MdbPath.c_str();
}
will work if you can guarantee that your program is using the MBCS character set option (or if the Character Set option is set to Not Setif you're using Visual Studio), since a std::string character type will be the same as the TCHAR type.
However, if your build is UNICODE, then a std::string character type will not be the same as the TCHAR type, since TCHAR is defined as a wide character, not a single-byte character. Thus returning c_str() will give a compiler error. So your original code, plus the tentative fix of returning std::string::c_str() may work, but it is technically wrong with respect to the types being used.
You should be working directly with the types presented to your function -- if the type is TCHAR, then you should be using TCHAR based types, but again, a TCHAR is a type that changes definition depending on the build type.
As a matter of fact, the code will fail miserably at runtime if it were a UNICODE build and to fix the compiler error(s), you casted to an LPCTSTR to keep the compiler quiet. You cannot cast a string pointer type to another string pointer type. Casting is not converting, and merely casting will not convert a narrow string into a wide string and vice-versa. You will wind up with at the least, odd characters being used or displayed, and worse, a program with random weird behavior and crashes.
In addition to this you never know when you really will need to build a UNICODE version of the DLL, since MBCS builds are becoming more rare. Going by your original post, the DLL's usage of TCHAR indicates that yes, this DLL may (or even has) been built for UNICODE.
There are a few alternate solutions that you can use. Note that some of these solutions may require you to make additional coding changes beyond the string types. If you're using C++ stream objects, they may also need to be changed to match the character type (for example, std::ostream and std::wostream)
Solution 1. Use a typedef, where the string type to use depends on the build type.
For example:
#ifdef UNICODE
typedef std::wstring tchar_string;
#else
typedef std::string tchar_string;
#endif
and then use tchar_string instead of std::string. Switching between MBCS and UNICODE builds will work without having to cast strin types, but requires coding changes.
Solution 2. Use a typedef to use TCHAR as the underlying character type in the std::basic_string template.
For example:
typedef std::basic_string<TCHAR> tchar_string;
And use tchar_string in your application. You get the same public interface as std::(w)string and this allows you to switch between MBCS and UNICODE build seamlessly (with respect to the string types).
Solution 3. Go with UNICODE builds and drop MBCS builds altogether
If the application you're building is a new one, then UNICODE is the default option (at least in Visual Studio) when creating a new application. Then all you really need to do is use std::wstring.
This is sort of the opposite of the original solution given of making sure you use MBCS, but IMO this solution makes more sense if there is only going to be one build type. MBCS builds are becoming more rare, and basically should be used only for legacy apps.

What happens to the non-unicode API functions when I define UNICODE and/or _UNICODE?

In MS Visual Studio, when you do not set the character set, the likes of AfxMessageBox() (and countless other API functions) will happily accept a CStringA argument. But the moment you set the character set to Unicode, what appear to be the very same functions will only accept CStringW arguments.
Now this is precisely what the documentation says should happen... but...
where exactly did those non-Unicode API functions go? Are they still there to be linked to under other names (AfxMessageBoxA() perhaps?). By what magic does one API disappear and another one appear in its place... or alternatively... by what mischievous hacker trick can one make them reappear? And if it is possible to make them reappear in the presence of Unicode, should one (judiciously) use such hacker mischief?

The declaration of AfxMessageBox() in afxwin.h is:
int AFXAPI AfxMessageBox(LPCTSTR lpszText, UINT nType = MB_OK,
UINT nIDHelp = 0);
It is LPCTSTR that adapts the string type. If you compile with UNICODE in effect then it is an alias for const wchar_t*. Without it is const char*. There is no AfxMessageBoxA() version.
This is very different from the way the winapi functions work, necessarily so since this is a C++ function that mangles differently. Technically they could have provided another overload of the function, they didn't. You'll also have a different link demand, you need to link the non-Unicode version of the MFC library to keep the linker happy. Notable is that it is deprecated and no longer ships with recent VS editions, but still available (right now) as a separate download.
This should answer your question, it doesn't go anywhere, it simply doesn't exist. Mixing cannot work, you'll need A2W() to convert the string. You could of course simply write your own overload if necessary.

Should string encoding for library conform to Unicode or flexible?

I am created a library in C++ which exposes c style interface APIs. Some of the arguments are string so they would be char *. Now I know they should be all Unicode but because it is a library I don't think I want to force users to use decide or not. Ideally I thought it would be best to use TCHAR so I can build it either way for unicode code and ASCII users. Than I read this and it opposes the idea in general.
As an example of API, the strings are filenames or error messages like below.
void LoadSomeFile(char * fileName );
const char * GetErrorMsg();
I am using c++ and STL. There is this debate of std::string vs std::wstring as well.
Personally I really like MFC's CString class which takes care of all this nicely but that means I have to use MFC just for its string class.
Now I think TCHAR is probably the best solution for me but do I have to use CString (internally) for that to work? Can I use it with STL string? As far as I can see, it is either string or wstring there.

The TCHAR type is an unfortunate design choice that has thankfully been left behind us. Nobody has to use TCHAR any more, thank goodness. The Unicode choice has been made for us as well: Unicode is the only sane choice going forwards.
The question is, is your library Windows-only? Or is it portable?
If your library is portable, then the typical choice is char * or std::string with UTF-8 encoded strings. For more information, see UTF-8 Everywhere. The summary is that wchar_t is UTF-16 on Windows but UTF-32 everywhere else, which makes it almost useless for cross-platform programming.
If your library runs on Win32 only, then you may feel free to use wchar_t instead. On Windows, wchar_t is UTF-16.
Don't use both, it will make your code and API bloated and difficult to read. TCHAR is a hack for supporting the Win32 API and migrating to Unicode.

WChars, Encodings, Standards and Portability

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
Portability and serialization are orthogonal concepts.
Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
Updates
Following many very nice comments I'd like to add a few observations:
If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++
No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with
int main(int argc, char** argv)
you have already lost Unicode support for command line arguments. You have to write
int wmain(int argc, wchar_t** argv)
instead, or use the GetCommandLineW function, none of which is specified in the C standard.
More specifically,
any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.
I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)
DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.
DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.

The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.
Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:
Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn 𐀀 into a string of wchar_t?
Text editor: How do you find grapheme cluster boundaries in a wchar_t string?
If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.
Your program requirements may differ and wchar_t may work fine for you.

Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.
There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.
The facets are described in 22.5 [locale.stdcvt] (from n3242).
I don't understand how this doesn't satisfy at least some of your requirements:
namespace ns {
typedef char32_t char_t;
using std::u32string;
// or use user-defined literal
#define LIT u32
// Communicate with interface0, which wants utf-8
// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;
inline std::string
to_interface0(string const& s)
{
return converter0().to_bytes(s);
}
inline string
from_interface0(std::string const& s)
{
return converter0().from_bytes(s);
}
// Communitate with interface1, which wants utf-16
// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;
inline std::wstring
to_interface0(string const& s)
{
return converter1().to_bytes(s);
}
inline string
from_interface0(std::wstring const& s)
{
return converter1().from_bytes(s);
}
} // ns
Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).
In fact I wrote the above to be as short as possible but you'd really want helpers like this:
template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
return converter0().from_bytes(std::forward<T>(t)...);
}
which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js