I have a C++ Native WinAPI application that strictly uses Unicode functions and data types. Ie, CreateWindowW(), SendMessageW(), wstring, WCHAR, etc. Now I intend to expand my application to use SQLite3.
My Problem: The SQLite3 library is ANSI. Which means I have to use char* as most function parameters.
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
If there are what might these impacts be?
SQLite is not restricted to ANSI. It is a misconception that char* implies ANSI encoded text. Not all functions that operate on char* data assume that the data is ANSI encoded. In the case of SQLite it fully supports Unicode and does so using char* data encoded using UTF-8.
If you intend to continue using UTF-16 encoded text internal to your application you'll need to add an adapter layer at the boundary between your code and the SQLite code. Convert from UTF-16 to UTF-8 when passing data to SQLite, and the opposite direction when receiving.
Which to my mind renders the question that you asked somewhat moot, but I'll address that anyway:
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
The most obvious drawbacks of using ANSI functions are:
Severely restricted character set.
Performance cost when converting between different character sets.
Risk of programmer confusion and errors due to using multiple character sets in a single codebase.
No limitation, you can use ANSI strings in Unicode applications.
Some details: Unicode application is compile-time definition. At run time, program can work both with Unicode and ANSI strings.
For example:
char* ptr1; // this is always ANSI string
wchar_t* ptr2; // this is always Unicode string
TCHAR* ptr3; // this is generic string, which is compiled as char* or wchar_t*
Unicode/ANSI configuration differs by interpreting a generic text macros, like TCHAR. Some Windows API are also implemented using generic text macros. For example: SetWindowText is actually macro, which is expanded to SetWindowTextA in ANSI configuration, and to SetWindowTextW in Unicode configuration.
Any non-generic string or API name (like char*, SetWindowTextW etc.) works by the same way in any program configuration.
Use ATL conversion macros to convert between different (generic and non-generic) string types: http://msdn.microsoft.com/en-us/library/87zae4a3%28v=vs.80%29.aspx
You can use Ansi-based APIs in a Unicode application. Simply convert your input Unicode strings to Ansi when passing them to the API, and convert any output Ansi strings to Unicode upon return from the API. You can use WideCharToMultiByte() and MultiByteToWideChar() for that, or higher-level wrappers like CString, ATL conversions, etc.
Related
I want to understand the difference between char and wchar_t ? I understand that wchar_t uses more bytes but can I get a clear cut example to differentiate when I would use char vs wchar_t
Short anwser:
You should never use wchar_t in modern C++, except when interacting with OS-specific APIs (basically use wchar_t only to call Windows API functions).
Long answer:
Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what).
In a C++ program you have two locales:
Standard C library locale set by std::setlocale
Standard C++ library locale set by std::locale::global
Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen, std::fstream::open etc). Behavior differs between OSes:
Linux is encoding agnostic, so those function simply pass char string to underlying system call
On Windows char string is converted to wide string using user specific locale before system call is made
Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C" locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string assuming it is UTF-8 sequences and everything "just works".
Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *) and std::fstream::open(const char *) to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen, std::fstream::open(const wchar_t *) on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/
With C++17 you can use std::filesystem::path to store file path in a portable way, but it is still broken on Windows:
Implicit constructor std::filesystem::path::path(const char *) uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.
std::error_category::message(int) for both error categories returns error description using user-specific encoding.
So what we have on Windows is:
Standard library functions that open files are broken and should never be used.
Arguments passed to main(int, char**) are broken and should never be used.
WinAPI functions ending with *A and macros are broken and should never be used.
std::filesystem::path is partially broken and should never be used directly.
Error categories returned by std::generic_category and std::system_category are broken and should never be used.
If you need long term solution for a non-trivial project, I would recommend:
Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
Re-implementing standard error categories returned by std::generic_category and std::system_category so that they would always return UTF-8 encoded strings.
Wrapping std::filesystem::path so that new class would always use UTF-8 when converting path to string and string to path.
Wrapping all required functions from std::filesystem so that they would use your path wrapper and your error categories.
Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).
You can check this link for further explanation: http://utf8everywhere.org/
Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.
Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.
The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.
The wchar_t, a.k.a. wide characters, provides more room for encodings.
Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.
Prefer Unicode to handle large character sets (such as emojis).
Never use wchar_t.
When possible, use (some kind of array of) char, such as std::string, and ensure that it is encoded in UTF-8.
When you must interface with APIs that don't speak UTF-8, use char16_t or char32_t. Never use them otherwise; they provide only illusory advantages and encourage faulty code.
Note that there are plenty of cases where more than one char32_t is required to represent a single user-visible character. OTOH, using UTF-8 with char forces you to handle variable width very early.
A file contains non-latin content and is encoded in UTF8.
Currently the existing code uses "fopen" to open the file, parses it and calls my validate function with the non-latin content and passes data as char*.
void validate(const char* str)
{
....
}
I have to do some validation on passed char array.
The application uses Sun C++ 5.11 and which I think doesn't supports unicode. (I googled for unicode support on Sun C++ 5.11, I didn't get any proper pointers about the unicode support. So I wrote a simple program to check if Sun C++ supports unicode and the program didn't compile).
How do I do the validation on the input char*? Is it possible using wchar_t?
The application uses <compiler> and which I think doesn't supports unicode
This isn't a problem. You only need compiler support for unicode to embed unicode string literals in the code, or for fixed width character types to represent UTF-16 or UTF-32. Your unicode is UTF-8 and comes from user input, so no unicode compiler support should be needed.
How do I do the validation on the input char*?
The C++ standard library has very few tools for processing unicode. The provided tools primarily consist of conversion between different unicode formats, and even those tools were not available prior to C++11.
Input and output is mostly just copying of bytes, so no significant processing is required to do that. For other processing (which you presumably need for "validation") you will need to implement the tools yourself, or use third party tools. You will need to refer to the ~1000 pages of the unicode standard if you choose to implement yourself: http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
Is it possible using wchar_t?
wchar_t is the native wide character type used for the native wide character encoding of the system. UTF-8 does not use wide code-units.
I have a question:
Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.
Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:
http://utf8everywhere.org/
It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.
WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.
CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.
Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.
And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).
You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.
The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.
The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.
Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.
In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.
I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.
It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.
I want to develope an application in Linux. I want to use wstring beacuse my application should supports unicode and I don't want to use UTF-8 strings.
In Windows OS, using wstring is easy. beacuse any ANSI API has a unicode form. for example there are two CreateProcess API, first API is CreateProcessA and second API is CreateProcessW.
wstring app = L"C:\\test.exe";
CreateProcess
(
app.c_str(), // EASY!
....
);
But it seems working with wstring in Linux is complicated! for example there is an API in Linux called parport_open (It just an example).
and I don't know how to send my wstring to this API (or APIs like parport_open that accept a string parameter).
wstring name = L"myname";
parport_open
(
0, // or a valid number. It is not important in this question.
name.c_str(), // Error: because type of this parameter is char* not wchat_t*
....
);
My question is how can I use wstring(s) in Linux APIs?
Note: I don't want to use UTF-8 strings.
Thanks
Linux APIs (on recent kernels and with correct locale setting) on almost every distribution use UTF-8 strings by default1. You too should use them inside your code. Resistance is futile.
The wchar_t (and thus wstring) on Windows were convenient only when Unicode was limited to 65536 characters (i.e. wchar_t were used for UCS-2), now that the 16-bit Windows wchar_t are used for UTF-16 the advantage of 1 wchar_t=1 Unicode character is long gone, so you have the same disadvantages of using UTF-8. Nowadays IMHO the Linux approach is the most correct. (Another answer of mine on UTF-16 and why Windows and Java use it)
By the way, both string and wstring aren't encoding-aware, so you can't reliably use any of these two to manipulate Unicode code points. I heard that wxString from the wxWidgets toolkit handles UTF-8 nicely, but I never did extensive research about it.
actually, as pointed out below, the kernel aims to be encoding-agnostic, i.e. it treats the strings as opaque sequences of (NUL-terminated?) bytes (and that's why encodings that use "larger" character types like UTF-16 cannot be used). On the other hand, wherever actual string manipulation is done, the current locale setting is used, and by default on almost any modern Linux distribution it is set to UTF-8 (which is a reasonable default to me).
I don't want to use UTF-8 strings.
Well, you will need to overcome that reluctance, at least when calling the APIs. Linux uses single byte string encodings, invariably UTF-8. Clearly you should use a single byte string type since you obviously can't pass wide characters to a function that expects char*. Use string rather than wstring.
I want to make my Win32 C++ application able to be played on any encoding version (UNICODE & ANSI). Now I am a little confused as to what exactly is the difference between the two(or more?) encodings?
To make my Win32 application cross-encoding compatible does that mean I have to go through my code & replace every std::string with std::wstring, then replace every char with a wchar_t* and then replace every literal string("") with L""?
What will happen if my application runs on a UNICODE machine & my application has one std::string in it?
Do you have any advice on the steps I need to take to make my application cross-encoding compatible?
For eg:
- Change all c_strings & strings to their UNICODE equivalent
- Change any Win32 functions to the uncide version (eg, change from getenv() to _wgetenv())
What will happen if my application runs on a UNICODE machine & my application has one std::string in it?
Computers are not ANSI or Unicode but the Operating Systems on which the computers operate on are. The last version of Windows that didn't support Unicode was Windows 3.11 for Workgroups. If you run a ASCII compiled application on a UniCode.
What exactly is the difference between the two(or more?) encodings?
What is ASCII?
ASCII is a seven-bit encoding technique which assigns a number to each of the 128 characters used most frequently in American English. This allows most computers to record and display basic text. ASCII does not include symbols frequently used in other countries.
What is Unicode?
One major draw back to ASCII was you could only have 256 different characters. However, languages such as Japanese and Arabic have thousands of characters. Thus ASCII would not work in these situations. The result was Unicode which allowed for up to 65,536 different characters.
Unicode is an attempt by ISO and the Unicode Consortium to develop a coding system for electronic text that includes every written alphabet in existence. Unicode uses 8-, 16-, or 32-bit characters depending on the specific representation, so Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1.
In Win32, UNICODE is supported by #define-ing the UNICODE and _UNICODE macros. This, in turn, causes your program to use the Unicode variants of the Win32 functions.
Do you have any advice on the steps I need to take to make my application cross-encoding compatible?
Each Win32 function (that takes or returns a string) has two variants, one for ASCII and one for Unicode. And the function call resolves to one of these, depending on whether or not the UNICODE macro is defined. So you should define the macro and start using the Unicode versions of the functions. for eg:
Replacing every std::string with std::wstring,
Replacing every char with a wchar_t*
Replacing every literal string("") with L""
Making use of the TCHAR support in Windows etc.
as you pointed out are a list of things that you will have to take care of, mind you this is not the complete list.
Basically, You will have to use all the Unicode versions of the types and function calls in your code.
The last version of Windows that did not use Unicode internally was Windows ME. The recommendation for new code is to use Unicode exclusively. Some conversion may be necessary when you need to read and write files that are encoded with a specific code page.
You're on the right track with your initial thoughts. If you're using Microsoft's CString, it comes in two versions CStringA and CStringW - you need to change one compiler definition and it will use CStringW in every place that you specify CString, and everything will just work. You should use std::wstring instead of std::string. Prefix every string literal with L"" or use Microsoft's macro _T("") which will convert to the same thing.
When you compile a program for ANSI or Unicode, you're affecting two things.
Which set of APIs get called. Suppose your code calls CreateFile(). The actual API called is either CreateFileA() or CreateFileW() (ANSI or Wide (i.e. Unicode)) depending on your compiler setting. Internally the NT kernal uses Unicde for all APIs. The ANSI APIs simply convert their string parameters to ANSI and call the Unicode APIs. Many APIs are Unicode only.
How T* macros are expanded. TCHAR will eventually be expanded to char in ANSI mode, wchar_t in Unicode mode.
Things like std::string and std::wstring are not affected until you need to call an API and want to pass a string to them. The use of string vs. wstring should be determined by your program's needs and not whether it's compiled ANSI or Unicode.
You can use ATL to easily convert strings as necessary.
// assume compiled for Unicode
#include <atlbase.h>
void myfunc() {
USES_CONVERSION;
std::string filename = "...";
HANDLE hFile = CreateFile(A2W(filename.c_str()), ...
or, if you prefer, you can use A2T() and your code will work whether it's compiled for ANSI or Unicode.
You can use TCHAR in your case.
In UNICODE, TCHAR is WCHAR.
In not UNICODE, TCHAR is CHAR.
If you want to use std::string, I recommend you the following usage.
#ifdef UNICODE
#define std::tstring str::wstring
#else
#define std::tstring str::string
#endif
and,
Use std::tstring in your program.