Handling the utf8 encoded char* array - c++

A file contains non-latin content and is encoded in UTF8.
Currently the existing code uses "fopen" to open the file, parses it and calls my validate function with the non-latin content and passes data as char*.
void validate(const char* str)
{
....
}
I have to do some validation on passed char array.
The application uses Sun C++ 5.11 and which I think doesn't supports unicode. (I googled for unicode support on Sun C++ 5.11, I didn't get any proper pointers about the unicode support. So I wrote a simple program to check if Sun C++ supports unicode and the program didn't compile).
How do I do the validation on the input char*? Is it possible using wchar_t?

The application uses <compiler> and which I think doesn't supports unicode
This isn't a problem. You only need compiler support for unicode to embed unicode string literals in the code, or for fixed width character types to represent UTF-16 or UTF-32. Your unicode is UTF-8 and comes from user input, so no unicode compiler support should be needed.
How do I do the validation on the input char*?
The C++ standard library has very few tools for processing unicode. The provided tools primarily consist of conversion between different unicode formats, and even those tools were not available prior to C++11.
Input and output is mostly just copying of bytes, so no significant processing is required to do that. For other processing (which you presumably need for "validation") you will need to implement the tools yourself, or use third party tools. You will need to refer to the ~1000 pages of the unicode standard if you choose to implement yourself: http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
Is it possible using wchar_t?
wchar_t is the native wide character type used for the native wide character encoding of the system. UTF-8 does not use wide code-units.

Related

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

How to display Unicode with FLTK?

According to the FLTK 1.3.2 documentation:
Unicode support was only recently added to FLTK and is still
incomplete.
However, the following are supposedly implemented:
It is important to note that the initial implementation of Unicode and
UTF-8 in FLTK involves three important areas:
provision of Unicode character tables and some simple related functions
conversion of char* variables and function parameters from single byte per character representation to UTF-8 variable length
sequences
modifications to the display font interface to accept general Unicode character or UCS code numbers instead of just ASCII or
Latin1 characters.
My question is, how do I actually display Unicode on my FLTK controls? I can't find any widget functions which accept Unicode. For example, this is the signature for the label function:
void Fl_Widget::label ( const char * text )
From the link you posted:
FLTK will be entirely converted to Unicode using UTF-8 encoding. If a different encoding is required by the underlying operating system, FLTK will convert the string as needed.
The three bullet points you list are the areas that make up their implementation of Unicode support; That is, these are things they are doing or are planning to do.
FLTK implementers are going to provide Unicode character tables and some simple related functions
FLTK implementers are going to convert char* variables and function parameters from using SBCS to UTF-8. (That is, they are going to re-implement FLTK functions and variables to treat char* strings as UTF-8.)
FLTK implementers going to modify the display font interface to cover more than just ASCII and Latin1 characters.
My question is, how do I actually display Unicode on my FLTK controls? I can't find any widget functions which accept Unicode. For example, this is the signature for the label function:
void Fl_Widget::label ( const char * text )
There are many people that incorrectly use 'Unicode' to mean an encoding that uses 2-byte characters. The FLTK documentation you link to does not make this mistake. Understanding this, the documentation says quite clearly how you use Unicode with the above signature: You pass the Unicode data as a char* string using the UTF-8 encoding. For example if you're using a compiler that uses UTF-8 as the execution encoding:
widget.label("кошка 日本国");
Or if you have a C++11 compiler:
widget.label( u8"кошка 日本国");

Are there any limitations from using ANSI Functions in a Unicode Application

I have a C++ Native WinAPI application that strictly uses Unicode functions and data types. Ie, CreateWindowW(), SendMessageW(), wstring, WCHAR, etc. Now I intend to expand my application to use SQLite3.
My Problem: The SQLite3 library is ANSI. Which means I have to use char* as most function parameters.
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
If there are what might these impacts be?
SQLite is not restricted to ANSI. It is a misconception that char* implies ANSI encoded text. Not all functions that operate on char* data assume that the data is ANSI encoded. In the case of SQLite it fully supports Unicode and does so using char* data encoded using UTF-8.
If you intend to continue using UTF-16 encoded text internal to your application you'll need to add an adapter layer at the boundary between your code and the SQLite code. Convert from UTF-16 to UTF-8 when passing data to SQLite, and the opposite direction when receiving.
Which to my mind renders the question that you asked somewhat moot, but I'll address that anyway:
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
The most obvious drawbacks of using ANSI functions are:
Severely restricted character set.
Performance cost when converting between different character sets.
Risk of programmer confusion and errors due to using multiple character sets in a single codebase.
No limitation, you can use ANSI strings in Unicode applications.
Some details: Unicode application is compile-time definition. At run time, program can work both with Unicode and ANSI strings.
For example:
char* ptr1; // this is always ANSI string
wchar_t* ptr2; // this is always Unicode string
TCHAR* ptr3; // this is generic string, which is compiled as char* or wchar_t*
Unicode/ANSI configuration differs by interpreting a generic text macros, like TCHAR. Some Windows API are also implemented using generic text macros. For example: SetWindowText is actually macro, which is expanded to SetWindowTextA in ANSI configuration, and to SetWindowTextW in Unicode configuration.
Any non-generic string or API name (like char*, SetWindowTextW etc.) works by the same way in any program configuration.
Use ATL conversion macros to convert between different (generic and non-generic) string types: http://msdn.microsoft.com/en-us/library/87zae4a3%28v=vs.80%29.aspx
You can use Ansi-based APIs in a Unicode application. Simply convert your input Unicode strings to Ansi when passing them to the API, and convert any output Ansi strings to Unicode upon return from the API. You can use WideCharToMultiByte() and MultiByteToWideChar() for that, or higher-level wrappers like CString, ATL conversions, etc.

Unicode Portability

I'm currently taking care of an application that uses std::string and char for string operations - which is fine on linux, since Linux is agnostic to Unicode (or so it seems; I don't really know, so please correct me if I'm telling stories here). This current style naturally leads to this kind of function/class declarations:
std::string doSomethingFunkyWith(const std::string& thisdata)
{
/* .... */
}
However, if thisdata contains unicode characters, it will be displayed wrongly on windows, since std::string can't hold unicode characters on Windows.
So I thought up this concept:
namespace MyApplication {
#ifdef UNICODE
typedef std::wstring string_type;
typedef wchar_t char_type;
#else
typedef std::string string_type;
typedef char char_type;
#endif
/* ... */
string_type doSomethingFunkyWith(const string_type& thisdata)
{
/* ... */
}
}
Is this a good concept to go with to support Unicode on windows?
My current toolchain consists of gcc/clang on Linux, and wine+MinGW for Windows support (crosstesting also happens via wine), if that matters.
Multiplatform issues comes from the fact that there are many encodings, and a wrong encoding pick will lead to encóding íssues. Once you tackle that problem, you should be able to use std::wstring on all your program.
The usual workflow is:
raw_input_data = read_raw_data()
input_encoding = "???" // What is your file or terminal encoding?
unicode_data = convert_to_unicode(raw_input_data, input_encoding)
// Do something with the unicode_data, store in some var, etc.
output_encoding = "???" // Is your terminal output encoding the same as your input?
raw_output_data = convert_from_unicode(unicode_data, output_encoding)
print_raw_data(raw_data)
Most Unicode issues comes from wrongly detecting the values of input_encoding and output_encoding. On a modern Linux distribution this is usually UTF-8. On Windows YMMV.
Standard C++ don't know about encodings, you should use some library like ICU to do the conversion.
How you store a string within your application is entirely up to you -- after all, nobody would know as long as the strings stay within your application. The problem starts when you try to read or write strings from the outside world (console, files, sockets etc.) and this is where the OS matters.
Linux isn't exactly "agnostic" to Unicode -- it does recognize Unicode but the standard library functions assume UTF-8 encoding, so Unicode strings fit into standard char arrays. Windows, on the other hand, uses UTF-16 encoding, so you need a wchar_t array to represent 16-bit characters.
The typedefs you proposed should work fine, but keep in mind that this alone doesn't make your code portable. As an example, if you want to store text in files in a portable manner, you should choose one encoding and stick to it across all platforms -- this could require converting between encodings on certain platforms.
Linux does support Unicode, it simply uses UTF-8. Probably a better way to make your system portable would be to make use of International Components for Unicode and treat all std::string objects as containing UTF-8 characters, and convert them to UTF-16 as needed when invoking Windows functions. It almost always makes sense to use UTF-8 over UTF-16, as UTF-8 uses less space for some of the most commonly used characters (e.g. English*) and more space for less frequent characters, whereas UTF-16 wastes space equally for all characters, no matter how frequently they are used.
While you can use your typedefs, this will mean that you have to write two copies of every single function that has to deal with strings. I think it would be more efficient to simply do all internal computations in UTF-8 and simply translate that to/from UTF-16 if necessary when inputting/outputting as needed.
*For HTML, XML, and JSON that use English as part of the encoding (e.g. "<html>, <body>, etc.) regardless of the language of the values, this can still be a win for foreign languages.
The problem for Linux and using Unicode is that all the IO and most system functions use UTF-8 and the wide character type is 32 bit. Then there is interfacing to Java and other programs which requires UTF-16.
As a suggestion for Unicode support, see the OpenRTL library at http://code.google.com/p/openrtl which supports all UTF-8, UTF-16 and UTF-32 on windows, Linux, Osx and Ios. The Unicode support is not just the character types, but also Unicode collation, normalization, case folding, title casing and about 64 different Unicode character properties per full unsigned 32 bit character.
The OpenRTL code is ready now to support char8_t, char16_t and char32_t for the new C++ standards as well, allthough the same character types are supported using macros for existing C and C++ compilers. I think for Unicode and strings processing that it might be what you want for your library.
The point is that if you use OpenRTL, you can build the system using the OpenRTL "char_t" type. This supports the notion that your entire library can be built in either UTF8, UTF16 or UTF32 mode, even on Linux, because OpenRTL is already handling all the interfacing to a lot of system functions like files and io stuff. It has its own print_f functions for example.
By default the char_t is mapping to the wide character type. So on windows it is 32 bit and on Linux it is 32 bit. But you can make it also make it 8 bit everywhere for example. Also it has the support to do fast UTF decoding inside loops using macros.
So instead of ifdeffing between wchar_t and char, you can build everything using char_t and OpenRTL takes care of the rest.

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support