According to the FLTK 1.3.2 documentation:
Unicode support was only recently added to FLTK and is still
incomplete.
However, the following are supposedly implemented:
It is important to note that the initial implementation of Unicode and
UTF-8 in FLTK involves three important areas:
provision of Unicode character tables and some simple related functions
conversion of char* variables and function parameters from single byte per character representation to UTF-8 variable length
sequences
modifications to the display font interface to accept general Unicode character or UCS code numbers instead of just ASCII or
Latin1 characters.
My question is, how do I actually display Unicode on my FLTK controls? I can't find any widget functions which accept Unicode. For example, this is the signature for the label function:
void Fl_Widget::label ( const char * text )
From the link you posted:
FLTK will be entirely converted to Unicode using UTF-8 encoding. If a different encoding is required by the underlying operating system, FLTK will convert the string as needed.
The three bullet points you list are the areas that make up their implementation of Unicode support; That is, these are things they are doing or are planning to do.
FLTK implementers are going to provide Unicode character tables and some simple related functions
FLTK implementers are going to convert char* variables and function parameters from using SBCS to UTF-8. (That is, they are going to re-implement FLTK functions and variables to treat char* strings as UTF-8.)
FLTK implementers going to modify the display font interface to cover more than just ASCII and Latin1 characters.
My question is, how do I actually display Unicode on my FLTK controls? I can't find any widget functions which accept Unicode. For example, this is the signature for the label function:
void Fl_Widget::label ( const char * text )
There are many people that incorrectly use 'Unicode' to mean an encoding that uses 2-byte characters. The FLTK documentation you link to does not make this mistake. Understanding this, the documentation says quite clearly how you use Unicode with the above signature: You pass the Unicode data as a char* string using the UTF-8 encoding. For example if you're using a compiler that uses UTF-8 as the execution encoding:
widget.label("кошка 日本国");
Or if you have a C++11 compiler:
widget.label( u8"кошка 日本国");
Related
Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";
A file contains non-latin content and is encoded in UTF8.
Currently the existing code uses "fopen" to open the file, parses it and calls my validate function with the non-latin content and passes data as char*.
void validate(const char* str)
{
....
}
I have to do some validation on passed char array.
The application uses Sun C++ 5.11 and which I think doesn't supports unicode. (I googled for unicode support on Sun C++ 5.11, I didn't get any proper pointers about the unicode support. So I wrote a simple program to check if Sun C++ supports unicode and the program didn't compile).
How do I do the validation on the input char*? Is it possible using wchar_t?
The application uses <compiler> and which I think doesn't supports unicode
This isn't a problem. You only need compiler support for unicode to embed unicode string literals in the code, or for fixed width character types to represent UTF-16 or UTF-32. Your unicode is UTF-8 and comes from user input, so no unicode compiler support should be needed.
How do I do the validation on the input char*?
The C++ standard library has very few tools for processing unicode. The provided tools primarily consist of conversion between different unicode formats, and even those tools were not available prior to C++11.
Input and output is mostly just copying of bytes, so no significant processing is required to do that. For other processing (which you presumably need for "validation") you will need to implement the tools yourself, or use third party tools. You will need to refer to the ~1000 pages of the unicode standard if you choose to implement yourself: http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
Is it possible using wchar_t?
wchar_t is the native wide character type used for the native wide character encoding of the system. UTF-8 does not use wide code-units.
I have a C++ Native WinAPI application that strictly uses Unicode functions and data types. Ie, CreateWindowW(), SendMessageW(), wstring, WCHAR, etc. Now I intend to expand my application to use SQLite3.
My Problem: The SQLite3 library is ANSI. Which means I have to use char* as most function parameters.
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
If there are what might these impacts be?
SQLite is not restricted to ANSI. It is a misconception that char* implies ANSI encoded text. Not all functions that operate on char* data assume that the data is ANSI encoded. In the case of SQLite it fully supports Unicode and does so using char* data encoded using UTF-8.
If you intend to continue using UTF-16 encoded text internal to your application you'll need to add an adapter layer at the boundary between your code and the SQLite code. Convert from UTF-16 to UTF-8 when passing data to SQLite, and the opposite direction when receiving.
Which to my mind renders the question that you asked somewhat moot, but I'll address that anyway:
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
The most obvious drawbacks of using ANSI functions are:
Severely restricted character set.
Performance cost when converting between different character sets.
Risk of programmer confusion and errors due to using multiple character sets in a single codebase.
No limitation, you can use ANSI strings in Unicode applications.
Some details: Unicode application is compile-time definition. At run time, program can work both with Unicode and ANSI strings.
For example:
char* ptr1; // this is always ANSI string
wchar_t* ptr2; // this is always Unicode string
TCHAR* ptr3; // this is generic string, which is compiled as char* or wchar_t*
Unicode/ANSI configuration differs by interpreting a generic text macros, like TCHAR. Some Windows API are also implemented using generic text macros. For example: SetWindowText is actually macro, which is expanded to SetWindowTextA in ANSI configuration, and to SetWindowTextW in Unicode configuration.
Any non-generic string or API name (like char*, SetWindowTextW etc.) works by the same way in any program configuration.
Use ATL conversion macros to convert between different (generic and non-generic) string types: http://msdn.microsoft.com/en-us/library/87zae4a3%28v=vs.80%29.aspx
You can use Ansi-based APIs in a Unicode application. Simply convert your input Unicode strings to Ansi when passing them to the API, and convert any output Ansi strings to Unicode upon return from the API. You can use WideCharToMultiByte() and MultiByteToWideChar() for that, or higher-level wrappers like CString, ATL conversions, etc.
Recently, I have gotten interested in Text Encoding. As you know, there are many kinds of Text Encoding such as CRC949, UTF-8 and so on.
I am wondering how to express them properly. (To the screen and users.) I mean, they are different from each other. I remember there was particular way to express text accrording to encoding in C#.
Is it possible one can use just simple printf() in C to express string regardless of encoding? Does the compiler automatically do it?
Read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
From the article:
We decided to do everything internally in UCS-2 (two byte) Unicode,
which is what Visual Basic, COM, and Windows NT/2000/XP use as their
native string type. In C++ code we just declare strings as wchar_t
("wide char") instead of char and use the wcs functions instead of the
str functions (for example wcscat and wcslen instead of strcat and
strlen). To create a literal UCS-2 string in C code you just put an L
before it as so: L"Hello".
I want to make my Win32 C++ application able to be played on any encoding version (UNICODE & ANSI). Now I am a little confused as to what exactly is the difference between the two(or more?) encodings?
To make my Win32 application cross-encoding compatible does that mean I have to go through my code & replace every std::string with std::wstring, then replace every char with a wchar_t* and then replace every literal string("") with L""?
What will happen if my application runs on a UNICODE machine & my application has one std::string in it?
Do you have any advice on the steps I need to take to make my application cross-encoding compatible?
For eg:
- Change all c_strings & strings to their UNICODE equivalent
- Change any Win32 functions to the uncide version (eg, change from getenv() to _wgetenv())
What will happen if my application runs on a UNICODE machine & my application has one std::string in it?
Computers are not ANSI or Unicode but the Operating Systems on which the computers operate on are. The last version of Windows that didn't support Unicode was Windows 3.11 for Workgroups. If you run a ASCII compiled application on a UniCode.
What exactly is the difference between the two(or more?) encodings?
What is ASCII?
ASCII is a seven-bit encoding technique which assigns a number to each of the 128 characters used most frequently in American English. This allows most computers to record and display basic text. ASCII does not include symbols frequently used in other countries.
What is Unicode?
One major draw back to ASCII was you could only have 256 different characters. However, languages such as Japanese and Arabic have thousands of characters. Thus ASCII would not work in these situations. The result was Unicode which allowed for up to 65,536 different characters.
Unicode is an attempt by ISO and the Unicode Consortium to develop a coding system for electronic text that includes every written alphabet in existence. Unicode uses 8-, 16-, or 32-bit characters depending on the specific representation, so Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1.
In Win32, UNICODE is supported by #define-ing the UNICODE and _UNICODE macros. This, in turn, causes your program to use the Unicode variants of the Win32 functions.
Do you have any advice on the steps I need to take to make my application cross-encoding compatible?
Each Win32 function (that takes or returns a string) has two variants, one for ASCII and one for Unicode. And the function call resolves to one of these, depending on whether or not the UNICODE macro is defined. So you should define the macro and start using the Unicode versions of the functions. for eg:
Replacing every std::string with std::wstring,
Replacing every char with a wchar_t*
Replacing every literal string("") with L""
Making use of the TCHAR support in Windows etc.
as you pointed out are a list of things that you will have to take care of, mind you this is not the complete list.
Basically, You will have to use all the Unicode versions of the types and function calls in your code.
The last version of Windows that did not use Unicode internally was Windows ME. The recommendation for new code is to use Unicode exclusively. Some conversion may be necessary when you need to read and write files that are encoded with a specific code page.
You're on the right track with your initial thoughts. If you're using Microsoft's CString, it comes in two versions CStringA and CStringW - you need to change one compiler definition and it will use CStringW in every place that you specify CString, and everything will just work. You should use std::wstring instead of std::string. Prefix every string literal with L"" or use Microsoft's macro _T("") which will convert to the same thing.
When you compile a program for ANSI or Unicode, you're affecting two things.
Which set of APIs get called. Suppose your code calls CreateFile(). The actual API called is either CreateFileA() or CreateFileW() (ANSI or Wide (i.e. Unicode)) depending on your compiler setting. Internally the NT kernal uses Unicde for all APIs. The ANSI APIs simply convert their string parameters to ANSI and call the Unicode APIs. Many APIs are Unicode only.
How T* macros are expanded. TCHAR will eventually be expanded to char in ANSI mode, wchar_t in Unicode mode.
Things like std::string and std::wstring are not affected until you need to call an API and want to pass a string to them. The use of string vs. wstring should be determined by your program's needs and not whether it's compiled ANSI or Unicode.
You can use ATL to easily convert strings as necessary.
// assume compiled for Unicode
#include <atlbase.h>
void myfunc() {
USES_CONVERSION;
std::string filename = "...";
HANDLE hFile = CreateFile(A2W(filename.c_str()), ...
or, if you prefer, you can use A2T() and your code will work whether it's compiled for ANSI or Unicode.
You can use TCHAR in your case.
In UNICODE, TCHAR is WCHAR.
In not UNICODE, TCHAR is CHAR.
If you want to use std::string, I recommend you the following usage.
#ifdef UNICODE
#define std::tstring str::wstring
#else
#define std::tstring str::string
#endif
and,
Use std::tstring in your program.