How to store Unicode characters in an array? - c++

I'm writing a C++ wxWidgets calculator application, and I need to store the characters for the operators in an array. I have something like int ops[10] = {'+', '-', '*', '/', '^'};. What if I wanted to also store characters such as √, ÷ and × in said array, in a way so that they are also displayable inside a wxTextCtrl and a custom button?

This is actually a hairy question, even though it does not look like it at first. Your best option is to use Unicode Control Sequences instead of adding the special characters with your Source Code Editor.
wxString ops[]={L"+", L"-", L"*", L"\u00F7"};
You need to make sure that characters such as √, ÷ and × are being compiled correctly.
Your sourcefile (.cpp) needs to store them in a way that ensures the compiler is generating the correct characters. This is harder than it looks, especially when svn, git, windows and linux are involved.
Most of the time .cpp files are ANSI or 8 bit encoded and do not support Unicode constants out of the box.
You could save your sourcefile in UTF-8 codepage so that these characters are preserved. But not all compilers accept UTF-8
The best way to do that is to encode these using Unicode control characters.
wxString div(L"\u00F7"); is the string for ÷. Or in your case perhaps wxChar div('\u00F7'). You have to look up the Unicode control sequences for the other special chars. This way your source file will contain ANSI chars only and will be accepted by all compilers. You will also avoid code page problems when you exchange source files with different OS platforms.
Then you have to make sure that you compile wxWidgets with UNICODE awareness (although I think this is the default for wx3.x). Then, if your OS supports it, these special characters should show up.
Read up on Unicode controls (Wikipedia). Also good input is found in utf8everywhere.org. The file .editorconfig can also be of help.

Prefer to use wchar_t instead of int.
wchar_t ops[10] = {L'+', L'-', L'*', L'/', L'^', L'√', L'÷', L'×'};
These trivially support the characters you describe, and trivially and correctly convert to wxStrings.

Related

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Euro Symbol for C++

I am creating a small currency converting program in C++ in Visual Studio 2013.
For a pound sign, I have created a char variable using the Windows-1252 code for £ symbol.
const char poundSign(156);
I need to find a way in which I can implement a symbol for €, euro symbol. When I run the program.
You should not use any of the deprecated extended ASCII encodings as they are long obsolete nowadays. As user1937198 said, 156 is the character code of £ in the archaic Windows-1252* encoding. The appearance of non-ASCII characters in these encodings depends on the codepage selected by the user, which makes it impossible to mix characters from different codepages under such a scheme. Additionally, your users will be very confused if they pick the wrong codepage.
Consider using Unicode instead. Since you're on Windows, you should probably use UTF-16, in which case the correct way is to declare:
// make sure the source code is saved as UTF-16!
const wchar_t poundSign[] = L"£";
const wchar_t euroSign[] = L"€";
In UTF-16, there's no guarantee that a single character will take only 1 character due to the surrogate mechanism, hence it's best to store "characters"* as strings.
Keep in mind that this means the rest of your program should switch to Unicode as well: use the "W" versions of the Windows API (the WCHAR-versions).
[*] In technical lingo, each "Unicode character" is referred to as a code point.
You have to distinguish between the translation environment and the execution environment.
E.g., it heavily depends on where your program is being run what glyph is assigned to the character you intended to produce (cf. 'codepage').
You may want to check of your terminal or whatsoever is unicode-aware and then just use the appropriate codepoints.
Also, you should not create character constants the way you did. Use a character literal instead.

Using Different Character Encodings

Recently, I have gotten interested in Text Encoding. As you know, there are many kinds of Text Encoding such as CRC949, UTF-8 and so on.
I am wondering how to express them properly. (To the screen and users.) I mean, they are different from each other. I remember there was particular way to express text accrording to encoding in C#.
Is it possible one can use just simple printf() in C to express string regardless of encoding? Does the compiler automatically do it?
Read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
From the article:
We decided to do everything internally in UCS-2 (two byte) Unicode,
which is what Visual Basic, COM, and Windows NT/2000/XP use as their
native string type. In C++ code we just declare strings as wchar_t
("wide char") instead of char and use the wcs functions instead of the
str functions (for example wcscat and wcslen instead of strcat and
strlen). To create a literal UCS-2 string in C code you just put an L
before it as so: L"Hello".

Using Unicode in a C++ source file

I'm working with a C++ sourcefile in which I would like to have a quoted string that contains Asian Unicode characters.
I'm working with QT on Windows, and the QT Creator development environment has no problem displaying the Unicode. The QStrings also have no problem storing Unicode. When I paste in my Unicode, it displays fine, something like:
#define MY_STRING 鸟
However, when I save, my lovely Unicode characters all become ? marks.
I tried to open up the source file and resave it as Unicode encoded. It then displays and saves correctly in QT Creator. However, on compile, it seems like the compiler has no idea what to do with this, and throws a ton of misguided errors and warnings, such as "stray \255 in program" and "null character(s) ignored".
What's the correct way to include Unicode in C++ source files?
Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.
I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.
It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.
While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.
Using the L prefix and \u or \U notation for escaping Unicode characters:
Section 6.4.3 of the C99 specification defines the \u escape sequences.
Example:
#define MY_STRING L"A \u8801 B"
/* A congruent-to B */
Are you using a wchar_t interface? If so, you want L"\u1234" for a wide string containing Unicode character U+1234 (hex 0x1234). (Looking at the QString header file I think this is what you need.)
If not and your interface is UTF-8 then you'll need to encode your character in UTF-8 first and then create a narrow string containing that, e.g. "\xE0\xF8" or similar.