I am creating a small currency converting program in C++ in Visual Studio 2013.
For a pound sign, I have created a char variable using the Windows-1252 code for £ symbol.
const char poundSign(156);
I need to find a way in which I can implement a symbol for €, euro symbol. When I run the program.
You should not use any of the deprecated extended ASCII encodings as they are long obsolete nowadays. As user1937198 said, 156 is the character code of £ in the archaic Windows-1252* encoding. The appearance of non-ASCII characters in these encodings depends on the codepage selected by the user, which makes it impossible to mix characters from different codepages under such a scheme. Additionally, your users will be very confused if they pick the wrong codepage.
Consider using Unicode instead. Since you're on Windows, you should probably use UTF-16, in which case the correct way is to declare:
// make sure the source code is saved as UTF-16!
const wchar_t poundSign[] = L"£";
const wchar_t euroSign[] = L"€";
In UTF-16, there's no guarantee that a single character will take only 1 character due to the surrogate mechanism, hence it's best to store "characters"* as strings.
Keep in mind that this means the rest of your program should switch to Unicode as well: use the "W" versions of the Windows API (the WCHAR-versions).
[*] In technical lingo, each "Unicode character" is referred to as a code point.
You have to distinguish between the translation environment and the execution environment.
E.g., it heavily depends on where your program is being run what glyph is assigned to the character you intended to produce (cf. 'codepage').
You may want to check of your terminal or whatsoever is unicode-aware and then just use the appropriate codepoints.
Also, you should not create character constants the way you did. Use a character literal instead.
Related
I'm writing a C++ wxWidgets calculator application, and I need to store the characters for the operators in an array. I have something like int ops[10] = {'+', '-', '*', '/', '^'};. What if I wanted to also store characters such as √, ÷ and × in said array, in a way so that they are also displayable inside a wxTextCtrl and a custom button?
This is actually a hairy question, even though it does not look like it at first. Your best option is to use Unicode Control Sequences instead of adding the special characters with your Source Code Editor.
wxString ops[]={L"+", L"-", L"*", L"\u00F7"};
You need to make sure that characters such as √, ÷ and × are being compiled correctly.
Your sourcefile (.cpp) needs to store them in a way that ensures the compiler is generating the correct characters. This is harder than it looks, especially when svn, git, windows and linux are involved.
Most of the time .cpp files are ANSI or 8 bit encoded and do not support Unicode constants out of the box.
You could save your sourcefile in UTF-8 codepage so that these characters are preserved. But not all compilers accept UTF-8
The best way to do that is to encode these using Unicode control characters.
wxString div(L"\u00F7"); is the string for ÷. Or in your case perhaps wxChar div('\u00F7'). You have to look up the Unicode control sequences for the other special chars. This way your source file will contain ANSI chars only and will be accepted by all compilers. You will also avoid code page problems when you exchange source files with different OS platforms.
Then you have to make sure that you compile wxWidgets with UNICODE awareness (although I think this is the default for wx3.x). Then, if your OS supports it, these special characters should show up.
Read up on Unicode controls (Wikipedia). Also good input is found in utf8everywhere.org. The file .editorconfig can also be of help.
Prefer to use wchar_t instead of int.
wchar_t ops[10] = {L'+', L'-', L'*', L'/', L'^', L'√', L'÷', L'×'};
These trivially support the characters you describe, and trivially and correctly convert to wxStrings.
string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.
Recently, I have gotten interested in Text Encoding. As you know, there are many kinds of Text Encoding such as CRC949, UTF-8 and so on.
I am wondering how to express them properly. (To the screen and users.) I mean, they are different from each other. I remember there was particular way to express text accrording to encoding in C#.
Is it possible one can use just simple printf() in C to express string regardless of encoding? Does the compiler automatically do it?
Read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
From the article:
We decided to do everything internally in UCS-2 (two byte) Unicode,
which is what Visual Basic, COM, and Windows NT/2000/XP use as their
native string type. In C++ code we just declare strings as wchar_t
("wide char") instead of char and use the wcs functions instead of the
str functions (for example wcscat and wcslen instead of strcat and
strlen). To create a literal UCS-2 string in C code you just put an L
before it as so: L"Hello".
I'm working with a C++ sourcefile in which I would like to have a quoted string that contains Asian Unicode characters.
I'm working with QT on Windows, and the QT Creator development environment has no problem displaying the Unicode. The QStrings also have no problem storing Unicode. When I paste in my Unicode, it displays fine, something like:
#define MY_STRING 鸟
However, when I save, my lovely Unicode characters all become ? marks.
I tried to open up the source file and resave it as Unicode encoded. It then displays and saves correctly in QT Creator. However, on compile, it seems like the compiler has no idea what to do with this, and throws a ton of misguided errors and warnings, such as "stray \255 in program" and "null character(s) ignored".
What's the correct way to include Unicode in C++ source files?
Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.
I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.
It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.
While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.
Using the L prefix and \u or \U notation for escaping Unicode characters:
Section 6.4.3 of the C99 specification defines the \u escape sequences.
Example:
#define MY_STRING L"A \u8801 B"
/* A congruent-to B */
Are you using a wchar_t interface? If so, you want L"\u1234" for a wide string containing Unicode character U+1234 (hex 0x1234). (Looking at the QString header file I think this is what you need.)
If not and your interface is UTF-8 then you'll need to encode your character in UTF-8 first and then create a narrow string containing that, e.g. "\xE0\xF8" or similar.
I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?
So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?
If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way?
What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?
The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.
Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.
The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.
As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.
The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.
Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.
The standard makes no mention of encoding formats for strings.
Take a look at ICU from IBM (its free). http://site.icu-project.org/