Converting UTF-8 Characters to Upper/Lower case C++ - c++

I have a string that contains UTF-8 Characters, and I have a method that is supposed to convert every character to either upper or lower case, this is easily done with characters that overlap with ASCII, and obviously some characters cannot be converted, e.g. any Chinese character. However is there a good way to detect and convert other characters that can be Upper/Lower, e.g. all the greek characters? Also please note that I need to be able to do this on both Windows and Linux.
Thank you,

Have a look at ICU.
Note that lower case to upper case functions are locale-dependant. Think about the turkish (ascii) letter I which gets "dotless lowercase i" and (ascii) i which gets "uppercase I with a dot".

Assuming that you have access to wctype.h, then convert your text to a 2-byte unicode string and use towupper(). Then convert it back to UTF-8.

On Linux, or with a standard library that supports it, you would obtain a std::locale object for the appropriate locale, as uppercase conversion is locale-specific. Convert each UTF-8 character to a wchar_t, then call std::toupper() on it, then convert back to UTF-8. Note that the resulting string might be longer or shorter, and some ligatures might not work properly: ß to Ss in German is the example everyone keeps bringing up.
On Windows, this approach will work even less of the time, because wide characters are UTF-16 and not a fixed-width encoding (which violates the C++ language standard, but then maybe the standards committee shouldn't have tried to bluff Microsoft into breaking the Windows API). There is a ToUpper method in the CLR.
It is probably easier to use a portable library such as ICU.
Also make sure whether what you want is uppercase (capitalizing every letter) or titlecase (capitalizing the first letter of a string, or the first part of a ligature).

Related

Can I read åäö from wxWidget wxTextCtrl?

In C++. I have a wxTexfield and want the user to input a swedish translation of a word.
Everything works until the user types å, ä or ö (utf8).
Converting wxString to utf8 is not the problem - the problem is i can not even get the text out of the field. For the rest of text i use (where ans is a ponter to the Textfield). Any Idea? For the other strings i just use and it works perfekt.
std::string ch = std::string((ans->GetValue()));
You can't convert an arbitrary Unicode string to std::string without specifying the encoding. By default, the encoding is that of the current locale which, especially under Windows, is not necessarily UTF-8 which is what you almost certainly want to use precisely because the characters not representable in this encoding will be simply lost during conversion.
So the correct thing to do is to explicitly use ans->GetValue().ToUTF8() and then your std::string will contain UTF-8-encoded representation of your characters. Of course, you need to realize that the string won't be of length 1, even for a single character, so perhaps you need to use std::wstring instead.
P.S. In wxWidgets 3.1.5+ you also have utf8_string() directly returning std::string, so you can also use this one if you have a new enough version.

Does `std::wregex` support utf-16/unicode or only UCS-2?

With c++11 the regex library was introduced into the standard library.
On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally utf-16 when interfacing with the system/platform (eg. CreateFileW).
However it seems that std::regex isn't utf-8 or does not support it, so I'm wondering whether std::wregex supports utf-16 or just ucs2 ?
I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.
The question is:
Is std::wregex representing ucs2 when wchar_t has size of 2 ?
C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding
What encoding does std::string.c_str() use?
Does std::string in c++ has encoding format
Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
https://en.wikipedia.org/wiki/UTF-8#Description
The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead
In other languages normalization takes place.
This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"
If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words
The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library
Related:
Do C++11 regular expressions work with UTF-8 strings?
How well is Unicode supported in C++11?
How do I properly use std::string on UTF-8 in C++?
How to use Unicode range in C++ regex
See also
Unicode Regular Expressions
Unicode Support in the Standard Library

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Detecting Multi-Byte Character Encodings

What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.
ICU does character set detection. You must note that, as the ICU documentation states:
This is, at best, an imprecise operation using statistics and
heuristics. Because of this, detection works best if you supply at
least a few hundred bytes of character data that's mostly in a single
language.
If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.
As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.
in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.
This can of course only be a heuristic and you need to scan the whole text.
Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.

Converting wide char string to lowercase in C++

How do I convert a wchar_t string from upper case to lower case in C++?
The string contains a mixture of Japanese, Chinese, German and Greek characters.
I thought about using towlower...
http://msdn.microsoft.com/en-us/library/8h19t214%28VS.80%29.aspx
.. but the documentation says that:
The case conversion of towlower is locale-specific. Only the characters relevant to the current locale are changed in case.
Edit: Maybe I should describe what I'm doing. I receive a Unicode search query from a user. It's originally in UTF-8 encoding, but I'm converting it to a widechar (I may be wrong on the wording). My debugger (VS2008) correctly shows the Japanese, German, etc characters in in the "variable quick watch". I need to go through another set of data in Unicode and find matches of the search string. While this is no problem for me to do when the search is case sensitive, it's more problematic to do it case insensitive. My (maybe naive) approach to solve the problem would be to convert all input data and output data to lower case and then compare it.
If your string contains all those characters, the codeset must be Unicode-based. If implemented properly, Unicode (Chapter 4 'Character Properties') defines character properties including whether the character is upper case and the lower case mapping, and so on.
Given that preamble, the towlower() function from <wctype.h> is the correct tool to use. If it doesn't do the job, you have a QoI (Quality of Implementation) problem to discuss with your vendor. If you find the vendor unresponsive, then look at alternative libraries. In this case, you might consider ICU (International Components for Unicode).
You have a nasty problem in hand. A Japanese locale will not help converting German and vice versa. There are languages which do not have the concept of captalization either (toupper and friends would be a no-op here, I suppose). So, can you break up your string into individual chunks of words from the same language? If you can then you can convert the pieces and string them up.
This SO answer shows how to work with facets to work with several locales. If this is on Windows, you can consider using win32 API functions, if you can work with C++.NET (managed C++), you can use the char.ToLower and string.ToLower functions, which are Unicode compliant.
Have a look at _wcslwr_l in <wchar.h> (MSDN).
You should be able to run the function on the input for each of the locales.