Converting wide char string to lowercase in C++ - c++

How do I convert a wchar_t string from upper case to lower case in C++?
The string contains a mixture of Japanese, Chinese, German and Greek characters.
I thought about using towlower...
http://msdn.microsoft.com/en-us/library/8h19t214%28VS.80%29.aspx
.. but the documentation says that:
The case conversion of towlower is locale-specific. Only the characters relevant to the current locale are changed in case.
Edit: Maybe I should describe what I'm doing. I receive a Unicode search query from a user. It's originally in UTF-8 encoding, but I'm converting it to a widechar (I may be wrong on the wording). My debugger (VS2008) correctly shows the Japanese, German, etc characters in in the "variable quick watch". I need to go through another set of data in Unicode and find matches of the search string. While this is no problem for me to do when the search is case sensitive, it's more problematic to do it case insensitive. My (maybe naive) approach to solve the problem would be to convert all input data and output data to lower case and then compare it.

If your string contains all those characters, the codeset must be Unicode-based. If implemented properly, Unicode (Chapter 4 'Character Properties') defines character properties including whether the character is upper case and the lower case mapping, and so on.
Given that preamble, the towlower() function from <wctype.h> is the correct tool to use. If it doesn't do the job, you have a QoI (Quality of Implementation) problem to discuss with your vendor. If you find the vendor unresponsive, then look at alternative libraries. In this case, you might consider ICU (International Components for Unicode).

You have a nasty problem in hand. A Japanese locale will not help converting German and vice versa. There are languages which do not have the concept of captalization either (toupper and friends would be a no-op here, I suppose). So, can you break up your string into individual chunks of words from the same language? If you can then you can convert the pieces and string them up.

This SO answer shows how to work with facets to work with several locales. If this is on Windows, you can consider using win32 API functions, if you can work with C++.NET (managed C++), you can use the char.ToLower and string.ToLower functions, which are Unicode compliant.

Have a look at _wcslwr_l in <wchar.h> (MSDN).
You should be able to run the function on the input for each of the locales.

Related

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!
The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)
Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'
For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt
Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Getting the upper or lower case of a unicode code point (as uint32_t)

Is there a way to get the upper or lower case character for a given unicode code point (or the equivalent utf-8 code unit sequence) ?
I read that this could be done with ICU, but that would be the only thing i'd need ICU for, so i don't want to import a whole huge library (with its licences and dependencies, if any) for a single feature.
I also read that upper and lower case depend on the locale. What does this mean exactly ?
Thanks for your help.
PS: Can't use C++11, using VS2005
ICU is the right tool for this. Case-folding (the idea that multiple symbols represent the same "letter") is a tricky concept in the general form.
What's the uppercase form of i? What country are we in and what language are we writing? English has the pair Ii. Turkish has two pairs: İi and Iı. So it's not so simple, and explains the "locale matters" part of the problem.
Another interesting case is the capital for the German ß (Eszett or "sharp S" in English). Its capital form is two letters, SS. So there's no promise that the uppercase form of a string will even have the same number of letters in it.
It's possible that there's some small library that just focuses on case folding, but I'm not aware of it. Generally to do Unicode reasonably, you have to do a lot of Unicode.

How to check if the casting to wchar_t "failed"

I have a code that does something like this:
char16_t msg[256]={0};
//...
wstring wstr;
for (int i =0;i<len;++i)
{
if((unsigned short)msg[i]!=167)
wstr.push_back((wchar_t) msg[i]);
else
wstr.append(L"_<?>_");
}
as you can see it uses some rather ugly hardcoding(I'm not sure it works, but it works for my data) to figure out if wchar_t casting "failed"(that is the value of the replacement character)
From wiki:
The replacement character � (often a black diamond with a white
question mark) is a symbol found in the Unicode standard at codepoint
U+FFFD in the Specials table. It is used to indicate problems when a
system is not able to decode a stream of data to a correct symbol. It
is most commonly seen when a font does not contain a character, but is
also seen when the data is invalid and does not match any character:
So I have 2 questions:
1. Is there a proper way to do this nicely?
2. Are there other characters like replacement character that signal the failed conversion?
EDIT: i use gcc on linux so wchar_t is 32 bit, and the reason why I need this cast to work is because weird wstrings kill my glog library. :) Also wcout dies. :( :)
Doesn't work like that. wchar_t and char16_t are both integer types in C++. Casting from one to the other follows the usual rules for integer conversions, it does not attempt to convert between charsets in any way, or verify that anything is a genuine unicode code point.
Any replacement characters will have to come from more sophisticated code than a simple cast (or could be from the original input, of course).
Provided that:
The input in msg is a sequence of code points in the BMP
wchar_t in your implementation is at least 16 bits and the wide character set used by your implementation is Unicode (or a 16-bit version of Unicode, whether that's BMP-only, or UTF-16).
Then the code you have should work fine. It will not validate the input, though, just copy the values.
If you want to actually handle Unicode strings in C++ (and not merely sequences of 16-bit values), you should use the International Components for Unicode (ICU) library. Quoting the FAQ:
Why ICU4C?
The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).
As a side effect, you get proper error reporting if a conversion fails...
If you don't mind platform-specific code, Windows has the MultiByteToWideChar API.
*Edit: I see you're on linux; I'll leave my answer here though in case Windows people can benefit from it.
A cast can not fail neither it will produce any replacement characters. The 167 value in your code does not indicate a failed cast, it means something else what only the code's author knows.
Just for reference, Unicode code point 167 (0x00A7) is a section sign: §. Maybe that will ring some bells about what the code was supposed to do.
And though I don't know what it is, consider rewriting it with:
wchar_t msg[256];
...
wstring wstr(msg, wcslen(msg));
or
char16_t msg[256];
...
u16string u16str(msg, wcslen(msg));
then do something to that 167 values if you need to.

Converting UTF-8 Characters to Upper/Lower case C++

I have a string that contains UTF-8 Characters, and I have a method that is supposed to convert every character to either upper or lower case, this is easily done with characters that overlap with ASCII, and obviously some characters cannot be converted, e.g. any Chinese character. However is there a good way to detect and convert other characters that can be Upper/Lower, e.g. all the greek characters? Also please note that I need to be able to do this on both Windows and Linux.
Thank you,
Have a look at ICU.
Note that lower case to upper case functions are locale-dependant. Think about the turkish (ascii) letter I which gets "dotless lowercase i" and (ascii) i which gets "uppercase I with a dot".
Assuming that you have access to wctype.h, then convert your text to a 2-byte unicode string and use towupper(). Then convert it back to UTF-8.
On Linux, or with a standard library that supports it, you would obtain a std::locale object for the appropriate locale, as uppercase conversion is locale-specific. Convert each UTF-8 character to a wchar_t, then call std::toupper() on it, then convert back to UTF-8. Note that the resulting string might be longer or shorter, and some ligatures might not work properly: ß to Ss in German is the example everyone keeps bringing up.
On Windows, this approach will work even less of the time, because wide characters are UTF-16 and not a fixed-width encoding (which violates the C++ language standard, but then maybe the standards committee shouldn't have tried to bluff Microsoft into breaking the Windows API). There is a ToUpper method in the CLR.
It is probably easier to use a portable library such as ICU.
Also make sure whether what you want is uppercase (capitalizing every letter) or titlecase (capitalizing the first letter of a string, or the first part of a ligature).