How to search a non-ASCII character in a c++ string? - c++

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.

First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.

My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Related

Does `std::wregex` support utf-16/unicode or only UCS-2?

With c++11 the regex library was introduced into the standard library.
On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally utf-16 when interfacing with the system/platform (eg. CreateFileW).
However it seems that std::regex isn't utf-8 or does not support it, so I'm wondering whether std::wregex supports utf-16 or just ucs2 ?
I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.
The question is:
Is std::wregex representing ucs2 when wchar_t has size of 2 ?
C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding
What encoding does std::string.c_str() use?
Does std::string in c++ has encoding format
Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
https://en.wikipedia.org/wiki/UTF-8#Description
The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead
In other languages normalization takes place.
This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"
If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words
The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library
Related:
Do C++11 regular expressions work with UTF-8 strings?
How well is Unicode supported in C++11?
How do I properly use std::string on UTF-8 in C++?
How to use Unicode range in C++ regex
See also
Unicode Regular Expressions
Unicode Support in the Standard Library

Detecting Multi-Byte Character Encodings

What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.
ICU does character set detection. You must note that, as the ICU documentation states:
This is, at best, an imprecise operation using statistics and
heuristics. Because of this, detection works best if you supply at
least a few hundred bytes of character data that's mostly in a single
language.
If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.
As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.
in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.
This can of course only be a heuristic and you need to scan the whole text.
Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.

Distinguishing between string formats

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?
Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.
Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)
In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

Converting UTF-8 Characters to Upper/Lower case C++

I have a string that contains UTF-8 Characters, and I have a method that is supposed to convert every character to either upper or lower case, this is easily done with characters that overlap with ASCII, and obviously some characters cannot be converted, e.g. any Chinese character. However is there a good way to detect and convert other characters that can be Upper/Lower, e.g. all the greek characters? Also please note that I need to be able to do this on both Windows and Linux.
Thank you,
Have a look at ICU.
Note that lower case to upper case functions are locale-dependant. Think about the turkish (ascii) letter I which gets "dotless lowercase i" and (ascii) i which gets "uppercase I with a dot".
Assuming that you have access to wctype.h, then convert your text to a 2-byte unicode string and use towupper(). Then convert it back to UTF-8.
On Linux, or with a standard library that supports it, you would obtain a std::locale object for the appropriate locale, as uppercase conversion is locale-specific. Convert each UTF-8 character to a wchar_t, then call std::toupper() on it, then convert back to UTF-8. Note that the resulting string might be longer or shorter, and some ligatures might not work properly: ß to Ss in German is the example everyone keeps bringing up.
On Windows, this approach will work even less of the time, because wide characters are UTF-16 and not a fixed-width encoding (which violates the C++ language standard, but then maybe the standards committee shouldn't have tried to bluff Microsoft into breaking the Windows API). There is a ToUpper method in the CLR.
It is probably easier to use a portable library such as ICU.
Also make sure whether what you want is uppercase (capitalizing every letter) or titlecase (capitalizing the first letter of a string, or the first part of a ligature).

MFC: what would be the regex to check if a character is unicode or not?

I'm trying to use windows' API IsTextUnicode to check if a character input is unicode or not, but is sort of buggy. I figured, it might be better using a regex. However, I'm new to constructing regular expressions. What would be the regex to check if a character is unicode or not?
Thanks...
Well, that depends what you mean by ‘Unicode’. As the answers so far say, pretty much any character “is Unicode”.
Windows abuses the term ‘Unicode’ to mean the UTF-16LE encoding that the Win32 API uses internally. You can detect UTF-16 by looking for the Byte Order Mark at the front, bytes FF FE for UTF-16LE (or FE FF for UTF-16BE). It's possible to have UTF-16 text that is not marked with a BOM, but that's quite bad news as you can only detect it by pure guesswork.
Pure guesswork is what the IsTextUnicode function is all about. It looks at the input bytes and, by seeing how often common patterns turn up in it, guesses how likely it is that the bytes represent UTF-16LE or UTF-16BE-encoded characters. Since every sequence of bytes is potentially a valid encoding of characters(*), you might imagine this isn't very predictable or reliable. And you'd be right.
See Windows i18n guru Michael Kaplan's description of IsTextUnicode and why it's probably not a good idea.
In general you would want a more predictable way of guessing what encoding a set of bytes represents. You could try:
if it begins FE FF, it's UTF-16LE, what Windows thinks of as ‘Unicode’;
if it begins FF FE, it's UTF-16BE, what Windows equally-misleadingly calls ‘reverse’ Unicode;
otherwise check the whole string for invalid UTF-8 sequences. If there are none, it's probably UTF-8 (or just ASCII);
otherwise try the system default codepage.
(*: actually not quite true. Apart from the never-a-characters like U+FFFF, there are also many sequences of UTF-16 code units that aren't valid characters, thanks to the ‘surrogates’ approach to encoding characters outside the 16-bit range. However IsTextUnicode doesn't know about those anyway, as it predates the astral planes.)
Every character you'll encounter is part of Unicode. For instance, latin 'a' is U+0061. This is especially true on Windows, which natievely uses Unicode and UTF-16 encoding.
The Microsoft function IsTextUnicode is named rather unfortunately. It could more accurately be described as GuessTextEncodingFromRawBytes(). I suspect that your real problem is not the interpretation of raw bytes, since you already know it's one character.
I think you're mixing up two different concepts. A character and its encoding are not the same. Some characters (like A) are encoded identically in ASCII or latin-1 and UTF-8, some aren't, some can only be encoded in UTF-8 etc.
IsTextUnicode() tries to guess the encoding from a stream of raw bytes.
If, on the other hand, you already have a character representation, and you wish to find out whether it can be natively expressed as ASCII or latin-1 or some other encoding, then you could indeed look at the character range ([\u0000-\u007F] for ASCII).
Lastly, there are some invalid codes (like \uFFFE) which are possible bytes representations that are not allowed as Unicode characters. But I don't think this is what you're looking for.