Range of UTF-8 Characters in C++11 Regex

Range of UTF-8 Characters in C++11 Regex - c++

This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++ with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は which is \u306f and ぁ which is \u3041. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.
The character class you are looking for is the one that includes:
any character in the range U+4E00..U+9FA0; or
any of the characters 々, 〆, ヵ, ヶ.
The character class you specified is the one that includes:
any of the "characters" \xe4 or \xb8; or
any "character" in the range \x80..\xe9; or
any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.
Messy isn't it? Do you see the problem?
This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.
It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.
If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.
If you instead add {1} it does not match because we are back to matching three "characters" against one.
Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.
That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".
The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.
And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?
R"(\p{Script=Han})"
Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.
So what should you do?
You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".
I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

Related

Raku Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?

I frequently encounter malformed utf-8 characters that breaks my codes. I have read some (not all) related questions/answers on stackoverflow, but nothing specific to Raku/perl6. Is there a fast way to remove these pesky characters from strings? The predefined character classes in "https://docs.raku.org/language/regexes#Predefined_character_classes" just won't do it:
Example: from REPL:
> say "â " ~~ /\w/ # you have to have a space following the "a" with "^" for it to work
｢â｣
> say "�" ~~ /\w/ # without the space, the character doesn't look normal
Malformed UTF-8 at line 1 col 6
> say "â ".chars # looks like 2 chars, but it says 1 char
1
> say "â ".comb.[0] # strange, the pesky char makes the space precede the cursor as I type
â
> say "â".comb.[0 ] # strange, the pesky char makes the space precede the cursor as I type
â
> say "â".comb.[0] # there is a space following ']' or it won't work
â
> say "â".comb.[0 ] # very strange, must have space before ']'
â
> say "â".comb
(â)
> say "â".comb.[0] .ord # # same here, very strange, it makes space precede the cursor
226
> my $a = Buf.new(226)
Buf:0x<E2>
> say $a.decode
Malformed termination of UTF-8 string
in block <unit> at <unknown file> line 1
> say $a.decode('utf8-c8')
􏿽xE2
> for #$a { say $_.chr; }
â
> say (#$a).elems
1
> say "â " ~~ / <alpha> / # again, must have space in the quote
｢â｣
alpha => ｢â｣
> say "â " ~~ / <cntrl> /
Nil
This is very troublesome. How to remove these non-utf8 chars? Is there a predefined character class for all good utf-8 chars or for good ASCII chars that are model citizens?

Hopefully someone will have a better answer. In the meantime...
There are several very different things going on in your question.
Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?
There is supposed to be a nice, obvious, fairly simple one:
say .decode: replacement => '�'
given $buf-that's-supposed-to-be-utf8
This should decode the same way a plain slurp does, except that, instead of just giving up on the decode when it encounters "Malformed UTF-8", it should just replace malformed data with the replacement character you've specified and continue as best it can.
Unfortunately (as far as I know) this doesn't work due to bugs in rakudo/moarvm as outlined in my answer to decode with replacement does not seem to work.
I did not file an issue at the time I wrote that SO. Your new SO has prompted me to file two bug reports:
.decode's replacement option didn't work in Rakudo v2019.03.01 and presumably still doesn't #3509
decoder replacement options didn't work in Rakudo v2019.03.01 and presumably still don't #1245
Some other options are given in the answers to error message: Malformed UTF-8.
I see in your repl examples you've tried .decode('utf8-c8'). This may be your best bet within raku as it stands.
If none of the above is helpful, I think you're stuck for now with using an external tool to preprocess files before they get to raku.
Is there a predefined character class for all good utf-8 chars
utf8 data is not characters. It's just bytes. The data encodes characters, or at least it's supposed to, but it's very important to keep encodings and characters separate in your mind.
If you know how old-fashioned telegrams work, it's like that. There's a message in characters. And then morse code for transmitting it. They're very different things.
When you see "Malformed UTF-8" or similar, it means the decoder is choking on some part of the data (the bytes). They make no sense to it as characters. It's like morse code that doesn't follow the rules for morse code.
Such data is considered to be confusing crap at best and dangerous crap at worst. The Unicode standard requires that it is entirely eliminated before you can do anything with it.
The obvious friendly solution is to replace crap with a user specified replacement character as you asked. In contrast, a regex character class is both the wrong tool and too late.
Example: from REPL
This is another whole ball of wax.
There's:
The encoding used by your (terminal on your) local system;
The characters you see rendered, and the indication of the cursor, when you use your local system;
What's in your cut/paste buffer when you copy from your repl display;
What your browser does with that buffer when you paste into the edit window for an SO question;
What SO's servers do with that the contents of the edit window when you click the Post your question button and when SO renders your question;
What my local system, browser, terminal, cut/paste buffer, etc. are doing when I look at your SO question;
Etc.
This complexity exists even if both our systems and both you and I are doing what we're supposed to be doing. So, sure, something is amiss with the cursor and other issues, but I'm not going to try nail that down with this answer because, unlike the first part of your question I answered above, it's not really to do with raku/do.

I was trying to determine whether the issues you're seeing are due to the REPL, or some other factor. Here's a link to a gist from your input code:
https://gist.github.com/jubilatious1/b99def4cb2d02e6cef5c15b3fd102447
I removed spaces inside the doublequotes to force an error (if any). I inserted a semicolon at the end of every code line before the comment (if any). I moved one problematic line, say $a.decode;, to the very end. Then I tested the gist with a fairly recent version of Rakudo:
~$ raku --version
Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10.
Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
Built on MoarVM version 2020.10.
Here's the output I see:
~$ raku lisprogtor_unicode_SO.p6
｢â｣
Nil
1
â
â
â
â
(â)
226
􏿽xE2
â
1
｢â｣
alpha => ｢â｣
Nil
----
Malformed termination of UTF-8 string
in block <unit> at lisprogtor_unicode_SO.p6 line 36
I'm wondering if this means some/many of the Unicode errors you've encountered are either 1) confined to the REPL, or 2) have been resolved since you first posted?
HTH.
(updated 11/24/2020).

Translate accented to unaccented characters in Sublime Text snippet using regex

I'm writing a ST3 snippet that inserts a \subsection{} with a label. The label is created by converting the header text to conform with the LaTeX standards for labels using a (rather lengthy) regular expression:
${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:N)(?9:n)(?10O)(?11:o)(?12:U)(?13:u)/g}
Actually, I would like for it to be even longer. But if I add the extra groups that I would like, then ST3 crashes when I execute the snippet.
${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([Ç])?|\b)(?:([ç])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)(?:([Ý])?|\b)(?:([ÿý])?|\b)/(?1:-)(?2:A)(?3:a)(?4:C)(?5:c)(?6:E)(?7:e)(?8:I)(?9:i)(?10:O)(?11:o)(?12:N)(?13:n)(?14:U)(?15:u)(?16:Y)(?17:y)/g}
Is there any more efficient way of doing this? Preferably one that won't cause ST3 to crash ;)
Edit:
Here are some example strings:
Flygande bæckasiner søka hwila på mjuka tuvor
Åke Staël hade en överflödig idé
And the results (with the current, working regex):
Flygande-backasiner-soka-hwila-pa-mjuka-tuvor
Ake-Stael-hade-en-overflodig-ide
But I would like to also replace the characters (ÇçÝÿý) with their unaccented counterparts (CcYyy) so that e.g.
Comment ça va
becomes
Comment-ca-va

I don't know this syntax, but I suspect that the problem comes from the too many optional groups combined with a lot of alternatives that cause a too complex processing.
So you can try to design your pattern like this, and you can add other groups of letters in the same way (take a look at the unicode table to find character ranges):
${1/([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}
if the lookahead feature is available you can improve this pattern to prevent non-accented characters to be tested with each alternatives:
${1/(?=[ \t_À-ÆÈ-ÏÑ-ÖØ-Üà-æè-ïñ-öø-üŒœ])(?:([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ))/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}
Note: Æ (Aelig) must be transliterated as AE (the same for Œ => OE)

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?

The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.

Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

Why aren't my hyphens displaying correctly using std::cout?

I am trying to print out the following string using std::cout :
"Encryptor –pid1 0x34f –pid2"
the '-' characters appear as u's with a circumflex above them (I'm not sure how to type this).
How do I print out the hyphen as intended?

That was not a hyphen.
It was a "n-dash", which will render differently across consoles based on encoding settings.
The hyphen key is usually on the number row of your keyboard, on Western layouts.

Make sure your terminal's idea of the character encoding matches that of your source code. How to do this, of course, depends on your operating system, which terminal emulator (assuming it's an emulator at all) you're using, and so on, neither of which you state.
Also, that's not a hyphen in your example, it's too long. It's probably an "em dash".

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama

Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js