Regex matching behavior with multi-character unicode symbol

Regex matching behavior with multi-character unicode symbol - regex

I am having trouble understanding some behavior I observed with multi-character unicode symbols.
Take, as an example, the string 🤚🏾🇺🇸🤚🏾🇺🇸🤚🏾, and the regex (🤚🏾|🇺🇸)(?![🏾]), I get three matches: Both flags, and the last hand. Expected: 5 matches, each symbol once.
Since both 🤚🏾 and 🇺🇸 are 2 character symbols, I tried writing a non-unicode example. With the string abcdabcdab and the regex (ab|cd)(?![b]), I get the expected 5 matches, each pair of ab and cd once.
Thinking that there might be some interaction between 🤚🏾 and 🏾, I used a different unicode character, giving me the regex (🤚🏾|🇺🇸)(?![🇹]). Here I get the same result that I got in the first example.
Since both 🇹 and 🏾 are usually not used individually, I tried using "normal" unicode or ASCII characters instead of 🇹. I my example, I used 🐰 and a, which gave me the expected result of 5 matches, each symbol once.
Is someone able to explain this behavior, or is this a bug?
This behavior only happened in PCRE and the JavaScript regex engine, I used this site to test it. https://regex101.com/

You should not put a multibyte character inside a character class like in (?![🏾]). Inside the character class, it got "decomposed" into a sequence of two bytes, \uD83C and \uDFFE , matching either of them, not as a sequence. As the hand emoji is a sequence of \uD83E\uDD1A\uD83C\uDFFE (it ends with these two bytes), the lookahead got triggered and affected the matches.
To solve the problem, you just need to remove the brackets and use (🤚🏾|🇺🇸)(?!🏾) so that the 🏾 char could be treated as a byte sequence, not one or another char.

Related

How to search a unicode character using its code point in sublime text

From what I understand, unicode characters have various representations.
e.g., code point or hex byte (these two representations are not always the same if UTF-8 encoding is used).
If I want to search for a visible unicode character (e.g., 汉) I can just copy it and search. This works even if I do not know its underlying unicode representation. But for other characters which may not be easily visible, such as zeros width space, that way does not work well. For these characters, we may want to search it using its code point.
My question
If I have known a character's code point, how do I search it in sublime text using regular expression? I highlight sublime text because different editors may use different format.

Zero width space characters can be found via:
\x{200b}
Demo
Non breaking space characters can be found via:
\xa0
Demo

For unicode character whose code point is CODE_POINT (code point must be in hexadecimal format), we can safely use regular expression of the format \x{CODE_POINT} to search it.
General rules
For unicode characters whose code points can fit in two hex digits, it is fine to use \x without curly braces, but for those characters whose code points are more than two hex digits, you have to use \x followed by curly braces.
Some examples
For example, in order to find character A, you can use either \x{41} or \x41 to search it.
As another example, in order to find 我(according to here, its code point is U+6211), you have to use \x{6211} to search it instead of \x6211 (see image below). If you use \x6211, you will not find the character 我.

Replace odd length substrings of character

I am struggling with a little problem concerning regular expressions.
I want to replace all odd length substrings of a specific character with another substring of the same length but with a different character.
All even sequences of the specified character should remain the same.
Simplified example: A string contains the letters a,b and y and all the odd length sequences of y's should be replaced by z's:
abyyyab -> abzzzab
Another possible example might be:
ycyayybybcyyyyycyybyyyyyyy
becomes
zczayybzbczzzzzcyybzzzzzzz
I have no problem matching all the sequences of odd length using a regular expression.
Unfortunately I have no idea how to incorporate the length information from these matches into the replacement string.
I know I have to use backreferences/capture groups somehow, but even after reading lots of documentation and Stack Overflow articles I still don't know how to pursue the issue correctly.
Concerning possible regex engines, I am working with mainly with Emacs or Vim.
In case I have overlooked an easier general solution without a complicated regular expression (e.g. a small and fixed series of simple search and replace commands), this would help too.

Here's how I'd do it in vim:
:s/\vy#<!y(yy)*y#!/\=repeat('z', len(submatch(0)))/g
Explanation:
The regex we're using is \vy#<!y(yy)*y#!. The \v at the beginning turns on the magic option, so we don't have to escape as much. Without it, we would have y\#<!y\(yy\)*y\#!.
The basic idea for this search, is that we're looking for a 'y' y followed by a run of pairs of 'y's,(yy)*. Then we add y#<! to guarantee there isn't a 'y' before our match, and add y\#! to guarantee there isn't a 'y' after our match.
Then we replace this using the eval register, i.e. \=. From :h sub-replace-\=:
*sub-replace-\=* *s/\=*
When the substitute string starts with "\=" the remainder is interpreted as an
expression.
The special meaning for characters as mentioned at |sub-replace-special| does
not apply except for "<CR>". A <NL> character is used as a line break, you
can get one with a double-quote string: "\n". Prepend a backslash to get a
real <NL> character (which will be a NUL in the file).
The "\=" notation can also be used inside the third argument {sub} of
|substitute()| function. In this case, the special meaning for characters as
mentioned at |sub-replace-special| does not apply at all. Especially, <CR> and
<NL> are interpreted not as a line break but as a carriage-return and a
new-line respectively.
When the result is a |List| then the items are joined with separating line
breaks. Thus each item becomes a line, except that they can contain line
breaks themselves.
The whole matched text can be accessed with "submatch(0)". The text matched
with the first pair of () with "submatch(1)". Likewise for further
sub-matches in ().
TL;DR, :s/foo/\=blah replaces foo with blah evaluated as vimscript code. So the code we're evaluating is repeat('z', len(submatch(0))) which simply makes on 'z' for each 'y' we've matched.

regexEXR V2.1 character set mismatch

I was working on regex and studying the applications of character sets.
I tried the regex /[64-bit]/g, but the highlighted answer was contradictory; it highlighted uppercase letters, numbers and certain operators.
Why is that?

It's obvious that you're not using the right construct. Once you fix that, everything falls into place.
It doesn't make sense to use a character class if you want to match 64-bit literally. You should just use /64-bit/g as your regex in this case.
Character classes (specified by []) have different rules than the rest of the regex. They match a single character listed within (or not listed, if it's a negated char class).
A range of characters can also be specified to match, and that is where you have your problem. According to any online ASCII chart, 4 is #52 in the table, and b is #98. (Note that [4-bit] is actually an equivalent regex.) Between those two points, there are many characters, including the uppercase letters. That is why you are getting unexpected matches.

Boost regex does not match

I made a python regular expression and now I'm supposed to code the program in C++.
I was told to use boost's regex by the respective person.
It is supposed to match a group of at least one to 80 lower alphanumeric characters including underscore followed by a backslash then another group of at least one to 80 lower alphanumeric characters again including an underscore and last but not least a question mark. The total string must be at least 1 character long and is not allowed to exceed 256.
Here is my python regex:
^((?P<grp1>[a-z0-9_]{1,80})/(?P<grp2>[a-z0-9_]{1,80})([?])){1,256}$
My current boost regex is:
^(([a-z0-9_]{1,80})\/([a-z0-9_]{1,80})([?])){1,256}$
Cut down basically my code would look like this:
boost::cmatch match;
bool isMatch;
boost::regex myRegex = "^(([a-z0-9_]{1,80})\/([a-z0-9_]{1,80})([?])){1,256}$";
isMatch = boost::regex_match(str.c_str(), match, myRegex);
Edit: whoops totally forgot the question xDD. My problem is quite simple: The regex doesn't match though it's supposed to.
Example matches would be:
some/more?
object/value?
devel42/version_number?

The last requirement
The total string must be at least 1 character long and is not allowed to exceed 256.
is always true as your string is already limited from 3 to 162 characters. You have only to keep the first part of your regex:
^[a-z0-9_]{1,80}/[a-z0-9_]{1,80}\?$

My g++ gives me the warning "unknown escape sequence: '\/'"; that means you should use "\\/" instead of "\/". You need a backslash char stored in the string, and then let the regex parser eat it as a escaping trigger.
By the way, my boost also requires a constructor invocation, so
boost::regex myRegex("^(([a-z0-9_]{1,80})\\/([a-z0-9_]{1,80})([?])){1,256}$");
seems work.
You can also use C++11 raw string literal to avoid C++ escaping:
boost::regex myRegex(R"(^(([a-z0-9_]{1,80})\/([a-z0-9_]{1,80})([?])){1,256}$)");
By the way, testing <regex> in libstdc++ svn is welcome. It should come with GCC 4.9 ;)

The actual error was a new line sent to the server by the client on entering the respective string that would've been later compared.
Funny how the errors root is rarely where you expect it to be.
Anyways, thank you all for your answers. They gave me the ability to clean up my regular expressions.

Regex for extracting qmake variables

I'm trying to write the QRegExp for extracting variable names from qmake project code (*.pro files).
The syntax of variable usage have two forms:
$$VAR
$${VAR}
So, my regular expression must handle both cases.
I'm trying to write expression in this way:
\$\$\{?(\w+)\}?
But it does not work as expected: for string $$VAR i've got $$V match, with disabled "greeding" matching mode (QRegExp::setMinimal (true)). As i understood, gready-mode can lead to wrong results in my case.
So, what am i doing wrong?
Or maybe i just should use greedy-mode and don't care about this behavior :)
P.S. Variable name can't contains spaces and other "special" symbols, only letters.

You do not need to disable greedy matching. If greedy matching is disabled, the minimal match that satisfies your expression is returned. In your example, there's no need to match the AR, because $$V satisfies your expression.
So turn the minimal mode back on, and use
\$\$(\w+|\{\w+\})
This matches two dollar signs, followed by either a bunch of word characters, or by a bunch of word characters between braces. If you can trust your data not to contain any non-matching braces, your expression should work just as well.
\w is equal to [A-Za-z0-9_], so it matches all digits, all upper and lowercase alphabetical letters, and the underscore. If you want to restrict this to just the letters of the alphabet, use [A-Za-z] instead.
Since the variable names can not contain any special characters, there's no danger of matching too much, unless a variable can be followed directly by more regular characters, in which case it's undecidable.
For instance, if the data contains a string like Buy our new $$Varbuster!, where $$Var is supposed to be the variable, there is no regular expression that will separate the variable from the rest of the string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex matching behavior with multi-character unicode symbol - regex

Related

How to search a unicode character using its code point in sublime text

Replace odd length substrings of character

regexEXR V2.1 character set mismatch

Boost regex does not match

Regex for extracting qmake variables

Categories

Resources