How to search for a negated Unicode backreference (PCRE)? - regex

I'd like to find two non-identical Unicode words separated by a colon using a PCRE regex.
Take for example, this string:
Lôrem:ipsüm dõlör:sït amêt:amêt cønsectetûr:cønsectetûr âdipiscing:elït
I can easily find the two identical words separated by a colon using:
(\p{L}+):(\1)
which will match: cønsectetûr:cønsectetûr and amêt:amêt
However, I want to negate the backreference to find only non-identical Unicode words separated by a colon.
What's the proper way to negate a backreference in PCRE?
(\p{L}+):(^\1) obviously does not work.

You start by using a negative lookahead to prevent a match if the captured part repeats after the colon:
(\p{L}+):(?!\1)
Then you need to match the second unicode word, another \p{L}+:
(\p{L}+):(?!\1)\p{L}+
And last, to prevent false matches, use word boundaries:
\b(\p{L}+):(?!\1\b)\p{L}+\b
regex101 demo

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

Need RegEx to remove dots from a string only if there are more than one

Hello I have strings like
tda2030 100.200.300 circuit
I want to check if this string contains any keyword (separated by whitespace but can be on start or end) that contain more than 1 dot and then remove the dots.
The result should be
tda2030 100200300 circuit
in the example.
I tried a lot but I think I need a regex-pert :) Thanks in advance.
That's an interesting question because of your requirement to have multiple dots and the fact that PCRE does not allow infinite-width lookbehinds to see if we might have a dot behind us. We'll get over that limitation by using \K and \G.
Here is a regex that will find the right dots (see online demo)
(?<=\w)\.(?=\w+\.)|\G\w+\K\.
Use preg_replace to replace with an empty string:
$replaced = preg_replace("~(?<=\w)\.(?=\w+\.)|\G\w+\K\.~","",$string);
How does it work?
We have two cases separated by an | (OR)
Match a dot that is preceded by at least one word character and followed by some word characters and a dot
Match a dot that follows the previous match (which had to be a dot) and some word characters

Regex for this particular pattern

I have three different things
xxx
xxx>xxx
xxx>xxx>xxx
Where xxx can be any combination of letters and number
I need a regex that can match the first two but NOT the third.
To match ASCII letters and digits try the following:
^[a-zA-Z0-9]{3}(>[a-zA-Z0-9]{3})?$
If letters and digits outside of the ASCII character set are required then the following should suffice:
^[^\W_]{3}(>[^\W_]{3})?$
^\w+(?:>\w+)?$
matches an entire string.
\w+(?:>\w+)?\b(?!>)
matches strings like this in a larger substring.
If you want to exclude the underscore from matching, you can use [\p{L]\p{N}] instead (if your regex engine knows Unicode), or [^\W_] if it doesn't, as a substitute for \w.

Matching parts of string that contain no consecutive dashes

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.