What is the purpose of using positive lookarounds over not? - regex

Say the string is ‘abc’ and the expression is (?=a)abc, would that not be the same as just searching for abc? When do positive lookarounds have purpose over not using them?

Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.
http://www.regular-expressions.info/lookaround.html
Here is a small example from https://ourcraft.wordpress.com/2009/03/25/positive-examples-of-positive-and-negative-lookahead/
Say I want to retrieve from a text document all the words that are immediately followed by a comma. We’ll use this example string:
What then, said I, shall I do? You shan't, he replied, do anything.
As a first attempt, I could use this regular expression to get one or more word parts followed by a comma:
[A-Za-z']+,
This yields four results over the string:
then,
I,
shan't,
replied,
Notice that this gets me the comma too, though, which I would then have to remove. Wouldn’t it be better if we could express that we want to match a word that is followed by a comma without also matching the comma?
We can do that by modifying our regex as follows:
[A-Za-z']+(?=,)
This matches groups of word characters that are followed by a comma, but because of the use of lookahead the comma is not part of the matched text (just as we want it not to be). The modified regex results in these matches:
then
I
shan't
replied

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Match string with prefix and at least one comma (,)

Trying to match a comma seperated list of values.
I want to check that a comma ',' occurs at least once and string contains a certain prefix
ie
tel_local: 123456, tel_national: 123456
is valid but:
tel_local: 123456 is not as no comma
Currently using
^(tel_local:)|(tel_national:),+$
but it matches tel_local: 123456
Just try with:
^(\w+: \d+)(?:, (\w+: \d+))*$
To do exactly what you ask, use:
^tel_(?:local|national):(?=.*?,)
Demo
First, your main problem was the alternation. Think about if you have a conditional (A=B OR A=C) AND Z=5. The parenthesis are necessary in order for the order of evaluating to remain correct, same with the alternation ((?:local|national)) in your expression.
Then, since you didn't specify what had to come after tel_local or tel_national (one would assume a space and digits) asides from that it needed a comma, I used a lookahead assertion. (?=.*?,) will look ahead 0+ characters and check for a ,. If no comma is found, the assertion will fail and your match will end.
You can use this lookahead based regex:
^(?=[^,]*,)((?:tel_local|tel_national): *\d+,? *)+$
RegEx Demo
you need to include the kleene plus ("+") in a regular expression to state that some term should feature at least one or more times
therefore, if you want to specify a string to match should contain at least one comma something like ",+" should do

Regular expressions, can I exclude pairs of characters?

How do you exclude pairs of characters from a regular expression?
I am trying to get a regular expression that will have 5 alphanumeric characters followed by
anything except "XX" and "AD", followed by XX.
So
D22D0ACXX
will match, but the following two will not match
D22D0ADXX
D22D0XXXX.
My first attempt was :
([A-Z0-9]{5}[^(?AD)|(?XX)]XX)
But this treats the character classes part [^(?AD)|(?XX)] as one character, so I end up with the last 8 characters, not all 9.
Can I exclude pairs of characters without getting into back references?
I need to capture the whole group, hence the outer parenthesis. The negative lookahead suggestions don't seem to do this.
Use negative lookahead:
([A-Z0-9]{5}(?!(AD|XX)XX).{4})
Don't treat it as a character class, instead, think of it as an alternation with a negative lookahead, e.g:
([A-Z0-9]{5}(?!(AD|XX)XX))
Then, if you need the tail, include it after the lookhead, e.g:
([A-Z0-9]{5}(?!(AD|XX)XX)[A-Z0-9]{4})

Regex to match one or two quotes but not three in a row

For the life of me I can't figure this one out.
I need to search the following text, matching only the quotes in bold:
Don't match: """This is a python docstring"""
Match: " This is a regular string "
Match: "" ← That is an empty string
How can I do this with a regular expression?
Here's what I've tried:
Doesn't work:
(?!"")"(?<!"")
Close, but doesn't match double quotes.
Doesn't work:
"(?<!""")|(?!"")"(?<!"")|(?!""")"
I naively thought that I could add the alternates that I don't want but the logic ends up reversed. This one matches everything because all quotes match at least one of the alternates.
(Please note: I'm not running the code, so solutions around using __doc__ won't help, I'm just trying to find and replace in my code editor.)
You can use /(?<!")"{1,2}(?!")/
DEMO
Autopsy:
(?<!") a negative look-behind for the literal ". The match cannot have this character in front
"{1,2} the literal " matched once or twice
(?!") a negative look-ahead for the literal ". The match cannot have this character after
Your first try might've failed because (?!") is a negative look-ahead, and (?<!") is a negative look-behind. It makes no sense to have look-aheads before your match, or look-behinds after your match.
I realized that my original problem description was actually slightly wrong. That is, I need to actually only match a single quote character, unless if it's part of a group of 3 quote characters.
The difference is that this is desirable for editing so that I can find and replace with '. If I match "one or two quotes" then I can't automatically replace with a single character.
I came up with this modification to h20000000's answer that satisfies that case:
(?<!"")(?<=(?!""").)"(?!"")
In the demo, you can see that the "" are matched individually, instead of as a group.
This works very similarly to the other answer, except:
it only matches a single "
that leaves us with matching everything we want except it still matches the middle quotes of a """:
Finally, adding the (?<=(?!""").) excludes that case specifically, by saying "look back one character, then fail the match if the next three characters are """):
I decided not to change the question because I don't want to hijack the answer, but I think this can be a useful addition.

Regular expression to split a string but consider multi-digit escape sequences

I could need some help on the following problem with regular expressions and would appreciate any help, thanks in advance.
I have to split a string by another string, let me call it separator. However, if an escape sequence preceeds separatorString, the string should not be split at this point. The escape sequence is also a string, let me call it escapeSequence.
Maybe it is better to start with some examples
separatorString = "§§";
escapeSequence = "###";
inputString = "Part1§§Part2" ==> Desired output: "Part1", "Part2"
inputString = "Part1§§Part2§§ThisIs###§§AllPart3" ==> Desired output: "Part1", "Part2", "ThisIs###§§AllPart3"
Searching stackoverflow, I found Splitting a string that has escape sequence using regular expression in Java and came up with the regular expression
"(?<!(###))§§".
This is basically saying, match if you find "§§", unless it is preceeded by "###".
This works fine with Regex.Split for the examples above, however, if inputString is "Part1###§§§§Part2" I receive "Part1###§", "§Part2" instead of "Part1###§§", "Part2".
I understand why, as the second "§" gives a match, because the proceeding chars are "##§" and not "###". I tried several hours to modify the regex, but the result got only worse. Does someone have an idea?
Let's call the things that appear between the separators, tokens. Your regex needs to stipulate what the beginning and end of a token looks like.
In the absence of any stipulation, in other words, using the regex you have now, the regex engine is happy to say that the first token is Part1###§ and the second is §Part2.
The syntax you used, (?<!foo) , is called a zero-width negative look-behind assertion. In other words, it looks behind the current match, and makes an assertion that it must match foo. Zero-width just indicates that the assertion does not advance the pointer or cursor in the subject string when the assertion is evaluated.
If you require that a new token start with something specific (say, an alphanumeric character), you can specify that with a zero-width positive lookahead assertion. It's similar to your lookbehind, but it says "the next bit has to match the following pattern", again without advancing the cursor or pointer.
To use it, put (?=[A-Z]) following the §§. The entire regex for the separator is then
(?<!###)§§(?=[A-z]).
This would assert that the character following a separator sequence needs to be an uppercase alpha, while the characters preceding the separator sequence must not be ###. In your example, it would force the match on the §§ separator to be the pair of chars before Part2. Then you would get Part1###§§ and Part2 as the tokens, or group captures.
If you want to stipulate what a token is in the negative - in other words to stipulate the a token begins with anything except a certain pattern, you can use a negative lookahead assertion. The syntax for this is (?!foo). It works just as you would expect - like your negative lookbehind, only looking forward.
The regular-expressions.info website has good explanations for all things regex, including for the lookahead and lookbehind constructs.
ps: it's "Hello All", not "Hello Together".
How about doing the opposite: Instead of splitting the string at the separators match non-separator parts and separator parts:
/(?:[^§#]|§[^§#]|#(?:[^#]|#(?:[^#]|#§§)))+|§§/
Then you just have to remove every matched separator part to get the non-separator parts.