Regular expression no match when followed by character [duplicate] - regex

This question already has an answer here:
Regex match numbers not followed by a hyphen
(1 answer)
Closed 1 year ago.
I am trying to capture groups in a text that only match when the match is not followed by a specific character, in this case the opening parentheses "(" to indicate the start of a 'function/method' rather than a 'property'.
This seems pretty straightforward so I tried:
TEXT
$this->willMatch but $this->willNot()
RESULT
RegExp pattern: \$this->[a-zA-Z0-9\_]+(?<!\()
Expected: $this->willMatch
Actual: $this->willMatch, $this->willNot
RegExp pattern: \$this->[a-zA-Z0-9\_]+[^\(]
Expected: $this->willMatch
Actual: $this->willMatch, $this->willNot
RegExp pattern: \$this->[a-zA-Z0-9]+(?!\()
Expected: $this->willMatch
Actual: $this->willMatch, $this->willNo
My intuition says i need to add ^ and $ but that wont work for multiple occurrences in a text.
Curious to meet the RegExp wizard that can solve this!

Answer from The fourth bird definitely works and it is well explained as well.
As an alternative to using word boundary one can use possessive quantifier i.e. ++ to turn off backtracking thus improving efficiency further.
\$this->\w++(?!\()
RegEx Demo
Please note use of \w instead of equivalent [a-zA-Z0-9_] here.
Like a greedy quantifier, a possessive quantifier repeats the token as many times as possible. Unlike a greedy quantifier, it does not give up matches as the engine backtracks.

The (?<!\() will always be true as the character class does not match a (
Note that you don't have to escape the \_
You can use a word boundary after the character class to prevent backtracking, and turn the negative lookbehind into a negative lookahead (?!\() to assert not ( directly to the right.
\$this->[a-zA-Z0-9_]+\b(?!\()
Regex demo

Related

Regular Expression for anything in Between ${something} [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am a newbie in regular expression, I have written regular expression for ${serviceName} basicly I want to take the words in between ${ } So I already wrote regular expression for this that is perfectly fine
"\\$\\{(\\w+)\\}"
But what I want to take any values not only the words which are in between ${serviceName.1.Type}.So can you guys help me with regular expression for ${serviceName.1.Type}.
I hope my question is clear.
Thanks In Advance.
A good place to test regular expressions is https://regex101.com/
\w+ matches any word character (equal to [a-zA-Z0-9_])
If you want to match anything you can replace it with: .*
.* matches any character (except for line terminators)
You might want to add a "?" at the end to match to first "}"
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed
Also you don't need to escape the { } in this case
So what you want is:
"\\${(.*?)}"
\$\{([\w?\.?\d?\s?]+)\}
This expression captures as a group everything that appears between {}
You can then call the group with the expression $1
On this web you will see your exercise solved and if other expressions have some additional character you can try to add it. Now it is prepared for points \. , spaces \s, letters \w and digits \d

RegEx for combining "match everything" and "negative lookahead" [duplicate]

This question already has answers here:
RegExp exclusion, looking for a word not followed by another
(3 answers)
Closed 3 years ago.
I'm trying to match the string "this" followed by anything (any number of characters) except "notthis".
Regex: ^this.*(?!notthis)$
Matches: thisnotthis
Why?
Even its explanation in a regex calculator seems to say it should work. The explanation section says
Negative Lookahead (?!notthis)
Assert that the Regex below does not match
notthis matches the characters notthis literally (case sensitive)
The negative lookahead has no impact in ^this.*(?!notthis)$ because the .* will first match until the end of the string where notthis is not present any more at the end.
I think you meant ^this(?!notthis).*$ where you match this from the start of the string and then check what is directly on the right can not be notthis
If that is the case, then match any character except a newline until the end of the string.
^this(?!notthis).*$
Details of the pattern
^ Assert start of the string
this Match this literally
(?!notthis)Assert what is directly on the right is notnotthis`
.* Match 0+ times any char except a newline
$ Assert end of the string
Regex demo
If notthis can not be present in the string instead of directly after this you could add .* to the negative lookahead:
^this(?!.*notthis).*$
^^
Regex demo
See it in a regulex visual
Because of the order of your rules. Before your expression would get to negative lookahead, prior rules has been fulfilled, there is nothing left to match.
If you wish to match everything after this, except for notthis, this RegEx might also help you to do so:
^this([\s\S]*?)(notthis|())$
which creates an empty group () for nothing, with an OR to ignore notthis:
^this([\s\S]*?)(notthis|())$
You might remove (), ^ and $, and it may still work:
this([\s\S]*?)(notthis|)

Regex Negative Lookbehind Matches Lookbehind text .NET

Say I have the following strings:
PB-GD2185-11652-MTCH
GD2185-11652-MTCH
KD-GD2185-11652-MTCH
KD-GD2185-11652
I want REGEX.IsMatch to return true if the string has MTCH in it and does not start with PB.
I expected the regex to be the following:
^(?<!PB)\S+(?=MTCH)
but that gives me the following matches:
PB-GD2185-11652-
GD2185-11652-
KD-GD2185-11652-
I do not understand why the negative lookbehind not only doesn't exclude the match but includes the PB characters in the match. The positive lookahead works as expected.
EDIT 1
Let me start with a simpler example. The following regex matches all of the strings as I would expect it to:
\S+
The following regex still matches all of the strings even though I would expect it not to:
\S+(?!MTCH)
The following regex matches all but the final H character on the first three strings:
\S+(?<!MTCH)
From the documentation at regex 101, a lookahead looks for text to the right of the pattern and a lookbehind looks for text to the left of the pattern, so having a lookahead at the beginning of a string does not jive with the documentation.
Edit 2
take another example with the following three strings:
grey
greyhound
hound
the regex:
^(?<!grey)hound
only matches the final hound. whereas the regex:
^(?<!grey)\S+
matches all three.
You need a lookahead: ^(?!PB)\S+(?=MTCH). Using the look-behind means the PB has to come before the first character.
The problem was because of the greediness of \S+. When dealing with lookarounds and greedy quantifiers you can easily match more characters than you expect. One way to deal with this is to insert a negative lookaround in a group with the greedy quantifier to exclude it as a match as stated in this question:
How to non-greedy multiple lookbehind matches
and on this helpful website about greediness in regular expressions:
http://www.rexegg.com/regex-quantifiers.html
Note that this second link has a few other ways to deal with the greediness in various situations.
A good regular expression for this situation is as follows:
^(?<!PB)((?!PB)\S+)(MTCH)
In situations like this it is going to be much clearer to do it logically within the code. So first check if the string matches MTCH and then that it doesn't match ^PB

Why the [^A] does not work?

Why the regular expression:
changes\s*=\s*[^A].*
matches
changes = AssignDictionary(out
What I want to find is no words starting with character "A" ([^A]) following the spaces (\s*), and it supposes not to match that line...what am I doing wrong?
The [^A] does not work because of backtracking. \s* matches zero or more whitespaces, and then the engine backtracks to accommodate for a non-A. Since there are two spaces after =, the second space is matched with [^A] -> there is a match.
See Step 12 & 13 (regex demo):
If you want to fail the match when there is an A after =, you need a negative lookahead:
changes\s*=(?!\s*A)\s*.*
^^^^^^^^
See another demo
Or another PCRE variation: changes\s*=\s*+(?!A).* (check if the character is not A after all whitespaces after =).
If your regex engine supports atomic groups or possessive quantifiers, you can make your regex work by preventing backtracking into the \s* construct:
changes\s*=\s*+[^A].*
^^ (possessive quantifier)
changes\s*=(?>\s*)[^A]\s*.*
^^ ^ - atomic group
And in case your engine does not support atomic groups, nor possessive quantifiers, you can disable backtracking with a capture group/backreference combination (to emulate an atomic group):
changes\s*=(?=(\s*))\1[^A].*
See this demo.
Still, the first solution with a lookahead is preferable since it seems the most universal one. The fastest looks to be the one with the possessive quantifier.
It is also possible to get that with plain regex. Just indicate what is not a valid character following the arbitrary number of spaces before the "not A". As you indicated this is: not A, but of course also "not a space".
Otherwise backtracking would allow a space preceeding an A in tat position to
be matched for the "not-A" and defeat your intentions.
Using changes\s*=\s*[^A\s].* will match anything that does not have an A or a white space after the spaces following the equals sign (and extend the match to end-of-line/end-of-input.

What is the difference between .*? and .* regular expressions?

I'm trying to split up a string into two parts using regex. The string is formatted as follows:
text to extract<number>
I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.
On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
regular-expressions.info/Repetition - Laziness instead of Greediness
Example 1: From A to Z
Let's compare these two patterns: A.*Z and A.*?Z.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)
Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
Regex not being greedy enough
Regular expression: who's greedier
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).
See Quantifier Cheat Sheet.
Let's say you have:
<a></a>
<(.*)> would match a></a where as <(.*?)> would match a.
The latter stops after the first match of >. It checks for one
or 0 matches of .* followed by the next expression.
The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.