Perl: Matching string not containing PATTERN - regex

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?

Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s

jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

Related

Regex matching pattern in multiple lines without specific word in the match

I would like to match the following pattern in multiple lines
The pattern begins with "PAT_A"
The pattern ends with the first ";" after "PAT_A"
The pattern contains "PAT_B" between "PAT_A" and ";"
The pattern does not contain "NOT_MATCH_THIS" between "PAT_A" and ";"
For example, this should make a match
PAT_A_YYY(
OK,
PAT_B
);
And this should not make a match.
PAT_A_XXX(
NOT_MATCH_THIS,
PAT_B
);
I managed to fulfill the first three requirements with
(PAT_A[^;]*?)(\bPAT_B\b)([^;]*;)
where the groups are for extracting the different parts matched.
However, I did not succeed in excluding matches containing "NOT_MATCH_THIS".
I have checked the post "How to negate specific word in regex?" about negative lookahead. However, it seems that the answer there matches the whole line instead of the pattern requirement described above. And I am not sure how I should incorporate the negative lookahead into my regex pattern.
Is there any way I could match with regex fulfilling all the four requirements?
You might use
^PAT_A[^;\n]*(?:\n(?![^\n;]*NOT_MATCH_THIS)[^;\n]*)*\n[^;\n]*PAT_B[^;]*;
In parts, the pattern matches:
^ Start of string
PAT_A Match literally
[^;\n]* Optionally match any char except ; or a newline
(?: Non capture group (to repeat as a whole)
\n(?![^\n;]*NOT_MATCH_THIS) Match a newline, and assert that the string does not contain NOT_MATCH_THIS and does not contain a ; or a newline to stay on the same line
[^;\n]* If the previous assertion is true, match the whole line (no containing a ;)
)* Close the non capture group, and optionally repeat matching all lines
\n[^;\n]* Match a newline, and any char except ; or a newline
PAT_B[^;]*; Then match PAT_B followed by any char except ; followed by matching the ;
Regex demo
I don't have a RegEx interpreter handy, but you could try this:
(PAT_A[^;]*?(?!NOT_MATCH_THIS))(\bPAT_B\b)([^;]*;)
Or maybe:
(PAT_A[^;]*?(?!NOT_MATCH_THIS)[^;]*?)(\bPAT_B\b)([^;]*;)

How negative lookahead works

I want to match a string not containing word "the"
The following solution looks logical to me:
^(?!.*the.*).*$
The following one (I've came across on SO) also works but I cannot understand WHY it works
^((?!the).)*$
In my view (?!the). should match a)ANY b)single character then repeatd by *, so the regex should match any string?
There is the great site I'm using for reference http://www.rexegg.com but no such example there
It's basically doing a match-any-character, and search for the string literal "the" in every position. If found, the negation cancels the match.
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character)
( # Match the regular expression below and capture its match into backreference number 1
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
the # Match the characters “the” literally
)
. # Match any single character that is not a line break character
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ # Assert position at the end of a line (at the end of the string or before a line break character)
The above solution works but only if you also want to match strings not containing words with the characters the in them -- e.g., I was going there would be excluded. You need word boundaries if you want to match everything not containing the word the:
^((?!\bthe\b).)*$
or:
^(?!.*\bthe\b).*$
^((?!the).)*$
This will check at every point before consuming if there is the ahead of it.So in a string abcthe after c regex engine will see the and it will fail.But because you have ^$ anchors because the the engine could not make a complete match it will fail and not match anything.If you remove $ it will match upto abc.

Regex capturing any text between

I'm trying to capture text (any text) that falls between some kind of delimiter with word boundaries on each end, like so:
This is not the text. ##This is the text I want to capture.## This is also not the text. ##But I would like to capture this, too##.
I thought this would be easy with regex like this
\b([#]{2})(.*)(\1)\b
This doesn't produce a match and I can't figure why.
Note, I would also like to avoid capturing the text between the first '##' and the last '##', capturing both sections with all the text in between.
In other words I don't want one of the matches to be:
##This is the text I want to capture.## This is also not the text. ##But I would like to capture this, too##
georg and Ulugbek Umirov posted the perfect answer on this question as comment. I repeat the expression here with an explanation mainly to give the question an answer and therefore remove it from the list of unanswered questions.
##\b(.+?)## searches for a string
starting and ending with ## and
with a word character at beginning and
having 1 or more characters between.
Because of the parentheses the string found between ## is marked for backreference.
The question mark ? after the + multiplier changes the matching behavior from greedy to non greedy. The greedy expression .+ matches everything from first ## to last ## whereas the non greedy expression .+? matches just everything from first ## to next ##.
\b means word boundary and therefore the first character after ## must be a word character (letter, digit or underscore).
The matching behavior of . depends on a flag. The dot can match any character including line terminating characters, or any character except line terminating characters. Line terminating characters are carriage return (= \r = CR) and line feed (= newline = \n = LF).
If matching everything between two delimiter strings should be independent on matching behavior of the dot, it is better to use the regular expression ##\b([\w\W]+?)## like Ulugbek Umirov suggested as \w matches any word character and \W matches any non word character. Both in a character class definition matches therefore always any character including CR and LF.
It would be also possible to use ##\b([\s\S]+?)## where \s matches any whitespace character and \S matches any non whitespace character resulting with both in a character class definition in matching any character including CR and LF, too.
Further it would be possible to use ##(\w[\s\S]*?)## or ##\w([\w\W]*?)## or ##(\w.*?)## all resulting in the same matching behavior as all other expressions above, if the matching behavor for dot is any character including CR+LF.
Last, if the used regular expression engine supports lookbehind and lookahead, it would be also possible to match only the string between ## without matching the delimiters by using for example the regular expression (?<=##)\b[\w\W]+?(?=##) which makes the need of a marking group unnecessary. (?<=##) is a positive lookbehind expression and (?=##) is a positive lookahead expression both for the string ##.

Regex.Replace formatting a query

I am working in VB.Net and trying to use Regex.Replace to format a string I am using to query Sql. What Im going for is to cut out comments "--". I've found that in most cases the below works for what I need.
string = Regex.Replace(command, "--.*\n", "")
and
string = Regex.Replace(command, "--.*$", "")
However I have ran into a problem. If I have a string inside of my query that contains the double dash string it doesn't work, the replace will just cut out the whole line starting at the double dash. It makes since to me as to why but I can't figure out the regular expression i need to match on.
logically I need to match on a string that starts with "--" and is not proceeded by "'" and not followed by "'" with any number of characters inbetween. But Im not sure how to express that in a regular expression. I have tried variations of:
string = Regex.Replace(cmd, "[^('.*)]--.*\n[^(.*')]", "")
Which I know is obviously wrong. I have looked at a couple of online resources including http://www.codeproject.com/KB/dotnet/regextutorial.aspx
but due to my lack of understanding I can't seem to figure this one out.
I think you meant "match on a string that starts with -- and is not proceededpreceeded by ' and not followed by ' with any number of characters inbetween"
If so, then this is what you are looking for:
string = Regex.Replace(cmd, "(?<!'.*?--)--(?!.*?').*(?=\r\n)", "")
'EDIT: modified a little
Of course, it means you can't have apostrophes in your comments... and would be exceedingly easy to hack if someone wanted to (you aren't thinking of using this to protect against injection attacks, are you? ARE YOU!??! :D )
I can break down the expression if you'd like, but it's essentially the same as my modified quote above!
EDIT:
I modified the expression a little, so it does not consume any carriage return, only the comment itself... the expression says:
(?<! # negative lookbehind assertion*
' # match a literal single quote
.*? # followed by anything (reluctantly*)
-- # two literal dashes
) # end assertion
-- # match two literal dashes
(?! # negative lookahead assertion
.*? # match anything (reluctant)
' # followed by a literal single quote
) # end assertion
.* # match anything
(?= # positive lookahead assertion
\r\n # match carriage-return, line-feed
) # end assertion
negative lookbehind assertion means at this point in the match, look backward here and assert that this cannot be matched
negative lookahead assertion means look forward from this point and assert this cannot be matched
positive lookahead asserts the following expression CAN be matched
reluctant means only consume a match for the previous atom (the . which means everything in this case) if you cannot match the expression that follows. Thus the .*? in .*?-- (when applied against the string abc--) will consume a, then check to see if the -- can be matched and fail; it will then consume ab, but stop again to see if the -- can be matched and fail; once it consumes abc and the -- can be matched (success), it will finally consume the entire abc--
non-reluctant or "greedy" which would be .* without the ? will match abc-- with the .*, then try to match the end of the string with -- and fail; it will then backtrack until it can match the --
one additional note is that the . "anything" does not by default include newlines (carriage-return/line-feed), which is needed for this to work properly (there is a switch that will allow . to match newlines and it will break this expression)
A good resource - where I've learned 90% of what I know about regex - is Regular-Expressions.info
Tread carefully and good luck!
OK what you are doing here is not right :
/[^('.*)]--.*\n[^(.*')]/
You are saying the following :
Do not match a (, ), ', ., * then match -- then match anything until a newline and to not match the same character class as the one at the start.
What you probably meant to do is this :
/(?<!['"])\s*--.*[\r\n]*/
Which says, make sure that you don't match a ' or " match any whitespace match -- and anything else until the end or a newline or line feed character.

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.