skipping comments with regex - regex

this has been asked so many times - yet I don't get why the following negative look-behind still matches after the comment character ";" ?!
(?<!;).+mylib.*
Debuggex Demo
TEST-TEXT:
; /home/mylib/blabla/laydef1.rul (matches wrongly!?)
/home/mylib/blabla/laydef2.rul (matches as it should)
P.S. RegEx class is PCRE

Since PCRE doesn't support variable length lookbehind you can use this regex construct:
/^\h*(?:;.*(*SKIP)(*F)|.*mylib.*)/m
RegEx Demo
Your regex: (?<!;).+mylib.* fails because .+ matches everything from ; tomylib`
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.

Related

Replacement for lookbehind in std::regex

I need a regex to match tokens for a syntax highlighter, which should match full words when surrounded by non-alphanumeric characters or string boundaries. The regex I initially came up with is:
(?<=[^\w]|^)TOKEN(?=[^\w]|$)
Where TOKEN is the token I'm searching for. This works in regex testers, but c++'s regex doesn't support lookbehinds. Omitting the lookbehind causes the regex to match the character before the token as well, which causes issues. I'm aware boost::regex supports lookbehinds, but I'd like to keep to std::regex if possible.
My question is: can I change my regex to exclude the character before the token from the match?
The pattern is missing a closing ] at the end, and \w also matches \d
You might use an alternation asserting either the start of the string, or a position where \b does not match and assert not a word char to the right.
(?:^|\B)TOKEN(?!\w)
Regex demo
After the update of the question, you can write (?<=[^\w]|^)TOKEN(?=[^\w]|$) as (?<=\W|^)TOKEN(?=\W|$) or in short without the lookbehind:
\bTOKEN(?!\w)

SyntaxError: (irb):4: invalid pattern in look-behind (positive look-behind/ahead)

I am trying to write a regex-replace pattern in order to replace a number in a hash like such:
regexr link
some_dict = {
TEST: 123
}
such that 123 could be captured and replaced.
(?<= |\t*[a-zA-Z0-9_]+: |\t+)\d+(?=.*)
You'll see that this works perfectly fine in regexr:
When I run this gsub in irb, however, here is what happens:
irb(main):005:0> " TEST: 123".gsub(/(?<= |\t*[a-zA-Z0-9_]+: |\t+)\d+(?=.*)/, "321")
SyntaxError: (irb):5: invalid pattern in look-behind: /(?<= |\t*[a-zA-Z0-9_]+: |\t+)\d+(?=.*)/
I was looking around for similar issues like Invalid pattern in look-behind but I made sure to exclude capture groups in my look-behind so I'm really not sure where the problem lies.
The reason is that Ruby's Onigmo regex engine does not support infinite-width lookbehind patterns.
In a general case, positive lookbehinds that contain quantifiers like *, + or {x,} can often be substituted with a consuming pattern followed with \K:
/(?: |\t*[a-zA-Z0-9_]+: |\t+)\K\d+(?=.*)/
#^^^ ^^
However, you do not even need that complicated pattern. (?=.*) is redundant, as it does not require anything, .* matches even an empty string. The positive lookbehind pattern will get triggered if there is a space or tab immediately to the left of the current location. The regex is equal to
.gsub(/(?<=[ \t])\d+/, "321")
where the pattern matches
(?<=[ \t]) - a location immediately preceded with a space/tab
\d+ - one or more digits.

How to properly use char negation [^ ] to do word non-fixed width backward lookbehind?

I am trying to match all Python source code lines which has an open parenthesis, but it is not a function definition. Basically, match all function calls, but not function definitions.
I am parsing Python Source Code, but I have only the PCRE engine, not the new Javascript with non fixed width look-behind. I am trying to not match if the sentence is preceded by the word def anywhere (.*) before the match.
This regular expression does it half ways:
(?:^)(?:[^d][^e][^f])+\(
It should not match lines with: (not match an open parenthesis preceded by def)
anything def anything(thing)
anyyything def anythinggg(thing)
And only match lines as: (match an open parenthesis preceded by anything but def)
anything anything(thing)
anyyything anythinggg(thing)
But it has a problem, as I do (?:[^d][^e][^f])+, the expression only works when the open parenthesis ( is preceded by a sentence which has length multiple of 3:
https://regex101.com/r/ec0FgD/1 - Live example
In PCRE you cannot use variable length lookbehind but can make use of (*SKIP)(*FAIL) verbs to fail a match:
def[^(]*\((*SKIP)(*F)|\(
Updated Regex Demo
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
It should not match lines with: (not match an open parenthesis preceded by def)
You can use a negative lookahead assertion at ^ start of each line to check for your condition:
^(?![^\n(]*?def)[^\n(]*\(
See demo at regex101
the negated class [^\n(] matches any character besides newline and opening parenthesis
to discard the part before ( from match, use \K for reset: ^(?![^\n(]*?def)[^\n(]*+\K\(
use word boundaries \b at start/end of def if it's desired to match the substring as word

Regex Negative Lookbehind Matches Lookbehind text .NET

Say I have the following strings:
PB-GD2185-11652-MTCH
GD2185-11652-MTCH
KD-GD2185-11652-MTCH
KD-GD2185-11652
I want REGEX.IsMatch to return true if the string has MTCH in it and does not start with PB.
I expected the regex to be the following:
^(?<!PB)\S+(?=MTCH)
but that gives me the following matches:
PB-GD2185-11652-
GD2185-11652-
KD-GD2185-11652-
I do not understand why the negative lookbehind not only doesn't exclude the match but includes the PB characters in the match. The positive lookahead works as expected.
EDIT 1
Let me start with a simpler example. The following regex matches all of the strings as I would expect it to:
\S+
The following regex still matches all of the strings even though I would expect it not to:
\S+(?!MTCH)
The following regex matches all but the final H character on the first three strings:
\S+(?<!MTCH)
From the documentation at regex 101, a lookahead looks for text to the right of the pattern and a lookbehind looks for text to the left of the pattern, so having a lookahead at the beginning of a string does not jive with the documentation.
Edit 2
take another example with the following three strings:
grey
greyhound
hound
the regex:
^(?<!grey)hound
only matches the final hound. whereas the regex:
^(?<!grey)\S+
matches all three.
You need a lookahead: ^(?!PB)\S+(?=MTCH). Using the look-behind means the PB has to come before the first character.
The problem was because of the greediness of \S+. When dealing with lookarounds and greedy quantifiers you can easily match more characters than you expect. One way to deal with this is to insert a negative lookaround in a group with the greedy quantifier to exclude it as a match as stated in this question:
How to non-greedy multiple lookbehind matches
and on this helpful website about greediness in regular expressions:
http://www.rexegg.com/regex-quantifiers.html
Note that this second link has a few other ways to deal with the greediness in various situations.
A good regular expression for this situation is as follows:
^(?<!PB)((?!PB)\S+)(MTCH)
In situations like this it is going to be much clearer to do it logically within the code. So first check if the string matches MTCH and then that it doesn't match ^PB

Perl Regex "Not" (negative lookahead)

I'm not terribly certain what the correct wording for this type of regex would be, but basically what I'm trying to do is match any string that starts with "/" but is not followed by "bob/", as an example.
So these would match:
/tom/
/tim/
/steve
But these would not
tom
tim
/bob/
I'm sure the answer is terribly simple, but I had a difficult time searching for "regex not" anywhere. I'm sure there is a fancier word for what I want that would pull good results, but I'm not sure what it would be.
Edit: I've changed the title to indicate the correct name for what I was looking for
You can use a negative lookahead (documented under "Extended Patterns" in perlre):
/^\/(?!bob\/)/
TLDR: Negative Lookaheads
If you wanted a negative lookahead just to find "foo" when it isn't followed by "bar"...
$string =~ m/foo(?!bar)/g;
Working Demo Online
Source
To quote the docs...
(?!pattern)
(*nla:pattern)
#(*negative_lookahead:pattern)
A zero-width negative lookahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that lookahead and lookbehind are NOT the same thing. You cannot use this for lookbehind. (Source: PerlDocs.)
Negative Lookaheads For Your Case
The accepted answer is great, but it leaves no explanation, so let me add one...
/^\/(?!bob\/)/
^ — Match only the start of strings.
\/ — Match the / char, which we need to escape because it is a character in the regex format (i.e. s/find/replacewith/, etc.).
(?!...) — Do not match if the match is followed by ....
bob\/ — This is the ... value, don't match bob/', once more, we need to escape the /`.