Regex negated character disjunction [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Very quick and simple question.
Consider the vector of character strings ("AvAv", "AvAvAv")
Why does the pattern (Av)\1([^A]|$) match both strings?
The pattern says have an isntance of "Av", have another, then either have a character that is not an "A" or else come to an end. The first string clearly matches, the latter I do not see how it does. It has two copies of "Av" but then it fails to end (missing the second disjunct), and fails to be followed by a charavter other than "A" (missing the first disjunct), so how does the pattern successfully match it?
Thank you so much for your time and assistance. It is greatly appreciated.

Here is an explanation:
AvAv - matches (Av)\1$
In this case, we can match Av, followed by that captured quantity, followed by $ from the alternation. In the case of AvAvAv we also have a match:
AvAvAv - again matches (Av)\1$
^^^^ last four letters match
It is the same logic here, except that in order to match, we have to skip the first Av.
If the pattern were ^(Av)\1([^A]|$) then only AvAv would be a match.

A RegEx only needs to match a part of the string to be considered "a match".
In other words, your RegEx matches this part:
AvAvAv
for the second example.
If you don't want it to match the second one, use a caret ^
^(Av)\1([^A]|$)
In this way the second one won't be matched.

Related

My regex appears to be complete, yet it's missing matches

I'm currently working on a piece of regex that mostly works, however there's a few matches that aren't capturing, despite working when they're the only match. I'm hoping someone can point out what is clearly an obvious error, but one that I'm missing.
Specifically, the string kMad matches to [Kk\D]+ by itself, but not when it's part of the bigger string.
For reference:
Full Regex showing missing matches
Specific Regex showing matches
Expected matches by line
Non-matching occurrence of kMad10:31-18:5771 does not include 4 digits at the end since two digits after the colon is already captured by (\d{2}:\d{2}\-\d{2}:\d{2}). You can define a range for the digit occurrences for the regex section after it like \d{2,4} instead of \d{4}
The new regex will be:
(?:\d{4}\-\d{2}\-\d{2}+\.)?(?:Line\d{1,3})?(OFF|ADO|([Kk\D]+)?(\d{2}:\d{2}\-\d{2}:\d{2})(\d{2,4}))
Regex101 Demo

Regex: repeated matches using start of line

Say that I would like to replace all as that are after 2 initial as and that only have as in between it and the first 2 as. I can do this in Vim using the (very magic \v) regex s:\v(^a{2}a{-})#<=a:X:g:
aaaaaaaaaaa
goes to
aaXXXXXXXXX
However, why does s:\v^a{2}a{-}\zsa:X:g only replace the first occurrence? I.e., giving
aaXaaaaaaaa
I presume this is because the first match "consumes" the start of the line and the first 2 as such that later matches only are matching on what remains of the line, which never can match the ^ again. Is this true? Or rather what is the most pedagogical explanation?
P.S. This is a minimal example of another problem.
Edit
Accepted answer corrected a typo in the original regex (a missing ^) and its comment answered the question: why can the ^ be "reused" in the lookbehind but not in the \zs case? (Ans: lookbehind doesn't consume the match whereas \zs does.)
The point here is that (a{2}a{-})#<=a matches any a (see the last a) that is preceded with two or more a chars. In NFA regex flavors, it is equal to (?<=a{2,}?)a, see its demo.
The ^a{2}a{-}\zsa regex matches the start of string, then two or more as, then discards this matched text and matches an a. So, it cannot match other as since the ^ anchors the match at the start of the string (and it does not allow matching anywhere else).
You probably want to go on using a lookbehind construct and add ^ there (if you want to only start matching if the string starts with two as):
:%s/\v(^a{2}a{-})#<=a/X/g

Multiple possible matches for regex in Perl [duplicate]

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 3 years ago.
I'm new to Perl and is working with regular expressions. I am not able to decide how Perl resolves the ambiguity for a regex match when multiple matches are possible for a given query string. For example
('hellohellohello' =~ m/h.*o/)
This could match 'hello', 'hellohello' or 'hellohellohello'. Which one will it choose - shortest or largest match ? What if we want opposite behavior (like if default is to find the shortest match then finding the largest match) ?
In case the answer to the first is largest consider
('hello
hellohello' =~ m/h.*o/)
Here, it could match from the first line (before the newline character) or the second line (after the newline character) - first vs largest match. Which one will it use ?
What are the complete set of rules that can be used to decide which substring of a string would match a given regex (might be some case other than the one mentioned in the examples where multiple matches could be found) ?
* is greedy, so it tries to match the longest possible string, so long as the rest of the pattern can still be matched. So it will match hellohellohello.
If you use *? instead, that makes it non-greedy, and it will match the shortest possible string, again as long as the rest of the pattern matches. So m/h.*?o/ will match hello.

Regex for string containing one string, but not another [duplicate]

This question already has answers here:
Regular expression for a string containing one word but not another
(5 answers)
Closed 3 years ago.
Have regex in our project that matches any url that contains the string
"/pdf/":
(.+)/pdf/.+
Need to modify it so that it won't match urls that also contain "help"
Example:
Shouldn't match: "/dealer/help/us/en/pdf/simple.pdf"
Should match: "/dealer/us/en/pdf/simple.pdf"
If lookarounds are supported, this is very easy to achieve:
(?=.*/pdf/)(?!.*help)(.+)
See a demo on regex101.com.
(?:^|\s)((?:[^h ]|h(?!elp))+\/pdf\/\S*)(?:$|\s)
First thing is match either a space or the start of a line
(?:^|\s)
Then we match anything that is not a or h OR any h that does not have elp behind it, one or more times +, until we find a /pdf/, then match non-space characters \S any number of times *.
((?:[^h ]|h(?!elp))+\/pdf\/\S*)
If we want to detect help after the /pdf/, we can duplicate matching from the start.
((?:[^h ]|h(?!elp))+\/pdf\/(?:[^h ]|h(?!elp))+)
Finally, we match a or end line/string ($)
(?:$|\s)
The full match will include leading/trailing spaces, and should be stripped. If you use capture group 1, you don't need to strip the ends.
Example on regex101

How to match a line not containing a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 6 years ago.
I was wondering how to match a line not containing a specific word using Python-style Regex (Just use Regex, not involve Python functions)?
Example:
PART ONE OVERVIEW 1
Chapter 1 Introduction 3
I want to match lines that do not contain the word "PART"?
This should work:
/^((?!PART).)*$/
Edit (by request): How this works
The (?!...) syntax is a negative lookahead, which I've always found tough to explain. Basically, it means "whatever follows this point must not match the regular expression /PART/." The site I've linked explains this far better than I can, but I'll try to break this down:
^ #Start matching from the beginning of the string.
(?!PART) #This position must not be followed by the string "PART".
. #Matches any character except line breaks (it will include those in single-line mode).
$ #Match all the way until the end of the string.
The ((?!xxx).)* idiom is probably hardest to understand. As we saw, (?!PART) looks at the string ahead and says that whatever comes next can't match the subpattern /PART/. So what we're doing with ((?!xxx).)* is going through the string letter by letter and applying the rule to all of them. Each character can be anything, but if you take that character and the next few characters after it, you'd better not get the word PART.
The ^ and $ anchors are there to demand that the rule be applied to the entire string, from beginning to end. Without those anchors, any piece of the string that didn't begin with PART would be a match. Even PART itself would have matches in it, because (for example) the letter A isn't followed by the exact string PART.
Since we do have ^ and $, if PART were anywhere in the string, one of the characters would match (?=PART). and the overall match would fail. Hope that's clear enough to be helpful.