Multiple possible matches for regex in Perl [duplicate] - regex

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 3 years ago.
I'm new to Perl and is working with regular expressions. I am not able to decide how Perl resolves the ambiguity for a regex match when multiple matches are possible for a given query string. For example
('hellohellohello' =~ m/h.*o/)
This could match 'hello', 'hellohello' or 'hellohellohello'. Which one will it choose - shortest or largest match ? What if we want opposite behavior (like if default is to find the shortest match then finding the largest match) ?
In case the answer to the first is largest consider
('hello
hellohello' =~ m/h.*o/)
Here, it could match from the first line (before the newline character) or the second line (after the newline character) - first vs largest match. Which one will it use ?
What are the complete set of rules that can be used to decide which substring of a string would match a given regex (might be some case other than the one mentioned in the examples where multiple matches could be found) ?

* is greedy, so it tries to match the longest possible string, so long as the rest of the pattern can still be matched. So it will match hellohellohello.
If you use *? instead, that makes it non-greedy, and it will match the shortest possible string, again as long as the rest of the pattern matches. So m/h.*?o/ will match hello.

Related

Find DATE match starting from end of string [duplicate]

This question already has answers here:
Regex Last occurrence?
(7 answers)
Closed 3 years ago.
I have the following RegEx syntax that will match the first date found.
([0-9]+)/([0-9]+)/([0-9]+)
However, I would like to start from the end of the content and search backwards. In other words, in the below example, my syntax will always match the first date, but I want it to match the last instead.
Some Text here
01/02/15
Some additional
text here.
10/04/14
Ending text
here
I believe this is possible by using a negative lookahead, but all my attempts failed at this because I don't understand RegEx enough. Help would be appreciated.
Note: my application uses RegEx PCRP.
You could make the dot match a newline using for example an inline modifier (?s) and match until the end of the string.
Then make use of backtracking until the last occurrence of the date like pattern and precede the first digit with a word boundary.
Use \K to forget what was matched and match the date like pattern.
^(?s).*\b\K[0-9]+/[0-9]+/[0-9]+
Regex demo
Note that the pattern is a very broad match and does not validate a date itself.

Regex substitution: find double quotes not following by specific character [duplicate]

This question already has an answer here:
Regex Match a character which is not followed by another specific character
(1 answer)
Closed 4 years ago.
I have the following situation:
3" a
3":a
3",a
3"a
3"2
3"A
I need to find a replace a double quote with space every time the double quote is not following by : or ,.
So, for my case the expected results will be:
3 a
3":a
3",a
3 a
3 2
3 A
Any idea how write this logic using regex?
Regards,
You can use a negative lookahead A(?!B) for that. It matches an expression A that is not followed by expression B.
The replacement of the matches with spaces will depend on the used language.
"(?![:,])
Applied to your examples: https://regex101.com/r/UiPlaC/2
If you want to handle the case 3" a without having multiple spaces, just include one (or even more?) optional spaces in the match.
"(?![:,])\ ?
See here for more information:
Regex lookahead, lookbehind and atomic groups
https://www.regular-expressions.info/lookaround.html

Regex negated character disjunction [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Very quick and simple question.
Consider the vector of character strings ("AvAv", "AvAvAv")
Why does the pattern (Av)\1([^A]|$) match both strings?
The pattern says have an isntance of "Av", have another, then either have a character that is not an "A" or else come to an end. The first string clearly matches, the latter I do not see how it does. It has two copies of "Av" but then it fails to end (missing the second disjunct), and fails to be followed by a charavter other than "A" (missing the first disjunct), so how does the pattern successfully match it?
Thank you so much for your time and assistance. It is greatly appreciated.
Here is an explanation:
AvAv - matches (Av)\1$
In this case, we can match Av, followed by that captured quantity, followed by $ from the alternation. In the case of AvAvAv we also have a match:
AvAvAv - again matches (Av)\1$
^^^^ last four letters match
It is the same logic here, except that in order to match, we have to skip the first Av.
If the pattern were ^(Av)\1([^A]|$) then only AvAv would be a match.
A RegEx only needs to match a part of the string to be considered "a match".
In other words, your RegEx matches this part:
AvAvAv
for the second example.
If you don't want it to match the second one, use a caret ^
^(Av)\1([^A]|$)
In this way the second one won't be matched.

Regex for string containing one string, but not another [duplicate]

This question already has answers here:
Regular expression for a string containing one word but not another
(5 answers)
Closed 3 years ago.
Have regex in our project that matches any url that contains the string
"/pdf/":
(.+)/pdf/.+
Need to modify it so that it won't match urls that also contain "help"
Example:
Shouldn't match: "/dealer/help/us/en/pdf/simple.pdf"
Should match: "/dealer/us/en/pdf/simple.pdf"
If lookarounds are supported, this is very easy to achieve:
(?=.*/pdf/)(?!.*help)(.+)
See a demo on regex101.com.
(?:^|\s)((?:[^h ]|h(?!elp))+\/pdf\/\S*)(?:$|\s)
First thing is match either a space or the start of a line
(?:^|\s)
Then we match anything that is not a or h OR any h that does not have elp behind it, one or more times +, until we find a /pdf/, then match non-space characters \S any number of times *.
((?:[^h ]|h(?!elp))+\/pdf\/\S*)
If we want to detect help after the /pdf/, we can duplicate matching from the start.
((?:[^h ]|h(?!elp))+\/pdf\/(?:[^h ]|h(?!elp))+)
Finally, we match a or end line/string ($)
(?:$|\s)
The full match will include leading/trailing spaces, and should be stripped. If you use capture group 1, you don't need to strip the ends.
Example on regex101

Matching pattern multiple times in same string with regex [duplicate]

This question already has an answer here:
Finding the indexes of multiple/overlapping matching substrings
(1 answer)
Closed 7 years ago.
I'm trying to find all matches of a particular pattern "8ab|ab8" in the string "8ab8". So I tried the R command gregexpr("8ab|ab8","8ab8") hoping to get a return vector with the starting positions as c(1,2).
Unfortunately, it seems that what happens is that once the first pattern is matched, that portion of the string is "removed" and the second pattern won't be matched.
For example, once "8ab" is matched, "8ab8" becomes "8" and when R tries matching "ab8" in "8", the pattern won't be found. I know this because gregexpr("8ab|ab8","8ab ab8") works fine and returns starting positions of pattern matches as c(1,5).
The question is, how do I match the same pattern multiple times in the first case?
Use perl regular expressions: perl=TRUE . (see ?regex for info on perl regular expressions)
gregexpr("(?=8ab)|(?=ab8)","8ab8",perl=T)