Regex Negative Lookahead Is Being Ignored - regex

I have the following sample text
[Item 1](./path/notes.md)
[Item 2](./path)
[Item 3](./path/notes.md)
[Item 4](./path)
When I applied the following regex \[(.*)\].*(?!notes\.md).*\), I expected the following output when I printed the first capture group
Item 2
Item 4
but what I ended up getting was
Item 1
Item 2
Item 3
Item 4
which, to me, seems that the negative lookahead portion, (?!notes\.md), was being ignored for some reason, so the regex was matching the whole string.
What is really confusing me, is that a positive lookahead works as I would expect. For example using \[(.*)\].*(?=notes\.md).*\) returns the following when printing the first capture group
Item 1
Item 3
which makes sense, so I am really confused as to why the negative lookahead is not functioning properly.

Let's go through what happens when matching your pattern on item 1:
\[(.*)\] matches [Item 1]
.* matches (./path/notes.md
The remaining string is now )
(?!notes\.md) checks whether the remaining string matches the pattern notes\.md. It does not, so the match continues.
\) matches the ) and the match succeeds.
If you change it so that the .* before the lookahead is inside the lookahead instead (\[(.*)\](?!.*notes\.md).*\)), it will now work as follows:
\[(.*)\] matches [Item 1]
The remaining string is now (./path/notes.md)
(?!.*notes\.md) checks whether the remaining string matches the pattern .*notes\.md, which it does, so the match fails (well more precisely, the regex engine will try to backtrack before it gives up on the match, but there's no alternative way to match \[(.*)\]', so the match still fails).
So with that change, it will reject all strings where notes.md appears anywhere before the ). If you want it to only reject instances where notes.md appears directly before the ), you can use a loookbehind (without .*) instead or add the \) to the lookahead.

In short, you have way too many .* (can lead to Catastrophic backtracking, look it up!). Remember that they match any character zero or more times. That means they will keep trying to match until they get success. And that is not necessarily the number of characters, you want.
A way to solve your problem is to move the negative look ahead to the front, like this:
(?!.*notes\.md)\[([^\]]+)\].*
Explanation:
(?!.*notes\.md) negative look ahead for any number of any char followed by 'notes.md'
\[ a [ character
([^\]]+) group 1, any character not being ], one or more times
\] a ] character
.* the rest of the line
Use the 'multiline' flag to get each line.

The problem here is that .* before the negative lookahead is greedy and will continue finding anything then stops.
A way to manage this is to include this greedy behaviour inside of the negative lookahead like hère
https://regex101.com/r/yzUQoP/1
/\[(.*)\](?!.*notes\.md)/gm

The pattern \[(.*)\].*(?!notes\.md).*\) that you tried matches from the first [ until the last occurrence of ]
Then what happens is that .* will match the rest of the line, so the following assertion (?!notes\.md) will always be true as the rest of the line is already matched.
Then the engine can backtrack to match the last )
If you don't want to cross [] and () while matching:
\[([^][]+)]\((?![^()]*\bnotes\.md\b)[^()]*\)
\[ Match [
([^][]+) Capture group 1, match 0+ times any char other than [ and ]
]\( Match ](
(?! Negative lookahead
[^()]*\bnotes\.md\b Match 0+ times any char other then ( and ) and then match notes.md between word boundaries to prevent partial matches
) Close lookahead
[^()]* Match 0+ times any char except ( and )
\) Match )
Regex demo

Related

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

How can I get the first and last part of one wordcombination using regex

How can I get only the middle part of a combined name with PCRE regex?
name: 211103_TV_storyname_TYPE
result: storyname
I have used this single line: .(\d)+.(_TV_) to remove the first part: 211103_TV_
Another idea is to use (_TYPE)$ but the problem is that I don´t have in all variations of names a space to declare a second word to use the ^ for the first word and $ for the second.
The variation of the combined name is fix for _TYPE and the TV.
The numbers are changing according to the date. And the storyname is variable.
Any ideas?
Thanks
With your shown samples, please try following regex, this creates one capturing group which contains matched values in it.
.*?_TV_([^_]*)(?=_TYPE)
OR(adding a small variation of above solution with fourth bird's nice suggestion), following is without lazy match .*? unlike above:
_TV_([^_]*)(?=_TYPE)
Here is the Online demo for above regex
Explanation: Adding detailed explanation for above.
.*?_ ##Using Lazy match to match till 1st occurrence of _ here.
TV_ ##Matching TV_ here.
([^_]*) ##Creating 1st capturing group which has everything before next occurrence of _ here.
(?=_TYPE) ##Making sure previous values are followed by _TYPE here.
You could match as least as possible chars after _TV_ until you match _TYPE
\d_TV_\K.*?(?=_TYPE)
\d_TV_ Match a digit and _TV_
\K Forget what is matched until now
.*? Match as least as possible characters
(?=_TYPE) Assert _TYPE to the right
Regex demo
Another option without a non greedy quantifier, and leaving out the digit at the start:
_TV_\K[^_]*+(?>_(?!TYPE)[^_]*)*(?=_TYPE)
_TV_ Match literally
\K[^_]*+ Forget what is matched until now and optionally match any char except _
(?>_(?!TYPE)[^_]*)* Only allow matching _ when not directly followed by TYPE
(?=_TYPE) Assert _TYPE to the right
Regex demo
Edit
If you want to replace the 2 parts, you can use an alternation and replace with an empty string.
If it should be at the start and the end of the string, you can prepend ^ and append $ to the pattern.
\b\d{6}_TV_|_TYPE\b
\b\d{6}_TV_ A word boundary, match 6 digits and _TV_
| Or
_TYPE\b Match _TYPE followed by a word boundary
Regex demo
Here i put some additional Screenshots to the post. With the Documentation that appears on the help button. And you see the forms and what i see.
Documentation
The regular expressions we use are based on PCRE - Perl Compatible Regular Expressions. Full specification can be found here: http://www.pcere.org and http://perldoc.perl.org/perlre.html
Summary of some useful terms:
Metacharacters
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
Quantifiers
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Charcter Classes
\w Match a "word" character (alphanumeric plus mao}
\W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Capture buffers
The bracketing construct (...) creates capture buffers. To refer to
Within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "". The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.
Referring back to another part of the match is called a backreference.
Examples
Replace story with certain prefix letters M N or E to have the prefix "AA":
`srcPattern "(M|N|E ) ([A-Za-z0-9\s]*)"`
`trgPattern "AA$2" `
`"N StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"E StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"M StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
"NoMatchWord StoryWord1 StoryWord2" -> "NoMatchWord StoryWord1 StoryWord2" (no match found, name remains the same)

Regex to find a line with two capture groups that match the same regex but are still different

I am trying to analyse my source code (written in C) for not corresponding timer variable comparisons/allocations. I have a rage of timers with different timebases (2-250 milliseconds). Every timer variable contains its granularity in milliseconds in its name (e.g. timer10ms) as well as every timer-photo and define (e.g. fooTimer10ms, DOO_TIMEOUT_100MS).
Here are some example lines:
fooTimer10ms = timer10ms;
baaTimer20ms = timer10ms;
if (DIFF_100MS(dooTimer10ms) >= DOO_TIMEOUT_100MS)
if (DIFF_100MS(dooTimer10ms) < DOO_TIMEOUT_100MS)
I want to match those line where the timebases are not corresponding (in this case the second, third and fourth line). So far I have this regex:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))
that is capable of finding every line where there are two of those granularities. So instead of just line 2, 3 and 4 it matches all of them. The only idea I had to narrow it down is to add a negative lookbehind with a back-reference, like so:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))(?<!\1)
but this is not allowed because a negative lookbehind has to have a fixed length.
I found these two questions (one, two) but the fist does not have the restriction of having both capture groups being of the same kind and the second is looking for equal instances of the capture group.
If what I want can be achieved way easier, by using something else than regex, I would be happy to know. My mind is just stuck due to my believe that regex is capable of that and I am just not creative enough to use it properly.
One option is to match the timer part followed by the digits and use a negative lookahead with a backreference to assert that it does not occur at the right.
For the example data, a bit specific pattern using a range from 2-250 might be:
.*?(timer(?:2[0-4]\d|250|1?\d\d|[2-9])ms)\b\S*[^\S\r\n]*[<>]?=[^\S\r\n]*\b(?!\S*\1)\S+
The pattern matches
.*? Match any char except a newline, as least as possible (Non greedy)
( Capture group 1
timer Match literally
(?:2[0-4]\d|250|1?\d\d|[2-9]) Match a digit in the range of 2-250
ms Match literally
)\b Close group and a word boundary
\S*[^\S\r\n]* Match optional non whitespace chars and optional spaces without newlines
[<>]?= Match an optional < or > and =
[^\S\r\n]*\b Match optional whitespace chars without a newline and a word boundary
(?!\S*\1) Negative lookahead, assert no occurrence of what is captured in group 1 in the value
\S+ Match 1+ non whitespace chars
Regex demo
Or perhaps a broader pattern matching 1-3 digits and optional whitespace chars which might also match a newline:
.*?(timer\d{1,3}ms\b)\S*\s*[<>]?=\s*\b(?!.*\1)\S+
Regex demo
Note that {1-3} should be {1,3} and could also match 999

Using regex to determine straight (unordered hand)

A straight in poker is five cards in a row, for example 23456 or 89TJQ. With a "sorted" hand, the regex could be written as:
^(A2345|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)$
It's a bit verbose but straightforward enough. However, would it be possible to generate a (sensible) regex if the hand was unordered? For example, if the hand was 52634 or JQ89T??
One possible way would be to use a ?=.*<item> lookahead (which would essentially be "unsorted"), for example:
^(?:
(?=.*A)(?=.*2)(?=.*3)(?=.*4)(?=.*5)
|(?=.*2)(?=.*3)(?=.*4)(?=.*5)(?=.*6)
|(?=.*3)(?=.*4)(?=.*5)(?=.*6)(?=.*7)
|(?=.*4)(?=.*5)(?=.*6)(?=.*7)(?=.*8)
|(?=.*5)(?=.*6)(?=.*7)(?=.*8)(?=.*9)
|(?=.*6)(?=.*7)(?=.*8)(?=.*9)(?=.*T)
|(?=.*7)(?=.*8)(?=.*9)(?=.*T)(?=.*J)
|(?=.*8)(?=.*9)(?=.*T)(?=.*J)(?=.*Q)
|(?=.*9)(?=.*T)(?=.*J)(?=.*Q)(?=.*K)
|(?=.*T)(?=.*J)(?=.*Q)(?=.*K)(?=.*A)
)
.{5}$
Are there other / better approaches to finding if a straight exists using regex only?
You can use the following regex:
See regex in use here
(?!.*(.).*\1)(?:[A2345]{5}|[23456]{5}|[34567]{5}|[45678]{5}|[56789]{5}|[6789T]{5}|[789TJ]{5}|[89TJQ]{5}|[9TJQK]{5}|[TJQKA]{5})
This works by first using a negative lookahead to ensure that the string doesn't contain any duplicates (?!.*(.).*\1). Then it matches 5 characters from any of the straight possibilities.
(?!.*(.).*\1)
#^^^ ^ negative lookahead ensuring what follows doesn't match
# ^^ match any character any number of times
# ^^^ capture a character into capture group #1
# ^^ match any character any number of times
# ^^ match the same text as most recently matched by the 1st capture group
Against JQQ89, it works as follows:
- .* matches J
- (.) captures Q
- .* matches nothing
- \1 tries to match Q (and succeeds)
- Negative lookahead has a match, so fail the match.

R- regex extracting a string between a dash and a period

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.
I have a data frame with column names as follows;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
What I want is to use regex in R to extract the string in between the first dash and the last period.
Anticipated results are,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
I tried following with no luck...
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Would someone be able to help me with this problem?
Thanks a lot in advance.
Shani.
Here is a solution to be used with gsub:
v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)
See IDEONE demo
The regex matches:
^[^-]* - zero or more characters other than -
- - a hyphen
([^.]+) - Group 1 matching and capturing one or more characters other than a dot
.* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.
This can easily be achieved with the following regex:
-([^.]+)
# look for a dash
# then match everything that is not a dot
# and save it to the first group
See a demo on regex101.com. Outputs are:
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Transgenic-C-3
R-Transgenic-C-3
Regex
-([^.]+)\\.
Description
- matches the character - literally
1st Capturing group ([^\\.]+)
[^\.]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
. matches the character . literally
\\. matches the character . literally
Debuggex Demo
Output
MATCH 1
1. [4-14] `R-Null-C-3`
MATCH 2
1. [29-39] `R-Null-C-3`
MATCH 3
1. [54-64] `R-Null-C-3`
MATCH 4
1. [85-95] `R-Null-C-3`
MATCH 5
1. [110-126] `R-Transgenic-C-3`
MATCH 6
1. [141-157] `R-Transgenic-C-3`
This seems an appropriate case for lookarounds:
library(stringr)
str_extract(v, '(?<=-).*(?=\\.)')
where
(?<= ... ) is a positive lookbehind, i.e. it looks for a - immediately before the next captured group;
.* is any character . repeated 0 or more times *;
(?= ... ) is a positive lookahead, i.e. it looks for a period (escaped as \\.) following what is actually captured.
I used stringr::str_extract above because it's more direct in terms of what you're trying to do. It is possible to do the same thing with sub (or gsub), but the regex has to be uglier:
sub('.*?(?<=-)(.*)(?=\\.).*', '\\1', v, perl = TRUE)
.*? looks for any character . from 0 to as few as possible times *? (lazy evaluation);
the lookbehind (?<=-) is the same as above;
now the part we want .* is put in a captured group (...), which we'll need later;
the lookahead (?=\\.) is the same;
.* captures any character, repeated 0 to as many as possible times (here the end of the string).
The replacement is \\1, which refers to the first captured group from the pattern regex.