How To Use a Regex Capture Result to Lookbehind - regex

I am trying to use the result of the capture group to perform a look behind for a specific answer.
Sample of Text:
10) Once a strategy has been formulated and implemented, it is important that the firm sticks to it no matter what happens.
Answer: FALSE
11) Which of the following strategies does Tesla need to implement or achieve to gain a competitive advantage?
A) imitate the features of the most popular SUVs on the market
B) reinvest profits to build successively better electric automobiles
C) sell advertising space on their cars' digital displays
D) substitute less-expensive components to keep costs low
Answer: B
Current Output:
https://regex101.com/r/bLKmYX/1
It is currently outputting FALSE and B as the answers to these questions.
Expected Output
I would like it to output FALSE and B) reinvest profits to build successively better electric automobiles
Current Regex Expression
'^\d+\)\s*([\s\S]*?)\nAnswer:\s*(.*)'
How can I use the result of the second capture group, (B), to perform a lookbehind and get the whole answer?

What you ask for is not possible due to the fact that a captured value can only be checked after it was obtained.
You may try another logic: capture the answer letter and then match the same letter after Answer: substring using the backreference to the group value.
You may consider a pattern like
(?m)^\d+\)\s*((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?)\nAnswer:\s*(\3|FALSE)
See the regex demo.
It has 4 capturing groups now, the first one containing the whole question body, then the second one containing the answer line you need, the third one is auxiliary (it is used to check which answer is correct), and the fourth one is the answer value.
Details
(?m) - ^ now matches line start positions and $ matches line end positions
^ - start of a line
\d+ - 1+ digits
\) - a ) char
\s* - 0+ whitespaces
((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?) - Group 1:
(?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)? - an optional non-capturing group matching
(?:(?!^\d+\))[\s\S])*? - any char, 0 or more occurrences, that does not start a start of line, 1+ digits and then a ) sequence
\n - a newline
(([A-Z])\).*) - Group 2: an ASCII uppercase letter captured into Group 3, then ) char and then the rest of the line (.*)
$ - end of line
[\s\S]*? - any 0+ chars as few as possible
\nAnswer: - a new line, Answer: string
\s* - 0+ whitespaces
(\3|FALSE) - Group 4: Group 3 value or FALSE.

Related

Ignore Until "Spacebar+I or V or X" - Regex Expression

So... I had a regex which worked just fine (wasn't pretty but worked), until the Roman Numerals reached more than X.
Currently my Regex looks like this:
(.*?)(^(X{1,3})(I[XV]|V?I{0,3})$|^(I[XV]|V?I{1,3})$|^V$)*(.)( EP\. )(\d*)(.*)
The problem I have right now is that if roman numeral has value 10 or more it's is in 1st group which drives me nuts.
I need it to work in a way that all before roman numerals is ignored.
Test Text:
PEPA THE PIG XVI EP. 169 - BAD ENDING
Could you please help me fix the regex so it would actually do what it suppose to do?
You should re-consider using anchors in the middle of a regex: ^ requires start of string and $ requires the end of string.
Besides, (.) before ( Ep\. ) consume the space, and the Ep pattern cannot match it.
Consider using
^(.*?)\b(X{1,3}(?:I[XV]|V?I{0,3})|I[XV]|V?I{1,3}|V)\b(.)\b(EP\.)\s*(\d+)(.*)
See the regex demo. You might still need to check what exactly you want to match with (.).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\b - a word boundary
(X{1,3}(?:I[XV]|V?I{0,3})|I[XV]|V?I{1,3}|V) - Group 2: one to three Xs followed with IX or IV, or with an optional V and then zero to three Is, or IX, IV, or an optional V followed with one to three Is or V
\b - a word boundary
(.) - Group 3: any one char (other than a newline)
\b - a word boundary
(EP\.) - Group 4: EP.
\s* - zero or more whitespaces
(\d+) - Group 5: one or more digits
(.*) - Group 6: any zero or more chars other than line break chars, as many as possible

Regex for two of any digit then three of another then four of another?

Regex is great, but I can't for the life of me figure out how I'd express the following constraint, without spelling out the whole permutation:
2 of any digit [0-9]
3 of any other digit [0-9] excluding the above
4 of any third digit [0-9] excluding the above
I've got this monster, which is clearly not a good way of doing this, as it grows exponentially with each additional set of digits:
^(001112222|001113333|001114444|001115555|001116666|0001117777|0001118888|0001119999|0002220000|...)$
OR
^(0{2}1{3}2{4}|0{2}1{3}3{4}|0{2}1{3}4{4}|0{2}1{3}5{4}|0{2}1{3}6{4}|0{2}1{3}7{4}|0{2}1{3}8{4}|...)$
Looks like the following will work:
^((\d)\2(?!.+\2)){2}\2(\d)\3{3}$
It may look a bit tricky, using recursive patterns, but it may look more intimidating then it really is. See the online demo.
^ - Start string anchor.
( - Open 1st capture group:
(\d) - A 2nd capture group that does capture a single digit ranging from 0-9.
\2 - A backreference to what is captured in this 2nd group.
(?!.+\2) - Negative lookahead to prevent 1+ characters followed by a backreference to the 2nd group's match.
){2} - Close the 1st capture group and match this two times.
\2 - A backreference to what is most recently captured in the 2nd capture group.
(\d) - A 3rd capture group holding a single digit.
\3{3} - Exactly three backreferences to the 3rd capture group's match.
$ - End string anchor.
EDIT:
Looking at your alternations it looks like you are also allowing numbers like "002220000" as long as the digits in each sequence are different to the previous sequence of digits. If that is the case you can simplify the above to:
^((\d)\2(?!.\2)){2}\2(\d)\3{3}$
With the main difference is the "+" modifier been taken out of the pattern to allow the use of the same number further on.
See the demo
Depending on whether your target environment/framework/language supports lookaheads, you could do something like:
^(\d)\1(?!\1)(\d)\2\2(?!\1|\2)(\d)\3\3\3$
First capture group ((\d)) allows us to enforce the "two identical consecutive digits" by referencing the capture value (\1) as the next match, after which the negative lookahead ensures the next sequence doesn't start with the previous digit - then we just repeat this pattern twice
Note: If you want to exclude only the digit used in the immediately preceding sequence, change (?!\1|\2) to just (?!\2)

Regex (PCRE): Match all digits in a line following a line which includes a certain string

Using PCRE, I want to capture only and all digits in a line which follows a line in which a certain string appears. Say the string is "STRING99". Example:
car string99 house 45b
22 dog 1 cat
women 6 man
In this case, the desired result is:
221
As asked a similar question some time ago, however, back then trying to capture the numbers in the SAME line where the string appears ( Regex (PCRE): Match all digits conditional upon presence of a string ). While the question is similar, I don't think the answer, if there is one at all, will be similar. The approach using the newline anchor ^ does not work in this case.
I am looking for a single regular expression without any other programming code. It would be easy to accomplish with two consecutive regex operations, but this not what I'm looking for.
Maybe you could try:
(?:\bstring99\b.*?\n|\G(?!^))[^\d\n]*\K\d
See the online demo
(?: - Open non-capture group:
\bstring99\b - Literally match "string99" between word-boundaries.
.*?\n - Lazy match up to (including) nearest newline character.
| - Or:
\G(?!^) - Asserts position at the end of the previous match but prevent it to be the start of the string for the first match using a negative lookahead.
) - Close non-capture group.
[^\d\n]* - Match 0+ non-digit/newline characters.
\K - Resets the starting point of the reported match.
\d - Match a digit.

Regex pattern to match letter combination of a word

Currently I am developing puzzle game for kids where player needs to select correct word from the grid. I used regex to match the word.
For an example I used ([D|E|C|K]){4} to match DECK because player should be able to select the word not in exact D->E->C->K order. Player may select it KDEC or EDCK or KCED or any order.
I achieved this by using ([D|E|C|K]){4}.
But here I am facing issue, this pattern matches EEEE or DDDD or DKDK and etc. Simply any combination of 4 chars from the set.
Any Idea how can I modify the regex to get my desired outcome?
Thanks in advance.
Basically, this is not a good job for a regex because this is not regular language. You'd better follow a simple algorithm to split the input string into characters, sort them, and rejoin into a string, do the same with the search string, then compare the results.
See a JavaScript demo with the word TALL:
const strings = ['TALL','LATL','TLAL','TTAL','AATT','ATL','STL'];
const search = 'TALL';
const compare_with = search.split("").sort().join("");
for (let s of strings) {
console.log(s, ':', s.split("").sort().join("") == compare_with );
}
Can we do it with a regex? In .NET, you may use balancing construct, and it is a solution, not a workaround.
Scenario 1: .NET regex engine specific solution
Assuming your search word is TALL, you may build a regex like
^(?:(T)|(A)|(L)|(L)){4}$(?<-1>)(?<-2>)(?<-3>)(?<-4>)
See the regex demo.
Details
^- start of string
(?:(T)|(A)|(L)|(L)){4} - a non-capturing group that matches 4 occurrences of
(T) - T pushed on to the Group 1 capture stack
|(A) - or A pushed on to the Group 2 capture stack
|(L) - or L pushed on to the Group 3 capture stack
|(L) - or L pushed on to the Group 4 capture stack
$ - end of string
(?<-1>)(?<-2>)(?<-3>)(?<-4>) - Pop a value from each of the capturing groups. If any group capture stack is not empty, return false and result in no match, else, there is a match.
Scenario 2: Lookahead basd work-around in case all characters are unique
You may match and capture each letter from the range into a separate capturing group and add a negative lookahead before each subsequent capturing group to avoid matching a letter matched before it.
The regex will look like
^([DECK])(?!\1)([DECK])(?!\1|\2)([DECK])(?!\1|\2|\3)([DECK])$
See the regex demo
Details
^ - start of string
([DECK]) - Group 1: a letter, D, E, C or K
(?!\1) - the next char cannot be the one captured into Group 1
([DECK]) - Group 2: a letter, D, E, C or K
(?!\1|\2)([DECK]) - the next letter cannot be equal to the first and second one
(?!\1|\2|\3)([DECK]) - the next letter cannot be equal to the first, second and third one
$ - end of string

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"