How to write If conditions in regex? - regex

I am trying to capture the amount from the following string:
Delivery Charge $2
Promo - (FIRST) ($4)
$7
New Coins earned $5
Issued on behalf of
.......................
The line "New Coins earned $5" might not be present sometime. I am willing to capture the Amount paid which is "7" in this case. I tried with \.?\s*\n*([\d.,]+)\s*\n*Issued\s*\n*on but this will only capture the amount if "New Coins earned $5" is not present in the document. I read about if else conditions and positive-lookahead. However, couldn't make this working. Any suggestions on how to capture?

Since the value you need is always preceded with $ on a separate line you may use
\$(\d[\d,.]*)[\n\r]+(?:.*[\r\n]+){0,2}Issued\s+on\b
The value you need is in Group 1.
Details
\$ - a $ char
(\d[\d,.]*) - Group 1: a digit followed with any 0+ digits, , or . chars
[\n\r]+ - 1 or more CR or LF symbols
(?:.*[\r\n]+){0,2} - 0, 1 or 2 repetitions of 0+ chars other than linebreak chars followed with 1+ LF/CR symbols
Issued\s+on\b - Issued, 1+ whitespaces, on as a whole word (as \b is a word boundary).
See the regex demo.
Python demo:
import re
rx = r"\$(\d[\d,.]*)[\n\r]+(?:.*[\r\n]+){0,2}Issued\s+on\b"
s = "Delivery Charge $2\nPromo - (FIRST) ($4)\n$1,000.55\nNew Coins earned $5\nIssued on behalf of ......................."
match = re.search(rx, s, re.M)
if match:
print(match.group(1)) # -> 1,000.55

You can do it like (?(?=regex)then|else), but notice that (?=) is lookahead and have zero length, so your then condition must match expression in brackets too.
You also can make more complex expressions in a way
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
Where then1, then2, then3 sorted in descending priority order, because first matching "then" condition will skip all others.
You can look for more info here

Related

Regex to match whole sentence but with exeptions

I use this regex to mark whole sentences ending in period, question mark or exclamation mark - it works so far.
[A-Z][^.!?]*[.?!]
But there are two problems:
if there is a number followed by a period in the sentence.
if there is an abbreviation with a period in the sentence.
Then the sentence is extracted incorrectly.
Example:
Sentence Example: "Er kam am 1. November."
Sentence Example: "Dr. Schiwago."
The first sentence then becomes two sentences because a period follows the number.
The second sentence then also becomes two sentences because the abbreviation ends in a period.
How can I adjust the regex so that both problems do not occur?
So in the first sentence, whenever a period follows a number, this should not be seen as the end of the sentence, but the regex continues to the next period.
In the second sentence, for example, a minimum size of 4 characters would ensure that the abbreviation is not seen as a complete sentence.
DEMO
You can use
\b(?:\d+\.\s*[A-Z]|(?:[DJS]r|M(?:rs?|(?:is)?s))\.|[^.!?])*[.?!]
See the regex demo. Add the known abbreviations and other patterns you come across as alternatives to the non-capturing group.
\b - word boundary (Note: add (?=[A-Z]) after \b if you need to start matching with an uppercase letter)
(?:\d+\.\s*[A-Z]|(?:[DJS]r|M(?:rs?|iss))\.|[^.!?])* - zero or more occurrences of:
\d+\.\s*[A-Z] - one or more digits, ., zero or more whitespaces, uppercase letter
| - or
(?:[DJS]r|M(?:rs?|(?:is)?s))\. - Dr., Jr., Sr., Mr, Mrs, Ms, Miss
| - or
[^.!?] - a char other than ., ! and ?
[.?!] - a ., ? or ! char.
Try (regex101)
[A-Z].*?(?<!Dr)(?<!\d)\s*[.!?]
[A-Z] - start with capital letter
.*? - non-greedy match-all
(?<!Dr)(?<!\d)\s*[.!?] - we match until ., ! or ?, there must not be Dr or digit before it.

How To Use a Regex Capture Result to Lookbehind

I am trying to use the result of the capture group to perform a look behind for a specific answer.
Sample of Text:
10) Once a strategy has been formulated and implemented, it is important that the firm sticks to it no matter what happens.
Answer: FALSE
11) Which of the following strategies does Tesla need to implement or achieve to gain a competitive advantage?
A) imitate the features of the most popular SUVs on the market
B) reinvest profits to build successively better electric automobiles
C) sell advertising space on their cars' digital displays
D) substitute less-expensive components to keep costs low
Answer: B
Current Output:
https://regex101.com/r/bLKmYX/1
It is currently outputting FALSE and B as the answers to these questions.
Expected Output
I would like it to output FALSE and B) reinvest profits to build successively better electric automobiles
Current Regex Expression
'^\d+\)\s*([\s\S]*?)\nAnswer:\s*(.*)'
How can I use the result of the second capture group, (B), to perform a lookbehind and get the whole answer?
What you ask for is not possible due to the fact that a captured value can only be checked after it was obtained.
You may try another logic: capture the answer letter and then match the same letter after Answer: substring using the backreference to the group value.
You may consider a pattern like
(?m)^\d+\)\s*((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?)\nAnswer:\s*(\3|FALSE)
See the regex demo.
It has 4 capturing groups now, the first one containing the whole question body, then the second one containing the answer line you need, the third one is auxiliary (it is used to check which answer is correct), and the fourth one is the answer value.
Details
(?m) - ^ now matches line start positions and $ matches line end positions
^ - start of a line
\d+ - 1+ digits
\) - a ) char
\s* - 0+ whitespaces
((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?) - Group 1:
(?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)? - an optional non-capturing group matching
(?:(?!^\d+\))[\s\S])*? - any char, 0 or more occurrences, that does not start a start of line, 1+ digits and then a ) sequence
\n - a newline
(([A-Z])\).*) - Group 2: an ASCII uppercase letter captured into Group 3, then ) char and then the rest of the line (.*)
$ - end of line
[\s\S]*? - any 0+ chars as few as possible
\nAnswer: - a new line, Answer: string
\s* - 0+ whitespaces
(\3|FALSE) - Group 4: Group 3 value or FALSE.

How to capture 2 different patterns one from the beginnig and other from the end of the string using Regular Expression?

I need to capture two different patterns one from the beginning of the string and the other from the end.
I am using Python3.
Example 1:
string: 'TRADE ACCOUNT BALANCE FROM 2 TRADE LINES CALL. .... $ 23,700'
expected_output: TRADE ACCOUNT BALANCE 23,700
my_regex_pattern: r'(TRADE ACCOUNT BALANCE).+([\d,]+)'
output(group 0): TRADE ACCOUNT BALANCE
output(group 1): 0
Example 2:
string: 'AVERAGE BALANCE IN THE PAST 5 QUARTERS ......... $ 26,460'
output: AVERAGE BALANCE 26,460
my_regex_pattern: r'(AVERAGE BALANCE).+([\d,]+)'
output(group 0): AVERAGE BALANCE
output(group 1): 0
The substring, in the end, will always be a number. The substring, in the beginning, will always be a word
I do not understand why it is capturing just the last character from the end.
The .+ in your pattern matches the whole string to the end, and then backtracks to find the first occurrence that matches [\d,]+ pattern. Since the last 0 meets this criterion, that match succeeds with just 0 in the second group.
What you need to do in this situation is to find where to "anchor" the second group start.
In the strings you provided, there is a dollar symbol before the number. So, you may use
(TRADE ACCOUNT BALANCE).*\$\s*(\d[\d,]*)
See the regex demo and the regex graph:
Details
(TRADE ACCOUNT BALANCE) - Group 1: a literal substring
.* - any 0+ chars other than line break chars, as many as possible
\$ - a $ char
\s* - 0+ whitespaces
(\d[\d,]*) - Group 2: a digit, and then 0+ digits or commas.

Regex to get the word after specific match words

I am trying to pull the dollar amount from some invoices. I need the match to be on the word directly after the word "TOTAL". Also, the word total may sometimes appear with a colon after it (ie Total:). An example text sample is shown below:
4 Discover Credit Purchase - c REF#: 02353R TOTAL: 40.00 AID: 1523Q1Q TC: mzQm 40.00 CHANGE 0.00 TOTAL NUMBER OF ITEMS SOLD = 0 12/23/17 Ql:38piii 414 9 76 1G6 THANK YOU FOR SHOPPING KR08ER Now Hiring - Apply Today!
In the case of the sample above, the match should be "40.00".
The Regex statement that I wrote:
(?<=total)([^\n\r]*)
pulls EVERYTHING after the word "total". I only want the very next word.
This (unlike other answers so far) matches only the total amount (ie without needing to examine groups):
((?<=\bTOTAL\b )|(?<=\bTOTAL\b: ))[\d.]+
See live demo matching when input has, and doesn’t have, the colon after TOTAL.
The reason 2 look behinds (which don’t capture input) are needed is they can’t have variable length. The optional colon is handled by using an alternation (a regex OR via ...|...) of 2 look behinds, one with and one without the colon.
If TOTAL can be in any case, add (?i) (the ignore case flag) to the start of the regex.
What you could do is match total followed by an optional colon :? and zero or more times a whitespace character \s* and capture in a group one or more digits followed by an optional part that matches a dot and one or more digits.
To match an upper or lowercase variant of total you could make the match case insensitive by for example by adding a modifier (?i) or use a case insensitive flag.
\btotal:?\s*(\d+(?:\.\d+)?)
The value 40.00 will be in group 1.
Explanations are in the regex pattern.
string str = "4 Discover Credit Purchase - c REF#: 02353R TOTAL: 40.00 AID: 1523Q1Q";
string pattern = #"(?ix) # 'i' means case-insensitive search
\b # Word boundary
total # 'TOTAL' or 'total' or any other combination of cases
:? # Matches colon if it exists
\s+ # One or more spaces
(\d+\.\d+) # Sought number saved into group
\s # One space";
// The number is in the first group: Groups[1]
Console.WriteLine(Regex.Match(str, pattern).Groups[1].Value);
you can use below regex to get amount after TOTAL:
\bTOTAL\b:?\s*([\d.]+)
It will capture the amount in first group.
Link : https://regex101.com/r/tzze8J/1/
Try this pattern: TOTAL:? ?(\d+.\d+)[^\d]?.
Demo

regex with optional capture group

I am trying to get the ammount, unit and substance out of a string using a regex. The units and substances come from a predefined list.
So:
"2 kg of water" should return: 2, kg, water
"1 gallon of crude oil" should return: 1, gallon, oil
I can achieve this with the following regex:
(\d*) ?(kg|ml|gallon).*(water|oil)
The problem is that I can't figure out how to make the last capture group optional. If the substance is not in the predefined list, I still want to get the ammount and unit. So:
"1 gallon of diesel" should return: 1, gallon or 1, gallon, ''
I have tried wrapping the last group in an optional non capturing group as explained here: Regex with optional capture fields but with no success.
Here is the current reges in te online regex tester: https://regex101.com/r/hV3wQ3/55
You are trying to use (\d+) ?(kg|ml|gallon).*(?:(water|oil))? and there is no way this pattern can capture water / oil. The problem is the .* grabs any 0+ chars other than line break chars up to the end of the string / line, and the (?:(water|oil))? is tried when the regex index is there, at the string end. Since (?:(water|oil))? can match an empty string, it matches the location at the end of the string, and the match is returned.
You may still use the capturing group as obligatory, but wrap the .* and the capturing group with an optional non-capturing group:
(\d+) ?(kg|ml|gallon)(?:.*(water|oil))?
^^^ ^^
See the regex demo
The (?:.*(water|oil))? matches 1 or 0 (greedily) occurrences of any 0+ chars other than line break chars (.*) and then either water or oil.