Regex to get the word after specific match words - regex

I am trying to pull the dollar amount from some invoices. I need the match to be on the word directly after the word "TOTAL". Also, the word total may sometimes appear with a colon after it (ie Total:). An example text sample is shown below:
4 Discover Credit Purchase - c REF#: 02353R TOTAL: 40.00 AID: 1523Q1Q TC: mzQm 40.00 CHANGE 0.00 TOTAL NUMBER OF ITEMS SOLD = 0 12/23/17 Ql:38piii 414 9 76 1G6 THANK YOU FOR SHOPPING KR08ER Now Hiring - Apply Today!
In the case of the sample above, the match should be "40.00".
The Regex statement that I wrote:
(?<=total)([^\n\r]*)
pulls EVERYTHING after the word "total". I only want the very next word.

This (unlike other answers so far) matches only the total amount (ie without needing to examine groups):
((?<=\bTOTAL\b )|(?<=\bTOTAL\b: ))[\d.]+
See live demo matching when input has, and doesn’t have, the colon after TOTAL.
The reason 2 look behinds (which don’t capture input) are needed is they can’t have variable length. The optional colon is handled by using an alternation (a regex OR via ...|...) of 2 look behinds, one with and one without the colon.
If TOTAL can be in any case, add (?i) (the ignore case flag) to the start of the regex.

What you could do is match total followed by an optional colon :? and zero or more times a whitespace character \s* and capture in a group one or more digits followed by an optional part that matches a dot and one or more digits.
To match an upper or lowercase variant of total you could make the match case insensitive by for example by adding a modifier (?i) or use a case insensitive flag.
\btotal:?\s*(\d+(?:\.\d+)?)
The value 40.00 will be in group 1.

Explanations are in the regex pattern.
string str = "4 Discover Credit Purchase - c REF#: 02353R TOTAL: 40.00 AID: 1523Q1Q";
string pattern = #"(?ix) # 'i' means case-insensitive search
\b # Word boundary
total # 'TOTAL' or 'total' or any other combination of cases
:? # Matches colon if it exists
\s+ # One or more spaces
(\d+\.\d+) # Sought number saved into group
\s # One space";
// The number is in the first group: Groups[1]
Console.WriteLine(Regex.Match(str, pattern).Groups[1].Value);

you can use below regex to get amount after TOTAL:
\bTOTAL\b:?\s*([\d.]+)
It will capture the amount in first group.
Link : https://regex101.com/r/tzze8J/1/

Try this pattern: TOTAL:? ?(\d+.\d+)[^\d]?.
Demo

Related

How to reduce specific Regex expression to 50 characters or less

I am trying to match strings where there are two or more of the following words: Strength, Intelligence and Dexterity, with a value of 45 or higher. This an example of a string that would return a match:
+51 to Strength
+47 to Intelligence
+79 to maximum Life
+73 to maximum Mana
28% increased Rarity of Items found
+37% to Cold Resistance
The regex expression is to be entered in a game (Path of exile). The regex string can be a maximum of 50 characters.
The fourth bird has found a solution, but the string is more than 50 characters:
\b[45][0-9] to (?:Str|Int|Dex)[\s\S]*?\b[45][0-9] to (?:Str|Int|Dex).
Is there a way to found a similar expression, but with 50 characters or less?
Thanks in advance!
You can shorten it using a group and repeat that group with a quantifier, and write [0-9] as \d for example:
^(?:[\s\S]*?\b[45]\d to (?:Str|Int|Dex)){2}
The pattern matches:
^ Start of string
(?: Non capture group
[\s\S]*? Match any char as few as possible
\b[45]\d to (?:Str|Int|Dex) Match 4 or 5 followed by a digit, to and one of Str Int Dex
){2} Close the non capture group and repeat it 2 times
Regex demo

Negate a character group to replace all other characters

I have the following string:
"Thu Dec 31 22:00:00 UYST 2009"
I want to replace everything except for the hours and minutes so I get the following result:
"22:00"
I am using this regex :
(^([0-9][0-9]:[0-9][0-9]))
But its not matching anything.
This would be my line of actual code :
println("Thu Dec 31 22:00:00 UYST 2009".replace("(^([0-9][0-9]:[0-9][0-9]))".toRegex(),""))
Can someone help me to correct the regex?
The reason the one you have isn't working is because you are asserting that the line starts right before the minutes and seconds, which isn't the case. This can be fixed by removing the assertion (^).
If you need the assertion to remain, there is another way. In most languages, you wouldn't be able to use a variable-length positive lookbehind here, but lucky for you, it looks like you can in Kotlin.
A positive lookbehind is basically just telling the pattern "this comes before what I'm looking for". It's denoted by a group beginning with ?<=. In this case, you can use something like (?<=^[\w ]+). This will match all word characters or spaces between the beginning of the line and where the pattern that comes after it is able to match. Appending it to your expression would look something like (?<=^[\w ]+)([0-9][0-9]:[0-9][0-9]) (note you will have to escape the \w in order for it to be in a string and not be angry about it).
Side note, Yogesh_D is correct in saying that \d\d:\d\d is the same as your [0-9][0-9]:[0-9][0-9]. Using this, it would look more like (?<=^[\w ]+)\d\d:\d\d.
You may use various solutions, here are two:
val text = """Thu Dec 31 22:00:00 UYST 2009"""
val match = """\b(?:0?[1-9]|1\d|2[0-3]):[0-5]\d\b""".toRegex().find(text)
println(match?.value)
val match2 = """\b(\d{1,2}:\d{2}):\d{2}\b""".toRegex().find(text)
println(match2?.groupValues?.getOrNull(1))
Both return 22:00. See regex #1 demo and regex #2 demo.
The regex complexity should be selected based on how messy the input string is.
Details
\b - a word boundary
(?:0?[1-9]|1\d|2[0-3]) - an optional zero and then a non-zero digit, or 1 and any digit, or 2 and a digit from 0 to 3
: - a : char
[0-5]\d - 0, 1, 2, 3, 4 or 5 and then any one digit
\b - a word boundary.
If there is a match with this regex, you get it as a whole match, so you can access it via match?.value.
If you do not have to worry about any pre-valiation when matching, you may simply match 3 colon-separated digit pairs and capture the first two, see the second regex:
\b - a word boundary
(\d{1,2}:\d{2}) - Group 1: one or two digits, : and two digits
:\d{2} - a : and two digits (not captured)
\b - a word boundary.
If there is a match, we need Group 1 value, hence match2?.groupValues?.getOrNull(1) is used.
I am not sure what language you are using but why use negation when you can directly match the first digits in the hh:mm format.
Assuming that the date string format always is in the format with a hh:mm in there.
This regex snippet should have the first group match the hh:mm.
https://regex101.com/r/aHdehZ/1
The regex to use is (\d\d:\d\d)

Regex to extract Credit Card information

I have some data from which I only want to extract Card details i.e. Card number, Expr month/Year and Cvv.
I tried many patterns but none of them worked for both of them. If one gets matched then the other wont.
Test Data:
4400634848591837Cvv: 362Expm: 04Expy: 20
4400634848591837:04 20 362
4400634848591837|04/24 362
4400634848591837 0420 362
Regex:
(\d{16})[\/\s:|]*?(\d\d)[\/\s|]*?(\d{2,4})[\/\s|-]*?(\d{3})
This matches the rest of them but I haven't figured out how to match first line. I have tried +- Lookahead & Lookbehind but It never worked for me. So any help would be great.
Demo: Here
The part after the 16 digits for the first line has a different format, and the order of the values is also different.
You can use an alternation | with 3 groups to get the values vor the Cvv part.
Note that you don't have to make the character class [\/\s|-]*? non greedy using the ? as the characters can not cross matching the digits that follow.
\b(\d{16})(?:[\/\s:|]*(\d\d)[\/\s|]*(\d{2,4})[\/\s|-]*(\d{3})|Cvv:\s*(\d{3})Expm:\s*(\d\d)Expy:\s*(\d\d))\b
\b A word boundary to prevent a partial match
(\d{16})(?:[\/\s:|]*(\d\d)[\/\s|]*(\d{2,4})[\/\s|-]*(\d{3}) The pattern for the last 3 lines
| Or
Cvv:\s*(\d{3})Expm:\s*(\d\d)Expy:\s*(\d\d)) The pattern for the first line, matching the texts in the line followed by capturing the digits in 3 groups
\b A word boundary
Regex demo

How to write If conditions in regex?

I am trying to capture the amount from the following string:
Delivery Charge $2
Promo - (FIRST) ($4)
$7
New Coins earned $5
Issued on behalf of
.......................
The line "New Coins earned $5" might not be present sometime. I am willing to capture the Amount paid which is "7" in this case. I tried with \.?\s*\n*([\d.,]+)\s*\n*Issued\s*\n*on but this will only capture the amount if "New Coins earned $5" is not present in the document. I read about if else conditions and positive-lookahead. However, couldn't make this working. Any suggestions on how to capture?
Since the value you need is always preceded with $ on a separate line you may use
\$(\d[\d,.]*)[\n\r]+(?:.*[\r\n]+){0,2}Issued\s+on\b
The value you need is in Group 1.
Details
\$ - a $ char
(\d[\d,.]*) - Group 1: a digit followed with any 0+ digits, , or . chars
[\n\r]+ - 1 or more CR or LF symbols
(?:.*[\r\n]+){0,2} - 0, 1 or 2 repetitions of 0+ chars other than linebreak chars followed with 1+ LF/CR symbols
Issued\s+on\b - Issued, 1+ whitespaces, on as a whole word (as \b is a word boundary).
See the regex demo.
Python demo:
import re
rx = r"\$(\d[\d,.]*)[\n\r]+(?:.*[\r\n]+){0,2}Issued\s+on\b"
s = "Delivery Charge $2\nPromo - (FIRST) ($4)\n$1,000.55\nNew Coins earned $5\nIssued on behalf of ......................."
match = re.search(rx, s, re.M)
if match:
print(match.group(1)) # -> 1,000.55
You can do it like (?(?=regex)then|else), but notice that (?=) is lookahead and have zero length, so your then condition must match expression in brackets too.
You also can make more complex expressions in a way
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
Where then1, then2, then3 sorted in descending priority order, because first matching "then" condition will skip all others.
You can look for more info here

Limit number of character of capturing group

Let's say i have this text : "AAAA1 AAA11 AA111AA A1111 AAAAA AAAA1111".
I want to find all occurrences matching these 3 criteria :
-Capital letter 1 to 4 times
-Digit 1 to 4 times
-Max number of characters to be 5
so the matches would be :
{"AAAA1", "AAA11", "AA111", "A1111", "AAAA1"}
i tried
([A-Z]{1,4}[0-9]{1,4}){5}
but i knew it would fail, since it's looking for five time my group.
Is there a way to limit result of the groups to 5 characters?
Thanks
You can limit the character count with a look ahead while checking the pattern with you matching part.
If you can split the input by whitespace you can use:
^(?=.{2,5}$)[A-Z]{1,4}[0-9]{1,4}$
See demo here.
If you cannot split by whitespace you can use capturing group with (?:^| )(?=.{2,5}(?=$| ))([A-Z]{1,4}[0-9]{1,4})(?=$| ) for example, or lookbehind or \K to do the split depending on your regex flavor (see demo).
PREVIOUS ANSWER, wrongly matches A1A1A, updated after #a_guest remark.
You can use a lookahead to check for your pattern, while limiting the character count with the matching part of the regex:
(?=[A-Z]{1,4}[0-9]{1,4}).{2,5}
See demo here.