Regex for receipt items - regex

I have a simple receipt which I typed out. I need to be able to read the items purchased on the receipt. The sample receipt is below.
Tim Hortons
Alwasy Fresh
1 Brek Wrap Combo /A ($0.76)
1 Bacon-wrap $3.79
1 Grilled $0.00
1 5 Pieces Bacon-wrap $0.00
1 Orange $1.40
1 Deposit $0.10
Subtotal: $55.84
GST: $0.29
Debit: $55.84
Take out
Thanks for stopping by!!
Tell us how we did
I came up with the following regex string to find the items.
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
It works for the most part but there are a few incorrect lines like
4
GST: $0.29
Can someone come up with a better pattern. Below is a link to see it in action.
http://regexr.com/3cnk9

I see a number of problems with this original regex:
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)* will only store the last character; you wanted (.*) for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}, which captures only the last whitespace character. In this case, you don't need the group since \s is a single space character, so you can simply use \s{1,10}.
Here is a piece-by-piece explanation of what that regular expression does.
Capturing solution
The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):
^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$
Explained and matched to your sample at regex101.
Separated out and commented (assumes the /x flag is supported):
/ # begin regex
^\s* # start of line, ignore leading spaces if present
(\d+) # $1 = quantity
\s+ # spacing as a delimiter
(.*\S) # $2 = item: contains anything, must end in a non-space char
\s+ # spacing as a delimiter
(\(?) # $3 = negation, an optional open parenthesis
\$ # dollar sign
([0-9.]+) # $4 = price
\)?\s*$ # trailing characters: optional end-paren and space(s)
/x # end regex, multi-line regex flag
with sample perl code executed from a command line:
perl -ne '
my ($quantity, $item, $neg, $price)
= /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
if ($item) {
if ($neg) { $price *= -1; }
print "<$quantity><$item><$price>\n"
}' RECEIPT_FILE
(If you want that as a perl script, wrap the code with while(<>) { } and you're done.)
This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.
I set the output to use angle brackets (< and >) to indicate what each variable stores.
The output of your given sample receipt would therefore be:
<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>
Prices only solution
You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K:
grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE
Grep's -P flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o displays only the matching text. \K denotes the start of the match. Put the \$ after the \K if you want to capture it. (See also the regex101 description and matches.)
Output from that grep command:
0.76
3.79
0.00
0.00
1.40
0.10
Prices only – with awk
There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk that should be significantly faster. (The difference won't be noticeable with a small input.)
awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE
Commented version with explanation:
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
print $NF; # print the contents of the last field
}' RECEIPT_FILE
Prices only – with awk, supporting negative prices
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
neg = 1;
if ( $NF ~ /\(/ ) { # the last field has an open parenthesis
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
neg = -1;
}
print $NF * neg; # print the last field, negated if parenthesized
}' RECEIPT_FILE

Here's my attempt:
^(\d+)\s+(.*)\s+\(?(\$.+)\)?$
Stub. Remember to turn the multiline option on. Components:
^ - beginning of line
(\d+) - capture the quantity at the beginning of each line item
\s+ - one or more space
(.*) - capture the item description
\s+ - one or more space
\(? - optional open bracket `(` character
($.+) - capture anything including and after the dollar sign
\)? - optional close bracket `)` character
$ - end of line

You can use
^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)
See the regex demo
This regex should be used with the /m modifier to match data on different lines. In JS, the /g modifier is also required.
Explanation:
^ - start of a line
(\d+) - Group 1 capturing one or more digits
\s+ - one or more whitespaces
(.*?) - Group 2 capturing zero or more any characters but a newline up to the closest
\s+ - one or more whitespaces
\(? - an optional ( (on the first line)
\$ - a literal $
(\d+\.\d+) - Group 3 capturing one or more digits followed with . and one or more digits.
JS demo:
var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm;
var str = ' Tim Hortons\n Alwasy Fresh\n\n1 Brek Wrap Combo /A ($0.76)\n1 Bacon-wrap $3.79\n1 Grilled $0.00\n1 5 Pieces Bacon-wrap $0.00\n1 Orange $1.40\n1 Deposit $0.10\nSubtotal: $55.84\nGST: $0.29\nDebit: $55.84\nTake out\n\n Thanks for stopping by!!\n Tell us how we did';
while ((m = re.exec(str)) !== null) {
document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}

Adam Katz's answer should be the accepted one! I used this variation of his answer for an implementation in JavaScript:
const receiptRegex = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/gm
let items = [];
const matches = inputStr.matchAll(receiptRegex);
for (const matchedGroup of matches) {
const [
fullString, //[0] -> matched string "1 Blue gatorade $2.00"
quantity, //[1] -> quantity "1"
item, //[2] -> item description "Blue gatorade"
ignoredSymbol, //[3] -> "$" (should probably always ignore)
price //[4] -> amount "2.00"
] = matchedGroup;
items.push({
quantity,
item,
price,
});
}

Related

Powershell: Can't Get RegEx to work on multiple lines

I am getting notes from a ticket that come in the form of:
[Employee ID]:
[First Name]: Test
[Last Name]: User
[Middle Initial]:
[Email]:
[Phone]:
[* Last 4 of SSN]: 1234
I've tried the following code to get the first name (in this example it would be 'Test':
if ($incNotes -match '(^\[First Name\]:)(. * ?$)')
{
Write-Host $_.matches.groups[0].value
Write-Host $_.matches.groups[1].value
}
But I get nothing. Is there a way I can use just one long regex pattern to get the information I need? The information stays in the same format on every ticket that comes through.
How would I get the information after the [First Name]: and so on....
You can use
if ($incNotes -match '(?m)^\[First Name]: *(\S+)') {
Write-Host $matches[1]
}
See the regex demo. If you can have any kind of horizontal whitespace chars between : and the name, replace the space with [\p{Zs}\t], or some kind of [\s-[\r\n]].
Details:
(?m) - a RegexOptions.Multiline option that makes ^ match start of any line position, and $ match end of lines
^ - start of a line
\[First Name]: - a [First Name]: string
* - zero or more spaces
(\S+) - Capturing group 1: one or more non-whitespace chars (replace with \S.* or \S[^\n\r]* to match any text till end of string).
Note that -match is a case insensitive regex matching operator, use -cmatch if you need a case sensitive behavior. Also, it only finds the first match and $matches[1] returns the Group 1 value.

Pyspark - Regex - Extract value from last brackets

I created the following regular expression with the idea of extracting the last element in brackets. See that if I only have one parenthesis it works fine, but if I have 2 parenthesis it extracts the first one (which is a mistake) or extract with the brackets .
Do you know how to solve it?
tmp= spark.createDataFrame(
[
(1, 'foo (123) oiashdj (hi)'),
(2, 'bar oiashdj (hi)'),
],
['id', 'txt']
)
tmp = tmp.withColumn("old", regexp_extract(col("txt"), "(?<=\().+?(?=\))", 0));
tmp = tmp.withColumn("new", regexp_extract(col("txt"), "\(([^)]+)\)?$", 0));
tmp.show()
+---+--------------------+---+----+
| id| txt|old| new| needed
+---+--------------------+---+----+
| 1|foo (123) oiashdj...|123|(hi)| hi
| 2| bar oiashdj (hi)| hi|(hi)| hi
+---+--------------------+---+----+
To extract the substring between parentheses with no other parentheses inside at the end of the string you may use
tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));
Details
\( - matches (
([^()]+) - captures into Group 1 any 1+ chars other than ( and )
\) - a ) char
$ - at the end of the string.
The 1 argument tells the regexp_extract to extract Group 1 value.
See the regex demo online.
NOTE: To allow trailing whitespace, add \s* right before $: r"\(([^()]+)\)\s*$"
NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use
r"(?s).*\(([^()]+)\)"
The .* will grab all the text up to the end, and then backtracking will do the job.
This should work. Use it with the single line flag.
\([^\(\)]*?\)(?!.*\([^\(\)]*?\))
https://regex101.com/r/Qrnlf3/1

Regex for preg_match_all prices

I need to set Regdex for these expressions:
number€
number €
€number
€ number
Need to Output: number €
number$
number $
$number
$ number
Need to Output: number $
Do i have to write one regex for each one? Which is the easy way?
Example for number€:
preg_match_all("/((?:[0-9]*[.,])?[0-9]+)\p{Sc}/u", $post->post_title, $percentage, PREG_SET_ORDER);
if(isset($percentage[0][0]) && $percentage[0][0] != "" ){
$text = $percentage[0][1]." €";
Thanks
It seems you may use one regex with a preg_replace_callback:
/(?J)(?<num>[0-9]*[.,]?[0-9]+)\s*(?<cur>\p{Sc})|(?<cur>\p{Sc})\s*(?<num>[0-9]*[.,]?[0-9]+)/u
See the regex demo and a PHP demo:
$re = '/(?J)(?<num>[0-9]*[.,]?[0-9]+)\s*(?<cur>\p{Sc})|(?<cur>\p{Sc})\s*(?<num>[0-9]*[.,]?[0-9]+)/u';
$str = '12€';
echo preg_replace_callback($re, function($m) {
return $m["num"] . " " . $m["cur"];
}, $str);
Or, to extract the values with the number currency order, use the same regex and the following code:
$res = array();
preg_replace_callback($re, function($m) use (&$res) {
array_push($res, $m["num"] . " " . $m["cur"]);
return $m[0];
}, $str);
print_r($res);
See another PHP demo.
Pattern details:
(?J) - PCRE_INFO_JCHANGED modifier so that the identically matched groups could be used multiple times in the pattern
(?<num>[0-9]*[.,]?[0-9]+) - Group "num" capturing a float/integer
\s* - 0+ whitespaces
(?<cur>\p{Sc}) - Group "cur" capturing the currency symbol
| - or (then goes the same as above in different order)
(?<cur>\p{Sc})\s*(?<num>[0-9]* [.,]?[0-9]+) - currency symbol ("cur" group), 0+ whitespaces, a float/integer (in "num" group).

Regex - Extracting a number when preceeded OR followed by a currency sign

if (preg_match_all('((([£€$¥](([ 0-9]([0-9])*)((\.|\,)(\d{2}|\d{1}))|([ 0-9]([0-9])*)))|(([0-9]([0-9])*)((\.|\,)(\d{2}|\d{1})(\s{0}|\s{1}))|([0-9]([0-9])*(\s{0}|\s{1})))[£€$¥]))', $Commande, $matches)) {
$tot1 = $matches[0];
This is my tested solution.
It works for all 4 currencies when sign is placed before or after, with or without a space in between.
It works with a dot or a comma for decimals.
It works without decimal, or with just 1 number after the dot or comma.
It extracts several amounts in the same string in a mix of formats declined above as long as there is a space in between.
I think it covers everything, although I am sure it can be simplified.
It was Needed for an international order form where clients enter the amounts themselves as well as the description in the same field.
You can use a conditional:
if (preg_match_all('~(\$ ?)?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?(?:[pcm]|bn|[mb]illion)?(?(1)| ?\$)~i', $order, $matches)) {
$tot = $matches[0];
}
Explanation:
I put the currency in the first capturing group: (\$ ?) and I make it optional with a ?
At the end of the pattern, I use an if then else:
(?(1) # if the first capturing group exist
# then match nothing
| # else
[ ]?\$ # matches the currency
) # end of the conditional
You should check for optional $ at the end of amount:
\$? ?(\d[\d ,]*(?:\.\d{1,2})?|\d[\d,](?:\.\d{2})?) ?\$?(?:[pcm]|bn|[mb]illion)
Live demo

How to Capture Only Surnames from a Regex Pattern?

Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.