Regex - Extracting a number when preceeded OR followed by a currency sign - regex

if (preg_match_all('((([£€$¥](([ 0-9]([0-9])*)((\.|\,)(\d{2}|\d{1}))|([ 0-9]([0-9])*)))|(([0-9]([0-9])*)((\.|\,)(\d{2}|\d{1})(\s{0}|\s{1}))|([0-9]([0-9])*(\s{0}|\s{1})))[£€$¥]))', $Commande, $matches)) {
$tot1 = $matches[0];
This is my tested solution.
It works for all 4 currencies when sign is placed before or after, with or without a space in between.
It works with a dot or a comma for decimals.
It works without decimal, or with just 1 number after the dot or comma.
It extracts several amounts in the same string in a mix of formats declined above as long as there is a space in between.
I think it covers everything, although I am sure it can be simplified.
It was Needed for an international order form where clients enter the amounts themselves as well as the description in the same field.

You can use a conditional:
if (preg_match_all('~(\$ ?)?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?(?:[pcm]|bn|[mb]illion)?(?(1)| ?\$)~i', $order, $matches)) {
$tot = $matches[0];
}
Explanation:
I put the currency in the first capturing group: (\$ ?) and I make it optional with a ?
At the end of the pattern, I use an if then else:
(?(1) # if the first capturing group exist
# then match nothing
| # else
[ ]?\$ # matches the currency
) # end of the conditional

You should check for optional $ at the end of amount:
\$? ?(\d[\d ,]*(?:\.\d{1,2})?|\d[\d,](?:\.\d{2})?) ?\$?(?:[pcm]|bn|[mb]illion)
Live demo

Related

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75
The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.
Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1
If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

Regex for receipt items

I have a simple receipt which I typed out. I need to be able to read the items purchased on the receipt. The sample receipt is below.
Tim Hortons
Alwasy Fresh
1 Brek Wrap Combo /A ($0.76)
1 Bacon-wrap $3.79
1 Grilled $0.00
1 5 Pieces Bacon-wrap $0.00
1 Orange $1.40
1 Deposit $0.10
Subtotal: $55.84
GST: $0.29
Debit: $55.84
Take out
Thanks for stopping by!!
Tell us how we did
I came up with the following regex string to find the items.
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
It works for the most part but there are a few incorrect lines like
4
GST: $0.29
Can someone come up with a better pattern. Below is a link to see it in action.
http://regexr.com/3cnk9
I see a number of problems with this original regex:
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)* will only store the last character; you wanted (.*) for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}, which captures only the last whitespace character. In this case, you don't need the group since \s is a single space character, so you can simply use \s{1,10}.
Here is a piece-by-piece explanation of what that regular expression does.
Capturing solution
The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):
^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$
Explained and matched to your sample at regex101.
Separated out and commented (assumes the /x flag is supported):
/ # begin regex
^\s* # start of line, ignore leading spaces if present
(\d+) # $1 = quantity
\s+ # spacing as a delimiter
(.*\S) # $2 = item: contains anything, must end in a non-space char
\s+ # spacing as a delimiter
(\(?) # $3 = negation, an optional open parenthesis
\$ # dollar sign
([0-9.]+) # $4 = price
\)?\s*$ # trailing characters: optional end-paren and space(s)
/x # end regex, multi-line regex flag
with sample perl code executed from a command line:
perl -ne '
my ($quantity, $item, $neg, $price)
= /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
if ($item) {
if ($neg) { $price *= -1; }
print "<$quantity><$item><$price>\n"
}' RECEIPT_FILE
(If you want that as a perl script, wrap the code with while(<>) { } and you're done.)
This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.
I set the output to use angle brackets (< and >) to indicate what each variable stores.
The output of your given sample receipt would therefore be:
<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>
Prices only solution
You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K:
grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE
Grep's -P flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o displays only the matching text. \K denotes the start of the match. Put the \$ after the \K if you want to capture it. (See also the regex101 description and matches.)
Output from that grep command:
0.76
3.79
0.00
0.00
1.40
0.10
Prices only – with awk
There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk that should be significantly faster. (The difference won't be noticeable with a small input.)
awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE
Commented version with explanation:
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
print $NF; # print the contents of the last field
}' RECEIPT_FILE
Prices only – with awk, supporting negative prices
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
neg = 1;
if ( $NF ~ /\(/ ) { # the last field has an open parenthesis
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
neg = -1;
}
print $NF * neg; # print the last field, negated if parenthesized
}' RECEIPT_FILE
Here's my attempt:
^(\d+)\s+(.*)\s+\(?(\$.+)\)?$
Stub. Remember to turn the multiline option on. Components:
^ - beginning of line
(\d+) - capture the quantity at the beginning of each line item
\s+ - one or more space
(.*) - capture the item description
\s+ - one or more space
\(? - optional open bracket `(` character
($.+) - capture anything including and after the dollar sign
\)? - optional close bracket `)` character
$ - end of line
You can use
^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)
See the regex demo
This regex should be used with the /m modifier to match data on different lines. In JS, the /g modifier is also required.
Explanation:
^ - start of a line
(\d+) - Group 1 capturing one or more digits
\s+ - one or more whitespaces
(.*?) - Group 2 capturing zero or more any characters but a newline up to the closest
\s+ - one or more whitespaces
\(? - an optional ( (on the first line)
\$ - a literal $
(\d+\.\d+) - Group 3 capturing one or more digits followed with . and one or more digits.
JS demo:
var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm;
var str = ' Tim Hortons\n Alwasy Fresh\n\n1 Brek Wrap Combo /A ($0.76)\n1 Bacon-wrap $3.79\n1 Grilled $0.00\n1 5 Pieces Bacon-wrap $0.00\n1 Orange $1.40\n1 Deposit $0.10\nSubtotal: $55.84\nGST: $0.29\nDebit: $55.84\nTake out\n\n Thanks for stopping by!!\n Tell us how we did';
while ((m = re.exec(str)) !== null) {
document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}
Adam Katz's answer should be the accepted one! I used this variation of his answer for an implementation in JavaScript:
const receiptRegex = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/gm
let items = [];
const matches = inputStr.matchAll(receiptRegex);
for (const matchedGroup of matches) {
const [
fullString, //[0] -> matched string "1 Blue gatorade $2.00"
quantity, //[1] -> quantity "1"
item, //[2] -> item description "Blue gatorade"
ignoredSymbol, //[3] -> "$" (should probably always ignore)
price //[4] -> amount "2.00"
] = matchedGroup;
items.push({
quantity,
item,
price,
});
}

Regex on Splitting String

I am trying to split the following string into proper output using regex. Answers do not have to be in perl but in general regex is fine:
Username is required.
Multi-string name is optional
Followed by Uselessword is there but should be be parsed
Followed by an optional number
Following by an IP in brackets < > (Required)
String = username optional multistring name uselessword 45 <100.100.100.100>
Output should be:
Match 1 = username
Match 2 = optional multistring name
Match 3 = 45
Match 4 = 100.100.100.100
This sort of things are easier to handle using multiple regex. Here is an example:
my #arr = (
'username optional multistring name uselessword 45 <100.100.100.100>',
'username 45 <100.100.100.100>'
);
for(#arr){
## you can use anchor ^ $ here
if(/(\S+) (.+?) (\d+) <(.+?)>/){
print "$1\n$2\n$3\n$4\n";
}
## you can use anchor ^ $ here
elsif(/(\S+) (\d+) <(.+?)>/){
print "$1\n$2\n\n$3\n";
}
print "==========\n";
}
First if block is looking for four groups from the input. And the second block is looking for three groups.
If you need, you can use [ ]+ to handle multiple spaces between the groups.
Also, if you need, you can adjust the optional group (.+?) according to your preferred characters(usually through the character class [bla]).

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement