A regex to get any price string - regex

I need to get the price from a string, but no other numbers. There are no restrictions on what the string can say, but it will always have a dollar amount in it. It's the dollar amount I need to get from the string.
The closest solution I've been able to find is \d{1,3}[,\\.]?(\\d{1,2})?
On an example string like, "2 BED / 2 BATH for $120,000.00, what a deal!!!", the regex should only return $1,000,000, and no other numbers. The solution above will return 2, 2, and 1,000,000.00. An ideal solution should NOT match on any digits that are outside of the dollar amount. It also needs to include the symbol immediately before the match (to account for the possibility of all currency symbols (USD, GBP, EUR, etc).
So, the price that's matched by the regex should look like: $120,000.00, but it could also match on something like €40,000

If you want to match all currency symbols before a number with the number itself, you may combine the two expressions:
Currency symbol regex: \b(?:[BS]/\.|R(?:D?\$|p))| \b(?:[TN]T|[CJZ])\$|Дин\.|\b(?:Bs|Ft|Gs|K[Mč]|Lek|B[Zr]|k[nr]|[PQLSR]|лв|ден|RM|MT|lei|zł|USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)\b|\$[Ub]|[\p{Sc}ƒ]
Number regex: (?<!\d)(?<!\d\.)(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d{1,2})?(?!\.?\d)
Currencies are taken from World Currency Symbols, the 3-letter currency codes used in the pattern are the most commonly used ones, but the comprehensive list can also be compiled using those data.
The answer is
(?:\b(?:[BS]/\.|R(?:D?\$|p))|\b(?:[TN]T|[CJZ])\$|Дин\.|\b(?:Bs|Ft|Gs|K[Mč]|Lek|B[Zr]|k[nr]|[PQLSR]|лв|ден|RM|MT|lei|zł|USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)|\$[Ub]|[\p{Sc}ƒ])\s?(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d{1,2})?(?!\.?\d)
See the regex demo
It is created like this: (?:CUR_SYM_REGEX)\s?NUM_REGEX, with the lookbehinds in number regex stripped from the pattern since the left-hand context is already defined.

you can use this one
[$€]{1}(?P<amount>[\d,\.]+(?>\.\d{2}){0,})\b
insert any currency sign into the first group [$€] to match them
and try it online here

This alternative will match any amount without specify the currency
\S+\d[\d,\.]*?\b
If you have to specify currency due to misspellings in the input, then you can also use the following regex as an alternative:
(?:\p{Sc}|ƒ)[\d,\.]+\\b
Note: \p{Sc} can match any Currency Symbol.
The regex '\S+\d[\d,\.]*?\b' tested in a testbench written in Java, to show it handles any amount and currency:
public static void main(String[] args) {
List<String> inputs = Arrays.asList(
"2 BED / 2 BATH for $120,000.00, what a deal!!!",
"$1 2 BED / 2 BATH for $120,000.00, what a deal $3",
"$1.00 2 BED / 2 BATH for $2,000.00, what a deal $300",
"£40.00 2 BED / 2 BATH for $50,000, what a deal €600.00",
"₧10 2 BED / 2 BATH for ƒ80.00, what a deal ₨9"
);
Pattern pattern = Pattern.compile("\\S+\\d[\\d,\\.]*?\\b");
for (String input : inputs) {
System.out.printf("Line to match: '%s'%n", input);
Matcher matcher = pattern.matcher(input);
System.out.println("Extracted price string:");
while(matcher.find()) {
System.out.println(matcher.group());
}
System.out.println("=======================");
}
}
Output:
Line to match: '2 BED / 2 BATH for $120,000.00, what a deal!!!'
Extracted price string:
$120,000.00
=======================
Line to match: '$1 2 BED / 2 BATH for $120,000.00, what a deal $3'
Extracted price string:
$1
$120,000.00
$3
=======================
Line to match: '$1.00 2 BED / 2 BATH for $2,000.00, what a deal $300'
Extracted price string:
$1.00
$2,000.00
$300
=======================
Line to match: '£40.00 2 BED / 2 BATH for $50,000, what a deal €600.00'
Extracted price string:
£40.00
$50,000
€600.00
=======================
Line to match: '₧10 2 BED / 2 BATH for ƒ80.00, what a deal ₨9'
Extracted price string:
₧10
ƒ80.00
₨9
=======================
Link to more currency signs:
https://en.wikipedia.org/wiki/Currency_sign_(typography)

Related

Extract £ or % value from Google Sheets with REGEXTRACT

I have a Google Sheets formula that extracts a £ currency value or a percentage discount from a block of text.
=REGEXEXTRACT(B2,"[\d,.£%]+") - Extracts £ value or % discount (but other numbers too)
=REGEXEXTRACT(B2,"[\d,.]+") - Extracts digits, commas, or periods
However, if the text contains any others numbers before the £ value or % discount they get extracted first.
How can I only extract the £ value or % discount from each cell in Google Sheets?
The maximum discount displayed is 2 decimal places maximum, which may help in building a formula to extract 4 digits left or right of the value.
EXAMPLE DATA
Amy Wills 44% Discount
1Direction Food 45.37% Discount
AllUnder20 £120 Commission
AATU 13.31% Discount
Tickets4You £70 Commission
AllAboutU £7 Commission
Andrea Cardini 4% Discount
You can use
=JOIN("", REGEXEXTRACT(B2, "£(\d+(?:[.,]\d+)?)|(\d+(?:[.,]\d+)?)%"))
Details:
£(\d+(?:[.,]\d+)?) - matches a £ and then matches and captures into Group 1 one or more digits followed with one or zero occurrences of ./, and then one or more digits
| - or
(\d+(?:[.,]\d+)?)% - matches and captures into Group 2 one or more digits followed with one or zero occurrences of ./, and then one or more digits, and then a % is matched.
See the demo screenshot:
See the RE2 regex demo.
Based on your samples, this should work.
=SUMPRODUCT(N(SPLIT(B2," ")))
You can see it at work here in cell C2.

How to find all currency related digits REGEX?

For a string that has free text:
"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"
How to find a regex pattern that will extract currency related numbers?
In this case: 89.99, 95, and 100?
So far, I've tried these patterns:
[0-9]*[€.]([0-9]*)
\[0-9]{1,3}(?:\.\[0-9]{3})*,\[0-9]\[0-9]
[0-9]+\€\.[0-9]+
But these don't seem to be producing exactly what is needed
Simpler solution would be [.\d]*€[.\d]*.
One option is to match all 3 variations and afterwards remove the euro sign from the match.
(?:\d+€\d*|€\d+(?:\.\d+)?)
Explanation
(?: Non capture group
\d+€\d* Match 1+ digit and € followed by optional digits
| Or
€\d+(?:\.\d+)? Match € followed by digits and an optional decimal part
) Close non capture group
Regex demo
For example
import re
regex = r"(?:\d+€\d*|€\d+(?:\.\d+)?)"
test_str = ("\"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5 \n"
"from last monday. If they do not level up again to 100€ by the end of this week there might \n"
"be serious consequences to the company\"")
print([x.replace("€", "") for x in re.findall(regex, test_str)])
Output
['89.99', '95', '100']
A bit more precise pattern for the number with optional comma followed by 3 digits and 2 digit decimal part could be:
(?:\d+€\d*|€\d{1,3}(?:,\d{3})*\.\d{2})
Regex demo
This need further testing but I would simply grab everything around € which is not whitespace, that is:
import re
text = """The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"""
values = re.findall(r"\S*€\S*", text)
print(values)
Output:
['€89.99', '9€5', '100€']

Regex that returns words, including words that contain numbers but not words that are just numbers

More specifically it cannot return words that are just numbers or contain any other characters such as #$., etc characters with accents are fine.
So if I use this text as an example:
we bought 6 500ml beers for $ 6.00 each from the êcole bar
would return we bought 500ml beers for each from the êcole bar so it removed the 6 and $ and the 6.00
In short; I am trying to read the item name from a restaurant receipt while ignoring the price and quantity of items being bought.
You could try a regex like
/([0-9]+)?[a-zA-Zê]/
To match all words with no digits use
/\b[^\d\W]+\b/g
To skip words which constists of only digits use
(?!^\d+$)^.+$

Regular expression for Number masking with exceptions

I want to mask phone numbers in a resume which also contains date in the for 2001, 2001-03 and percentages 45% 87% 78.45% 56.5%.
I only want to mask the phone numbers, and I don't need to mask it completely. If I could only mask 3 or 4 digits that makes it hard to guess, that does the job. Kindly help me out.
Phone number formats are
9876543210
98765 43210
98765-43210
9876 543 210
9876-543-210
Here is my answer:
(([0-9][- ]*){5})(([0-9][- ]*){5})
It will match exactly 10 digits with or without - or space.
After that, you can replace the first or the third group with ***** or anything you like.
For example:
$1*****
\d{4,5}[ -]?\d{3}[ -]?\d{2,3}
Strings matched:
9876543210, 98765 43210, 98765-43210, 9876 543 210, 9876-543-210
Strings not matched:
45% 87% 78.45% 56.5%
2001, 2001-03
I feel that a more complicated regex that doesn't match invalid phone numbers is not required since the requirement is to mask valid phone numbers of the above format.
Check here
Python code:
def fun(m):
if m:
return '*'*len(m.group(1))+m.group(2)
string = "Resume of candidate abcd. His phone numbers are : 9876543210, 98765 43210, 98765-43210.Date of birth of the candidate is 23-10-2013. His percentage is 57%. One more number 9876 543 213 His percentage in grad school is 44%. Another number 9876-543-210"
re.sub('(\d{4,5})([ -]?\d{3}[ -]?\d{2,3})',fun,string)
Output:
'Resume of candidate abcd. His phone numbers are : *****43210, *****
43210, *****-43210. Date of birth of the candidate is 23-10-2013. His
percentage is 57%. One more number **** 543 213 His percentage in grad
school is 44%. Another number ****-543-210'
More about re.sub:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping
occurrences of pattern in string by the replacement repl. If the
pattern isn’t found, string is returned unchanged. repl can be a
string or a function;
Just to help you on your way... I would use python to do is.
Use re module to search for number-like strings:
import re
num_re = re.compile('[0-9 -]{5,}')
with open('/my/file', 'r') as f:
for l in f:
for s in num_re.findall(l):
# Do some addition testing, like 'not starting with' or any
l.replace(s, '!!!MASKED!!!')
print l
I'm not saying that this code is finished, but it should help you on your way.
By the way, why I would use this approach:
You can easily add any tests you like to fix false positives.
Its readable.

RegEx for Prices?

I am searching for a RegEx for prices.
So it should be X numbers in front, than a "," and at the end 2 numbers max.
Can someone support me and post it please?
In what language are you going to use it?
It should be something like:
^\d+(,\d{1,2})?$
Explaination:
X number in front is: ^\d+ where ^ means the start of the string, \d means a digit and + means one or more
We use group () with a question mark, a ? means: match what is inside the group one or no times.
inside the group there is ,\d{1,2}, the , is the comma you wrote, \d is still a digit {1,2} means match the previous digit one or two times.
The final $ matches the end of the string.
I was not satisfied with the previous answers. Here is my take on it:
\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})
|^^^^^^|^^^^^^^^^^^^^|^^^^^^^^^^^|
| 1-3 | 3 digits | 2 digits |
|digits| repeat any | |
| | no. of | |
| | times | |
(get a detailed explanation here: https://regex101.com/r/cG6iO8/1)
Covers all cases below
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
But also weird stuff like
5.000,000.00
In case you want to include 5 and 1000 (I personally wound not like to match ALL numbers), then just add a "?" like so:
\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?
I am working on similar problem. However i want only to match if a currency Symbol or String is also included in the String like EUR,€,USD or $. The Symbol may be trailing or leading. I don't care if there is space between the Number and the Currency substring. I based the Number matching on the previous discussion and used Price Number: \d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?
Here is final result:
(USD|EUR|€|\$)\s?(\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2}))|(\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?)\s?(USD|EUR|€|\$)
I use (\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?)\s?(USD|EUR|€|\$) as a pattern to match against a currency symbol (here with tolerance for a leading space). I think you can easily tweak it for any other currencies
A Gist with the latest Version can be found at https://gist.github.com/wischweh/b6c0ac878913cca8b1ba
So I ran into a similar problem, needing to validate if an arbitrary string is a price, but needed a lot more resilience than the regexes provided in this thread and many other threads.
I needed a regex that would match all of the following:
5
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
And not to match stuff like IP addresses. I couldn't figure out a single regex to deal with the european and non-european stuff in one fell swoop so I wrote a little bit of Ruby code to normalise prices:
if value =~ /^([1-9][0-9]{,2}(,[0-9]{3})*|[0-9]+)(\.[0-9]{1,9})?$/
Float(value.delete(","))
elsif value =~ /^([1-9][0-9]{,2}(\.[0-9]{3})*|[0-9]+)(,[0-9]{1,9})?$/
Float(value.delete(".").gsub(",", "."))
else
false
end
The only difference between the two regexes is the swapped decimal place and comma. I'll try and break down what this is doing:
/^([1-9][0-9]{,2}(,[0-9]{3})*|[0-9]+)(\.[0-9]{1,9})?$/
The first part:
([1-9][0-9]{,2}(,[0-9]{3})*
This is a statement of numbers that follow this form: 1,000 1,000,000 100 12. But it does not allow leading zeroes. It's for the properly formatted numbers that have groups of 3 numerics separated by the thousands separator.
Second part:
[0-9]+
Just match any number 1 or more times. You could make this 0 or more times if you want to match: .11 .34 .00 etc.
The last part:
(\.[0-9]{1,9})?
This is the decimal place bit. Why up to 9 numerics, you ask? I've seen it happen. This regex is supposed to be able to handle any weird and wonderful price it sees and I've seen some retailers use up to 9 decimal places in prices. Usually all 0s, but we wouldn't want to miss out on the data ^_^
Hopefully this helps the next person to come along needing to process arbitrarily badly formatted price strings or either european or non-european format :)
^\d+,\d{1,2}$
I am currently working on a small function using regex to get price amount inside a String :
private static String getPrice(String input)
{
String output = "";
Pattern pattern = Pattern.compile("\\d{1,3}[,\\.]?(\\d{1,2})?");
Matcher matcher = pattern.matcher(input);
if (matcher.find())
{
output = matcher.group(0);
}
return output;
}
this seems to work with small price (0,00 to 999,99) and various currency :
$12.34 -> 12.34
$12,34 -> 12,34
$12.00 -> 12.00
$12 -> 12
12€ -> 12
12,11€ -> 12,11
12.999€ -> 12.99
12.9€ -> 12.9
£999.99€ -> 999.99
...
Pretty simple for "," separated numbers(Or no seperation) with 2 decimal places , supports deliminator but does not force them. Needs some improvement but should work.
^((\d{1,3}|\s*){1})((\,\d{3}|\d)*)(\s*|\.(\d{2}))$
matches:
1,123,456,789,134.45
1123456134.45
1234568979
12,345.45
123.45
123
no match:
1,2,3
12.4
1234,456.45
This may need some editing to make it function correctly
Quick explanation: Matches 1-3 numbers(Or nothing), matches a comma followed by 3 numbers as many times as needed(Or just numbers), matches a decimal point followed by 1 or 2 numbers(Or Nothing)
This code worked for me !! (PHP)
preg_match_all('/\d+((,\d+)+)?(.\d+)?(.\d+)?(,\d+)?/',$price[1]->plaintext,$lPrices);
So far I tried, this is the best
\d{1,3}[,\\.]?(\\d{1,2})?
https://regex101.com/r/xT8aQ7/1
r'(^\-?\d*\d+.?(\d{1,2})?$)'
This will allow digits with only one decimal and two digits after decimal
This one reasonably works when you may or may not have decimal part but an amount shows up like this 100,000 - or 100,000.00. Tested using Clojure only
\d{1,3}(?:[.,]\\d{3})*(?:[.,]\d{2,3})
\d+((,\d+)+)?(.\d+)?(.\d+)?(,\d+)?
to cover all
5
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
^((\d+)((,\d+|\d+)*)(\s*|\.(\d{2}))$)
Matches:
1
11
111
1111111
11,2122
1222,21222
122.23
1223,3232.23
Not Matches:
11e
x111
111,111.090
1.000
anything like \d+,\d{2} is wrong because the \d matches [0-9\.] i.e. 12.34,1.
should be: [0-9]+,[0-9]{2} (or [0-9]+,[0-9]{1,2} to allow only 1 decimal place)