Capturing the digit in a new line after whitespace - regex

The string we have right now is:
DB GOALS: DISADVANTAGED BUSINESS ENTERPRISE - 6.0%
PROPOSALS ISSUED 9 FUND TOTAL , , 0
TOTAL NUMBER OF WORKING DAYS 30
NUMBER OF BIDDERS 4 ENGINEERS EST 1,674,885.00 AMOUNT OVER
177,014.00 PERCENT OVER EST 10.57
PROGRAM ELEMENTS
I am using the pattern (AMOUNT OVER|AMOUNT UNDER)[\n\r\s]+(?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S) but it does not capture
AMOUNT OVER
177,014.00
in the text. I suspect it is because of the whitespace before 177,014.00 because it works when we remove the whitespace.
Is there a way to capture it as it is? Thanks so much!
Here is the regex101.com link for reference.

You might simplify the pattern a bit to:
\b(AMOUNT (?:OVER|UNDER))\s+((?:\d{1,3}(?:,\d{3})*(?:\.\d\d)?))(?!\S)
Note that [\n\r\s]+ can be written as \s+
Regex demo

Related

How to find all currency related digits REGEX?

For a string that has free text:
"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"
How to find a regex pattern that will extract currency related numbers?
In this case: 89.99, 95, and 100?
So far, I've tried these patterns:
[0-9]*[€.]([0-9]*)
\[0-9]{1,3}(?:\.\[0-9]{3})*,\[0-9]\[0-9]
[0-9]+\€\.[0-9]+
But these don't seem to be producing exactly what is needed
Simpler solution would be [.\d]*€[.\d]*.
One option is to match all 3 variations and afterwards remove the euro sign from the match.
(?:\d+€\d*|€\d+(?:\.\d+)?)
Explanation
(?: Non capture group
\d+€\d* Match 1+ digit and € followed by optional digits
| Or
€\d+(?:\.\d+)? Match € followed by digits and an optional decimal part
) Close non capture group
Regex demo
For example
import re
regex = r"(?:\d+€\d*|€\d+(?:\.\d+)?)"
test_str = ("\"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5 \n"
"from last monday. If they do not level up again to 100€ by the end of this week there might \n"
"be serious consequences to the company\"")
print([x.replace("€", "") for x in re.findall(regex, test_str)])
Output
['89.99', '95', '100']
A bit more precise pattern for the number with optional comma followed by 3 digits and 2 digit decimal part could be:
(?:\d+€\d*|€\d{1,3}(?:,\d{3})*\.\d{2})
Regex demo
This need further testing but I would simply grab everything around € which is not whitespace, that is:
import re
text = """The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"""
values = re.findall(r"\S*€\S*", text)
print(values)
Output:
['€89.99', '9€5', '100€']

Detecting whole number with an "x" or "-" after using regex

I'm trying to use regex to detect the quantity in a list of items on a receipt. The software uses OCR so the return can vary a bit. To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number. The use cases I'm trying to cover are:
2 Burgers $4.00
2 x Burgers $4.00
2 X Burgers $4.00
2x Burgers $4.00
2X Burgers $4.00
2- Burgers $4.00
2 - Burgers $4.00
The plan is for the regex to return 2 for each example above. The regex I have so far is \\d{1,2}(\\s[xX]|[xX]) this returns the top three examples fine but as much as I have tried I cant seem to get the rest detected, I haven't looked at adding the - yet as was stuck on detecting the x next to the Int.
Any help would be great, thanks
To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number.
I suggest using something like
let pattern = "(?m)^\\d+"
See the regex demo.
The pattern will match 1 or more digits at the start of any line:
(?m) - a MULTILINE modifier that makes ^ match the start of a line rather than the start of a string
^ - start of a line
\d+ - 1 or more (+) digits.
If you need to specify that some text should follow the digits, use a positive lookahead. E.g. you may require x/X/- after 0+ whitespaces, or a whitespace right after. Then, you need to use
let pattern = "(?m)\\d+(?=\\s*[xX-]|\\s)"
Here, (?=\\s*[xX-]|\\s) will make the regex match only those digits at the start of the line(s) that are immediately followed with either 0+ whitespace chars and then X, x or -, or that are immediately followed with a whitespace.
See this regex demo.
^(\\d+)\\s?[xX-]?.*?([$£](?:\\d{1,2})(?:,?\\d{3})*\.?\\d{0,2})$
See it working here (extra backslashes have been added in the code above to allow it to work in Swift, whereas the below link shows the expected result in JS, Python, Go and PHP, which means there are less backslashes there).
Will capture number of items and the price, what the item is is not captured.

Regex for validation of a street number

I'm using an online tool to create contests. In order to send prizes, there's a form in there asking for user information (first name, last name, address,... etc).
There's an option to use regular expressions to validate the data entered in this form.
I'm struggling with the regular expression to put for the street number (I'm located in Belgium).
A street number can be the following:
1234
1234a
1234a12
begins with a number (max 4 digits)
can have letters as well (max 2 char)
Can have numbers after the letter(s) (max3)
I came up with the following expression:
^([0-9]{1,4})([A-Za-z]{1,2})?([0-9]{1,3})?$
But the problem is that as letters and second part of numbers are optional, it allows to enter numbers with up to 8 digits, which is not optimal.
1234 (first group)(no letters in the second group) 5678 (third group)
If one of you can tip me on how to achieve the expected result, it would be greatly appreciated !
You might use this regex:
^\d{1,4}([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|)$
where:
\d{1,4} - 1-4 digits
([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|) - optional group, which can be
[a-zA-Z]{1,2}\d{1,3} - 1-2 letters + 1-3 digits
or
[a-zA-Z]{1,2} - 1-2 letters
or
empty
\d{0,4}[a-zA-Z]{0,2}\d{0,3}
\d{0,4} The first groupe matches a number with 4 digits max
[a-zA-Z]{0,2} The second groupe matches a char with 2 digit in max
\d{0,3} The first groupe matches a number with 3 digits max
You have to keep the last two groups together, not allowing the last one to be present, if the second isn't, e.g.
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
or a little less optimized (but showing the approach a bit better)
^\d{1,4}(?:[a-zA-z]{1,2}(?:\d{1,3})?)?$
As you are using this for a validation I assumed that you don't need the capturing groups and replaced them with non-capturing ones.
You might want to change the first number check to [1-9]\d{0,3} to disallow leading zeros.
Thank you so much for your answers ! I tried Sebastian's solution :
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
And it works like a charm ! I still don't really understand what the ":" stand for, but I'll try to figure it out next time i have to fiddle with Regex !
Have a nice day,
Stan
The first digit cannot be 0.
There shouldn't be other symbols before and after the number.
So:
^[1-9]\d{0,3}(?:[a-zA-Z]{1,2}\d{0,3})?$
The ?: combination means that the () construction does not create a matching substring.
Here is the regex with tests for it.

Regex - Alteryx - Parse - How to find an expression starting by the end of the string

I need to parse the following expression:
Fertilizer abc 7-15-15 5KG BOX 250 KG
in 3 fields:
The product description: Fertilizer abc 7-15-15
Size: 250
Size unit: KG
Do not know how to proceed. Please, any help and explanation?
Try this in the alteryx REGEX Tool with Parse selected as the Method:
([A-z ]* [\d-]{6,8}) ([A-Z\d]{2,6}) (.{1,5}?) (\d*) ([A-Z]*)
You can test it at Regexpal to see the breakdown of each group but essentially the first set of brackets will get you your product description (text and spaces until 6-8 characters made up of digits and dashes), the 2nd & 3rd parts will deal with the erroneous info that you don't want, the 4th group will be just digits and the 5th group will be any text afterwards.
Note that this will change dramatically if your data has digits where there is characters currently etc.
You can always break it up into even smaller groups and then concatenate back together as well.

RegEx for Prices?

I am searching for a RegEx for prices.
So it should be X numbers in front, than a "," and at the end 2 numbers max.
Can someone support me and post it please?
In what language are you going to use it?
It should be something like:
^\d+(,\d{1,2})?$
Explaination:
X number in front is: ^\d+ where ^ means the start of the string, \d means a digit and + means one or more
We use group () with a question mark, a ? means: match what is inside the group one or no times.
inside the group there is ,\d{1,2}, the , is the comma you wrote, \d is still a digit {1,2} means match the previous digit one or two times.
The final $ matches the end of the string.
I was not satisfied with the previous answers. Here is my take on it:
\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})
|^^^^^^|^^^^^^^^^^^^^|^^^^^^^^^^^|
| 1-3 | 3 digits | 2 digits |
|digits| repeat any | |
| | no. of | |
| | times | |
(get a detailed explanation here: https://regex101.com/r/cG6iO8/1)
Covers all cases below
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
But also weird stuff like
5.000,000.00
In case you want to include 5 and 1000 (I personally wound not like to match ALL numbers), then just add a "?" like so:
\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?
I am working on similar problem. However i want only to match if a currency Symbol or String is also included in the String like EUR,€,USD or $. The Symbol may be trailing or leading. I don't care if there is space between the Number and the Currency substring. I based the Number matching on the previous discussion and used Price Number: \d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?
Here is final result:
(USD|EUR|€|\$)\s?(\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2}))|(\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?)\s?(USD|EUR|€|\$)
I use (\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?)\s?(USD|EUR|€|\$) as a pattern to match against a currency symbol (here with tolerance for a leading space). I think you can easily tweak it for any other currencies
A Gist with the latest Version can be found at https://gist.github.com/wischweh/b6c0ac878913cca8b1ba
So I ran into a similar problem, needing to validate if an arbitrary string is a price, but needed a lot more resilience than the regexes provided in this thread and many other threads.
I needed a regex that would match all of the following:
5
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
And not to match stuff like IP addresses. I couldn't figure out a single regex to deal with the european and non-european stuff in one fell swoop so I wrote a little bit of Ruby code to normalise prices:
if value =~ /^([1-9][0-9]{,2}(,[0-9]{3})*|[0-9]+)(\.[0-9]{1,9})?$/
Float(value.delete(","))
elsif value =~ /^([1-9][0-9]{,2}(\.[0-9]{3})*|[0-9]+)(,[0-9]{1,9})?$/
Float(value.delete(".").gsub(",", "."))
else
false
end
The only difference between the two regexes is the swapped decimal place and comma. I'll try and break down what this is doing:
/^([1-9][0-9]{,2}(,[0-9]{3})*|[0-9]+)(\.[0-9]{1,9})?$/
The first part:
([1-9][0-9]{,2}(,[0-9]{3})*
This is a statement of numbers that follow this form: 1,000 1,000,000 100 12. But it does not allow leading zeroes. It's for the properly formatted numbers that have groups of 3 numerics separated by the thousands separator.
Second part:
[0-9]+
Just match any number 1 or more times. You could make this 0 or more times if you want to match: .11 .34 .00 etc.
The last part:
(\.[0-9]{1,9})?
This is the decimal place bit. Why up to 9 numerics, you ask? I've seen it happen. This regex is supposed to be able to handle any weird and wonderful price it sees and I've seen some retailers use up to 9 decimal places in prices. Usually all 0s, but we wouldn't want to miss out on the data ^_^
Hopefully this helps the next person to come along needing to process arbitrarily badly formatted price strings or either european or non-european format :)
^\d+,\d{1,2}$
I am currently working on a small function using regex to get price amount inside a String :
private static String getPrice(String input)
{
String output = "";
Pattern pattern = Pattern.compile("\\d{1,3}[,\\.]?(\\d{1,2})?");
Matcher matcher = pattern.matcher(input);
if (matcher.find())
{
output = matcher.group(0);
}
return output;
}
this seems to work with small price (0,00 to 999,99) and various currency :
$12.34 -> 12.34
$12,34 -> 12,34
$12.00 -> 12.00
$12 -> 12
12€ -> 12
12,11€ -> 12,11
12.999€ -> 12.99
12.9€ -> 12.9
£999.99€ -> 999.99
...
Pretty simple for "," separated numbers(Or no seperation) with 2 decimal places , supports deliminator but does not force them. Needs some improvement but should work.
^((\d{1,3}|\s*){1})((\,\d{3}|\d)*)(\s*|\.(\d{2}))$
matches:
1,123,456,789,134.45
1123456134.45
1234568979
12,345.45
123.45
123
no match:
1,2,3
12.4
1234,456.45
This may need some editing to make it function correctly
Quick explanation: Matches 1-3 numbers(Or nothing), matches a comma followed by 3 numbers as many times as needed(Or just numbers), matches a decimal point followed by 1 or 2 numbers(Or Nothing)
This code worked for me !! (PHP)
preg_match_all('/\d+((,\d+)+)?(.\d+)?(.\d+)?(,\d+)?/',$price[1]->plaintext,$lPrices);
So far I tried, this is the best
\d{1,3}[,\\.]?(\\d{1,2})?
https://regex101.com/r/xT8aQ7/1
r'(^\-?\d*\d+.?(\d{1,2})?$)'
This will allow digits with only one decimal and two digits after decimal
This one reasonably works when you may or may not have decimal part but an amount shows up like this 100,000 - or 100,000.00. Tested using Clojure only
\d{1,3}(?:[.,]\\d{3})*(?:[.,]\d{2,3})
\d+((,\d+)+)?(.\d+)?(.\d+)?(,\d+)?
to cover all
5
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
^((\d+)((,\d+|\d+)*)(\s*|\.(\d{2}))$)
Matches:
1
11
111
1111111
11,2122
1222,21222
122.23
1223,3232.23
Not Matches:
11e
x111
111,111.090
1.000
anything like \d+,\d{2} is wrong because the \d matches [0-9\.] i.e. 12.34,1.
should be: [0-9]+,[0-9]{2} (or [0-9]+,[0-9]{1,2} to allow only 1 decimal place)