How to find all currency related digits REGEX? - regex

For a string that has free text:
"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"
How to find a regex pattern that will extract currency related numbers?
In this case: 89.99, 95, and 100?
So far, I've tried these patterns:
[0-9]*[€.]([0-9]*)
\[0-9]{1,3}(?:\.\[0-9]{3})*,\[0-9]\[0-9]
[0-9]+\€\.[0-9]+
But these don't seem to be producing exactly what is needed

Simpler solution would be [.\d]*€[.\d]*.

One option is to match all 3 variations and afterwards remove the euro sign from the match.
(?:\d+€\d*|€\d+(?:\.\d+)?)
Explanation
(?: Non capture group
\d+€\d* Match 1+ digit and € followed by optional digits
| Or
€\d+(?:\.\d+)? Match € followed by digits and an optional decimal part
) Close non capture group
Regex demo
For example
import re
regex = r"(?:\d+€\d*|€\d+(?:\.\d+)?)"
test_str = ("\"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5 \n"
"from last monday. If they do not level up again to 100€ by the end of this week there might \n"
"be serious consequences to the company\"")
print([x.replace("€", "") for x in re.findall(regex, test_str)])
Output
['89.99', '95', '100']
A bit more precise pattern for the number with optional comma followed by 3 digits and 2 digit decimal part could be:
(?:\d+€\d*|€\d{1,3}(?:,\d{3})*\.\d{2})
Regex demo

This need further testing but I would simply grab everything around € which is not whitespace, that is:
import re
text = """The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"""
values = re.findall(r"\S*€\S*", text)
print(values)
Output:
['€89.99', '9€5', '100€']

Related

Capture values >= 20 $ (without cents, currency symbol and spaces)

I would like to capture $- (or £-)prices >= 20, without cents (pence), where the $ (£) may be in front of or after the value and the currency-symbol may be separated from the value by space(s) or not, e.g.:
$20
$3000
£ 60.67 (but only the '60'-part)
33$
500.99$ (but only the '500'-part)
90 £
Something like:
(?:[^\d][$£] ?)([\d]{3,}|[2-9][\d]{1})|([\d]{3,}|[2-9][\d]{1})(?: *.?[0-9]* ?[$£])
...which works, but simpler (or at least without the non-capturing (?: )-syntax because it doesn't work with my regex browser highlight extension.
I would like to use this to highlight prices e.g. on Amazon via a regex browser extension. If you happen to know a good one (which possibly even supports (?: )-syntax) I'd be happy to hear your suggestions, too :-)
Many thanks in advance
You can use the following regex:
(?<=[$£])\s*([2-9]\d|\d{3,})(?=[\.\s])|(?<!\.)([2-9]\d|\d{3,})(?=(?:\.\d+)?\s*[$£])
which splits the matching in two cases:
case 1: your money symbol is found before the numeric value
case 2: your money symbol is found after the numeric value
"Case 1": (?<=[$£])\s*([2-9]\d|\d{3,})(?=[\.\s])
(?<=[$£]): the money symbol is followed by...
\s*: spaces
([2-9]\d|\d{3,}): 20 or bigger numeric values
(?=[\.\s]): followed by either a dot (if decimal) or a space
"Case 2": (?<!\.)([2-9]\d|\d{3,})(?=(?:\.\d+)?\s*[$£])
(?<!\.): match is not preceeded by a dot
([2-9]\d|\d{3,}): 20 or bigger numeric values
(?=(?:\.\d+)?\s*[$£]): followed by optional dot and numbers (for decimal), mandatory spaces and the money symbol
Check the demo here.

Capturing the digit in a new line after whitespace

The string we have right now is:
DB GOALS: DISADVANTAGED BUSINESS ENTERPRISE - 6.0%
PROPOSALS ISSUED 9 FUND TOTAL , , 0
TOTAL NUMBER OF WORKING DAYS 30
NUMBER OF BIDDERS 4 ENGINEERS EST 1,674,885.00 AMOUNT OVER
177,014.00 PERCENT OVER EST 10.57
PROGRAM ELEMENTS
I am using the pattern (AMOUNT OVER|AMOUNT UNDER)[\n\r\s]+(?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S) but it does not capture
AMOUNT OVER
177,014.00
in the text. I suspect it is because of the whitespace before 177,014.00 because it works when we remove the whitespace.
Is there a way to capture it as it is? Thanks so much!
Here is the regex101.com link for reference.
You might simplify the pattern a bit to:
\b(AMOUNT (?:OVER|UNDER))\s+((?:\d{1,3}(?:,\d{3})*(?:\.\d\d)?))(?!\S)
Note that [\n\r\s]+ can be written as \s+
Regex demo

Regex - lazy match first pattern occurrence, but no subsequent matching patterns

I need to return the first percentage, and only the first percentage, from each row in a file.
Each row may have one or two, but not more than two, percentages.
There may or may not be other numbers in the line, such as a dollar amount.
The percentage may appear anywhere in the line.
Ex:
Profits in California were down 10.00% to $100.00, a decrease from 22.6% the prior year.
Profits in New York increased by 0.9%.
Profits in Texas were up 1.58% an increase from last year's 0.58%.
I can write a regex to capture all occurrences:
[0-9]+\.[0-9]+[%]+?
https://regex101.com/r/owZaGE/1
The other SO questions I've perused only address this issue when the pattern is at the front of the line or always preceded by a particular set of characters
What am I missing?
/^.*?((?:\d+\.)?\d+%)/gm
works with a multiline flag, no negative lookbehind (some engines don't support non-fixed width lookbehinds). Your match will be in the capture group.
Mine is similar to you except I allowed numbers like 30% (without decimal points)
\d+(\.\d+)?%
I don't know what language you are using, but in python for getting the first occurrence you can use re.search()
Here is an example:
import re
pattern = r'\d+(\.\d+)?%'
string = 'Profits in California were down 10.00% to $100.00, a decrease from 22.6% the prior year.'
print(re.search(pattern, string).group())
I was able to solve using a negative lookbehind:
(?<!%.*?)([0-9]+\.[0-9]+[%]+?)

Regex to get any numbers after the occurrence of a string in a line

Hi guys im trying to get the the substring as well as the corresponding number from this string
text = "Milk for human consumption may be taken only from cattle from 80 hours after the last treatment."
I want to select the word milk and the corresponding number 80 from this sentence. This is part of a larger file and i want a generic solution to get the word milk in a line and then the first number that occurs after this word anywhere in that line.
(Milk+)\d
This is what i came up with thinking that i can make a group milk and then check for digits but im stumped how to start a search for numbers anywhere on line and not just immediately after the word milk. Also is there any way to make the search case insensitive?
Edit: im looking to get both the word and the number if possible eg: "milk" "80" and using python
/(?<!\p{L})([Mm]ilk)(?!p{L})\D*(\d+)/
This matches the following strings, with the match and the contents of the two capture groups noted.
"The Milk99" # "Milk99" 1:"Milk" 2:"99"
"The milk99 is white" # "milk99" 1:"milk" 2:"99"
"The 8 milk is 99" # "milk is 99" 1:"milk" 2:"99"
"The 8milk is 45 or 73" # "milk is 45" 1:"milk" 2:"45"
The following strings are not matched.
"The Milk is white"
"The OJ is 99"
"The milkman is 37"
"Buttermilk is 99"
"MILK is 99"
This regular expression could be made self-documenting by writing it in free-spacing mode:
/
(?<!\p{L}) # the following match is not preceded by a Unicode letter
([Mm]ilk) # match 'M' or 'm' followed by 'ilk' in capture group 2
(?!p{L}) # the preceding match is not followed by a Unicode letter
\D* # match zero or more characters other than digits
(\d+) # match one or more digits in capture group 2
/x # free-spacing regex definition mode
\D* could be replaced with .*?, ? making the match non-greedy. If the greedy variant were used (.*), the second capture group for "The 8milk is 45 or 73" would contain "3".
To match "MILK is 99", change ([Mm]ilk) to (?i)(milk).
This seems to work in java (I overlooked that the questioner wanted python or the question was later edited) like you want to:
String example =
"Test 40\n" +
"Test Test milk for human consumption may be taken only from cattle from hours after the last treatment." +
"\nTest Milk for human consumption may be taken only from cattle from 80 hours after the last treatment." +
"\nTest miLk for human consumption may be taken only from cattle from 80 hours after the last treatment.";
Matcher m = Pattern.compile("((?i)(milk).*?(\\d+).*\n?)+").matcher(example);
m.find();
System.out.print(m.group(2) + m.group(3));
Look at how it tests whether the word "milk" appears in a case insensitive manner anywhere before a number in the exact same line and only prints these both. It also prints only the first found occurence (making it find all occurencies is also possible pretty easily just by a little modifications of the given code).
I hope the way it extracts these both things from a matching pattern is in the sense of your task.
You should try this one
(Milk).*?(\d+)
Based on your language, you can also specify a case-insensitive search. Example in JS: /(Milk).*?(\d+)/i, the final i makes the search case insensitive.
Note the *?, the most important part ! This is a lazy iteration. In other words, it reads any char, but as soon as it can stop and process the next instruction successfully then it does. Here, as soon as you can read a digit, you read it. A simple * would have returned the last number from this line after Milk instead

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.