Extract £ or % value from Google Sheets with REGEXTRACT - regex

I have a Google Sheets formula that extracts a £ currency value or a percentage discount from a block of text.
=REGEXEXTRACT(B2,"[\d,.£%]+") - Extracts £ value or % discount (but other numbers too)
=REGEXEXTRACT(B2,"[\d,.]+") - Extracts digits, commas, or periods
However, if the text contains any others numbers before the £ value or % discount they get extracted first.
How can I only extract the £ value or % discount from each cell in Google Sheets?
The maximum discount displayed is 2 decimal places maximum, which may help in building a formula to extract 4 digits left or right of the value.
EXAMPLE DATA
Amy Wills 44% Discount
1Direction Food 45.37% Discount
AllUnder20 £120 Commission
AATU 13.31% Discount
Tickets4You £70 Commission
AllAboutU £7 Commission
Andrea Cardini 4% Discount

You can use
=JOIN("", REGEXEXTRACT(B2, "£(\d+(?:[.,]\d+)?)|(\d+(?:[.,]\d+)?)%"))
Details:
£(\d+(?:[.,]\d+)?) - matches a £ and then matches and captures into Group 1 one or more digits followed with one or zero occurrences of ./, and then one or more digits
| - or
(\d+(?:[.,]\d+)?)% - matches and captures into Group 2 one or more digits followed with one or zero occurrences of ./, and then one or more digits, and then a % is matched.
See the demo screenshot:
See the RE2 regex demo.

Based on your samples, this should work.
=SUMPRODUCT(N(SPLIT(B2," ")))
You can see it at work here in cell C2.

Related

Regex - lazy match first pattern occurrence, but no subsequent matching patterns

I need to return the first percentage, and only the first percentage, from each row in a file.
Each row may have one or two, but not more than two, percentages.
There may or may not be other numbers in the line, such as a dollar amount.
The percentage may appear anywhere in the line.
Ex:
Profits in California were down 10.00% to $100.00, a decrease from 22.6% the prior year.
Profits in New York increased by 0.9%.
Profits in Texas were up 1.58% an increase from last year's 0.58%.
I can write a regex to capture all occurrences:
[0-9]+\.[0-9]+[%]+?
https://regex101.com/r/owZaGE/1
The other SO questions I've perused only address this issue when the pattern is at the front of the line or always preceded by a particular set of characters
What am I missing?
/^.*?((?:\d+\.)?\d+%)/gm
works with a multiline flag, no negative lookbehind (some engines don't support non-fixed width lookbehinds). Your match will be in the capture group.
Mine is similar to you except I allowed numbers like 30% (without decimal points)
\d+(\.\d+)?%
I don't know what language you are using, but in python for getting the first occurrence you can use re.search()
Here is an example:
import re
pattern = r'\d+(\.\d+)?%'
string = 'Profits in California were down 10.00% to $100.00, a decrease from 22.6% the prior year.'
print(re.search(pattern, string).group())
I was able to solve using a negative lookbehind:
(?<!%.*?)([0-9]+\.[0-9]+[%]+?)

Extracting data from a text file using Regex

I have a document which I am trying to extract information from using a Regex extractor.
I am trying to extract the value for Option 2 and Option 3
ie:
Option 2 will return €6644
Option 3 will return $8532
As both of them contain the same text I wish to withdraw, I would appreciate any help on how I can write my regex statement , which will allow me to extract the amount.
so far I have
((?<=Option\s2\s-\sWithdraw\sa\sspecified\samount\by\sfully\scashing\sin\spolicies\s\sI\swish\sto\scash\sin)[a-zA-Z0-9-\s]{30})
Which doesn't bring anything back
Any help would be greatly appreciated
<b>Text : </b>
<p>
Option 2 - Withdraw a specified amount by fully cashing in policies
I wish to withdraw €6644 (insert amount and currency)
(Please note that we will cash in the appropriate number of policies to reach the closest possible figure below the amount you require.
The balance will then be taken across all the remaining policies.
Please specify your fund choices for this balance overleaf.)
Please note: If you've invested in a PruFund Protected Fund, cashing in policies will erode the Guaranteed Minimum Fund.
Notes
1 For information on withdrawal limits, please see your Key Features Document.
2
At least £500, ¤750 or US$750 must remain invested in each fund you hold.
3 If you have invested in one of the PruFund Range of Funds, withdrawals may be subject to a 28-day delay.
If you also hold other funds, this could mean your withdrawal is made in two payments.
Option 3 - Withdraw a specified amount from across all policies
I wish to withdraw
€8532
(insert amount and currency) from across all the policies in my bond.
Please specify your fund choices below.
</P>
Try:
Option 2[\s\S]*?(€\d+)[\s\S]*?Option 3[\s\S]*?(€\d+)
See regex demo
Option 2 - Matches 'Option 2'
[\s\S]*? - Non-greedily matches 0 or more characters
(€\d+) - Matches '€' followed by one or more digits in Group 1
[\s\S]*? - Non-greedily matches 0 or more characters
Option 3 - Matches 'Option 3'
[\s\S]*? - Non-greedily matches 0 or more characters
(€\d+) - Matches '€' followed by one or more digits in Group 2
The two numbers are returned in Groups 1 and 2.

How to find all currency related digits REGEX?

For a string that has free text:
"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"
How to find a regex pattern that will extract currency related numbers?
In this case: 89.99, 95, and 100?
So far, I've tried these patterns:
[0-9]*[€.]([0-9]*)
\[0-9]{1,3}(?:\.\[0-9]{3})*,\[0-9]\[0-9]
[0-9]+\€\.[0-9]+
But these don't seem to be producing exactly what is needed
Simpler solution would be [.\d]*€[.\d]*.
One option is to match all 3 variations and afterwards remove the euro sign from the match.
(?:\d+€\d*|€\d+(?:\.\d+)?)
Explanation
(?: Non capture group
\d+€\d* Match 1+ digit and € followed by optional digits
| Or
€\d+(?:\.\d+)? Match € followed by digits and an optional decimal part
) Close non capture group
Regex demo
For example
import re
regex = r"(?:\d+€\d*|€\d+(?:\.\d+)?)"
test_str = ("\"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5 \n"
"from last monday. If they do not level up again to 100€ by the end of this week there might \n"
"be serious consequences to the company\"")
print([x.replace("€", "") for x in re.findall(regex, test_str)])
Output
['89.99', '95', '100']
A bit more precise pattern for the number with optional comma followed by 3 digits and 2 digit decimal part could be:
(?:\d+€\d*|€\d{1,3}(?:,\d{3})*\.\d{2})
Regex demo
This need further testing but I would simply grab everything around € which is not whitespace, that is:
import re
text = """The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"""
values = re.findall(r"\S*€\S*", text)
print(values)
Output:
['€89.99', '9€5', '100€']

Regex - Alteryx - Parse - How to find an expression starting by the end of the string

I need to parse the following expression:
Fertilizer abc 7-15-15 5KG BOX 250 KG
in 3 fields:
The product description: Fertilizer abc 7-15-15
Size: 250
Size unit: KG
Do not know how to proceed. Please, any help and explanation?
Try this in the alteryx REGEX Tool with Parse selected as the Method:
([A-z ]* [\d-]{6,8}) ([A-Z\d]{2,6}) (.{1,5}?) (\d*) ([A-Z]*)
You can test it at Regexpal to see the breakdown of each group but essentially the first set of brackets will get you your product description (text and spaces until 6-8 characters made up of digits and dashes), the 2nd & 3rd parts will deal with the erroneous info that you don't want, the 4th group will be just digits and the 5th group will be any text afterwards.
Note that this will change dramatically if your data has digits where there is characters currently etc.
You can always break it up into even smaller groups and then concatenate back together as well.

Regex for any non-zero number

I'm using regex in Notepad++, recent version at time of post Nov 2016
I'm working on problem where I have dozens of text files containing financial information of wage and commission earnings. I need to find all employees that earned a commission
Files are formatted a such.
emp001smithj20150000095000
is an example of no commission earned last year (position 17, 5 chars = "00000")
emp002jonest20151752545000
is an example of $17525 commission earned last year (position 17, 5 chars = "17525")
I've tried...
^.{16}\b[1-9]{5}\b
The rationale is that I want any non-zero number among five character word boundary that starts at position 16, but no luck. I'm obviously missing something!
^.{16}(?!00000)
Use negative look around to find the next five characters that are not 00000
You shouldn't use word boundaries here, as your numbers are surrounded by other numbers. You will need a lookahead to check, that your 5 digit number doesn't consist of numbers only, so a way would be:
^.{16}(?=0{0,4}[1-9])\d{5}
You shouldn't try to match [1-9]{5}, this e.g. won't match 15000