Extracting data from a text file using Regex - regex

I have a document which I am trying to extract information from using a Regex extractor.
I am trying to extract the value for Option 2 and Option 3
ie:
Option 2 will return €6644
Option 3 will return $8532
As both of them contain the same text I wish to withdraw, I would appreciate any help on how I can write my regex statement , which will allow me to extract the amount.
so far I have
((?<=Option\s2\s-\sWithdraw\sa\sspecified\samount\by\sfully\scashing\sin\spolicies\s\sI\swish\sto\scash\sin)[a-zA-Z0-9-\s]{30})
Which doesn't bring anything back
Any help would be greatly appreciated
<b>Text : </b>
<p>
Option 2 - Withdraw a specified amount by fully cashing in policies
I wish to withdraw €6644 (insert amount and currency)
(Please note that we will cash in the appropriate number of policies to reach the closest possible figure below the amount you require.
The balance will then be taken across all the remaining policies.
Please specify your fund choices for this balance overleaf.)
Please note: If you've invested in a PruFund Protected Fund, cashing in policies will erode the Guaranteed Minimum Fund.
Notes
1 For information on withdrawal limits, please see your Key Features Document.
2
At least £500, ¤750 or US$750 must remain invested in each fund you hold.
3 If you have invested in one of the PruFund Range of Funds, withdrawals may be subject to a 28-day delay.
If you also hold other funds, this could mean your withdrawal is made in two payments.
Option 3 - Withdraw a specified amount from across all policies
I wish to withdraw
€8532
(insert amount and currency) from across all the policies in my bond.
Please specify your fund choices below.
</P>

Try:
Option 2[\s\S]*?(€\d+)[\s\S]*?Option 3[\s\S]*?(€\d+)
See regex demo
Option 2 - Matches 'Option 2'
[\s\S]*? - Non-greedily matches 0 or more characters
(€\d+) - Matches '€' followed by one or more digits in Group 1
[\s\S]*? - Non-greedily matches 0 or more characters
Option 3 - Matches 'Option 3'
[\s\S]*? - Non-greedily matches 0 or more characters
(€\d+) - Matches '€' followed by one or more digits in Group 2
The two numbers are returned in Groups 1 and 2.

Related

Extract £ or % value from Google Sheets with REGEXTRACT

I have a Google Sheets formula that extracts a £ currency value or a percentage discount from a block of text.
=REGEXEXTRACT(B2,"[\d,.£%]+") - Extracts £ value or % discount (but other numbers too)
=REGEXEXTRACT(B2,"[\d,.]+") - Extracts digits, commas, or periods
However, if the text contains any others numbers before the £ value or % discount they get extracted first.
How can I only extract the £ value or % discount from each cell in Google Sheets?
The maximum discount displayed is 2 decimal places maximum, which may help in building a formula to extract 4 digits left or right of the value.
EXAMPLE DATA
Amy Wills 44% Discount
1Direction Food 45.37% Discount
AllUnder20 £120 Commission
AATU 13.31% Discount
Tickets4You £70 Commission
AllAboutU £7 Commission
Andrea Cardini 4% Discount
You can use
=JOIN("", REGEXEXTRACT(B2, "£(\d+(?:[.,]\d+)?)|(\d+(?:[.,]\d+)?)%"))
Details:
£(\d+(?:[.,]\d+)?) - matches a £ and then matches and captures into Group 1 one or more digits followed with one or zero occurrences of ./, and then one or more digits
| - or
(\d+(?:[.,]\d+)?)% - matches and captures into Group 2 one or more digits followed with one or zero occurrences of ./, and then one or more digits, and then a % is matched.
See the demo screenshot:
See the RE2 regex demo.
Based on your samples, this should work.
=SUMPRODUCT(N(SPLIT(B2," ")))
You can see it at work here in cell C2.

SSN and 9 number screening issues

This regex is looking for Social Security Numbers (SSNs) in several formats, but it also ignores obviously non-valid SSNs like 123-45-6789 or 000-00-0000, etc.
This expression should find a Social Security Number that :
Contains any non-numeric delimiter (i.e. ###-##-####,
###.##.####, or ### ### ####)
It should also catch 9 digits in
sequence with no delimiter, but bounded by whitespace (i.e. `text
### text, or### ######### ###`)
This expression will ignore a Social Security Number that : Contains all zeroes in any specific group(i.e. 000-##-####, ###-00-####, or ###-##-0000)
Begins with 666
Begins with any value from 900-999
Is equal to 078-05-1120 (due to the Woolworth's Wallet Fiasco)
Is equal to 219-09-9999 (appeared in an advertisement for the Social Security
Administration)
Contains all matching values(i.e. 000-00-0000, 111-11-1111, 222-22-2222, etc.)
Contains all incrementing values (i.e. 123-45-6789)
Regex
(#"(?!\b(\d)\1+\D?(\d)\1+\D?(\d)\1+\b)(?!123\D?45\D?6789|219\D?09\D?9999|078\D?05\D?1120)(?!666|000|9\d{2})(?<!\d)\d{3}\D?(?!00)\d{2}\D?(?!0{4})\d{4}(?!\d)(?<!\d{5}-\d{4})",
The problem is we catch other entries that resemble those but we need to be specific enough these aren't caught.
Such as -
(xxxx) xxx-xx-xxxx
684072943 (and order number etc.)
FA300217F0090
Potential Match #1:--------------- nt: ex: 201[[71230 0821]] am ex: 201[[71230 0821]] am 26 JUNE 2012 ---------------Potential Match #2:--------------- am ex: 201[[71230 0821]] am 26 JUNE 2012 DTG (date time group)
"[[ 210v13:2012]],"
Any ideas?
You can use \D? to match any non-digit as your delimiter. This would be a more simplified SSN validator:
^(?!219\D?09\D?9999|078\D?05\D?1120)(?!666|000|9\d{2})d{3}\D?(?!00)\d{2}\D?(?!0{4})\d{4}$
This article might be helpful: http://rion.io/2013/09/10/validating-social-security-numbers-through-regular-expressions-2/
The article also gives a more over-the-top solution, which may be what your are looking for:
^(?!\b(\d)1+\D?(\d)1+\D?(\d)1+\b)(?!123\D?45\D?6789|219\D?09\D?9999|078\D?05\D?1120)(?!666|000|9d{2})\d{3}\D?(?!00)\d{2}\D?(?!0{4})\d{4}$

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Regex in Hive QL (RLIKE) - performance?

I'm wondering how/if can I improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:
X123 or XY12, (type 1)
any two letter combo, excluding XY (type 2)
Type 1 groups always are of length 4. It's either letter X followed by a number between 100 and 999 (inclusive) OR XY followed by numbers between 0 and 99 (padded to length 2 with zeros).
Type 2 groups are 2 letter strings, with any letter allowed, excluding XY (although my query doesn't specify this).
User can belong to multiple groups, in which case different groups are separated by pound symbol (#). Here's an example:
groups user age
X124 john 23
XY22#AB mike 33
AB peter 21
X122#XY01 francis 43
I want to count rows in which at least one group in second format appears, i.e. where user is not exclusively member of groups in first format.
I need to catch all rows (i.e. users) which don't belong exclusively to first type of groups. In the example above, I want to exclude users john and francis because they are members only of type 1 groups.
On the other hand, mike is OK because he's member of AB group (i.e. group of type 2).
I'm currently doing it like this:
select
count(*)
from
users
where
groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'
Is this bad performance wise? And how should I approach fixing it?
I want to count rows in which at least one group in second format appears.
It seems a bit simpler then to select where groups like:
\b(?:(?!XY)[A-Z]{2})\b
\b is a word boundary. It doesn't consume a character, instead it states there cannot be a non-alphanumeric character there.
Live demo.

Regex - Alteryx - Parse - How to find an expression starting by the end of the string

I need to parse the following expression:
Fertilizer abc 7-15-15 5KG BOX 250 KG
in 3 fields:
The product description: Fertilizer abc 7-15-15
Size: 250
Size unit: KG
Do not know how to proceed. Please, any help and explanation?
Try this in the alteryx REGEX Tool with Parse selected as the Method:
([A-z ]* [\d-]{6,8}) ([A-Z\d]{2,6}) (.{1,5}?) (\d*) ([A-Z]*)
You can test it at Regexpal to see the breakdown of each group but essentially the first set of brackets will get you your product description (text and spaces until 6-8 characters made up of digits and dashes), the 2nd & 3rd parts will deal with the erroneous info that you don't want, the 4th group will be just digits and the 5th group will be any text afterwards.
Note that this will change dramatically if your data has digits where there is characters currently etc.
You can always break it up into even smaller groups and then concatenate back together as well.