Regular Expression for group - regex

I have a text were I need to find 3 groups strings.
I try expression: \r?\n\r?\n\r?[0-9A-Z].*\d{7} but I find only 2 strings instead 3.
I should highlight 00170784,HEDINV,00173575 but I get only 00170784 and 00173575
This is the text:
BUY
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.375
SELL
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.5938

May I suggest using this instead?
^\d{8}$|^[A-Z]{6}$
It has two capture groups it looks for. One is an 8 digit sequence for a whole line. The other is a 6 letter sequence for a whole line. That grabs what you're looking, unless there's a specific reason you're using all those linebreak matches.

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.
You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.
Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")

I want to use on my website digits of postal codes for Belgium/Netherlands for checking out

I'm searching a database of digits codes for Belgium & Netherlands when a person checks out hes/her shoppingcart.
I do did find a map of it but ofcourse it has to be put in a code. (map is 2 digit coded)
For example.
3630 Maasmechelen = postal code 2 digit (36)
Now the 2 digit is 36 but everything with 36 should be possilbe. For example. 3600-3601-3602 (hamlets) and the names of them has to be selected ofcourse.
Another example: 3600 Genk (client should be able to select this in total)
But it takes ages to do it manually so I would like to know if there is somewhere a database of all digits.

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Regex in Hive QL (RLIKE) - performance?

I'm wondering how/if can I improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:
X123 or XY12, (type 1)
any two letter combo, excluding XY (type 2)
Type 1 groups always are of length 4. It's either letter X followed by a number between 100 and 999 (inclusive) OR XY followed by numbers between 0 and 99 (padded to length 2 with zeros).
Type 2 groups are 2 letter strings, with any letter allowed, excluding XY (although my query doesn't specify this).
User can belong to multiple groups, in which case different groups are separated by pound symbol (#). Here's an example:
groups user age
X124 john 23
XY22#AB mike 33
AB peter 21
X122#XY01 francis 43
I want to count rows in which at least one group in second format appears, i.e. where user is not exclusively member of groups in first format.
I need to catch all rows (i.e. users) which don't belong exclusively to first type of groups. In the example above, I want to exclude users john and francis because they are members only of type 1 groups.
On the other hand, mike is OK because he's member of AB group (i.e. group of type 2).
I'm currently doing it like this:
select
count(*)
from
users
where
groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'
Is this bad performance wise? And how should I approach fixing it?
I want to count rows in which at least one group in second format appears.
It seems a bit simpler then to select where groups like:
\b(?:(?!XY)[A-Z]{2})\b
\b is a word boundary. It doesn't consume a character, instead it states there cannot be a non-alphanumeric character there.
Live demo.

Extract digit after or before a specific word

I have several articles about terrorist attacks which include info of the number of people killed and wounded. I am trying to extract the number concerning the people wounded.
This is a sample of the sentences to target:
at least 22 others were wounded
additional 20 soldiers were wounded
more than 40 people had been wounded
wounding at least six people
injuring at least 60 others
wounding more than 25
27 others were wounded
wounding 14
wounding 33
185 people were wounded
28 people wounded
As you can see the wordS wounded, wounding,injuring are either before or after the digit I want to extract, ususally within 3 or 4 words of distance from the number.
In this link you can find a sample of the articles and the regualr expression that I am trying to apply without success:
[Regex] (https://regex101.com/r/0DRayP/10)
You need to use capturing groups to get into groups your desired matches like:
(\d+)?.*?(wound(?:ed|ing)|injured).*?(\d+)
You are interested in groups $1, $2 and $3
Here is an example:
Online Demo