Conditional Regex Parsing - regex

I have bunch of product codes that I'm trying to parse (Example 99 ITEM SEC SALE). In rare conditions, product codes are like 99 ITEM SEC SALE.
If it the cell is "99 ITEM SEC SALE" then "ITEM SEC" will be parsed (take out 99 and SALE).
If the cell is "99 ITEM SEC" (NO--> SALE,SOLD, OR PURCHASED). I want ITEM SEC will be parsed as well.In other words, "SALE SOLD AND PURCHASED" are prohibited words.
1-It always starts with a set of numbers (no limit)
2-Alphabetic characters (Any)
3-Alphabetic characters (any)-optional
4-If the ending value(string) is NOT "SALE" or "SOLD" or "PURCHASED" then take the digits out and parse
I found something similar but could not figure out how it should work for my case.
Thanks for the help

Okay, so what you're working for is something like this.
(?P<number>\d+)\s+(?P<Item_Name>\w+)\s+(?P<code>[a-zA-Z]{0,3})\s+(?P<status>SOLD|SALE|PURCHASED)?
(?P<number>\d+) -- Named Capture Group 1 (number)- Match any number
\s+ -- Match any number of spaces
(?P<Item_Name>\w+) -- Named Capture Group 2 (Item_Name) - Match any word until space
\s+ match any number of spaces
(?P<code>[a-zA-Z]{0,3}) -- Named Capture Group 3 (code) - Match any a-zA-Z character 0-3 times
\s+ match any number of spaces
(?P<status>SOLD|SALE|PURCHASED)? -- Named Capture Group 4 (status) - Match SOLD / SALE / PURCHASED (? means 0 or 1 times so this is optional)
Live example: https://regex101.com/r/oR3sK8/1
I don't recall if named capture groups work like this for objective-C, if they don't you can remove the ?P<...> and the regex should still operate without issues (and keep your capture groups largely unchanged).

Related

GSheets - remove everything *after* a word (but keep the word)

How can I remove everything after a specific word (while keeping the word)?
I want to remove everything after the word 'films'.
"George Fellini 194 films 273 169 Edit" would turn into "George Fellini 194 films"
"Rick Bathista 7 films 10 27 Edit" would turn into "Rick Bathista 7 films"
There are many posts that are similar but aren't google sheets specific, and the two google sheets specific answers I've found eliminate the word I want to keep.
(It would be a bonus if it could also keep the singular "film" but not necessary.
What I've tried:
=REGEXEXTRACT(B2,"(.*) films .*") - deletes the word 'films'
=regexreplace(B2,"films ","") - also deletes the word 'films'
my sheet: https://docs.google.com/spreadsheets/d/1UL0cvdgbwJIAPSJTxajxM7_pw_pPqxq-Ofmt8uK6J6o/edit?usp=sharing
Use this formula:
=REGEXEXTRACT(B2,".*films?")
The documentation of REGEXEXTRACT says:
Extracts matching substrings according to a regular expression.
The regular expression matches any sequence of zero or more characters (.*) followed by film and an optional s (s?).
use:
=INDEX(IFNA(REGEXEXTRACT(B2:B, "(.+films)")))
() - extract group of something
.+ - all characters / anything
(.+films) - extract group of all characters ended by films included

Regex to get any numbers after the occurrence of a string in a line

Hi guys im trying to get the the substring as well as the corresponding number from this string
text = "Milk for human consumption may be taken only from cattle from 80 hours after the last treatment."
I want to select the word milk and the corresponding number 80 from this sentence. This is part of a larger file and i want a generic solution to get the word milk in a line and then the first number that occurs after this word anywhere in that line.
(Milk+)\d
This is what i came up with thinking that i can make a group milk and then check for digits but im stumped how to start a search for numbers anywhere on line and not just immediately after the word milk. Also is there any way to make the search case insensitive?
Edit: im looking to get both the word and the number if possible eg: "milk" "80" and using python
/(?<!\p{L})([Mm]ilk)(?!p{L})\D*(\d+)/
This matches the following strings, with the match and the contents of the two capture groups noted.
"The Milk99" # "Milk99" 1:"Milk" 2:"99"
"The milk99 is white" # "milk99" 1:"milk" 2:"99"
"The 8 milk is 99" # "milk is 99" 1:"milk" 2:"99"
"The 8milk is 45 or 73" # "milk is 45" 1:"milk" 2:"45"
The following strings are not matched.
"The Milk is white"
"The OJ is 99"
"The milkman is 37"
"Buttermilk is 99"
"MILK is 99"
This regular expression could be made self-documenting by writing it in free-spacing mode:
/
(?<!\p{L}) # the following match is not preceded by a Unicode letter
([Mm]ilk) # match 'M' or 'm' followed by 'ilk' in capture group 2
(?!p{L}) # the preceding match is not followed by a Unicode letter
\D* # match zero or more characters other than digits
(\d+) # match one or more digits in capture group 2
/x # free-spacing regex definition mode
\D* could be replaced with .*?, ? making the match non-greedy. If the greedy variant were used (.*), the second capture group for "The 8milk is 45 or 73" would contain "3".
To match "MILK is 99", change ([Mm]ilk) to (?i)(milk).
This seems to work in java (I overlooked that the questioner wanted python or the question was later edited) like you want to:
String example =
"Test 40\n" +
"Test Test milk for human consumption may be taken only from cattle from hours after the last treatment." +
"\nTest Milk for human consumption may be taken only from cattle from 80 hours after the last treatment." +
"\nTest miLk for human consumption may be taken only from cattle from 80 hours after the last treatment.";
Matcher m = Pattern.compile("((?i)(milk).*?(\\d+).*\n?)+").matcher(example);
m.find();
System.out.print(m.group(2) + m.group(3));
Look at how it tests whether the word "milk" appears in a case insensitive manner anywhere before a number in the exact same line and only prints these both. It also prints only the first found occurence (making it find all occurencies is also possible pretty easily just by a little modifications of the given code).
I hope the way it extracts these both things from a matching pattern is in the sense of your task.
You should try this one
(Milk).*?(\d+)
Based on your language, you can also specify a case-insensitive search. Example in JS: /(Milk).*?(\d+)/i, the final i makes the search case insensitive.
Note the *?, the most important part ! This is a lazy iteration. In other words, it reads any char, but as soon as it can stop and process the next instruction successfully then it does. Here, as soon as you can read a digit, you read it. A simple * would have returned the last number from this line after Milk instead

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

How to limit characters when using regexp_extract in hive?

I have a fixed length string in which I need to extract portions as fields. First 5 characters to ACCOUNT1, next 2 characters to ACCOUNT2 and so on.
I would like to use regexp_extract (not substring) but I am missing the point. They return nothing.
select regexp_extract('47t7916A2088M040323','(.*){0,5}',1) as ACCOUNT1,
regexp_extract('47t7916A2088M040323','(.*){6,8}',1) as ACCOUNT2 --and so on
If you want using regexp then use it like in this example. For Account1 expression '^(.{5})' means: ^ is a beginning of the string, then capturing group 1 consisting of any 5 characters (group is in the round brackets). {5} - is a quantifier, means exactly 5 times. For Account2 - capturing group 2 is after group1. (.{2}) - means two any characters.
In this example in the second regexp there are two groups(for first and second column) and we extracting second group.
hive> select regexp_extract('47t7916A2088M040323','^(.{5})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2;
OK
47t79 16
Time taken: 0.064 seconds, Fetched: 1 row(s)
Actually you can use the same regexp containing groups for all columns, extracting different capturing groups.
Example using the same regexp and extracting different groups:
hive> select regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2
> ;
OK
47t79 16
Time taken: 1.043 seconds, Fetched: 1 row(s)
Add more groups for each column. This approach works only for fixed length columns. If you want to parse delimited string, then put the delimiter characters between groups, modify group to match everything except delimiters and remove/modify quantifiers. For such example substring or split for delimited string looks much more simple and cleaner, regexp allows to parse very complex patterns. Hope you have caught the idea.

Matching a group that may or may not exist

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)
Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.
To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$
This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)
(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.