Regex to match when groups aren't same length - regex

I've been having trouble understanding how to make this regex more dynamic. I specifically want to pull out these for elements, but sometimes part of them will be missing. In my case here, it doesn't recognize the pattern because the 4th group isn't present.
For example, given the 2 strings:
Rafael C. is eating a Burger by McDonalds at Beach
David K. is eating a Burger by McDonalds
John G. is eating a by at House
I'm trying to pull out the [name], [item], [by name], [at name]. It will always be in this patterns, but parts of it may be missing at times. Sometimes it's the name missing, sometimes it's the item, sometimes its the name and by name, etc.
So I'm using:
Link here
(.*) is eating a (.*) by (.*) at (.*)
But because it's missing in the second string, it doesn't recognize it. I've tried using lookbehind/lookaheads. I've tried using quintifiers, but having a hard time understanding what it is to get exactly those 4 groups, as you can see below:
Output desired:
I'd like it capture:
[Rafael C., Burger, McDonalds, Beach]
[David K., Burger, McDonalds, '']
[John G., '', '', 'House']

You can use
^(.*) is eating a ((?:(?!\b(?:by|at)\b).)*?)(?: ?\bby ((?:(?!\bat\b).)*?))?(?: ?\bat (.*))?$
See the regex demo.
Details:
^ - string start
(.*) - Group 1: any zero or more chars other than line break chars as many as possible
is eating a - a literal string
((?:(?!\b(?:by|at)\b).)*?) - Group 2: any char other than line break char, zero or more but as few as possible occurrences, that is not a starting point for a by or at whole word char sequence
(?: ?\bby ((?:(?!\bat\b).)*?))? - an optional non-capturing group that matches an optional space, word boundary, by, space and then captures into Group 3 any char other than line break char, zero or more but as few as possible occurrences, that is not a starting point for an at whole word char sequence
(?: ?\bat (.*))? - an optional non-capturing group that matches an optional space, word boundary, at, space and then captures into Group 4 any zero or more chars other than line break chars as many as possible
$ - string end.

I suggest using the quantifier "?" like this.
(.*) is eating a (.*) by (.*)(?: at (.*))*
This works with your example. https://regex101.com/r/B4JbdS/1
edit : You are right #chitown88 this regex should match better. I use "[^\]" instead of ".*" to trim whitespace when there is no value.
I also used "(?=)" (lookahead) and "(?<=)" (lookbehind) to capture groupe between two specific match.
(.*)(?=^is eating a| is eating a).*(?<=^is eating a| is eating a) *([^\\]*?) *(?=by).*(?<=by) *([^\\]*?) *(?=at |at$).*(?<=at |at$)(.*)
https://regex101.com/r/PHbmAZ/1

Related

Remove all but the first four characters on each line

So I have a text file in Vscode that contains several lines of text like so:
1801: Joseph Marie Jacquard, a French merchant and inventor invent a loom that uses punched wooden cards to automatically weave fabric designs. Early computers would use similar punch cards.
So now I'm trying to isolate the year number/the first 4 characters of each line. I'm new to regex, and I know how to get the first 4 characters (I used ^.{4}) but how would I be able to find all EXCEPT for the first 4 characters so that I can replace them with nothing and be left with just the year numbers?
Find: (?<=^\d{4}).*
Replace: with nothing
regex101 Demo
(?<=^\d{4}) if a line starts ^ with 4 digits , (?<=...) is a positive lookbehind
.* match everything else up to line terminators, so the : will be included in the match
Since you never matched the 4 digits, a lookbehind/lookahead isn't part of any match necessarily, that you want to keep, you don't have to worry about any capture groups or replacements.
You can
Find:       ^(.{4}).+
Replace: $1
See the regex demo. Details:
^ - start of a line (in Visual Studio Code, ^ matches any line start)
(.{4}) - capturing group #1 that captures any four chars other than line break chars
.+ - one or more chars other than line break chars, as many as possible.
The $1 backreference in the replacement pattern replaces the match with Group 1 value.

Regex to capture a group, but only if preceded by a string and followed by a string

There's a few examples of the 'typical' solution to the problem, here in SO and elsewhere, but we need help with a slightly different version.
We have a string such as the following
pies
bob likes,
larry likes,
harry likes
cakes
And with the following regexp
(?<=pies\n|\G,\n)(\w+ likes)
Only when the string commences with pies we can capture the 'nnn likes' as expected, however, we'd also need that the capture fails if it doesn't end with 'cakes', and our attempts at doing so have failed.
Link to the regex101: https://regex101.com/r/uDNWXN/1/
Any help appreciated.
I suggest adding an extra lookahead at the start, to make sure there is cakes in the string:
(?s)(?<=\G(?!^),\n|pies\n(?=.*?cakes))(\w+ likes)
See the regex demo (no match as expected, add some char on the last line to have a match).
Pattern details
(?s) - DOTALL/singleline modifier to let . match any chars including line breaks
(?<= - a positive lookbehind that requires the following immediately to the left of the current location:
\G(?!^),\n - right after the end of previous match, a comma and then a newline
| - or
^pies\n(?=.*cakes) - start of string, pies, newline not followed with any 0+ chars as many as possible, and then a cakes string
) - end of the lookbehind
(\w+ likes) - Group 1: any one or more letters, digits or underscores and then a space and likes.

Regular expression to match n consecutive capitalized words

I am trying to capture n consecutive capitalized words. My current code is
n=5
a='This is a Five Gram With Five Caps and it also contains a Two Gram'
re.findall(' ([A-Z]+[a-z|A-Z]* ){n}',a)
Which returns the following:
['Caps ']
It's identifying the fifth consecutive capitalized word, but I would like it to return the entire string of capitalized words. In other words:
[' Five Gram With Five Caps ']
Note that | doesn't act as an OR inside a character class. It'll match | literally. The other issue here is that findall's behaviour is to return the match unless a group exists (although python's documentation doesn't really make this clear):
The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups
So this is why you're getting the result of the first capture group, which is the last uppercase-starting word of Caps.
The simple solution is to change your capturing group to a non-capturing group. I've also changed the space at the start to \b so as to not match an additional whitespace (which I presume you were planning on trimming anyway).
See code in use here
import re
r = re.compile(r"\b(?:[A-Z][a-zA-Z]* ){5}")
s = "This is a Five Gram With Five Caps and it also contains a Two Gram"
print(r.findall(s))
See regex in use here
\b(?:[A-Z][a-zA-Z]* ){5}
\b Assert position as a word boundary
(?:[A-Z][a-zA-Z]* ?){5} Match the following exactly 5 times
[A-Z] Match an uppercase ASCII letter once
[a-zA-Z]* Match any ASCII letter any number of times
Match a space
Result: ['Five Gram With Five Caps ']
Additionally, you may use the regex \b\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}\b instead. This will allow matches at the start/end of the string as well as anywhere in the middle without grabbing extra whitespace. Another alternative may include (?:^|(?<= ))\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}(?= |$)
Wrap the whole pattern in a capturing group:
(([A-Z]+[a-z|A-Z]* ){5})
Demo

regex with optional capture group

I am trying to get the ammount, unit and substance out of a string using a regex. The units and substances come from a predefined list.
So:
"2 kg of water" should return: 2, kg, water
"1 gallon of crude oil" should return: 1, gallon, oil
I can achieve this with the following regex:
(\d*) ?(kg|ml|gallon).*(water|oil)
The problem is that I can't figure out how to make the last capture group optional. If the substance is not in the predefined list, I still want to get the ammount and unit. So:
"1 gallon of diesel" should return: 1, gallon or 1, gallon, ''
I have tried wrapping the last group in an optional non capturing group as explained here: Regex with optional capture fields but with no success.
Here is the current reges in te online regex tester: https://regex101.com/r/hV3wQ3/55
You are trying to use (\d+) ?(kg|ml|gallon).*(?:(water|oil))? and there is no way this pattern can capture water / oil. The problem is the .* grabs any 0+ chars other than line break chars up to the end of the string / line, and the (?:(water|oil))? is tried when the regex index is there, at the string end. Since (?:(water|oil))? can match an empty string, it matches the location at the end of the string, and the match is returned.
You may still use the capturing group as obligatory, but wrap the .* and the capturing group with an optional non-capturing group:
(\d+) ?(kg|ml|gallon)(?:.*(water|oil))?
^^^ ^^
See the regex demo
The (?:.*(water|oil))? matches 1 or 0 (greedily) occurrences of any 0+ chars other than line break chars (.*) and then either water or oil.

How can I capture the desired group using REGEX

How can I break this string, to just capture Chocolate cake & nuts?
Input string
pizza & coke > sweets > Chocolate cake & nuts >
I am using this regex:
.*[\>]\s(.*)
However, it is capturing Chocolate cake & nuts >
How can I remove the > and the space in the end?
Desired result
lastone=Chocolate cake & nuts
Avoiding capture of space around the final phrase is a bit tricky. In Java,
.*>\s*(\S+(?:\s+[^>\s]+)*)\s*>.*
captures everything except initial and ending whitespace between the final two >'s. Note that you only get the last stuff between >'s because the * is "greedy." It matches the longest possible string that allows the rest of the regex to match.
Note that when you ask about a regex, you need to specify which regex engine you're using.
Edit: How it works
.*> matches anything followed by >. Then \s* matches 0 or more whitespace chars, and capturing starts. The \S+ matches one or more non-space characters, and (?:\s+[^>\s]+)* matches 0 or more repeats of spaces followed by characters that are anything except > and space (this is the tricky part), whereupon capturing stops. The (?: ) form of parentheses are non-capturing. They only group what's inside so * can match 0 or more of whatever that is. Finally, \s*>.* matches a final > preceded by optional whitespace and followed by anything.
Try move the > out of (). .*[\>]\s(.*?)\s*>
Or the more precise version [>\s]+(\w+[\w ]*&[ \w]*\w+)[> ]+
DEMO