Regex optional capture groups in any order - regex

I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.
I am trying to extract names and emails based on the following regex:
For names, two consecutive capitalized words:
[A-Z][\w]+\s+[A-Z][\w]+
For emails:
\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b
Example text:
John Doe john#doe.com random text
Jane Doe random text jane#doe.com
jim#doe.com more random text tim#doe.com Tim Doe
So far I have used non-capture groups and positive look aheads to tackle the "in-no-particular-order-or-even-present" problem but only managed to do so by segmenting by newlines. So my regex looks like this:
^(?=(?:.*([A-Z][\w]+\s+[A-Z][\w]+))?)(?=(?:.*(\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b))?).*
And the results miss items where there are multiple contacts on the same line:
[
["John Doe", "john#doe.com"],
["Jane Doe", "jane#doe.com"],
["Tim Doe", "tim#doe.com"],
]
When what I'm looking for is:
[
["John Doe", "john#doe.com"],
["Jane Doe", "jane#doe.com"],
[nil, "jim#doe.com"],
["Tim Doe", "tim#doe.com"],
]
My skills in regex are limited and I started using regex because it seemed like the best tool for matching names and emails.
Is regex the best tool to use for this kind of problem or are there more efficient alternatives using loops if we're extracting hundreds of contacts in this manner?

Your text is already almost too random to make this work. Even more names and emails are very difficult to capture at times. A more advanced email pattern would only help a little.There are not only unusual email addresses there are also all sorts of wild name patterns.
What about D'arcy Bly, Markus-Anthony Reid, Lee Z, and those are probably the simplest examples.
So, you have to make a lot of assumptions and won't be fully satisfied unless you are using more advanced techniques like Natural language processing.
If you insist on your approach, I came up with this (toothless) monstrosity:
([A-Z]\w+ [A-Z]\w+)(?:\w* )*([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})|
([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})(?:\w* )*([A-Z]\w+ [A-Z]\w+)|
([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})
The order of the alternation groups is important to be able to capture the stray email.
Demo
PS: The demo I uses a branch reset to capture only in group 1 and 2. However, it looks like Ruby 2.x does not support branch reset groups. So, you need to check all 5 groups for values.

Here's a rewrite of #wp78de's idea into Ruby regexp syntax:
regexp = /
(?<name>
[A-Z][\w]+\s+[A-Z][\w]+
){0}
(?<email>
\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b
){0}
(?:
\g<name> (?:\w*\s)* \g<email>
| \g<email> (?:\w*\s)* \g<name>
| \g<email>
)
/x
text = <<-TEXT
John Doe john#doe.com random text
Jane Doe random text jane#doe.com
jim#doe.com more random text tim#doe.com Tim Doe
TEXT
p text.scan(regexp)
# => [["John Doe", "john#doe.com"],
# => ["Jane Doe", "jane#doe.com"],
# => [nil, "jim#doe.com"],
# => ["Tim Doe", "tim#doe.com"]]

Related

Using regex to match groups that may not contain given text patterns

I'd like to use regex to extract birth dates and places, as well as (when they're defined) death dates and places, from a collection of encyclopedia entries. Here are some examples of such entries which illustrate the patterns that I'm trying to codify:
William Wordsworth, (born April 7, 1770, Cockermouth, Cumberland, England—died April 23, 1850, Rydal Mount, Westmorland), English poet...
Jane Goodall, in full Dame Jane Goodall, original name Valerie Jane Morris-Goodall, (born April 3, 1934, London, England), British ethologist...
Kenneth Wartinbee Spence, (born May 6, 1907, Chicago, Illinois, U.S.—died January 12, 1967, Austin, Texas), American psychologist...
I was hoping that the following regex pattern would identify the desired capture groups:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(?:—died )?(\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
But sadly, it does not. (Use https://regexr.com/ to view the results.)
Note: When I restructure the pattern around the phrase —died in the following way, the pattern does produce the expected results for entries like the first and third above (those with given death dates/places), but it obviously does not work in all cases.
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
What am I missing?
In general, you can mark the whole --died section as optional:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?))?\)
https://regexr.com/765fc
You can use the negative lookahead assertion (?!) to match groups that do not contain a given text pattern. The syntax for this is (?!pattern) where "pattern" is the text you want to exclude from the match. For example, if you want to match all groups of characters that do not contain the letter "a", you would use the regex "(?!a).*" This will match any group of characters that does not have an "a" in it.

Regex match characters when not preceded by a string

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
I am using this with the re.split function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.
I am trying to use something like:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
Here is a regexr of my situation: https://regexr.com/4sgcb
This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\#<![?.!]\)\( *\d\+ *\)\#!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.
You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.
Python demo:
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
print(m)
Output:
I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s* - matches 0 or more whitespace (used to trim the results)
(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
\d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
(?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
\.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
| - or
[^.!?] - any character but a ., !, and ?
(?:[.?!]|$) - a ., !, and ? or end of string.
As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
Replace spaces that should stay with some special character (re.sub)
Split the text (re.split)
Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Regex - get string after full date and before standard text

I'm stuck on another regex. I'm extracting email data. In the below example, only the time, date and message in quotes changes.
Message Received 6:06pm 21st February "Hello. My name is John Smith" Some standard text.
Message Received 8:08pm 22nd February "Hello. My name is "John Smith"" Some standard text.
How can I get the message only if I need to start with the positive lookbehind, (?<=Message Received ) to begin searching at this particular point of the data? The message will always start and end with quotes but the user is able to insert their own quotes as in the second example.
You can just use a negated charcter class in a capturing group:
/Message Received.*?"([^\n]+)"/
Snippet:
$input = 'Message Received 6:06pm 21st February "Hello. My name is John Smith" Some standard text.
Message Received 8:08pm 22nd February "Hello. My name is "John Smith"" Some standard text.}';
preg_match_all('/Message Received.*?"([^\n]+)"/', $input, $matches);
foreach ($matches[1] as $match) {
echo $match . "\r\n";
}
Output:
> Hello. My name is John Smith
> Hello. My name is "John Smith"
For extracting message in between double quotes.
(?=Message Received)[^\"]+\K\"[\w\s\"\.]+\"
Regex demo
You capture the message in a group
(?<=Message Received)[^"]*(.*)(?=\s+Some standard text)
Two out of the other three posted answers on this page provide an incorrect result. None of the other posted answers are as efficient as they could be:
To correctly extract the substring between the outer double quotes, use one of the following patterns:
/Message Received[^"]+"\K[^\n]+(?=")/ (No capture group, takes 132 steps, Demo)
/Message Received[^"]+"([^\n]+)"/ (Capture group, takes 130 steps, Demo)
Both patterns provide maximum accuracy and efficiency using negated character classes leading up to and including the targeted substring. The first pattern reduces preg_match_all()'s output array bloat by 50% by using \K instead of a capture group. For these reasons, one of these patterns should be used in your project. As your input string increases in size, my patterns provide increasingly better performance versus the other posted patterns.
PHP Implementation:
$in represents your input string.
Pattern #1 Method:
var_export(preg_match_all('/Message Received[^"]+"\K[^\n]+(?=")/',$in,$out)?$out[0]:[]);
// notice the output array only has elements in the fullstring subarray [0]
Output:
array (
0 => 'Hello. My name is John Smith',
1 => 'Hello. My name is "John Smith"',
)
Pattern #2 Method:
var_export(preg_match_all('/Message Received[^"]+"([^\n]+)"/',$in,$out)?$out[1]:[]);
// notice because a capture group is used, [0] subarray is ignored, [1] is used
Output:
array (
0 => 'Hello. My name is John Smith',
1 => 'Hello. My name is "John Smith"',
)
Both methods provide the desired output.
Anirudha's incorrect pattern: /(?<=Message Received)[^"]*(.*)(?=\s+Some standard text)/ (345 steps + a capture group + includes the unwanted outer double quotes)
Josh Crozier's pattern: /Message Received.*?"([^\n]+)"/ (174 steps + a capture group)
Sahil Gulati's incorrect pattern: /(?=Message Received)[^\"]+\K\"[\w\s\"\.]+\"/ (109 steps + includes the unwanted outer double quotes + unnecessarily escapes characters in the pattern)

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :