RegEx for passing punctuation - regex

I am using:
(.*) CO\s?[\(.*\)|\[.*\]|\{.*\}|''.*''|".*"](.*)
to represent
3M CO 'A'(MINNESOTA MINING AND MANUFACTURING COMPANY).
However, the first Single quotation mark cannot be covered by the regex code. Could you please tell me why?
s/(.*) CO\s?[\(.*\)|\[.*\]|\{.*\}|''.*''|".*"](.*)/$1 CO $2
I expect to get:
3M CO 'A'(MINNESOTA MINING AND MANUFACTURING COMPANY)
but I get
3M CO A'(MINNESOTA MINING AND MANUFACTURING COMPANY)

I'm guessing that here we wish to design an expression and match our inputs, part by part, such as:
(.+?)\s+CO\s+(['"].+?['"])([(\[{]).+?([)\]}])
We have added extra boundaries, which can be reduced, if not desired.
We are having three main capturing groups:
(.+?) # anything before Co;
(['"].+?['"]) # the quotation part; and
([(\[{]).+?([)\]}]) # inside various brackets included those, which we can escape, if required.
RegEx Circuit
jex.im visualizes regular expressions:
DEMO
Demo
This snippet just shows that how the capturing groups work:
const regex = /(.+?)\s+CO\s+(['"].+?['"])([(\[{]).+?([)\]}])/mg;
const str = `3M CO 'A'(MINNESOTA MINING AND MANUFACTURING COMPANY)
3M CO 'A'[MINNESOTA MINING AND MANUFACTURING COMPANY]
3M CO 'A'{MINNESOTA MINING AND MANUFACTURING COMPANY}
3M CO "A"{MINNESOTA MINING AND MANUFACTURING COMPANY}`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx
If this expression wasn't desired, it can be modified/changed in regex101.com.

Your regex should be expressed to
/(.*)\sCO\s?(\(.+\).*|".+".*|'.+'.*|{.+}.*|\[.+\].*)/
(.*) First capture group will capture starting group ("3M" in your example)
\sCO\s Then looks for a whitespace followed by CO followed by a whitespace
(".+".* etc.) Second capture group that looks for starting quote or bracket followed by at least one character of anything followed by closing quote, then followed by any number of any character
Why Original Regex Didn't Work
In the original regex, [\(.*\)|\[.*\]|\{.*\}|''.*''|".*"] can be simplified into [''.*''] (for the string you provided). I realize that for other strings, you might want to look for (.*) or [.*] or {.*} or ".*", but for the "3M" string, only the [''.*''] is relevant so we'll just look at this.
So [''.*''] just means: match any character in the list inside [], in any order. In this case, there are three unique characters in the list: ', . and * (although you did repeat ' 3 times). So it matched the first '. But since this match is outside your capture group (), this first ' is not included in your capture group response.
So the next match with (.*) matches everything else that comes after the first ' and includes them in the second matching group, i.e. A'(MINNESOTA MINING AND MANUFACTURING COMPANY) without the ' in front.
Does that make sense?
Demo
If you wanted to ensure the format includes 'A' or [A] or "A" or {A} or (A), then this is what you want:
let regex = /(.*)\sCO\s?(\(.+\)|".+".*|'.+'.*|{.+}.*|\[.+\].*)/;
[pattern, match1, match2] = "3M CO 'A'(MINNESOTA MINING AND MANUFACTURING COMPANY)".match(regex);
console.log(match1 + " CO " + match2);
//3M CO 'A'(MINNESOTA MINING AND MANUFACTURING COMPANY)
[pattern, match1, match2] = '3M CO (A)(MINNESOTA MINING AND MANUFACTURING COMPANY)'.match(regex);
console.log(match1 + " CO " + match2);
//3M CO (A)(MINNESOTA MINING AND MANUFACTURING COMPANY)
[pattern, match1, match2] = '3M CO "A"(MINNESOTA MINING AND MANUFACTURING COMPANY)'.match(regex);
console.log(match1 + " CO " + match2);
//3M CO "A"(MINNESOTA MINING AND MANUFACTURING COMPANY)
[pattern, match1, match2] = "3M CO [A](MINNESOTA MINING AND MANUFACTURING COMPANY)".match(regex);
console.log(match1 + " CO " + match2);
//3M CO [A](MINNESOTA MINING AND MANUFACTURING COMPANY)
[pattern, match1, match2] = "3M CO {A}(MINNESOTA MINING AND MANUFACTURING COMPANY)".match(regex);
console.log(match1 + " CO " + match2);
//3M CO {A}(MINNESOTA MINING AND MANUFACTURING COMPANY)

The ' does not match because in the second capturing group because you use a character class which can be written as CO\s?[(.*)|[\]{}'"] and then it will match CO '
So your pattern actually looks like:
(.*) CO\s?[.*()|[\]{}'"](.*)
^ ^ ^
group 1 Char class group 2
What you might do to get those matching in 2 groups is to use:
(.*?)CO\s?((?:(['"]).*?\3|\(.*?\)|\[.*?\]|\{.*?\}).*)
Explanation
(.*?) Capturing group 1, match any char except newline non greedy
CO\s? Match CO and optional whitespace char
( Capturing group 2
(?: Non capturing group, match any of the options
(['"]).*?\3 Match ' or " and use a backreference to what is captured
| Or
\(.*?\) Match (....)
| Or
\[.*?\] Match [....]
| Or
\{.*?\} Match {....}
) Close non capturing group
.* Match any char until the end of the string
) Close group 2
Regex demo
Note that the .*? is non greedy to prevent unnecessary backtracking and over matching.

Related

Using regex to parse data between delimeter and ending at a specified substring

I'm trying to parse out the names from a bunch of semi-unpredictable strings. More specifically, I'm using ruby, but I don't think that should matter much. This is a contrived example but some example strings are:
Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars
The regex I've come up with is
([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i
but in the case of "NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN" I'm receiving Chicago Bears TUNE as the second match. I'm trying to remove "tune in" so it's in it's own group.
I thought that by adding (?:[ -:]*tune)? it would separate the ending portion of the expression the same way that having vs in the middle was able to, but that doesnt seem to be the case. If I remove the ? at the end, it matches correctly for the above example, but it no longer matches for Eagles vs Bears
If anyone could help me, I would greatly appreciate it if you could breakdown your regex piece by piece.
You can capture the second group up to a -, : or tune preceded with zero or more whitespaces or till end of the line while making the second group pattern lazy:
([\w .]*) vs ([\w .]*?)(?=\s*(?:[:-]|tune|$))
See the regex demo.
Details:
([\w .]*) - Group 1: zero or more word, space or . chars as many as possible
vs - a vs string
([\w .]*?) - Group 2: zero or more word, space or . chars as few as possible
(?=\s*(?:[:-]|tune|$)) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
\s* - zero or more whitespaces
(?:[:-]|tune|$) - : or -, tune or end of a line.
You can use the following regular expression which I have expressed in free-spacing mode to make it self-documenting (search for "Free-Spacing Mode" at the link).
rgx = /
(?: |\A) # match space or beginning of string
(?<team1> # begin capture group team1
(?<team> # begin capture group team
(?<word> # begin capture group word
(?:\p{Lu}|\d) # word begins with an uppercase letter or digit
(?:\p{Ll}|\d)+ # ...followed by 1+ lowercase letters or digits
) # end capture group word
(?: # begin non-capture group
[ .] # match a space or period
\g<word> # match another word
)* # end non-capture group and execute 1+ times
) # end capture group team
) # end capture group team1
[ ]+ # match one or more spaces
(?:VS|vs) # match literal
[ ]+ # match one or more spaces
(?<team2> # begin capture group team2
\g<team> # match the second team name
) # end capture group team2
(?: # begin non-capture group
[ ] # match a space
(?: # begin non-capture group
(?:-[ ])? # optionally match literal
TUNE[ ]IN # match literal
| # or
-[ ]NFL[ ]Match # match literal
) # end inner capture group
)? # end outer non-capture group and make it optional
\z # match end of string
/x # free-spacing regex definition mode
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Phil.Eagles", "Chic.Bears"],
# ["3agles", "B3ars"]]
See Regexp#match and MatchData#[].
Note that \g<word> and \g<team> effectively copy the code contained in the capture groups word and team, respectively. These are called "Subexpression Calls". For additional information search for that term at Regexp. There are two advantages to using subroutine calls: less code is needed and the opportunities for coding errors is reduced.

how to implement or after group in regex pattern

I want to get the thread-id from my urls in one pattern. The pattern should hat just one group (on level 1). My test Strings are:
https://www.mypage.com/thread-3306-page-32.html
https://www.mypage.com/thread-3306.html
https://www.mypage.com/Thread-String-Thread-Id
So I want a Pattern, that gives me for line 1 and 2 the number 3306 and for the last line "String-Thread-Id"
My current state is .*[t|T]hread-(.*)[\-page.*|.html]. But it fails at the end after the id. How to do it well? I also solved it like .*Thread-(.*)|.*thread-(\\w+).*, but this is with two groups not applicable for my java code.
Not knowing if this fits for all situations, but I would try this:
^.*?thread-((?:(?!-page|\.html).)*)
In Java, that could look something like
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("^.*?thread-((?:(?!-page|\\.html).)*)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
Explanation:
^ # Match start of line
.*? # Match any number of characters, as few as possible
thread- # until "thread-" is matched.
( # Then start a capturing group (number 1) to match:
(?: # (start of non-capturing group)
(?!-page|\.html) # assert that neither "page-" nor ".html" follow
. # then match any character
)* # repeat as often as possible
) # end of capturingn group

Exclude word and quotes from regexp

I have the following phrases:
Mr "Smith"
MrS "Smith"
I need to retrieve only Smith from this phrases. I tried thousands of variants. I stoped on
(?!Mr|MrS)([^"]+).
Help, please.
The pattern (?!Mr|MrS)([^"]+) asserts from the current position that what is directly to the right is not Mr or MrS and then captures 1+ occurrences of any char except "
So it will not start the match at Mr but it will at r because at the position before the r the lookahead assertion it true.
Instead of using a lookaround, you could match either Mr or MrS and capture what is in between double quotes.
\mMrS? "([^"]+)"
\m A word boundary
MrS? Match Mr with an optional S
" Match a space and "
([^"]+) capture in group 1 what is between the "
" Match "
See a postgresql demo
For example
select REGEXP_MATCHES('Mr "Smith"', '\mMrS? "([^"]+)"');
select REGEXP_MATCHES('MrS "Smith"', '\mMrS? "([^"]+)"');
Output
regexp_matches
1 Smith
regexp_matches
1 Smith

Or between groups when one group has to be preceeded by a character

I have the following data:
$200 – $4,500
Points – $2,500
I would like to capture the ranges in dollars, or capture the Points string if that is the lower range.
For example, if I ran my regex on each of the entries above I would expect:
Group 1: 200
Group 2: 4,500
and
Group 1: Points
Group 2: 2,500
For the first group, I can't figure out how to capture only the integer value (without the $ sign) while allowing for capturing Points.
Here is what I tried:
(?:\$([0-9,]+)|Points) – \$([0-9,]+)
https://regex101.com/r/mD9JeR/1
Just use an alternation here:
^(?:(Points)|\$(\d{1,3}(?:,\d{3})*)) - \$(\d{1,3}(?:,\d{3})*)$
Demo
The salient points of the above regex pattern are that we use an alternation to match either Points or a dollar amount on the lower end of the range, and we use the following regex for matching a dollar amount with commas:
\$\d{1,3}(?:,\d{3})*
Coming up with a regex that doesn't match the $ is not difficult. Coming up with a regex that doesn't match the $ and consistently puts the two values, whether they are both numeric or one of them is Points, as capture groups 1 and 2 is not straightforward. The difficulties disappear if you use named capture groups. This regex requires the regex module from the PyPi repository since it uses the same named groups multiple times.
import regex
tests = [
'$200 – $4,500',
'Points – $2,500'
]
re = r"""(?x) # verbose mode
^ # start of string
(
\$ # first alternate choice
(?P<G1>[\d,]+) # named group G1
| # or
(?P<G1>Points) # second alternate choice
)
\x20–\x20 # ' – '
\$
(?P<G2>[\d,]+) # named group g2
$ # end of string
"""
# or re = r'^(\$(?P<G1>[\d,]+)|(?P<G1>Points)) – \$(?P<G2>[\d,]+)$'
for test in tests:
m = regex.match(re, test)
print(m.group('G1'), m.group('G2'))
Prints:
200 4,500
Points 2,500
UPDATE
#marianc was on the right track with his comment but did not ensure that there were no extraneous characters in the input. So, with his useful input:
import re
tests = [
'$200 – $4,500',
'Points – $2,500',
'xPoints – $2,500',
]
rex = r'((?<=^\$)\d{1,3}(?:,\d{3})*|(?<=^)Points) – \$(\d{1,3}(?:,\d{3})*)$'
for test in tests:
m = re.search(rex, test)
if m:
print(test, '->', m.groups())
else:
print(test, '->', 'No match')
Prints:
$200 – $4,500 -> ('200', '4,500')
Points – $2,500 -> ('Points', '2,500')
xPoints – $2,500 -> No match
Note that a search rather than a match is done since a lookbehind assertion done at the beginning of the line cannot succeed. But we enforce no extraneous characters at the start of the line by including the ^ anchor in our lookbehind assertion.
For the first capturing group, you could use an alternation matching either Points and assert what is on the left is a non whitespace char, or match the digits with an optional decimal value asserting what is on the left is a dollar sign using a positive lookbehind if that is supported.
For the second capturing group, there is no alternative so you can match the dollar sign and capture the digits with an optional decimal value in group 2.
((?<=\$)\d{1,3}(?:,\d{3})*|(?<!\S)Points) – \$(\d{1,3}(?:,\d{3})*)
Explanation
( Capture group 1
(?<=\$)\d{1,3}(?:,\d{3})* Positive lookbehind, assert a $ to the left and match 1-3 digits and repeat 0+ matching a comma and 3 digits
| Or
(?<!\S)Points Positive lookbehind, assert a non whitespace char to the left and match Points
) Close group 1
– Match literally
\$ Match $
( Capture group 2
\d{1,3}(?:,\d{3})* Match 1-3 digits and 0+ times a comma and 3 digits
) Close group
Regex demo

REGEX - Need a way of negative lookbehind'ing multiple strings

I am trying to match unwanted regions in filenames to delete the files.
I want to get any match if the REGEX finds a "bad region" (Brazil or Columbia) BUT not if they are mixed in with "good regions" in the same bracket (USA, UK, Europe, Australia).
I have a regex of
(?<![( ](USA)[,)])[( ](Brazil|Columbia)[,)](?![( ](USA|UK|Europe|Australia)[,)])
FIFA Soccer (USA, Brazil) <<< DON't MATCH IF USA IS IN SAME BRACKET BEFORE
FIFA Soccer (Brazil, USA) <<< DON't MATCH IF USA IS IN SAME BRACKET AFTER
FIFA Soccer (Brazil) <<< MATCH
FIFA Soccer (Brazil, Ireland) <<< MATCH
FIFA Soccer (Moon, Brazil) <<< MATCH
So far the correct lines match, but that's because I have a fixed-width "negative lookbehind" looking for "USA"...... but I also want "UK" "Europe" and "Australia" in my negative lookbehinds, and I can't do that as they have to be "fixed width"...
FIFA Soccer (UK, Brazil) <<< ERROR - THIS ONE SHOULDN'T MATCH AND DOES
FIFA Soccer (Brazil, UK) <<< This one works (no match) because I have my lookahead set up
See the live demo:
Here
So is there a way of getting in effect somehing like (?<![( ](USA|UK|Europe|Australia)[,)]) at the start of the REGEX to unmatch things like UK, Brazil and Europe, Brazil.
You may use
\((?!(?:[^()]*,\s*)?(?:USA|UK|Europe|Australia)\s*[,)])[^()]*\)
See the regex demo
Details
\( - a ( char
(?!(?:[^()]*,\s*)?(?:USA|UK|Europe|Australia)\s*[,)]) - a negative lookahead that will fail the match if, immediately to the right, there is
(?:[^()]*,\s*)? - an optional sequence of
[^()]* - 0+ chars other than ( and )
, - a comma
\s* - 0+ whitespaces
(?:USA|UK|Europe|Australia) - one of the good values
\s* - 0+ whitespaces
[,)] - a , or )
[^()]* - 0 or more chars other than ( and )
\) - a ) char.
Instead of variable length negative lookbehind, you may use PCRE verbs (*SKIP)(*F) in alternation to match and reject a match:
(?:USA|UK|Europe|Australia),\h*(?:Brazil|Austria)[,)](*SKIP)(*F)|(?:Brazil|Austria)[,)](?!\h?(?:USA|UK|Europe|Australia)[,)])
Updated RegEx Demo
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
You may use DEFINE verb in PCRE to avoid repetitions in your regex like this:
/
(?(DEFINE) # use define to avoid repetitions
(?<ct>USA|UK|Europe|Australia) # disallow countries
(?<mct>Brazil|Austria) # matching countries
)
# main regex starts here
(?&ct),\h*(?&mct)[,)](*SKIP)(*F)
|
(?&mct)[,)](?!\h?(?&ct)[,)])
/x
RegEx Demo 2
Using a pattern to extract country names and array_filter:
$filenames = ['FIFA Soccer (USA, Brazil)',
'FIFA Soccer (Brazil, USA)',
'FIFA Soccer (Brazil)',
'FIFA Soccer (Brazil, Ireland)',
'FIFA Soccer (Moon, Brazil)'];
$bad = ['Brazil', 'Columbia'];
$good = ['USA', 'UK', 'Europe', 'Australia'];
$todelete = array_filter($filenames, function ($i) use ($bad, $good) {
$countries = preg_match_all('~(?:\G(?!\A), |\()\K\pL+~', $i, $m) ? $m[0] : [];
return array_intersect($countries, $bad) && !array_intersect($countries, $good);
});
print_r($todelete);