REGEX - Need a way of negative lookbehind'ing multiple strings - regex

I am trying to match unwanted regions in filenames to delete the files.
I want to get any match if the REGEX finds a "bad region" (Brazil or Columbia) BUT not if they are mixed in with "good regions" in the same bracket (USA, UK, Europe, Australia).
I have a regex of
(?<![( ](USA)[,)])[( ](Brazil|Columbia)[,)](?![( ](USA|UK|Europe|Australia)[,)])
FIFA Soccer (USA, Brazil) <<< DON't MATCH IF USA IS IN SAME BRACKET BEFORE
FIFA Soccer (Brazil, USA) <<< DON't MATCH IF USA IS IN SAME BRACKET AFTER
FIFA Soccer (Brazil) <<< MATCH
FIFA Soccer (Brazil, Ireland) <<< MATCH
FIFA Soccer (Moon, Brazil) <<< MATCH
So far the correct lines match, but that's because I have a fixed-width "negative lookbehind" looking for "USA"...... but I also want "UK" "Europe" and "Australia" in my negative lookbehinds, and I can't do that as they have to be "fixed width"...
FIFA Soccer (UK, Brazil) <<< ERROR - THIS ONE SHOULDN'T MATCH AND DOES
FIFA Soccer (Brazil, UK) <<< This one works (no match) because I have my lookahead set up
See the live demo:
Here
So is there a way of getting in effect somehing like (?<![( ](USA|UK|Europe|Australia)[,)]) at the start of the REGEX to unmatch things like UK, Brazil and Europe, Brazil.

You may use
\((?!(?:[^()]*,\s*)?(?:USA|UK|Europe|Australia)\s*[,)])[^()]*\)
See the regex demo
Details
\( - a ( char
(?!(?:[^()]*,\s*)?(?:USA|UK|Europe|Australia)\s*[,)]) - a negative lookahead that will fail the match if, immediately to the right, there is
(?:[^()]*,\s*)? - an optional sequence of
[^()]* - 0+ chars other than ( and )
, - a comma
\s* - 0+ whitespaces
(?:USA|UK|Europe|Australia) - one of the good values
\s* - 0+ whitespaces
[,)] - a , or )
[^()]* - 0 or more chars other than ( and )
\) - a ) char.

Instead of variable length negative lookbehind, you may use PCRE verbs (*SKIP)(*F) in alternation to match and reject a match:
(?:USA|UK|Europe|Australia),\h*(?:Brazil|Austria)[,)](*SKIP)(*F)|(?:Brazil|Austria)[,)](?!\h?(?:USA|UK|Europe|Australia)[,)])
Updated RegEx Demo
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
You may use DEFINE verb in PCRE to avoid repetitions in your regex like this:
/
(?(DEFINE) # use define to avoid repetitions
(?<ct>USA|UK|Europe|Australia) # disallow countries
(?<mct>Brazil|Austria) # matching countries
)
# main regex starts here
(?&ct),\h*(?&mct)[,)](*SKIP)(*F)
|
(?&mct)[,)](?!\h?(?&ct)[,)])
/x
RegEx Demo 2

Using a pattern to extract country names and array_filter:
$filenames = ['FIFA Soccer (USA, Brazil)',
'FIFA Soccer (Brazil, USA)',
'FIFA Soccer (Brazil)',
'FIFA Soccer (Brazil, Ireland)',
'FIFA Soccer (Moon, Brazil)'];
$bad = ['Brazil', 'Columbia'];
$good = ['USA', 'UK', 'Europe', 'Australia'];
$todelete = array_filter($filenames, function ($i) use ($bad, $good) {
$countries = preg_match_all('~(?:\G(?!\A), |\()\K\pL+~', $i, $m) ? $m[0] : [];
return array_intersect($countries, $bad) && !array_intersect($countries, $good);
});
print_r($todelete);

Related

Using regex to parse data between delimeter and ending at a specified substring

I'm trying to parse out the names from a bunch of semi-unpredictable strings. More specifically, I'm using ruby, but I don't think that should matter much. This is a contrived example but some example strings are:
Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars
The regex I've come up with is
([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i
but in the case of "NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN" I'm receiving Chicago Bears TUNE as the second match. I'm trying to remove "tune in" so it's in it's own group.
I thought that by adding (?:[ -:]*tune)? it would separate the ending portion of the expression the same way that having vs in the middle was able to, but that doesnt seem to be the case. If I remove the ? at the end, it matches correctly for the above example, but it no longer matches for Eagles vs Bears
If anyone could help me, I would greatly appreciate it if you could breakdown your regex piece by piece.
You can capture the second group up to a -, : or tune preceded with zero or more whitespaces or till end of the line while making the second group pattern lazy:
([\w .]*) vs ([\w .]*?)(?=\s*(?:[:-]|tune|$))
See the regex demo.
Details:
([\w .]*) - Group 1: zero or more word, space or . chars as many as possible
vs - a vs string
([\w .]*?) - Group 2: zero or more word, space or . chars as few as possible
(?=\s*(?:[:-]|tune|$)) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
\s* - zero or more whitespaces
(?:[:-]|tune|$) - : or -, tune or end of a line.
You can use the following regular expression which I have expressed in free-spacing mode to make it self-documenting (search for "Free-Spacing Mode" at the link).
rgx = /
(?: |\A) # match space or beginning of string
(?<team1> # begin capture group team1
(?<team> # begin capture group team
(?<word> # begin capture group word
(?:\p{Lu}|\d) # word begins with an uppercase letter or digit
(?:\p{Ll}|\d)+ # ...followed by 1+ lowercase letters or digits
) # end capture group word
(?: # begin non-capture group
[ .] # match a space or period
\g<word> # match another word
)* # end non-capture group and execute 1+ times
) # end capture group team
) # end capture group team1
[ ]+ # match one or more spaces
(?:VS|vs) # match literal
[ ]+ # match one or more spaces
(?<team2> # begin capture group team2
\g<team> # match the second team name
) # end capture group team2
(?: # begin non-capture group
[ ] # match a space
(?: # begin non-capture group
(?:-[ ])? # optionally match literal
TUNE[ ]IN # match literal
| # or
-[ ]NFL[ ]Match # match literal
) # end inner capture group
)? # end outer non-capture group and make it optional
\z # match end of string
/x # free-spacing regex definition mode
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Phil.Eagles", "Chic.Bears"],
# ["3agles", "B3ars"]]
See Regexp#match and MatchData#[].
Note that \g<word> and \g<team> effectively copy the code contained in the capture groups word and team, respectively. These are called "Subexpression Calls". For additional information search for that term at Regexp. There are two advantages to using subroutine calls: less code is needed and the opportunities for coding errors is reduced.

Regex matching a written list such as "New York, Texas, and Florida"

I need a regex that will match the following conditions for an arbitrarily long list where each capture can be multiple words. It will always have the oxford comma, if that helps.
'New York' #=> ['New York']
'New York and Texas' #=> ['New York', 'Texas']
'New York, Texas, and Florida' #=> ['New York', 'Texas', 'Florida']
I found that (.+?)(?:,|$)(?:\sand\s|$)? will match 1 and 3 but not 2.
And (.+?)(?:\sand\s|$) will match 1 and 2 but not 3.
How can I match all 3?
You may split the text with the following pattern:
(?:\s*(?:\band\b|,))+\s*
Details
(?:\s*(?:\band\b|,))+ - 1 or more occurrences of:
\s* - 0+ whitespaces
(?:\band\b|,) - and as a whole word or a comma
\s* - 0 or more whitespace characters.
See the regex demo.
Note you may make it a bit more efficient if your regex engine supports possessive quantifiers:
(?:\s*+(?:\band\b|,))+\s*
^
Or atomic groups:
(?>\s*+(?:\band\b|,))+\s*
^^

Regex pattern that matches only if not included in another pattern

I am trying to understand how to make a regular expression that only matches a pattern if this pattern is not included in another one.
In the following example, I want to match dashes only if they are not into a [code][/code] tag.
---------
[code]
-------------------------------------------------------------------------------------
Some text
-----------------
Some other text
-------------------------------------------------------------------------------------
test
[/code]
I have searched for explanations about lookahead and lookbehind but cannot understand if and how it could be suitable for what I need.
I wanted to use a combination of negative lookbehind and negative lookahead but it seems that it is not possible to use + or * in negative lookbehind pattern.
So, for example, this won't work (because of the + in the negative look behind)
/(?<!\[code\].+?)(-{5,100})(?!.+?\[\/code\])/m
How can I achieve that in another way ?
One possibility if the tags are not nested is to match from the opening till the closing tag to match what you don't want. Then use an alternation to capture in a group what you do want, in this case 5 - 100 times a hyphen.
\[code\](?:(?!\[\/?code\]).)*\[\/code]|(-{5,100})/m
Explanation
\[code\] Match [code]
(?: Non capturing group
(?!\[\/?code\]). Assert if what is on the right is not [code] with an optional / after the opening [ Then match any character.
)* Repeat non capturing group and repeat 0+ times
\[\/code] Match [/code]
| Or
(-{5,100}) Capture in group 1 matching 5 - 100 times a hyphen
Regex demo
I don't believe a regular expression is the right tool for the job here.
str = <<END
---------
[code]
-------------------------------------------------------------------------------
Some text
----------------------------------
Some other text
-------------------------------------------------------------------------------
test
[/code]
------------
---
[code]
Some text
-------------------------------------------
[/code]
------------
END
within = false
str.split("\n").select do |line|
case line
when "[code]"
within = true
false
when "[/code]"
within = false
false
else
within == false
end
end
#=> ["---------", "------------", "---", "------------"]
I would have used the to-some-beloved flip-flip operator had it not been deprecated.
str.split("\n").reject do |line|
true if line == "[code]"..line == "[/code]"
end
#=> ["---------", "------------", "---", "------------"]
Hold the phone! It looks like Matz has un-deprecated it! (Scroll to end.)

Regex to split text ignoring occurrences of delimiter(s) in quoted text

How would I approach writing a regex where given a set of delimiters such as both ; and ,, I could get the following results on these examples:
coffee, water; tea -> [coffee, water, tea]
"coffee, black;", water; tea -> ["coffee, black;", water, tea]
To clarify, regular text cannot have spaces, quoted text can have spaces, delimiters inside the quotes are ignored, and all text is separated by delimiters.
I've been experimenting with regex myself, and haven't gotten the results that I want. I'm also working in an environment without lookaheads/lookbehinds. Any thoughts on how to achieve this?
Here is a good way (?:\r?\n|[,;]|^)[^\S\r\n]*((?:(?:[^\S\r\n]*[^,;"\s])*(?:"[^"]*")?[^,;"\s]*))[^\S\r\n]*
Added some WSp trim to it.
Nice demo here -> https://regex101.com/r/FsJtOE/1
Capture group 1 contains the element.
A simple find all should work.
Note, using Re2 has no assertions, but to handle all corners
it really needs them.
Unfortunately, this is as close as you can get using that regex engine.
One thing this will do is allow multiple words in non-quoted fields.
Readable version
# Validate even quotes: ^[^"]*(?:"[^"]*"[^"]*)*$
# Then ->
# ----------------------------------------------
# Find all:
(?: \r? \n | [,;] | ^ )
[^\S\r\n]*
( # (1 start)
(?:
(?:
[^\S\r\n]*
[^,;"\s]
)*
(?: " [^"]* " )?
[^,;"\s]*
)
) # (1 end)
[^\S\r\n]*
Replacing:
((\"[^\"]*\")|[a-zA-Z]+)[,;]
With:
$1,
Will give you what's inside the brackets.
Explanation:
((\"[^\"]*\")|[a-zA-Z]+) any of these two options:
(\"[^\"]*\") anything between double quotes
[a-zA-Z]+ any sequence of characters
[,;] any occurrence of , or ;
See on regex101, with this input:
coffee, water; tea
"coffee, black;", water; tea
You get this output:
coffee, water, tea
"coffee, black;", water, tea
Not sure what flavor of regex you're using that excludes the use of lookaheads, but would something like this work for you?
/".*"|[^;,"\s]+/
It checks first for a quoted value (using ".*") before trying values that exclude delimiters, quotes, and whitespace (using a negative character class [^;,"\s]+)
https://regex101.com/r/zWea28/1/

Regex for city with Apostrophe

Here's is my javascript regex for a city name and it's handling almost all cases except this.
^[a-zA-Z]+[\. - ']?(?:[\s-][a-zA-Z]+)*$
(Should pass)
Coeur d'Alene
San Tan Valley
St. Thomas
St. Thomas-Vincent
St. Thomas Vincent
St Thomas-Vincent
St-Thomas
anaconda-deer lodge county
(Should Fail)
San. Tan. Valley
St.. Thomas
St.. Thomas--Vincent
St.- Thomas -Vincent
St--Thomas
This matches all your names from the first list and not those from the second:
/^[a-zA-Z]+(?:\.(?!-))?(?:[\s-](?:[a-z]+')?[a-zA-Z]+)*$/
Multiline explanation:
^[a-zA-Z]+ # begins with a word
(?:\.(?!-))? # maybe a dot but not followed by a dash
(?:
[\s-] # whitespace or dash
(?:[a-z]+\')? # maybe a lowercase-word and an apostrophe
[a-zA-Z]+ # word
)*$ # repeated to the end
To allow the dots anywhere, but not two of them, use this:
/^(?!.*?\..*?\.)[a-zA-Z]+(?:(?:\.\s?|\s|-)(?:[a-z]+')?[a-zA-Z]+)*$/
^(?!.*?\..*?\.) # does not contain two dots
[a-zA-Z]+ # a word
(?:
(?:\.\s?|\s|-) # delimiter: dot with maybe whitespace, whitespace or dash
(?:[a-z]+\')? # maybe a lowercase-word and an apostrophe
[a-zA-Z]+ # word
)*$ # repeated to the end
Try this regex:
^(?:[a-zA-Z]+(?:[.'\-,])?\s?)+$
This does match:
Coeur d'Alene
San Tan Valley
St. Thomas
St. Thomas-Vincent
St. Thomas Vincent
St Thomas-Vincent
St-Thomas
anaconda-deer lodge county
Monte St.Thomas
San. Tan. Valley
Washington, D.C.
But doesn't match:
St.. Thomas
St.. Thomas--Vincent
St.- Thomas -Vincent
St--Thomas
(I allowed it to match San. Tan. Valley, since there's probably a city name out there with 2 periods.)
How the regex works:
# ^ - Match the line start.
# (?: - Start a non-catching group
# [a-zA-Z]+ - That starts with 1 or more letters.
# [.'\-,]? - Followed by one period, apostrophe dash, or comma. (optional)
# \s? - Followed by a space (optional)
# )+ - End of the group, match at least one or more of the previous group.
# $ - Match the end of the line
I think the following regexp fits your requirements :
^([Ss]t\. |[a-zA-Z ]|\['-](?:[^-']))+$
On the other hand, you may question the idea of using a regexp to do that... Whatever the complexity of your regexp, there will always be some fool finding a new unwanted pattern that matches...
Usually when you need to have valid city names, it's better to use some geocoding api, like google geocoding API