Regex for city with Apostrophe - regex

Here's is my javascript regex for a city name and it's handling almost all cases except this.
^[a-zA-Z]+[\. - ']?(?:[\s-][a-zA-Z]+)*$
(Should pass)
Coeur d'Alene
San Tan Valley
St. Thomas
St. Thomas-Vincent
St. Thomas Vincent
St Thomas-Vincent
St-Thomas
anaconda-deer lodge county
(Should Fail)
San. Tan. Valley
St.. Thomas
St.. Thomas--Vincent
St.- Thomas -Vincent
St--Thomas

This matches all your names from the first list and not those from the second:
/^[a-zA-Z]+(?:\.(?!-))?(?:[\s-](?:[a-z]+')?[a-zA-Z]+)*$/
Multiline explanation:
^[a-zA-Z]+ # begins with a word
(?:\.(?!-))? # maybe a dot but not followed by a dash
(?:
[\s-] # whitespace or dash
(?:[a-z]+\')? # maybe a lowercase-word and an apostrophe
[a-zA-Z]+ # word
)*$ # repeated to the end
To allow the dots anywhere, but not two of them, use this:
/^(?!.*?\..*?\.)[a-zA-Z]+(?:(?:\.\s?|\s|-)(?:[a-z]+')?[a-zA-Z]+)*$/
^(?!.*?\..*?\.) # does not contain two dots
[a-zA-Z]+ # a word
(?:
(?:\.\s?|\s|-) # delimiter: dot with maybe whitespace, whitespace or dash
(?:[a-z]+\')? # maybe a lowercase-word and an apostrophe
[a-zA-Z]+ # word
)*$ # repeated to the end

Try this regex:
^(?:[a-zA-Z]+(?:[.'\-,])?\s?)+$
This does match:
Coeur d'Alene
San Tan Valley
St. Thomas
St. Thomas-Vincent
St. Thomas Vincent
St Thomas-Vincent
St-Thomas
anaconda-deer lodge county
Monte St.Thomas
San. Tan. Valley
Washington, D.C.
But doesn't match:
St.. Thomas
St.. Thomas--Vincent
St.- Thomas -Vincent
St--Thomas
(I allowed it to match San. Tan. Valley, since there's probably a city name out there with 2 periods.)
How the regex works:
# ^ - Match the line start.
# (?: - Start a non-catching group
# [a-zA-Z]+ - That starts with 1 or more letters.
# [.'\-,]? - Followed by one period, apostrophe dash, or comma. (optional)
# \s? - Followed by a space (optional)
# )+ - End of the group, match at least one or more of the previous group.
# $ - Match the end of the line

I think the following regexp fits your requirements :
^([Ss]t\. |[a-zA-Z ]|\['-](?:[^-']))+$
On the other hand, you may question the idea of using a regexp to do that... Whatever the complexity of your regexp, there will always be some fool finding a new unwanted pattern that matches...
Usually when you need to have valid city names, it's better to use some geocoding api, like google geocoding API

Related

Regex to match when groups aren't same length

I've been having trouble understanding how to make this regex more dynamic. I specifically want to pull out these for elements, but sometimes part of them will be missing. In my case here, it doesn't recognize the pattern because the 4th group isn't present.
For example, given the 2 strings:
Rafael C. is eating a Burger by McDonalds at Beach
David K. is eating a Burger by McDonalds
John G. is eating a by at House
I'm trying to pull out the [name], [item], [by name], [at name]. It will always be in this patterns, but parts of it may be missing at times. Sometimes it's the name missing, sometimes it's the item, sometimes its the name and by name, etc.
So I'm using:
Link here
(.*) is eating a (.*) by (.*) at (.*)
But because it's missing in the second string, it doesn't recognize it. I've tried using lookbehind/lookaheads. I've tried using quintifiers, but having a hard time understanding what it is to get exactly those 4 groups, as you can see below:
Output desired:
I'd like it capture:
[Rafael C., Burger, McDonalds, Beach]
[David K., Burger, McDonalds, '']
[John G., '', '', 'House']
You can use
^(.*) is eating a ((?:(?!\b(?:by|at)\b).)*?)(?: ?\bby ((?:(?!\bat\b).)*?))?(?: ?\bat (.*))?$
See the regex demo.
Details:
^ - string start
(.*) - Group 1: any zero or more chars other than line break chars as many as possible
is eating a - a literal string
((?:(?!\b(?:by|at)\b).)*?) - Group 2: any char other than line break char, zero or more but as few as possible occurrences, that is not a starting point for a by or at whole word char sequence
(?: ?\bby ((?:(?!\bat\b).)*?))? - an optional non-capturing group that matches an optional space, word boundary, by, space and then captures into Group 3 any char other than line break char, zero or more but as few as possible occurrences, that is not a starting point for an at whole word char sequence
(?: ?\bat (.*))? - an optional non-capturing group that matches an optional space, word boundary, at, space and then captures into Group 4 any zero or more chars other than line break chars as many as possible
$ - string end.
I suggest using the quantifier "?" like this.
(.*) is eating a (.*) by (.*)(?: at (.*))*
This works with your example. https://regex101.com/r/B4JbdS/1
edit : You are right #chitown88 this regex should match better. I use "[^\]" instead of ".*" to trim whitespace when there is no value.
I also used "(?=)" (lookahead) and "(?<=)" (lookbehind) to capture groupe between two specific match.
(.*)(?=^is eating a| is eating a).*(?<=^is eating a| is eating a) *([^\\]*?) *(?=by).*(?<=by) *([^\\]*?) *(?=at |at$).*(?<=at |at$)(.*)
https://regex101.com/r/PHbmAZ/1

Using regex to parse data between delimeter and ending at a specified substring

I'm trying to parse out the names from a bunch of semi-unpredictable strings. More specifically, I'm using ruby, but I don't think that should matter much. This is a contrived example but some example strings are:
Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars
The regex I've come up with is
([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i
but in the case of "NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN" I'm receiving Chicago Bears TUNE as the second match. I'm trying to remove "tune in" so it's in it's own group.
I thought that by adding (?:[ -:]*tune)? it would separate the ending portion of the expression the same way that having vs in the middle was able to, but that doesnt seem to be the case. If I remove the ? at the end, it matches correctly for the above example, but it no longer matches for Eagles vs Bears
If anyone could help me, I would greatly appreciate it if you could breakdown your regex piece by piece.
You can capture the second group up to a -, : or tune preceded with zero or more whitespaces or till end of the line while making the second group pattern lazy:
([\w .]*) vs ([\w .]*?)(?=\s*(?:[:-]|tune|$))
See the regex demo.
Details:
([\w .]*) - Group 1: zero or more word, space or . chars as many as possible
vs - a vs string
([\w .]*?) - Group 2: zero or more word, space or . chars as few as possible
(?=\s*(?:[:-]|tune|$)) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
\s* - zero or more whitespaces
(?:[:-]|tune|$) - : or -, tune or end of a line.
You can use the following regular expression which I have expressed in free-spacing mode to make it self-documenting (search for "Free-Spacing Mode" at the link).
rgx = /
(?: |\A) # match space or beginning of string
(?<team1> # begin capture group team1
(?<team> # begin capture group team
(?<word> # begin capture group word
(?:\p{Lu}|\d) # word begins with an uppercase letter or digit
(?:\p{Ll}|\d)+ # ...followed by 1+ lowercase letters or digits
) # end capture group word
(?: # begin non-capture group
[ .] # match a space or period
\g<word> # match another word
)* # end non-capture group and execute 1+ times
) # end capture group team
) # end capture group team1
[ ]+ # match one or more spaces
(?:VS|vs) # match literal
[ ]+ # match one or more spaces
(?<team2> # begin capture group team2
\g<team> # match the second team name
) # end capture group team2
(?: # begin non-capture group
[ ] # match a space
(?: # begin non-capture group
(?:-[ ])? # optionally match literal
TUNE[ ]IN # match literal
| # or
-[ ]NFL[ ]Match # match literal
) # end inner capture group
)? # end outer non-capture group and make it optional
\z # match end of string
/x # free-spacing regex definition mode
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Phil.Eagles", "Chic.Bears"],
# ["3agles", "B3ars"]]
See Regexp#match and MatchData#[].
Note that \g<word> and \g<team> effectively copy the code contained in the capture groups word and team, respectively. These are called "Subexpression Calls". For additional information search for that term at Regexp. There are two advantages to using subroutine calls: less code is needed and the opportunities for coding errors is reduced.

Python Regex some name + US Address

I have these kind of strings:
WILLIAM SMITH 2345 GLENDALE DR RM 245 ATLANTA GA 30328-3474
LINDSAY SCARPITTA 655 W GRACE ST APT 418 CHICAGO IL 60613-4046
I want to make sure that strings I will get are like those strings like above.
Here's my regular expression:
[A-Z]+ [A-Z]+ [0-9]{3,4} [A-Z]+ [A-Z]{2,4} [A-Z]{2,4} [0-9]+ [A-Z]+ [A-Z]{2} [0-9]{5}-[0-9]{4}$
But my regular expression only matches the first example and does not match the second one.
Here's dawg's regex with capturing groups:
^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$
Here's the url.
UPDATE
sorry, I forgot to remove non-capturing group at the end of dawg's regex...
Here's new regex without non-capturing group: regex101
Try this:
^[A-Z]+[ \t]+[A-Z]+[ \t]+\d+.*[ \t]+[A-Z]{2}[ \t]+\d{5}(?:-\d{4})$
Demo
Explanation:
1. ^[A-Z]+[ \t]+[A-Z]+[ \t]+ Starting at the start of line,
two blocks of A-Z for the name
(however, names are often more complicated...)
2. \d+.*[ \t]+[A-Z]{2}[ \t]+ Using number start and
two letter state code at the end for the full address
Cities can have spaces such as 'Miami Beach'
3. \d{5}(?:-\d{4})$ Zip code with optional -NNNN with end anchor

RegEx for extracting US address not working when address is separated with newline

I have the following RegEx to extract US address from a string.
(\d+)[ \n]+((\w+[ ,])+[\$\n, ]+){2}([a-zA-Z]){2}[$\n, ]+(\d){5}
This is not working when the address is in the below format.
2933 Glen Crow Court
San Jose
CA 95148
and is working for the below data.
2933 Glen Crow Court,
San Jose, CA 95148
.
2933 Glen Crow Court, San Jose, CA 95148
Any help on this would be much appreciated.
You can simplify your pattern to something like this for matching the address, whether in one line or in multiple line.
\b\d+(?:\s+[\w,]+)+?\s+[a-zA-Z]{2}\s+\d{5}\b
Regex Explanation:
\b\d+ - Starts matching with word boundary with one or more digit
(?:\s+[\w,]+)+? - A non-grouping pattern that matches one or more whitespace then text having one or more word character and comma and whole of it one or more times but in non-greedy way.
\s+[a-zA-Z]{2} - Matches one or more whitespace then two alphabetic characters to expect text like CA, NY
\s+\d{5}\b - Followed by one or more whitespace then finally five digits with word boundary to avoid matching partially in a larger text
Demo
Add ? to the [ ,] check:
(\d+)[ \n]+((\w+[ ,]?)+[\$\n, ]+){2}([a-zA-Z]){2}[$\n, ]+(\d){5}
Try this pattern \d+\s+[\w ]+[\s,]+[\w ]+[\s,]+\w+ \d+
Explanation:
\d+\s+ - match one ore more digits then match one ore more white spaces
[\w ]+[\s,]+ - match one or more word characters or space, then one or more white spaces or comma
\w+ \d+ -match one ore more word charaters, space and onre or more digits
Demo
Not drake but you can thank me later...
r"(?:(\d+ [A-Za-z][A-Za-z ]+)[\s,]*([A-Za-z#0-9][A-Za-z#0-9 ]+)?[\s,]*)?(?:([A-Za-z][A-Za-z ]+)[\s,]+)?((?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA‌​|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD‌​|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2})(?:[,\s]+(\d{5}(?:-\d{4})?))?"
you can test it out here... demo
note: this only works for us addresses

REGEX - Need a way of negative lookbehind'ing multiple strings

I am trying to match unwanted regions in filenames to delete the files.
I want to get any match if the REGEX finds a "bad region" (Brazil or Columbia) BUT not if they are mixed in with "good regions" in the same bracket (USA, UK, Europe, Australia).
I have a regex of
(?<![( ](USA)[,)])[( ](Brazil|Columbia)[,)](?![( ](USA|UK|Europe|Australia)[,)])
FIFA Soccer (USA, Brazil) <<< DON't MATCH IF USA IS IN SAME BRACKET BEFORE
FIFA Soccer (Brazil, USA) <<< DON't MATCH IF USA IS IN SAME BRACKET AFTER
FIFA Soccer (Brazil) <<< MATCH
FIFA Soccer (Brazil, Ireland) <<< MATCH
FIFA Soccer (Moon, Brazil) <<< MATCH
So far the correct lines match, but that's because I have a fixed-width "negative lookbehind" looking for "USA"...... but I also want "UK" "Europe" and "Australia" in my negative lookbehinds, and I can't do that as they have to be "fixed width"...
FIFA Soccer (UK, Brazil) <<< ERROR - THIS ONE SHOULDN'T MATCH AND DOES
FIFA Soccer (Brazil, UK) <<< This one works (no match) because I have my lookahead set up
See the live demo:
Here
So is there a way of getting in effect somehing like (?<![( ](USA|UK|Europe|Australia)[,)]) at the start of the REGEX to unmatch things like UK, Brazil and Europe, Brazil.
You may use
\((?!(?:[^()]*,\s*)?(?:USA|UK|Europe|Australia)\s*[,)])[^()]*\)
See the regex demo
Details
\( - a ( char
(?!(?:[^()]*,\s*)?(?:USA|UK|Europe|Australia)\s*[,)]) - a negative lookahead that will fail the match if, immediately to the right, there is
(?:[^()]*,\s*)? - an optional sequence of
[^()]* - 0+ chars other than ( and )
, - a comma
\s* - 0+ whitespaces
(?:USA|UK|Europe|Australia) - one of the good values
\s* - 0+ whitespaces
[,)] - a , or )
[^()]* - 0 or more chars other than ( and )
\) - a ) char.
Instead of variable length negative lookbehind, you may use PCRE verbs (*SKIP)(*F) in alternation to match and reject a match:
(?:USA|UK|Europe|Australia),\h*(?:Brazil|Austria)[,)](*SKIP)(*F)|(?:Brazil|Austria)[,)](?!\h?(?:USA|UK|Europe|Australia)[,)])
Updated RegEx Demo
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
You may use DEFINE verb in PCRE to avoid repetitions in your regex like this:
/
(?(DEFINE) # use define to avoid repetitions
(?<ct>USA|UK|Europe|Australia) # disallow countries
(?<mct>Brazil|Austria) # matching countries
)
# main regex starts here
(?&ct),\h*(?&mct)[,)](*SKIP)(*F)
|
(?&mct)[,)](?!\h?(?&ct)[,)])
/x
RegEx Demo 2
Using a pattern to extract country names and array_filter:
$filenames = ['FIFA Soccer (USA, Brazil)',
'FIFA Soccer (Brazil, USA)',
'FIFA Soccer (Brazil)',
'FIFA Soccer (Brazil, Ireland)',
'FIFA Soccer (Moon, Brazil)'];
$bad = ['Brazil', 'Columbia'];
$good = ['USA', 'UK', 'Europe', 'Australia'];
$todelete = array_filter($filenames, function ($i) use ($bad, $good) {
$countries = preg_match_all('~(?:\G(?!\A), |\()\K\pL+~', $i, $m) ? $m[0] : [];
return array_intersect($countries, $bad) && !array_intersect($countries, $good);
});
print_r($todelete);