RegEx Address Parsing City - regex

In our database we have 1 "Address" field that stores the complete address as text. I am in the process of splitting the address into the following fields: Line1, City, State, Zip. (US Addresses). I have good expressions for parsing the state and zip, but I'm having a bit of difficulty parsing the city.
Basically, I am using the following rules for parsing the city:
It must come right before the state
It can have a comma, or two or more spaces before it.
If neither of the above are true, then just return the 1 word that comes before the state.
I am not interested in validating these addresses.
Here's an example of the RegEx that I've been working with, and it works great for parsing address components that are separated by a comma or more than 2 spaces, but I can't get it work if I try to include an alternative of the 1 proceeding word:
Sample Address: 1977 S. Joshua Tree PL, Palm Springs, CA 92264
.*(?i)(?((((,\s|\s{2,})\w+)+(\s\w+)))(?=(,\s+|\s+)(Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)))
Trying to make the 1st word optional causes the expression to only return "Springs", instead of "Palm Springs", which definitely matches in the expression above:
.*(?i)(?((((,\s|\s{2,})\w+)?(\s\w+)))(?=(,\s+|\s+)
Thanks for your help!

Personally, I think I would take a totally different approach. I would treat the zip code as authoritative, as it is the most granular data you have available. I would get a list of zip code to city mappings. Extract the zip code portion of the address. Write in new database fields the city and state values based on the zip code. Then write a script to go through each data entry and determine if the city and state names based on zip code can be found in your string. If they can, remove those values from the string. And flag that record as successfully processed. If they can't flag the record as one that you might need to perform manual review on.
Another alternate approach might be to use an API like Google Maps, to send your address string to and hopefully get a cleaned address out.

This may be overly broad, but it might work for you, depending on the regex implementation you are using:
(.+?),\s*(.+?)(?:,\s|\s\s)(.+?)\s(\d{5})
This will return the following groups from your example:
('1977 S. Joshua Tree PL', 'Palm Springs', 'CA', '92264')

I always prefer named capture groups for something like this. So try
(?<addr>[^,]+),\s+(?<city>[^,]+),\s+(?<state>[A-Za-z]{2})\s+(?<zip>\d{5}(-\d{4})?)
Parsing your example this will give you
addr: 1997 S. Joshua Tree PL
city: Palm Springs
state: CA
zip: 92264
and I threw in support for the extended postal code format as well.
You can just extract the value of the city group from the match generated by this regex.

Related

Regex with multiple groups, some of which are optional

I have trouble matching multiple groups, some of which are optional. I've tried variations of greedy/non greedy, but can't get it to work.
As input, I have cells which look like this:
SEPA Overboeking IBAN: AB1234 BIC: LALA678 Naam: John Smith Omschrijving: Hello hello Kenmerk: 03-05-2019 23:12 533238
I wanna split these up into groups of IBAN, BIC, Naam, Omschrijving, Kenmerk.
For this example, this yields: AB1234; LALA678; John Smith; Hello hello; 03-05-2019 23:12 533238.
To obtain this, I've used:
.*IBAN: (.*)\s+BIC: (.*)\s+Naam: (.*)\s+Omschrijving: (.*)\s+Kenmerk: (.*)
This works perfectly as long as all these groups are present in the input. Some cells, however don't have the "Omschrijving" and/or "Kenmerk" part. As output, I would like to have empty groups if they're not present. Right now, nothing is matched.
I've tried variations with greedy/non greedy, but couldn't get it to work.
Help would be greatly appreciated!
N.B.: I'm working in KNIME (open source data analysis tool)
I was able to split your input using the following regular expression:
^.*
\s+IBAN\:\s*(?<IBAN>.*?)
\s+BIC\:\s*(?<BIC>.*?)
\s+Naam\:\s*(?<Naam>.*?)
(?:\s+Omschrijving\:\s*(?<Omschrijving>.*?))?
(?:\s+Kenmerk\:\s*(?<Kenmerk>.*?))?
$
This requires your fields to follow the given order and will treat the fields IBAN, BIC and Naam as required. Fields Omschrijving and Kenmerk may be optional. I am pretty sure, this can still be optimized, but it results in the following output, which should be fine for you (or at least a starting point):
For evaluation and testing in KNIME, I used Palladian's Regex Extractor node, that can be configured as follows and provides a nice preview functionality:
I added an example workflow to my NodePit Space. It contains some example lines, parses them and provides the above seen output.

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

Parsing a String in SSIS or C#

I have one string without any delimiter and I want to parse it. Is it possible in SSIS or c#.
For Example, If I have address info in a single column, but i want to split/parse it in multiple columns such as House Number, Road Number, Road name, Road type, Locality name, state code, post code, country wise etc.
12/38 Meacher Street Mount Druitt NSW 2770 Australia -- In this case House Number:- 12, road no:- 38, road name meacher, road type - road, locality :- mount druitt, state-NSW, post code:- 2770
have all these info in a single column, so how I will parse it and split inh multiple columns. I know by giving space delimiter will not work as there will be split the wrong information and there will be some road name with more than space , so in this info will be split up in wrong column.
Any suggestion would be appreciated.
Thanks.
Please remember that the country can also have spaces in it and some countries use alphanumerical post codes.
If all addresses are in Australia and in the same format of (...), state, postcode, Australia then you can split it into
StreetAddress, State, PostCode
You could also use one of online APIs to find an address and then then you get individual elements.
The best solution is to keep it together - why split it?

I'm going to be teaching a few developers regular expressions - what are some good homework problems? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm thinking of presenting questions in the form of "here is your input: [foo], here are the capture groups/results: [bar]" (and maybe writing a small script to test their answers for my results).
What are some good regex questions to ask? I need everything from beginner questions like "validate a 4 digit number" to "extract postal codes from addresses".
A few that I can think off the top of my head:
Phone numbers in any format e.g. 555-5555, 555 55 55 55, (555) 555-555 etc.
Remove all html tags from text.
Match social security number (Finnish one is easy;)
All IP addresses
IP addresses with shorthand netmask (xx.xx.xx.xx/yy)
There's a bunch of examples of various regular expression techniques over at www.regular-expressions.info - everything for simple literal matching to backreferences and lookahead.
To keep things a bit more interesting than the usual email/phone/url stuff, try looking for more original exercises. Avoid boredom.
For example, have a look at the Forsysth-Edwards Notation which is used for describing a particular board position of a chess game.
Have your students validate and extract all the bits of information from a string like this:
rnbqkbnr/pp1ppppp/8/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R b KQkq - 1 2
Additionaly, have a look at algebraic chess notation, used to describe moves. Extract chess moves out of a piece of text (and make them bold).
1. e4 e5 2. Nf3 Black now defends his pawn 2...Nc6 3. Bb5 Black threatens c4
Validate phone numbers (extract area code + rest of number with grouping) (Assuming US phone number, otherwise generalize for you style)
Play around with validating email address (probably want to tell the students that this is hugely complicated regular expression but for simple ones it is pretty straight forward)
regexplib.com has a good library you can search through for examples.
H0w about extract first name, middle name, last name, personal suffix (Jr., III, etc.) from a format like:
Smith III, John Paul
How about Reg Ex to remove line breaks and tabs from the input
I would start with the common ones:
validate email
validate phone number
separate the parts of a URL
Be cruel. Tell them parse HTML.
RegEx match open tags except XHTML self-contained tags
Are you teaching them theory of finite automata as well?
Here is a good one: parse the addresses of churches correctly from this badly structured format (copy and paste it as text first)
http://www.churchangel.com/WEBNY/newhart.htm
I'm a fan of parsing date strings. Define a few common data formats, as well as time and date-time formats. These are often good exercises because some dates are simple mixes of digits and punctuation. There's a limited degree of freedom in parsing dates.
Just to throw them for a loop, why not reword a question or two to suggest that they write a regular expression to generate data fitting a specific pattern like email addresses, phone numbers, etc.? It's the same thing as validating, but can help them get out of the mindset that regex is just for validation (whereas the data generation tool in visual studio uses regex to randomly generate data).
Rather than teaching examples based from the data set, I would do examples from the perspective of the rule set to get basics across. Give them simple examples to solve that leads them to use ONE of several basic groupings in each solution. Then have a couple of "compound" regex's at the end.
Simple:
s/abc/def/
Spinners and special characters:
s/a\s*b/abc/
Grouping:
s/[abc]/def/
Backreference:
s/ab(c)/def$1/
Anchors:
s/^fred/wilma/
s/$rubble/and betty/
Modifiers:
s/Abcd/def/gi
After this, I would give a few examples illustrating the pitfalls of trying to match html tags or other strings that shouldn't be done with regex's to show the limitations.
Try to think of some tests that don't include ones that can be found with Google.
Asking a email validator should pose no trouble finding..
Try something like a 5 proof test.
Input 5 digit. Sum up each digit must be dividable by five: 12345 = 1+2+3+4+5 = 15 / 5 = 3(.0)

where can i get a regex or a library package for recognizing street address, postal code, state, phone numbers, emails and etc?

i have bunch of unformatted docs....
i need regex to capture street address, postal code, state, phone numbers, emails, such common formats...
This site offers a searchable library of regexs: and this regular expression cookbook contains hundreds of examples of regex matching patterns
In the case of street addresses and to a certain extent, postal codes, regexs can only go so far. As a matter of fact, trying to regex a street is essentially impossible because of the huge variety of formats for a street address--even from within the United States.
A regex that has worked rather well for strictly formatted US-based postal codes is: ^\d{5}([-+]?\d{4})?$
In the US, ZIP Codes are typically formatted as follows:
12345
123456789
12345-6789
12345+6789 12345-67ND (yes, you read that right, sometimes the last two can be "ND")
The other issue that you'll have is when a zero-prefixed ZIP such as one from New England has been run through Excel and it has removed the leading zero, leaving a four-digit number. This is why a regex alone can't get the job done 100% even for something as "simple" as a US-based ZIP Code.
Depending upon the business needs, you'll want to investigate an address verification solution. Any online provider worth their salt can standardize and verify and address which tells you if the address is real and can help reduce fraud and return shipping, etc.
In the interest of full disclosure, I'm the founder of SmartyStreets. We have an online address verification service which cleans, standardizes, and validates addresses. You're more than welcome to contact me personally for any questions you have.