Hey, folks. I'm looking for some regular expressions to help grab street addresses and phone numbers from free-form text (a la Gmail).
Given some text: "John, I went to the store today, and it was awesome! Did you hear that they moved to 500 Green St.? ... Give me a call at +14252425424 when you get a chance."
I'd like to be able to pull out:
500 Green St. (recognized as a street address)
+14252425424 (recognized as a phone number)
What makes this problem easier is that I don't care about parsing text that gets pulled out. That is, I don't care that Green is the name of the road or that 425 is the area code. I just want to grab strings that "look like" addresses or telephone numbers.
Unfortunately, this needs to work internationally, as best as possible.
Anyone have any leads? Thanks!
Phone numbers as long as you have a list of all country codes and number formats is easy, street addresses I have no idea, the only advice I can give you is to validate each set of words # addressdoctor.com
You can give RecogniContact (-> address-parser.com) a try, it recognizes both postal addresses and phone numbers.
Take a look at Chapter 7 of Dive Into Python. It touches both phone numbers and street addresses. I believe you can use this as a starting point. The international part seems tough. I suggest you build a first draft, try it on several locales, iterate and improve.
Related
I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})
Got a block of text I'm trying to pull phone numbers out of.
for example:
Phone Numbers
Any phone numbers that Angelo may currently or previously have used
are displayed below. Run a phone report on a particular number for
more information.
(555) 444-5555 (555) 555-7777 Not seeing something? Access additional
data sources. Accessing premium data sources may reveal hard to find
phone numbers like cell phones
the regex code I wrote to extract the numbers is
.?\d{3}.?\s\d{3}.\d{4}
For whatever reason, the results turn back blank and I'm not sure why. I've tested this regex code inside a uBot Expresion Checker and it pulls the phone numbers out as it should. But once I enter it in uBot it pulls nothing.
Any help? Thanks
FIGURED IT OUT:
.*\d3.?\s\d{3,5}.\d{3,5}
for whatever reason uBot wouldn't display the phone numbers correctly until I had the above worked out.
Most credit card regexes list mastercard as starting with a 5 and then having 1-5 as the second digit, though this one is from sears and has 5049 as the first four. I don't really want to change the regex without knowing if any other non conventions are used. Does anyone know if it's pretty safe to change it or if other alterations need be made also?
Thanks in advance!
Your RegEx is faulty :-) [Edit: If you want to support Sears cards, which is the premise of your question]
There is an accurate list of issuer numbers on Wikipedia:
http://en.wikipedia.org/wiki/List_of_Issuer_Identification_Numbers
It includes 5049 for Sears.
I suggest creating one or more unit tests for each listed issuer number and validating your RegEx with those unit tests.
UPDATE
There are plenty of widely accepted credit cards that start with "50", so your RegEx is still faulty if it asserts the 2nd digit is in the range 1-5.
Examples (From the Wiki link):
500235 National Bank of Canada
500766 Bank of Montreal
If you are selling things that are allowed to be sold to public benfit recipients (e.g. welfare recipients) also the EBT cards e.g.:
507683 Missouri EBT Card
i have bunch of unformatted docs....
i need regex to capture street address, postal code, state, phone numbers, emails, such common formats...
This site offers a searchable library of regexs: and this regular expression cookbook contains hundreds of examples of regex matching patterns
In the case of street addresses and to a certain extent, postal codes, regexs can only go so far. As a matter of fact, trying to regex a street is essentially impossible because of the huge variety of formats for a street address--even from within the United States.
A regex that has worked rather well for strictly formatted US-based postal codes is: ^\d{5}([-+]?\d{4})?$
In the US, ZIP Codes are typically formatted as follows:
12345
123456789
12345-6789
12345+6789 12345-67ND (yes, you read that right, sometimes the last two can be "ND")
The other issue that you'll have is when a zero-prefixed ZIP such as one from New England has been run through Excel and it has removed the leading zero, leaving a four-digit number. This is why a regex alone can't get the job done 100% even for something as "simple" as a US-based ZIP Code.
Depending upon the business needs, you'll want to investigate an address verification solution. Any online provider worth their salt can standardize and verify and address which tells you if the address is real and can help reduce fraud and return shipping, etc.
In the interest of full disclosure, I'm the founder of SmartyStreets. We have an online address verification service which cleans, standardizes, and validates addresses. You're more than welcome to contact me personally for any questions you have.
so im looking for a regex or some solution to detect street address, phone, fax etc in western countries.
i know it wont be perfect, but still my priority is on US and Canadian street addresses, province/state, postal code and etc....
it would be nice if someone went out and did this already, instead of me rewriting the regex...
Canadian postal codes can be verified though Canada Post's website.
It returns a range of valid addresses given a postal code. I am not sure if there's a web API for it, but it could provide much better accuracy than a regex.
You can probably find some interesting stuff in the sub-packages of PEAR::Validate (it's in PHP) that correspond to the locales you want
For instance, in the Validate_US class :
function postalCode($postalCode, $strong = false)
{
return (bool)preg_match('/^[0-9]{5}((-| )[0-9]{4})?$/', $postalCode);
}
The same method, in the Validate_FR class :
function postalCode($postalCode, $strong = false)
{
return (bool) preg_match('/^(0[1-9]|[1-9][0-9])[0-9][0-9][0-9]$/',
$postalCode);
}
But note that this kind of regex will only allow you to validate that an given code looks valid, not that it actually is valid : there are so many postal codes (and even more addresses !), the list would be un-manageable, and a maintenance nightmare, I guess.
You may try this regex, this will work for all Us and Canadian zip postal codes. eg A1A 1A1 Canadian Zip Codes and eg 99999 or 99999 8989 Us Zip Codes
(^\d{5}(-\d{4})?$)|(^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$)
Using information I got from this question I searched http://regexlib.com and found what you are looking for
This matches either postal code or zip
^\d{5}-\d{4}|\d{5}|[A-Z]\d[A-Z] \d[A-Z]\d$
Phone or fax:
^\+[0-9]{1,3}\([0-9]{3}\)[0-9]{7}$
Like Ben has mentioned, you won't be able to verify whether or not the address is valid or not, but you can verify that the format is correct.