Parsing a String in SSIS or C#

Parsing a String in SSIS or C# - regex

I have one string without any delimiter and I want to parse it. Is it possible in SSIS or c#.
For Example, If I have address info in a single column, but i want to split/parse it in multiple columns such as House Number, Road Number, Road name, Road type, Locality name, state code, post code, country wise etc.
12/38 Meacher Street Mount Druitt NSW 2770 Australia -- In this case House Number:- 12, road no:- 38, road name meacher, road type - road, locality :- mount druitt, state-NSW, post code:- 2770
have all these info in a single column, so how I will parse it and split inh multiple columns. I know by giving space delimiter will not work as there will be split the wrong information and there will be some road name with more than space , so in this info will be split up in wrong column.
Any suggestion would be appreciated.
Thanks.

Please remember that the country can also have spaces in it and some countries use alphanumerical post codes.
If all addresses are in Australia and in the same format of (...), state, postcode, Australia then you can split it into
StreetAddress, State, PostCode
You could also use one of online APIs to find an address and then then you get individual elements.
The best solution is to keep it together - why split it?

Related

Excel formulae: need way to determine if 3rd text character from right is "-"

I have a column of hospital names. In most of them, the last three characters are "-" and the two-letter abbreviation for the state, e.g. "-CA". but some (out of hundreds) have the state name somewhere in the hospital name, e.g. "Texas Tech U Affil-Lubbock" or "Community Health of South Florida".
I'm trying to find a way to make Excel give the last two characters only if the 3rd character from the right is a dash ("-"), but trying to specify that character position seems impossible.
I tried:
=IF(RIGHT(H4,-3)="-",RIGHT(H4,2),"noabbrev") and get #VALUE
=IF(RIGHT(H4,3)="-??",(RIGHT(H4,2)),"noabbrev") and always get noabbrev for
all cells
At this point, I fear I need to use =RIGHT(H4,2) in order to get the bulk of the cells correctly and eyeball/correct the errors by hand.
Am I missing the obvious again?

You can use this formula if H4 is text:
=IF(MID(H4,LEN(H4)-2,1)="-",RIGHT(H4,2),"noabbrev")

If A1 contains some text, then:
=Left(Right(A1,3),1)
should isolate the character you want.

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.

There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.

Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo

Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

Matching datasets with on variables with inconsistent formats in SAS

I have two datasets, one that lives within my agency and another that comes from an external source. Theoretically, all my agency's data should be matchable as a subset of the external data, but the problem is that there's no consistency in how PHN + street addresses are being recorded externally.
Our data = 100 West 10 Street
Their data = 100W 10th St / 100 W. 10 St. / 100west 10TH Street (you get the idea)
We have a lot of data, but they have even more, and both our data change on a daily basis, so it's infeasible to change formats one by one.
So I have two questions, coming from a SAS novice who's learned through work and lots of Googling, so please bear with me.
1 - Is there a way to do a quick non-perfect/fuzzy matching of the two datasets on addresses if they're not totally consistent in format? I understand that I'd have to go through the results, but I wanted a quick way to eliminate most of the non-matches immediately with minimal clean-up beforehand.
2 - If 1 isn't possible, what is the best approach to clean up the external data and to make the addresses more consistent? Should I keep the PHN + Street together, or keep them as separate variables? I started looking into prxchange and while it's definitely useful, it's not perfect. For example:
Address = left(prxchange('s / ST | ST. / STREET /', -1, cat(' ', address, ' ')));
Works great until it hits addresses at St Marks, for example, and converts the St to STREET.
The other problem is that I have to account for all the possible variations in spelling, abbreviations, periods, etc., which I'm doing now the old-fashioned way in Excel, but this leaves room for error.
Also, if some of the addresses have been compressed, such as 10west instead of 10 west, what is the best way to add a space or separate out entirely? Everything has been read in in the text format, and again there's no consistency in the number of characters to do a simple substring.
Thanks!

RegEx Address Parsing City

In our database we have 1 "Address" field that stores the complete address as text. I am in the process of splitting the address into the following fields: Line1, City, State, Zip. (US Addresses). I have good expressions for parsing the state and zip, but I'm having a bit of difficulty parsing the city.
Basically, I am using the following rules for parsing the city:
It must come right before the state
It can have a comma, or two or more spaces before it.
If neither of the above are true, then just return the 1 word that comes before the state.
I am not interested in validating these addresses.
Here's an example of the RegEx that I've been working with, and it works great for parsing address components that are separated by a comma or more than 2 spaces, but I can't get it work if I try to include an alternative of the 1 proceeding word:
Sample Address: 1977 S. Joshua Tree PL, Palm Springs, CA 92264
.*(?i)(?((((,\s|\s{2,})\w+)+(\s\w+)))(?=(,\s+|\s+)(Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)))
Trying to make the 1st word optional causes the expression to only return "Springs", instead of "Palm Springs", which definitely matches in the expression above:
.*(?i)(?((((,\s|\s{2,})\w+)?(\s\w+)))(?=(,\s+|\s+)
Thanks for your help!

Personally, I think I would take a totally different approach. I would treat the zip code as authoritative, as it is the most granular data you have available. I would get a list of zip code to city mappings. Extract the zip code portion of the address. Write in new database fields the city and state values based on the zip code. Then write a script to go through each data entry and determine if the city and state names based on zip code can be found in your string. If they can, remove those values from the string. And flag that record as successfully processed. If they can't flag the record as one that you might need to perform manual review on.
Another alternate approach might be to use an API like Google Maps, to send your address string to and hopefully get a cleaned address out.

This may be overly broad, but it might work for you, depending on the regex implementation you are using:
(.+?),\s*(.+?)(?:,\s|\s\s)(.+?)\s(\d{5})
This will return the following groups from your example:
('1977 S. Joshua Tree PL', 'Palm Springs', 'CA', '92264')

I always prefer named capture groups for something like this. So try
(?<addr>[^,]+),\s+(?<city>[^,]+),\s+(?<state>[A-Za-z]{2})\s+(?<zip>\d{5}(-\d{4})?)
Parsing your example this will give you
addr: 1997 S. Joshua Tree PL
city: Palm Springs
state: CA
zip: 92264
and I threw in support for the extended postal code format as well.
You can just extract the value of the city group from the match generated by this regex.

Pulling international street addresses / phone numbers from free-form text

Hey, folks. I'm looking for some regular expressions to help grab street addresses and phone numbers from free-form text (a la Gmail).
Given some text: "John, I went to the store today, and it was awesome! Did you hear that they moved to 500 Green St.? ... Give me a call at +14252425424 when you get a chance."
I'd like to be able to pull out:
500 Green St. (recognized as a street address)
+14252425424 (recognized as a phone number)
What makes this problem easier is that I don't care about parsing text that gets pulled out. That is, I don't care that Green is the name of the road or that 425 is the area code. I just want to grab strings that "look like" addresses or telephone numbers.
Unfortunately, this needs to work internationally, as best as possible.
Anyone have any leads? Thanks!

Phone numbers as long as you have a list of all country codes and number formats is easy, street addresses I have no idea, the only advice I can give you is to validate each set of words # addressdoctor.com

You can give RecogniContact (-> address-parser.com) a try, it recognizes both postal addresses and phone numbers.

Take a look at Chapter 7 of Dive Into Python. It touches both phone numbers and street addresses. I believe you can use this as a starting point. The international part seems tough. I suggest you build a first draft, try it on several locales, iterate and improve.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js