Matching datasets with on variables with inconsistent formats in SAS

Matching datasets with on variables with inconsistent formats in SAS - sas

I have two datasets, one that lives within my agency and another that comes from an external source. Theoretically, all my agency's data should be matchable as a subset of the external data, but the problem is that there's no consistency in how PHN + street addresses are being recorded externally.
Our data = 100 West 10 Street
Their data = 100W 10th St / 100 W. 10 St. / 100west 10TH Street (you get the idea)
We have a lot of data, but they have even more, and both our data change on a daily basis, so it's infeasible to change formats one by one.
So I have two questions, coming from a SAS novice who's learned through work and lots of Googling, so please bear with me.
1 - Is there a way to do a quick non-perfect/fuzzy matching of the two datasets on addresses if they're not totally consistent in format? I understand that I'd have to go through the results, but I wanted a quick way to eliminate most of the non-matches immediately with minimal clean-up beforehand.
2 - If 1 isn't possible, what is the best approach to clean up the external data and to make the addresses more consistent? Should I keep the PHN + Street together, or keep them as separate variables? I started looking into prxchange and while it's definitely useful, it's not perfect. For example:
Address = left(prxchange('s / ST | ST. / STREET /', -1, cat(' ', address, ' ')));
Works great until it hits addresses at St Marks, for example, and converts the St to STREET.
The other problem is that I have to account for all the possible variations in spelling, abbreviations, periods, etc., which I'm doing now the old-fashioned way in Excel, but this leaves room for error.
Also, if some of the addresses have been compressed, such as 10west instead of 10 west, what is the best way to add a space or separate out entirely? Everything has been read in in the text format, and again there's no consistency in the number of characters to do a simple substring.
Thanks!

Related

Regex masking all phone numbers except a specific range

Not 100% if this is possible but I would like to convert any outbound call that does not match my DID range to a set phone number. 
With our carrier in Australia if the ANI is not from their supplied range the call is blocked as part of new regulations. 
What I am looking for is something like this. 
if not +61 2 XXXX XXXX - +61 2 XXXX  XXXX  then send as +612XXXX XXXX
I apologise I have no true understanding of regex and do not know even where to begin.
I am starting to work on my knowledge of it though. please be kind. If anyone can point me to an "idiots guide" link I would be appreciative as I am just getting into this.

Of course it's possible. It's just a matter of how much work you want to do. I'm not quite sure what you want to mask and what you want to pass on unmutilated. A couple of particular examples would help. How many different formats, countries, and so on do you need to support?
With these problems, I tend to follow this approach:
Normalize the data. Make them all look the same. So, remove all non-digits, for example. +61 2 XXXX XXXX turns into 612XXXXXXXX. In this step, you'd also fill in implicit information, like a local number that does not include the country code. Number::Phone may be interesting, but, also note is was the largest distro on CPAN for awhile.
Now it should be easier to recognize the number and it's components (because if it isn't, you didn't do Step 1 right). Instead of a regex, you might use a parser. That is, get the country code, and then from that, decide what has to happen next. That's the sort of thing I have to do with ISBNs in Business::ISBN, which have a group code then a publisher code (both of which are variable length.
Once you can recognize the number, it's easy to select a range. If it's in the range, you know what to replace.

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.

There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.

Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo

Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

Parsing a String in SSIS or C#

I have one string without any delimiter and I want to parse it. Is it possible in SSIS or c#.
For Example, If I have address info in a single column, but i want to split/parse it in multiple columns such as House Number, Road Number, Road name, Road type, Locality name, state code, post code, country wise etc.
12/38 Meacher Street Mount Druitt NSW 2770 Australia -- In this case House Number:- 12, road no:- 38, road name meacher, road type - road, locality :- mount druitt, state-NSW, post code:- 2770
have all these info in a single column, so how I will parse it and split inh multiple columns. I know by giving space delimiter will not work as there will be split the wrong information and there will be some road name with more than space , so in this info will be split up in wrong column.
Any suggestion would be appreciated.
Thanks.

Please remember that the country can also have spaces in it and some countries use alphanumerical post codes.
If all addresses are in Australia and in the same format of (...), state, postcode, Australia then you can split it into
StreetAddress, State, PostCode
You could also use one of online APIs to find an address and then then you get individual elements.
The best solution is to keep it together - why split it?

Formatting UK postal codes for storage

I want to store UK postal codes in the database. Is it OK to store those postal codes without the spaces?

It is possible to store postcodes without spaces, but would definitely recommend formatting them correctly when they are displayed/output.
You can check out the allowed formats for postcodes here . There are always 3 characters after the space so it's easy to reinsert it.

Last 3 are always xyy
x Digit 0-9
yy Alpha A-Z
Anything before is the first part of the grid reference and has various formats.

we store postcodes and we accept inouts in any format, space or no space, but then strip or correct the entry for data storage
we find it works better this way when using the data for other things
Why would you want to store with no spaces?

Uk postcodes have a variety of formats:
list of formats
Why are you unable to store white spaces?

As others have said, there is no problem with removing all spaces and storing them, if that is what you want to do. As has been said, you can always format them with a space before the last three characters.
However, I would normally take them in any reasonable format, strip all spaces out, and them store them with this one extra space. The storage requirements are not an issue, and it makes it easier to simply display as it is. You would need to resolve the format before saving in some way, so you may as well save it as it is needed.

It's usually safe to remove the space. As others have said, you can re-insert the space later if required. The existence of a space between Outcode and Incode will not normally affect postal delivery. You should not have any non alpha numeric characters in a UK postcode, so if you see a dash you can safely remove it.
I work for Experian Data Quality and if your aim is clean data you may want to consider an address verification web service, like our Pro On Demand product. This will ensure you capture the correct postcode, as they can change over time, and that it is formatted correctly for your database.

It is okay to store without a space because you can always add an empty space back in to each postcode string - the heuristic is pretty simple.
As some other users have very helpfully explained, all UK postcodes have two groups of numbers and letters, separated by a space. The group following the space always contains a number and then two letters (thus, there are always three characters after the space). The group before the space will have either two, three, or four characters (see this Wikipedia page) and the screenshot below.
So, you can recreate the correct spacing by adding a space before the third-to-last character.
In R, it looks like this (but the same logic would work in other languages, such as Python):
#list of example postcodes
postcodes = c("LS176JA", "OX41EZ", "A99AA")
#add space to each postcode in the list of example postcodes
for (postcode in postcodes){
last_three = str_sub(postcode, start = -3)
first_x = str_replace(postcode, last_three, "")
final_postcode = paste0(first_x, " ", last_three)
print(final_postcode)
}
Which returns:
[1] "LS17 6JA"
[1] "OX4 1EZ"
[1] "A9 9AA"

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$

Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.

Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js