SAS address formatting - sas

How would I capitalize the first letters, insert only one space between each address and insert a comma and one space between the variables city/state in SAS
ie:200 Grass GROVE Lane SF CA

Related

Is there a method for dividing an Address string into 3 separate strings using regex

I am currently working on a project that requires me to divide an address into its street number, its street name, and if it has a suite, into its suite name.
EX: 1360 WHITE OAK RD STE F -----> 1360 | White Oak RD | STE F
I am currently using google sheet and using the =regexextract() functionality that uses Regex to parse the string into different columns. This is how I am currently dividing the number and the street (given the full address is in column B.
=ArrayFormula(REGEXEXTRACT(B1:B,"[0-9]*")) ---->gets the number EX:(1360)
=ArrayFormula(REGEXEXTRACT(B1:B," [a-zA-Z0-9 ]+")) ---->gets the street address including the suite number with a white space at the begining EX:( WHITE OAK RD STE F)
The question I am struggling with is how do I remove the white space from the 2nd formula and also prevent it from getting the suite text (which always starts with STE). Lastly what would be a formula for grabbing the suite text and number.
Thanks and I appreciate any help you can give!
The formula provided by MonkeyZeus works perfectly giving no issues whatsoever.
In case though you have your results in adjacent columns you can use a single formula on every row like
=SPLIT(REGEXREPLACE(B1,"([0-9]+) (.+) (STE.*)","$1♣︎$2♣︎$3"),"♣︎")
Or even use an Arrayformula to get your results for an entire column
=ArrayFormula(IFERROR(SPLIT(REGEXREPLACE(B1:B,"([0-9]+) (.+) (STE.*)","$1♣︎$2♣︎$3"),"♣︎")))
What the formula does
using parenthesis () we divide the text into 3 groups $1, $2, $3
$1♣︎$2♣︎$3 adding the character ♣︎ (could be any character that does not interfere with the formula) we prepare uor text for the SPLIT function
we split our now formed into groups text, to adjacent columns wherever ♣︎ is found
The Arrayformula applies all the above to every single row in column B while IFERROR makes sure we don't get any errors (like when empty cells are found).
Functions used:
ArrayFormula
IFERROR
SPLIT
REGEXREPLACE
For Google Sheets you could use the following 3 formulas:
=REGEXEXTRACT(B1,"^[0-9]*")
=REGEXREPLACE(B1,"^[0-9\s]*|\s*STE.*$", "")
=REGEXEXTRACT(B1,"STE.*$")
I would have used lookbehinds but they are not universally supported in all browsers (yet).
I'm not a Google Sheets expert so I've opted to remove ArrayFormula and replace the B1:B with just B1 since they seemed superfluous.

Open Refine regex for alphabets

i want to edit only alphabetic charcter from my cell
.
what i have done
value.match(/.*?(\^[a-zA-Z]*$).*?/)
but it returns null
i am try to clean address column in my data set following are the sample address
H3656 GALI#4 BLOCK-D, AREA 1
H#36/17 SECTOR 5D AREA 2
AREA 3 BLOCK-B NORTH NAZIMABAD
GERMANY AL JANNAT BENQUET SECTOR 16 Area 2 with short name
so that i first try to remove all numbers from my string
If you want to remove all the numbers, the most direct approach is probably:
value.replace(/\d+/, "")
If for any reason you want to find only the alphabetic characters, as indicated by the title of your question, this will be more effective than a value.match() :
value.find(/\p{L}\s?/).join("")
(\p{L} is a Java regular expression - Openrefine is written in Java - equivalent to [a-zA-Z], but which also takes into account Unicode characters like accented letters.)
In general, you should avoid using the .match() method unless you know exactly what you are doing. In 90% of cases, it is actually .find() that is desired.

Parsing a String in SSIS or C#

I have one string without any delimiter and I want to parse it. Is it possible in SSIS or c#.
For Example, If I have address info in a single column, but i want to split/parse it in multiple columns such as House Number, Road Number, Road name, Road type, Locality name, state code, post code, country wise etc.
12/38 Meacher Street Mount Druitt NSW 2770 Australia -- In this case House Number:- 12, road no:- 38, road name meacher, road type - road, locality :- mount druitt, state-NSW, post code:- 2770
have all these info in a single column, so how I will parse it and split inh multiple columns. I know by giving space delimiter will not work as there will be split the wrong information and there will be some road name with more than space , so in this info will be split up in wrong column.
Any suggestion would be appreciated.
Thanks.
Please remember that the country can also have spaces in it and some countries use alphanumerical post codes.
If all addresses are in Australia and in the same format of (...), state, postcode, Australia then you can split it into
StreetAddress, State, PostCode
You could also use one of online APIs to find an address and then then you get individual elements.
The best solution is to keep it together - why split it?

SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data. Can we avoid it?

SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data.
proc sql;
select emailaddr from tablename1;
quit;
The column emailaddr is varbinary(20)
For example:
I inserted "XX#WWW.com ", but while reading from db, it is appending spaces equal to the length of the column.
Since the column length is 20 it is returning "XX#WWW.com " ( note the spaces appended. I cannot use the trim() function since this also removes spaces that might genuinely be part of the original inserted data.
How can i stop sas from appending these spaces?
For my program i need to get the exact data as present in database without any extra spaces attached.
That's how SAS works; SAS has only CHAR equivalent datatype (in base SAS, anyway, DS2 is different), no VARCHAR concept. Whatever the length of the column is (20 here) it will have 20 total characters with spaces at the end to pad to 20.
Most of the time, it doesn't matter; when SAS inserts into another RDBMS for example it will typically treat trailing spaces as nonexistent (so they won't be inserted). You can use TRIM and similar to deal with the spaces if you're using regular expressions or concatenation to work with these values; CATS and similar functions perform concatenation-with-trimming.
If trailing spaces are part of your data, you are mostly out of luck in SAS. SAS considers trailing spaces irrelevant (equivalent to null characters). You can append a non-space character in SQL, or translate the spaces to NBSPs ('A0'x) or something else, while still in SQL, or use quotes or something around your actual values - but whatever you do will be complicated.

Cannot delete non-space whitespace character in excel

When bringing in data into excel via whatever method (import, paste, ...) I sometimes get the following issue. At the beginning of the cell there is an extra space in front of the text. Now I know the usual procedures to handle this namely:
trim(cell number)
and if its not a space character
=TRIM(SUBSTITUTE(cell number,CHAR(160),CHAR(32)))
But this time both of these didn't work. I did try other substitute CHAR's.
AND the character at the beginning is just plain weird. When I go to the very beginning of the cell and try to delete it I must hit the delete key twice to remove one space! But when I go to the first character in the cell and instead hit backspace I only need to press it once.
What else can I do to eliminate this weird non-space whitespace character?
If cell A1 contains non-visible junk characters, you must identify them before you can remove them.
Pick some cell and enter:
=IFERROR(CODE(MID($A$1,ROWS($1:1),1)),"")
and copy down. This will give you the CHAR code for each character in A1
Then you can use SUBSTITUTE() to remove the offender.
Lets assume column A has text where some cells are good and some have text with the weird space like character at the front. So some cells we want to change and some we don't.
1) Create a one column table with one letter in each cell. I decided to go over to the right to column H for the table. So for example cell H1 has A, cell H2 has B and so on.
2) Get the length of the cell we want to edit. I've put this formula in cell B1.
=LEN(A1)
3) Test the cell for the first letter. This gives us which cell to change and which not. I've put this formula in cell C1.
=ISNA(VLOOKUP(LEFT(A1),$H$1:$H$26,1,0))
4) Change (or not depending on step 3) using RIGHT and the result from LEN.
=IF(B1,RIGHT(A1,B1-2),A1)
Notice that I have to subtract 2 spaces and not one? Like I said it was a strange character.
5) Repeat down the column.
If the first legitimate character in your string will be in the set [A-Za-z0-9] then you could use this formula:
=MID(A1,MIN(SEARCH({"a";"b";"c";"d";"e";"f";"g";"h";"i";"j";"k";"l";"m";"n";"o";"p";"q";"r";"s";"t";"u";"v";"w";"x";"y";"z";0;1;2;3;4;5;6;7;8;9},A1&"abcdefghijklmnopqrstuvwxyz1234567890")),99)
where 99 is longer than the longest string might be. If there are other legitimate starting characters, then add them to both the array constant and the string at the end.
If you might need to remove trailing spaces (char(32)), you can enclose the above in a TRIM function.