regex in SAS - how to strip white spaces surrounding special characters? - regex

For the address data, would anyone have any idea on how to strip white spaces around the special characters in the numeric bit of these records? For the third record, for example, I want to have "44/95" instead of "44 / 95".
The special characters I want to do this for all "/","-","|" and ",".
I am guessing that using regex is the best way but I can't think of how to do this.
data addresses1;
infile datalines ;
input #1 address $35. ;
format address $50.;
datalines;
26 32-50 CENTRE DANDENONG ROAD
9 /93-95 DANDENONG ROAD EAST
44 / 95 OUTER CRESCENT
17| 21-25 PARKHILL DRIVE
run;
I have tried something like the following code, but did not work. Can someone point me to the right direction?
data addresses2;
set addresses1;
format fixed_address fixed_address2 $255.;
address=strip(address);
fixed_address2=compbl(strip(prxchange("s/(?<=[\|.\(\)\{\}\-\:\s\*\;\.\#\&\_\/\\]) +(?=\[\|.\(\)\{\}\-\:\s\*\;\.\#\&\_\/\\])/$1/",-1,strip(fixed_address))));
run;

I have made a RegEx for you, that should Work:
\S*( ?(?![/|,-])).*(?<![[/|,-])
It selects zero or more Non-Whitesspaces then a Space not followed by any of your characters, then one or more of any character, making sure the previous char is not any of your characters. It's not elegant and you will have to strip empty maches.

Related

Formating character 'decimals' (comma delimiter) AND character 'integers' to numeric 'decimals' (point delimiter)

This is somewhat related to my other question recently.
Setup I am reading in character variables of the sort 1 or 2,0 or 10,0 or 2,5. I want to convert them to numerics using a decimal point instead of a comma.
So ideally I would like to get the following result:
1 -> 1
2,0 -> 2
10,0 -> 10
2,5 -> 2.5
My code
data _null_;
test='5,0';
result=input(test_point,comma10.1);
put 'this should be:' result;
run;
does this for all character variables which are of the type 'xy,z' but fails for 'xy' with no comma separation at all. Here I would get
xy -> x,y
I was thinking to add an if/else to check whether the character string has length of 1 or bigger. So something like
data _null_;
test='5';
if length(test)=1 then result=input(test, comma10.);
else result=input(test, comma10.1);
put 'this should be:' result;
run;
But the problem here would be that
10 -> 1
Problems with like 10,00 (which is supposed to be 10) becoming 100 could probably be resolved by substituting the ',' with '.', but the characters with no decimal delimiter remain a problem.
Is there any clever solution to this?
My solution which is a bit hacky (and basically only uses the fact that the comma introduces a length>2 - problems with e.g. 123 would still arise):
data _null_;
t='5,5';
test=tranwrd(t, ',', '.');
if length(test)=1 or length(test)=2 then result=input(test, comma10.);
else result=input(test, comma10.1);
put 'this should be:' result;
run;
Sounds like your text strings were created in a place where the normal meaning of comma and period in numbers is reversed. So instead of using a period for decimal point and comma for thousand grouping they have reversed the meaning.
For that type of strings SAS has the COMMAX informat.
Normally you do NOT want to add a decimal specification to your informat. The decimal part of the informat is only used when the source string does not have a explicit decimal point. Basically it is telling SAS to divide values without an explicit decimal point by 10 to the power of the number of decimal places in the informat specification. It is designed to read data where the decimal point was purposely not written in order to save space.
Pretty much all the COMMA informat does is strip the string of commas and dollar signs and then read it using the normal numeric informat.
The COMMAX informat is the one that will understand the reversed meaning of the commas and periods. So it pretty much eliminates the periods and then converts the commas to periods and then reads it using the normal numeric informat.
Try a little test of your own.
data check;
input #1 string $32. #1 num ??32. #1 comma ??comma32. #1 commax ??commax32.
#1 d2num ??32.2 #1 d2comma ??comma32.2 #1 d2commax ??commax32.2
;
cards;
123
123.4
123,4
1,234.5
1.234,5
;
proc print;
run;
As it turns out (found it here) the COMMAXw,d does the trick without any hassle, the code then would be:
data _null_;
test='0,5';
result = input(test, COMMAX10.);
put 'this should be:' result;
run;
I find it a bit anti-intuitive, but it works.

sas, regex, numbers, substring, prxchange

I need help with the below code. I do not see how this is extracting the number from this address line text. When it (the pattern) says s/\D/ / I thought this replaces the digits with a space. I know the second part here is taking the substring up to the first space in the address line text. But, then I do not see how this is extracting the numbers. I pulled up the data set and it looks like this does work. Please help me understand how this is working.
DATA OUT.REQ_1_2_03;
SET OUT.REQ_1_2_02;
/* GET STREET NUMBER*/
PRE_RCV_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1)));
POST_RCV_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(POST_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',POST_RCV_ADDRESSS_LINE_1)));
PRE_HOST_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(PRE_HOST_ADDR_LINE_1,1,PRXMATCH('/\s/',PRE_HOST_ADDR_LINE_1)));
POST_HOST_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(POST_HOST_ADDR_LINE_1,1,PRXMATCH('/\s/',POST_HOST_ADDR_LINE_1)));
RUN;
try to understand using an example
PRE_RCV_ADDRESSS_LINE_1 ="123hello Village st"
start from the left side of the code.
first use prxmatch and it finds first space(\s)that comes 123hello
do substr till that space and you get 123hello
then remove prxchanges to replace \D (that is anything other than digit) and
is converted to 123
to sum it up by example
"123hello Village st" -- find space(\s) by prxmatch and substring till space gives "123hello"
"123hello" is changed to "123" by prxchange which replaces anything other than digit(\D) .
/* run this step to understand it better*/
data want ;
PRE_RCV_ADDRESSS_LINE_1 = "123hello Village st";
test1= SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1));
PRE_RCV_ST_NB= PRXCHANGE('s/\D//',-1,SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1)));
run;

Rearrange text on SAS

I can not find the way to reverse text strings.
For example I want to reverse these:
MMMM121231M34 to become 43M132121MMMM
MM1M11M1 to become 1M11M1MM
1111213111 to become 1113121111
Judging from your examples, what you mean by 'rearrange' is actually 'reverse'.
In that case, you've got the very handy reverse() function in SAS.
Used in context:
data test;
length text $32;
infile datalines;
input text $;
result=reverse(strip(text));
datalines;
MMMM121231M34
MM1M11M1
1111213111
;
run;
EDIT on #Joe's request: in the particular example above, I create the test dataset by setting a length of 32 characters for the text variable. Therefore, when reading the values from datalines, these are padded with blanks up to that total of 32 characters. Hence, when reversing that value, the result has that many blanks at the start, followed by the actual value you are looking for. By adding the strip function, you remove the excess blanks from the value of text before reversing, keeping only the "real" value in the result.

Removing quotes and spaces in SAS dataset

I am working in SAS EG and DI, facing a very peculiar problem.
When I look into a column of a dataset in SAS DI Studio or EG, it is appearing fine. But when I paste the data into notepad, some quotes and spaces are appearing.
The data which I am seeing in EG:
But the same data when copied into Notepad,
extra quotes and spaces are appearing like this(in 6th row):
I found this problem when I am using this field as a key in a join, the other related column values for 6th row are not going to the output as the match is failing for that 6th record.
I tried many things like tranwrd,dequote and compress but none of them is changing my result.
Can someone please help in understanding what the problem is and how can this be solved.
Take a look at what is in the column so that you can decide how to handle it. This query will show you both the character string and the Hexadecimal representation of the string.
proc sql;
select postcode,put(trim(postcode),$hex.) as hexcode,count(*) as nobs
from x
group by 1,2
;
quit;
So if you see hex characters like 0A, 0D, A0, 08 or other non-printable codes then you can figure out what is happening.
So you might see that you have POSTCODE='LS5 3BT' with HEXCODE='4C533520334254' for most of the records. But perhaps have some that look like the POSTCODE='LS5 3BT', but the value of HEXCODE is something like '0A4C533520334254' which would mean that you have a linefeed character at the beginning of the string. Or perhaps instead of space ('20'X) you have a tab ('09'X) in the middle of the string.

How to modify string to a character value in which each character of the string is separated by a comma?

I came across this question today morning and I am still trying to figure out it can be done. the following dataset is present and has a character variable CAT.
CAT
A
AB
B
ABCD
CB
.
.
.
and so on.
We need to write a SAS program to introduce commas in-between each character of the string if the length of the string is more than 1. I used length() function and used a do loop to create different variables and it just got messy. How do i tackle this?
Regular expression solution:
data have;
input CAT $;
datalines;
A
AB
B
ABCD
CB
;;;;
run;
data want;
set have;
cat_c = prxchange('s/(?<=[[:alpha:]])([[:alpha:]])/,$1/io',-1,CAT);
put cat_c=;
run;
The first parenthetical group is a look-behind for an alpha character; then the captured alpha character. Then replace with comma and character. If you want something other than [[:alpha:]] (ie, A-Z) then supply that as a class.
The solution using length and do loop isn't bad, honestly, if you want something that is more readable to novice programmers. Just use SUBSTR left of the equal sign.
data want2;
set have;
if length(cat) > 1 then
do _t = 1 to length(cat)-1;
substr(cat_c,2*_t-1,2)=substr(cat,_t,1)||',';
end;
substr(cat_c,2*length(cat)-1,1)=substr(cat,length(cat),1);
put cat_c=;
run;