SAS PRXPARSE usage to retrieve length of string matching pattern - regex

I have to retrieve the X-Axis and Y-Axis pos from ADDITIONAL_DETAILS field which is more than 300 bytes in length.
Somewhere in this string, I am getting the location details as RETLOCID=2312.4892 like that.
I am trying to use PERL REGEX in SAS.
Problem: I am able to get the starting position into postn1 from call prxsubstr(MATCH_PATTERN1, ADDITIONAL_DETAILS, postn1,length1); but the length is always returned as 8 even though it is more than that.
TRANSACTION_ID = substrn(ADDITIONAL_DETAILS, postn1, length1); This is not giving me proper value when I am restricting length to 8. Any help is appreciated. Below is the code:
DATA WORK.LOCATION;
INFILE DATALINES;
INPUT ADDITIONAL_DETAILS $50.;
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
gfdgf
;
RUN;
data WORK.POSITION;
set WORK.POSITION;
if _N_ = 1 then do;
MATCH_PATTERN1 = PRXPARSE("/(RETLOCID=)/");
MATCH_PATTERN2 = PRXPARSE("/([0-9]{1,}\.[0-9]{1,})/");
end;
retain MATCH_PATTERN1 MATCH_PATTERN2;
call prxsubstr(MATCH_PATTERN1, ADDITIONAL_DETAILS, postn1,length1);
call prxsubstr(MATCH_PATTERN2, ADDITIONAL_DETAILS, postn2,length2);
if postn1 > 0 and not missing(ADDITIONAL_DETAILS) then
TRANSACTION_ID = substrn(ADDITIONAL_DETAILS, postn1 + 8, length1);
RUN;
data work.POSITION;
set work.POSITION;
drop MATCH_PATTERN1 postn1 length1;
run;
I need to pull 121.5435 and 7821.5487

Try this:
DATA WORK.LOCATION;
INPUT ADDITIONAL_DETAILS $50.;
string=prxchange('s/[a-z=_]+//i',-1,ADDITIONAL_DETAILS);
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
DISTR_QUOTE=66.92
gfdgf
;
run;
Or
DATA WORK.LOCATION;
INPUT ADDITIONAL_DETAILS $50.;
length string $20.;
if prxmatch('/\=/',ADDITIONAL_DETAILS)=0 then string='';
else string=prxchange('s/.*(?<=\=)([^a-z]+).*/$1/i',-1,ADDITIONAL_DETAILS);
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
gfdgf
DISTR_QUOTE=66.92
;
proc print;
run;

Related

Concatenating a variable dynamically in SAS

I want to create a variable that resolves to the character before a specified character (*) in a string. However I am asking myself now if this specified character appears several times in a string (like it is in the example below), how to retrieve one variable that concatenates all the characters that appear before separated by a comma?
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
run;
Code:
data want;
set have;
cnt = count(string, "*");
_startpos = 0;
do i=0 to cnt until(_startpos=0);
before = catx(",",substr(string, find(string, "*", _startpos+1)-1,1));
end;
drop i _startpos;
run;
That code output before=C for the first and second observation. However I want it to be before=C,E for the first one and before=C,W,d for the second observation.
You can use Perl regular expression replacement pattern to transform the original string.
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
data want;
set have;
csl = prxchange('s/([^*]*?)([^*])\*/$2,/',-1,string); /* comma separated letters */
csl = prxchange('s/, *$//',1,csl); /* remove trailing comma */
run;
Make sure to increment _STARTPOS so your loop will finish. You can use CATX() to add the commas. Simplify selecting the character by using CHAR() instead of SUBSTR(). Also make sure to TELL the data step how to define the new variable instead of forcing it to guess. I also include test to handle the situation where * is in the first position.
data have;
input string $20.;
datalines;
ABC*EDE*
EFCC*W*d*
*XXXX*
asdf
;
data want;
set have;
length before $20 ;
_startpos = 0;
do cnt=0 to length(string) until(_startpos=0);
_startpos = find(string,'*',_startpos+1);
if _startpos>1 then before = catx(',',before,char(string,_startpos-1));
end;
cnt=cnt-(string=:'*');
drop i _startpos;
run;
Results:
Obs string before cnt
1 ABC*EDE* C,E 2
2 EFCC*W*d* C,W,d 3
3 *XXXX* X 1
4 asdf 0
call scan is also a good choice to get position of each *.
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
****
asdf
;
data want;
length before $20.;
set have;
do i = 1 to count(string,'*');
call scan(string,i,pos,len,'*');
before = catx(',',before,substrn(string,pos+len-1,1));
end;
put _n_ = +7 before=;
run;
Result:
_N_=1 before=C,E
_N_=2 before=C,W,d
_N_=3 before=
_N_=4 before=

SAS: How can I pad a character variable with zeroes while reading in from csv

Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;

specifying data informat using do loops in sas

I have a large data file with data in the following format: country, datatype, year1month1 to year2018month7.
Reading the data using proc import did not work for all data fields. I ended up modifying the SAS datastep code to ensure data format was correct.
However, I am having trouble simplifying the code, namely I would like a do loop to go through all the years and month. This way, I could use current date to figure out the range of dates for the file and the code to create Year/Month variable does not have to repeat 100 times in the file.
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
informat _(i)M(j) best32.;
end;
end;
informat Base_Year $1. ;
format Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
format _(i)M(j) best12.;
end;
end;
format Base_Year $1. ;
input
Country_Name $
do i = 1940 to 2018;
do j = 1 to 12;
_(i)M(j) $;
end;
end;
Base_Year $;
run;
There are a few approaches here that could work. The most directly translatable to your approach is to use the macro language.
You need to translate those two loops to something like this:
%do i = 1940 %to 2018;
%do j = 1 %to 12;
informat _&i.M&j. best32.;
%end;
%end;
Notice the % there. This also has to be in a macro; you can't do this in normal datastep code.
I would rewrite it to use a macro like so:
%macro make_ym(startyear=, endyear=, separator=);
%local i j;
%do i = &startyear. %to &endyear.;
%do j = 1 %to 12;
_&i.&separator.&j.
%end;
%end;
%mend make_ym;
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
informat %make_ym(startyear=1940,endyear=2018,separator=M) best32.;
informat Base_Year $1. ;
format %make_ym(startyear=1940,endyear=2018,separator=M) best12.;
format Base_Year $1. ;
input
Country_Name $
%make_ym(startyear=1940,endyear=2018,separator=M)
Base_Year $;
run;
I took out the $ after the yMm bits in the input since you declared them as numeric.
Don't model your data step after the code generated by PROC IMPORT. It does a lot of useless things, like attaching formats and informats to variables that don't need them.
For your problem you just need a simple program like this:
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. Y1940M01 .... Y2018M08 Base_Year :$1. ;
run;
Now the only tricky part is building that list of numerical variables. If the list is small enough you could just put it into a macro variable. Fortunately that is not a problem in this case since using 8 character names (YyyyyMmm) there is room for over 300 years worth in a data step character variable. A variable of length 10,800 bytes should have room for 100 years of month names.
So just run this data step first.
data _null_;
length names $10800 ;
basedate = mdy(1,1,1940);
lastdate = today();
do i=0 to intck('month',basedate,lastdate);
date=intnx('month',basedate,i);
names=catx(' ',names,cats('Y',year(date),'M',put(month(date),Z2.)));
end;
call symputx('names',names);
run;
Now you can use the macro variable in your INPUT statement.
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. &names Base_Year :$1. ;
run;

SAS programming : Convert Exponential value to numeric value in SAS

How to convert exponential(for eg. 3.22254e2, 3.24456545e-3) values to numeric format(322.254,0.00324456545) in SAS. I am getting the source as varchar from a file and need to store the same in oracle as number format.
I need to read from a file(csv) so When i tried to do the same i get the result(b) as null.
My code:
data work.exp_num ;
infile 'exp_number.csv'
lrecl = 256
delimiter = '~'
dsd
missover
firstobs = 2;
;
attrib a length = $300
format = $32.
informat = $32.;
input a ;
run;
data test;
set work.exp_num;
b=input(a,32.16);
run;
Kindly help.
Thanks in advance.
You can use the standard informat w.d .
Here you got an example:
data test;
a='3.24456545e-3'; output;
a='3.22254e2'; output;
run;
data erg;
set test;
b=input(a,32.16);
run;

Multiple hash objects in SAS

I have two SAS data sets. The first is relatively small, and contains unique dates and a corresponding ID:
date dateID
1jan90 10
2jan90 15
3jan90 20
...
The second data set very large, and has two date variables:
dt1 dt2
1jan90 2jan90
3jan90 1jan90
...
I need to match both dt1 and dt2 to dateID, so the output would be:
id1 id2
10 15
20 10
Efficiency is very important here. I know how to use a hash object to do one match, so I could do one data step to do the match for dt1 and then another step for dt2, but I'd like to do both in one data step. How can this be done?
Here's how I would do the match for just dt1:
data tbl3;
if 0 then set tbl1 tbl2;
if _n_=1 then do;
declare hash dts(dataset:'work.tbl2');
dts.DefineKey('date');
dts.DefineData('dateid');
dts.DefineDone();
end;
set tbl1;
if dts.find(key:date)=0 then output;
run;
A format would probably work just as efficiently given the size of your hash table...
data fmt ;
retain fmtname 'DTID' type 'N' ;
set tbl1 ;
start = date ;
label = dateid ;
run ;
proc format cntlin=fmt ; run ;
data tbl3 ;
set tbl2 ;
id1 = put(dt1,DTID.) ;
id2 = put(dt2,DTID.) ;
run ;
Edited version based on below comments...
data fmt ;
retain fmtname 'DTID' type 'I' ;
set tbl1 end=eof ;
start = date ;
label = dateid ;
output ;
if eof then do ;
hlo = 'O' ;
label = . ;
output ;
end ;
run ;
proc format cntlin=fmt ; run ;
data tbl3 ;
set tbl2 ;
id1 = input(dt1,DTID.) ;
id2 = input(dt2,DTID.) ;
run ;
I don't have SAS in front of me right now to test it but the code would look like this:
data tbl3;
if 0 then set tbl1 tbl2;
if _n_=1 then do;
declare hash dts(dataset:'work.tbl2');
dts.DefineKey('date');
dts.DefineData('dateid');
dts.DefineDone();
end;
set tbl1;
date = dt1;
if dts.find()=0 then do;
id1 = dateId;
end;
date = dt2;
if dts.find()=0 then do;
id2 = dateId;
end;
if dt1 or dt2 then do output; * KEEP ONLY RECORDS THAT MATCHED AT LEAST ONE;
drop date dateId;
run;
I agree with the format solution, for one, but if you want to do the hash solution, here it goes. The basic thing here is that you define the key as the variable you're matching, not in the hash itself.
data tbl2;
informat date DATE7.;
input date dateID;
datalines;
01jan90 10
02jan90 15
03jan90 20
;;;;
run;
data tbl1;
informat dt1 dt2 DATE7.;
input dt1 dt2;
datalines;
01jan90 02jan90
03jan90 01jan90
;;;;
run;
data tbl3;
if 0 then set tbl1 tbl2;
if _n_=1 then do;
declare hash dts(dataset:'work.tbl2');
dts.DefineKey('date');
dts.DefineData('dateid');
dts.DefineDone();
end;
set tbl1;
rc1 = dts.find(key:dt1);
if rc1=0 then id1=dateID;
rc2 = dts.find(key:dt2);
if rc2=0 then id2=dateID;
if rc1=0 and rc2=0 then output;
run;