Can I use Perl-regular expressions in SAS to add imputations to date? - regex

For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.

SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;

Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;

Related

Combine one column's values into a single string

This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;

sas, combine, unknown code datasets observations

I need help reading the code below. I am not sure what specific parts in this code are doing. For example, what does ( firstobs = 2 keep = column3 rename = (column3 = column4) ) do?
Also, what does ( obs = 1 drop = _all_ ); do?
I have also not used column5 = ifn( first.column1, (.), lag(column3) ); before. What does this do?
I am reading someone else's code. I wish I could provide more detail. If I find a solution, I will post it. Thanks for your help.
data out.dataset1;
set out.dataset2;
by column1;
WHERE column2 = 'N';
set out.dataset1 ( firstobs = 2 keep = column3 rename = (column3 = column4) )
out.dataset1 ( obs = 1 drop = _all_ );
FORMAT column5 DATETIME20.;
FORMAT column4 DATETIME20.;
column5 = ifn( first.column1, (.), lag(column3) );
column4 = ifn( last.column1, (.), column4 );
IF first.column1 then DIF=intck('dtday',column4,column3);
ELSE DIF= intck('dtday',column5,column3);
format column6 $6.;
IF first.column1
OR intck('dtday',column5,column3) GT 20 THEN column6= 'HARM';
ELSE column6= 'REPEAT';
run;
Seems like you need to learn about SAS datastep languange!
This series of things happening in the parenthesis are datastep options
You can use those options when ever you are referencing a table, even in a proc sql
The options you have:
firstobs : This starts the datafeed on the record enter in your case 2 it means SAS will start on the table on the 2nd record.
keep : This will only use the fields in list rather than using all the field in the table
rename = rename will rename field, so it works like an alias in SQL
OBS = will limit the amount of record you pull out of a table like top or limit in SQL
DROP = would remove the fields selected from the table in your case all is used wich means he's dropping all the fields.
as for the functions:
LAG is keeping the value from the previous record for the field you put in parenthesis so DPD_CLOSE_OF_BUSINESS_DT
INF = Works like a case or if. Basically you create a condition in the 1st argument and then the 2nd argument is applied when your condition in the 1st argument is true, the 3rd argument get done in the event that your condition on the 1st argument is false.
So to answer that question if it's the first record for the variable SOR_LEASE_NBR then the field Prev_COB_DT will be . otherwise it will be a previous value of DPD_CLOSE_OF_BUSINESS_DT.
The best advise I can give you is to start googling SAS and the function name you are wondering what it does, then it's a matter of encapsulation!
Hope this helps!
Basically your data step is using the LAG() function to look back one observation and the extra SET statement to look ahead one observation.
The IFN() function calls are then being used to make sure that missing values are assigned when at the boundary of a group.
You then use these calculated PREV and NEXT dates to calculate the DIF variable.
Note for this to work you need to be referencing the same input dataset in the two different SET statements (the dataset used in the last with the obs=1 and drop=_all_ dataset options
doesn't really need to be the same since it is not reading any of the actual data, it just has to have at least one observation).
( firstobs = 2 keep = DPD_CLOSE_OF_BUSINESS_DT rename =
(DPD_CLOSE_OF_BUSINESS_DT = Next_COB_DT) ) do?
Here the code firstobs=2 says SAS to read the data from 2nd observation in the dataset.
and also by using rename option trying to change the name of the variable.
(obs = 1 drop = _all_);
obs=1 is reading only the 1st obs in the dataset. If you specify obs=2 then up to 2nd obs will be read.
drop = _all_ , is dropping all of your variables.
Firstobs:
Can read part of the data. If you specify Firstobs= 10, it starts reading the data from 10th observation.
Obs :
If specify obs=15, up to 15th obs the data will be readed.
If you run the below table, it gives you 3 observations ( from 2nd to 4th ) in the output result.
Example;
DATA INSURANCE;
INFILE CARDS FIRSTOBS=2 OBS=4;
INPUT NAME$ GENDER$ AGE INSURANCE $;
CARDS;
SOWMYA FEMALE 20 MEDICAL
SUNDAR MALE 25 MEDICAL
DIANA FEMALE 67 MEDICARE
NINA FEMALE 56 MEDICAL
RUN;

How to split a column into multiple rows in SAS

I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;

SAS Perl Regular Expressions: How to write correct syntax?

I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;

TRANWRD to fix Merge error?

I recently combined two datasets with a pretty straightforward Merge statement. I was using an ACS dataset and a Census population dataset. I needed a flag from the latter to be in the former. When I merged, the place variable (town/county, state) was not de-duplicated because one dataset used state abbreviations while the other used the full spelling:
Obs GeoID GeoName
1 . Abbeville County, SC
2 45001 Abbeville County, South Carolina
I need to change the GeoName for Obs1 so that it equals Obs2
Would an index function work? Or do I need the TRANWRD function? Thanks.
Solved:
data _null_;
length geoName $100;
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;
What I would do is separate the city from the state and use SAS's inbuilt function stnamel to convert the abbreviation to the full name.
data _null_;
length geoName $100;
GeoName='Abbeville Road, SC';
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;