SAS Perl Regular Expressions: How to write correct syntax? - regex

I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x

If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;

Related

Can I use Perl-regular expressions in SAS to add imputations to date?

For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.
SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;
Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;

Numbered range lists for character data in SAS

I'm trying to create variables Cap1 through Cap6. I'm not sure how to have read them as character data. My code is:
DATA Capture;
INFILE '/folders/myfolders/sasuser.v94/Capture.txt' DLM='09'x DSD MISSOVER FIRSTOBS=2;
INPUT Sex $ AgeGroup $ Weight Cap1 - Cap6 $;
RUN;
And my issue is Cap1 through Cap5 are interpreted as numerical data. How do I solve this?
Your issue is simple: you are using a variable list, but you aren't applying the $ to the whole variable list! You need ( ) around the list and the modifier to apply it to the whole list.
See:
DATA Capture;
INFILE datalines DLM=' ' DSD;
INPUT Sex $ AgeGroup $ Weight (Cap1 - Cap6) ($);
datalines;
M 18-34 135 A B C D E F
F 35-54 115 G H I J K L
;;;;
RUN;
Indeed,
I would also expect this input statement to work as you did, but it does not. Putting a $ after Cap1 does not resolve it either, as this log shows.
26 INPUT Sex $ AgeGroup $ Weight Cap1 $ - Cap6 $;
_
22
ERROR 22-322: Expecting a name.
You can solve it
by assigning a format to your variables before reading them, for instance format Cap1 - Cap6 $2.;
To test it,
I included the data in the source file, i.e. using datalines
DATA Capture;
INFILE datalines DLM='09'x DSD missover FIRSTOBS=1;
format Sex $1. AgeGroup $9. Weight 8.2 Cap1 - Cap6 $2.;
INPUT Sex AgeGroup Weight Cap1 - Cap6;
datalines;
M 1-5 24.5 11 12 13 14 15 16
M 6-10 34.2 21 22 23 24 25 26
;
proc print;
proc contents;
RUN;
How to understand this:
SAS was originally created as a programming language for non-developers (i.c. statisticians) who rather don't care about data formats, so SAS does a lot of guess work for you (just like VBA if you don't use option explicit).
So, the first time you mention a variable name in a data step, SAS ads a variable to the Program Data Vector (PDV) with an apropriate type (numeric or charater) and length, but this is guess work.
For instance: as the first student in the test dataset CLASS included in the standard instalation of SAS is male,
data WORK.CLASS;
set sasHelp.CLASS;
select (sex);
when ('M') gender = 'male';
when ('F') gender = 'female';
otherwise gender = 'unknown';
end;
run;
results in truncating 'female' to four positions:
You can correct that by instructing sas to add the variable to the PDV beforehand.
For a character variable,
format myName $20.; and
length myName $20.; are equivalent and
informat myName $20.; is also about the same.
(The storry becomes more complex with user defined formats, though.)
For numerics, there is a huge difference:
length mySize 8.; preserves 8 bytes in the PDV for mySize
format mySize 8.; tells SAS to print or display mySize with up to 8 digits and no decimals
informat mySize $20.; tells SAS a expect 8 digits without decimals when reading mySize.
Numericals can only have certain lengths, depending on the operatin system. On windowns
8. is the default and corresponds to a double on most databases
4. corresponds to a float
3. is the minimum, which I use for booleans
Formats can be very different
format mySize 8.3; tells SAS tot print mySize with 8 characters, including 3 decimals for the fraction (which leaves room for up to 4 decimals before the decimal dot if it has a positive value. Less decimals will be printed to display larger numbers)
format mySize 8.3; tells SAS tot read mySize assuming the last 3 decimals are the fraction, so 12345678 will be interpreted as 12345.678
Then there are special formats to read and write dates, times and so on and user defined value and picture formats, but that lead me too far.

SAS: Transform variable into time series in text file import - length greater than 32.767

I get a calendar file from a vendor containing all holidays for a specific calendar.
The file contain 7 columns separated by a pipe (|). However column 7 that contain the actual holiday comes in a string format separated by semi-colon (;).
My problem is that column 7 has a length greater than 32.767 - then the solution I have done so far using some array and transpose tricks doesn't work anymore.
Basically the text file looks like:
INTERNAL_NAME|ERROR_CODE|NUMBER_OF_FIELDS|CALENDAR_CODE|CALENDAR_TYPE|CALENDAR_NAME|DATES
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109......etc.
However column 7 is delivered in a nice format so that the size of the array/matrix is given and the delimiter is given at the start of the string.
*1st charachter = delimiter -> ;
*Number of dimensions in matrix -> 2
*Number of rows in matrix -> 15.728
*Number of columns -> 1
*Data elements + Data -> 5 = Date and Data=01JAN1944 etc.
My desired result would be a dataset looking like
INTERNAL_NAME DATES
US 01JAN1944
US 02JAN1944
US 03JAN1944
US 08JAN1944
etc. until 15.728 observations is read.....
You can do this fairly easily.
The manual solution, i.e., assuming the fields are just as you say they are, is to use the secondary delimiter (;) and then you can parse that initial string on your own later since it's known to be shorter. Then iterate the inputs of that string, using # to hold the line.
data want;
infile datalines4 dlm=';' truncover;
length initial_string $500;
input initial_String $ #;
input dim row col #;
do _n_ = 1 by 1 until (missing(holiday_date));
input col_type holiday_Date #;
if not missing(holiday_date) then output;
end;
datalines4;
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109
;;;;
run;
If you want to use that information that tells you about the delimiter/etc. to drive the readin, you could do that, but it would take two passes on the data file (unless it has a limited set of possibilities and you could just use if/else branching with those limited set of input statements). One pass would read just that part, then call a macro to read in the rest in a separate data step. But if this is always the format of the file, and you don't really care about those fields - you just have to work with them being there - the above is probably better as it's faster and less complicated.

Custom format for SAS

Hi I'm interested in making a couple of slightly complex custom formats for data I produce in SAS. I need it to be of the numeric type.
FORMAT 1
0="-"
>0="<number>%"
<0="<number>%"
ie
0 >>>>>>> -
.74 >>>>> 74%
-.65>>>>> -65%
FORMAT 2
0="-"
>0="$<number(with commas)>"
<0="$(<number(with commas)>"
ie
0>>>>>>-
1467>>>>$1,467
-39087>>$(39,087)
I've made simple custom formats using code like this
proc format;
picture test
0='-';
run;
But I'm not sure how to write the syntax to append the $ sign and ( ) signs.
Thanks.
The percent format is fairly straightforward. The dollar one is a mite trickier as you have to watch your widths.
Basically you need to use prefix to get anything on the front (dollar, paren, minus sign) and just put anything that you want at the end actually at the end. '0' means a sometimes-printing digit, '9' means always-printing digit.
You use mult to make the .13 -> 13%.
And, for dollar, you can make use of the dollar format. You might also be able to use the NEGPAREN format on the negative side, but you can't combine that with the dollar sign...
proc format;
picture pctfmt
low - <0= '000%' (mult=100 prefix='-')
0 = '-'
0<-high = '000%' (mult=100);
picture dollfmt
low - <0 = '000,000,000.00)' (prefix='$(')
0 = '-'
0 <- high = [dollar16.2]
;
run;
data _null_;
input x;
put x= pctfmt.;
datalines;
-.15
-.05
0
.05
.15
;;;;
run;
data _null_;
input x;
put x= dollfmt12.2;
datalines;
-5.93
-13432554
0
12345324
5.98
;;;;
run;

SAS-How to format arrays dynamically based on information in one column

I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.