SAS: Transform variable into time series in text file import - length greater than 32.767 - sas

I get a calendar file from a vendor containing all holidays for a specific calendar.
The file contain 7 columns separated by a pipe (|). However column 7 that contain the actual holiday comes in a string format separated by semi-colon (;).
My problem is that column 7 has a length greater than 32.767 - then the solution I have done so far using some array and transpose tricks doesn't work anymore.
Basically the text file looks like:
INTERNAL_NAME|ERROR_CODE|NUMBER_OF_FIELDS|CALENDAR_CODE|CALENDAR_TYPE|CALENDAR_NAME|DATES
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109......etc.
However column 7 is delivered in a nice format so that the size of the array/matrix is given and the delimiter is given at the start of the string.
*1st charachter = delimiter -> ;
*Number of dimensions in matrix -> 2
*Number of rows in matrix -> 15.728
*Number of columns -> 1
*Data elements + Data -> 5 = Date and Data=01JAN1944 etc.
My desired result would be a dataset looking like
INTERNAL_NAME DATES
US 01JAN1944
US 02JAN1944
US 03JAN1944
US 08JAN1944
etc. until 15.728 observations is read.....

You can do this fairly easily.
The manual solution, i.e., assuming the fields are just as you say they are, is to use the secondary delimiter (;) and then you can parse that initial string on your own later since it's known to be shorter. Then iterate the inputs of that string, using # to hold the line.
data want;
infile datalines4 dlm=';' truncover;
length initial_string $500;
input initial_String $ #;
input dim row col #;
do _n_ = 1 by 1 until (missing(holiday_date));
input col_type holiday_Date #;
if not missing(holiday_date) then output;
end;
datalines4;
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109
;;;;
run;
If you want to use that information that tells you about the delimiter/etc. to drive the readin, you could do that, but it would take two passes on the data file (unless it has a limited set of possibilities and you could just use if/else branching with those limited set of input statements). One pass would read just that part, then call a macro to read in the rest in a separate data step. But if this is always the format of the file, and you don't really care about those fields - you just have to work with them being there - the above is probably better as it's faster and less complicated.

Related

Can I use Perl-regular expressions in SAS to add imputations to date?

For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.
SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;
Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;

How to split a column into multiple rows in SAS

I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;

Adding a new Column to existing SAS dataset

I have a SAS dataset that I have created by reading in a .txt file. It has about 20-25 rows and I'd like to add a new column that assigns an alphabet in serial order to each row.
Row 1 A
Row 2 B
Row 3 C
.......
It sounds like a really basic question and one that should have an easy solution, but unfortunately, I'm unable to find this anywhere. I get solutions for adding new calculated columns and so on, but in my case, I just want to add a new column to my existing datatable - there is no other relation between the variables.
This is kind of ugly and if you have more than 26 rows it will start to use random ascii characters. But it does solve the problem as defined by the question.
Test data:
data have;
do row = 1 to 26;
output;
end;
run;
Explanation:
On my computer, the letter 'A' is at position 65 in the ASCII table (YMMV). We can determine this by using this code:
data _null_;
pos = rank('A');
put pos=;
run;
The ASCII table will position the alphabet sequentially, so that B will be at position 66 (if A is at 65 and so on).
The byte() function returns a character from the ASCII table at a certain position. We can take advantage of this by using the position of ASCII character A as an offset, subtracting 1, then adding the row number (_n_) to it.
Final Solution:
data want;
set have;
alphabet = byte(rank('A')-1 + _n_);
run;
Not better than Tom's but a brute force alternative essentially. Create the string of Alpha and then use CHAR() to identify character of interest.
data want;
set sashelp.class;
retain string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
letter = char(string, _n_);
run;

Construct SAS dataset based on file containing metadata

I have two text files, one containing raw data with no headers and another containing the associated column names and lengths. I'd like to use these two files to construct a single SAS dataset containing the data from one file with the column names and lengths from the other.
The file containing the data is a fixed-width text file. That is, each column of data is aligned to a particular column of the text file, padded with spaces to ensure alignment.
datafile.txt:
John 45 Has two kids
Marge 37 Likes books
Sally 29 Is an astronaut
Bill 60 Drinks coffee
The file containing the metadata is tab-delimited with two columns: one with the name of the column in the data file and one with the character length of that column. The names are listed in the order in which they appear in the data file.
metadata.txt:
Name 7
Age 5
Comments 15
My goal is to have a SAS dataset that looks like this:
Name | Age | Comments
-------+------+-----------------
John | 45 | Has two kids
Marge | 37 | Likes books
Sally | 29 | Is an astronaut
Bill | 60 | Drinks coffee
I want every column to be character with the length specified in the metadata file.
There has to be a better way than my naive approach, which is to construct a length statement and an input statement using the imported metadata, like so:
/* Import metadata */
data meta;
length colname $ 50 collen 8;
infile 'C:\metadata.txt' dsd dlm='09'x;
input colname $ collen;
run;
/* Construct LENGTH and INPUT statements */
data _null_;
length lenstmt inptstmt $ 1000;
retain lenstmt inptstmt '' colstart 1;
set meta end=eof;
call catx(' ', lenstmt, colname, '$', collen);
call catx(' ', inptstmt, cats('#', colstart), colname, '$ &');
colstart + collen;
if eof then do;
call symputx('lenstmt', lenstmt);
call symputx('inptstmt', inptstmt);
end;
run;
/* Import data file */
data datafile;
length &lenstmt;
infile 'C:\datafile.txt' dsd dlm='09'x;
input &inptstmt;
run;
This gets me what I need, but there has to be a cleaner way. One could run into trouble with this approach if insufficient space is allocated to the variables storing the length and input statements, or if the statement lengths exceed the maximum macro variable length.
Any ideas?
What you're doing is a fairly standard method of doing this. Yes, you could check things a bit more carefully; I would allocate $32767 for the two statements, for example, just to be cautious.
There are some ways you can improve this, though, that may take some of your worries away.
First off, a common solution is to build this at the row level (as you do) and then use proc sql to create the macro variable. This has a larger maximum length limitation than the data step method (the data step method maximum is $32767 if you don't use multiple variables, SQL's is double that at 64kib).
proc sql;
select catx(' ',colname,'$',collen)
into :lenstmt separated by ' '
from meta; *and similar for inputstmt;
quit;
Second, you can surpass the 64k limit by writing to a file instead of to a macro variable. Take your data step, and instead of accumulating and then using call symput, write each line out to a temp file (or two). Then %include those files instead of using the macro variable in the input datastep - yes, you can %include in the middle of a datastep.
There are other methods, but these two are the most common and should work for most use cases. Some other methods include call execute, run_macro, or using file open commands to work with the file directly. In general, those are either more complicated or less useful than the most common two, although certainly they are also acceptable solutions and not uncommon to see in practice.
call execute show be able to help.
data _null_;
retain start 0;
infile 'c:\metadata.txt' missover end=eof;
if _n_=1 then do;
start=1;
call execute('data final_output; infile "c:\datafile.txt" truncover; input ');
end;
input colname :$8.
collen :8.
;
call execute( '#'|| put(start,8. -l) || ' ' || colname || ' $'|| put(collen,8. -r) ||'. ' );
start=sum(start,collen);
if eof then do;
call execute(';run;');
end;
run;
proc contents data=final_output;run;

SAS Perl Regular Expressions: How to write correct syntax?

I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;