SAS Proc Format - Format a character based on a substring - sas

I want to know if it is possible to create a custom format that will work on a range of character values by taking a substring and formatting it base on that.
e.g. A001, B001, C001 are all formatted to 'JAN'
A002, B002, C002 are all formatted to 'FEB'
Is this possible?
value $custom_format'A001', 'B001', 'C001' = 'JAN'
'A002', 'B002', 'C002' = 'FEB';
etc.
I know you can just list specific character values you want formatted (as above), but I would like this to work for any letter at the begining and only read the last 3 characters to determin the format, without having to list all possible iterations.

Use substr to only pull the last 3 values and create a format based on that.
proc format;
value $custom_format
'001' = 'JAN'
'002' = 'FEB'
;
run;
/* substr(var, 2) will start at the second character and read until the end */
data test;
var = 'A001';
month = put(substr(var, 2), $custom_format.);
output;
var = 'B002';
month = put(substr(var, 2), $custom_format.);
output;
run;
Output:
var month
A001 JAN
B002 FEB
If you need something more complex, you can supply regex to formats in SAS.

Related

Can I use Perl-regular expressions in SAS to add imputations to date?

For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.
SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;
Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;

SAS: Converting numeric to character values

I am trying to convert datatime20. from numeric to character value.
Currently I have numeric values like this: 01Jan200:00:00:00 and I need to convert it to character values and received output like: 2020-01-01 00:00:00.0
What format and informat should be used in aboved ?
I have tried used PUT function to convert numeric to character and tried many option, each time receiving other format. Should be also use DHMS function before PUT ?
There is not a native format that produces that string exactly. But it it not hard to build it in steps using existing formats. Or you could use PICTURE statement in PROC FORMAT to build your own format.
If you don't really care about the time of day part of the datetime value then this is an easy and clearly understand way to convert the numeric variable DT with number of seconds into a new character variable in that style. Use DATEPART() to get the date (number of days) from the datetime value and then use the YYMMDD format to generate the 10 character string for the date and then just append the constant string of the formatted zeros.
length dt_string $21.;
dt_string = put(datepart(dt),yymmdd10.)||' 00:00:00.0';
If you need the time of day part then you could also use the TOD format.
dt_string = put(datepart(dt),yymmdd10.)||put(dt,tod11.1);
Or you could use the format E8601DT21.1 and then change the letter T between the date and time to a space instead.
dt_string = translate(put(dt,E8601DT21.1),' ','T');
If you want to figure out what formats exist for datetime values and what the formatted results look like you could run a little program to pull the formats from the meta data and apply them to a specific datetime value.
data datetime_formats;
length format $50 string $80 ;
set sashelp.vformat;
where fmttype='F';
where also fmtinfo(fmtname,'cat')='datetime';
keep format string fmtname maxw minw maxd ;
format=cats(fmtname,maxw,'.','-L');
string=putn('01Jan2020:01:02:03'dt,format);
run;
A custom format can be defined to return the result of a user defined function. Docs
proc format;
value <format-name> (default=<width>)
other = [<function-name>()]
;
run;
Example:
options cmplib=(sasuser.functions);
proc fcmp outlib=sasuser.functions.temporal;
function E8601DTS (datetime) $21;
return (
translate (putn(datetime,'E8601DT21.1'),' ','T')
);
endsub;
run;
proc format;
value E8601DTS (default=21)
other = [E8601DTS()]
;
run;
data have;
do dt = '01jan2020:0:0'dt to '10jan2020:0:0'dt by '60:00't;
output;
end;
format dt datetime16.;
run;
ods html file='function-based-format.html';
proc print data=have(obs=4); title 'stock E8601DT';
proc print data=have(obs=4); title 'custom E8601DTS';
format dt E8601DTS.;
run;
ods html close;

Converting to date5. format in SAS

I have a column which has mixed values of month and date (its in character $5 format).
date
7/23
5/23
23MAR
7/19
I want the data to come as uniform date5. format like this
date
23MAR
23MAY
23MAR
19JUL.
Here is the code that I'm using
data DAte_check4again;
set Date_2test;
format check_dt date5.;
check_dt=datepart(date);
run;
SAS stores DATE, TIME and DATETIME values as numbers. The DATEPART() function you are trying to use is for converting DATETIME values to DATE values. But your source variable is character with a length of 5. (FORMATs are just instructions for how to display values).
So your first problem will be to convert the string into a DATE value. You can then take the first 5 characters of the DATE. format and store that into either your original variable or some other variable. Assuming that the month/day values are for the current year and you only have those two styles of strings here is one method to generate a date and also the 5 character string.
data want;
set have ;
if index(date,'/') then date_ck = input(cats(date,'/',year(today())),mmmddyy10.);
else date_ck = input(cats(date,year(today())),date9.);
format date_ck date9.;
new_date = substr(put(date_ck,date9.),1,5);
run;

Convert character value to DATE MMDDYYYY

I have character variable as below
03211962
04181968
when i run this , the excEl output shows like
3211962
It removes 0.
I need to change this as DATE MMDDYYYY.
You can do this using the function mdy. Here is an example based on what you've provided.
data ds;
input dte $;
datalines;
03211962
04181968
;
run;
data ds;
set ds;
format date MMDDYY10.;
mnth = input(substr(dte,1,2),2.);
day = input(substr(dte,3,2),2.);
year = input(substr(dte,5,4),4.);
date = mdy(mnth,day,year);
run;
In the first data step I read in the two values as the character variable dte and then in the second data step I convert the values to their numeric counterparts. The line mnth = input(substr(dte,1,2),2.); is just taking the first 2 characters in the dte variable and converting it to a numeric variable. The mdy function takes numeric values for month, day and year.
I would recommend looking here for other ways to format the date.

SAS: Transform variable into time series in text file import - length greater than 32.767

I get a calendar file from a vendor containing all holidays for a specific calendar.
The file contain 7 columns separated by a pipe (|). However column 7 that contain the actual holiday comes in a string format separated by semi-colon (;).
My problem is that column 7 has a length greater than 32.767 - then the solution I have done so far using some array and transpose tricks doesn't work anymore.
Basically the text file looks like:
INTERNAL_NAME|ERROR_CODE|NUMBER_OF_FIELDS|CALENDAR_CODE|CALENDAR_TYPE|CALENDAR_NAME|DATES
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109......etc.
However column 7 is delivered in a nice format so that the size of the array/matrix is given and the delimiter is given at the start of the string.
*1st charachter = delimiter -> ;
*Number of dimensions in matrix -> 2
*Number of rows in matrix -> 15.728
*Number of columns -> 1
*Data elements + Data -> 5 = Date and Data=01JAN1944 etc.
My desired result would be a dataset looking like
INTERNAL_NAME DATES
US 01JAN1944
US 02JAN1944
US 03JAN1944
US 08JAN1944
etc. until 15.728 observations is read.....
You can do this fairly easily.
The manual solution, i.e., assuming the fields are just as you say they are, is to use the secondary delimiter (;) and then you can parse that initial string on your own later since it's known to be shorter. Then iterate the inputs of that string, using # to hold the line.
data want;
infile datalines4 dlm=';' truncover;
length initial_string $500;
input initial_String $ #;
input dim row col #;
do _n_ = 1 by 1 until (missing(holiday_date));
input col_type holiday_Date #;
if not missing(holiday_date) then output;
end;
datalines4;
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109
;;;;
run;
If you want to use that information that tells you about the delimiter/etc. to drive the readin, you could do that, but it would take two passes on the data file (unless it has a limited set of possibilities and you could just use if/else branching with those limited set of input statements). One pass would read just that part, then call a macro to read in the rest in a separate data step. But if this is always the format of the file, and you don't really care about those fields - you just have to work with them being there - the above is probably better as it's faster and less complicated.