Create numeric value from character value in sas - sas

I want to conver a code like "13232C" to a numeric value. Maybe assign values 1 to 26 for A to Z. Then the new code would be "132323".

This code will work if there is just 1 letter in the code. If there are more then you will need to scan through each one to get the value. I've calculated the letter value (1-26) by subtracting 64 from the ASCII value (A=65), making sure to convert the letter to upper case if necessary. I've also assumed that the letter always appears at the end of the string
data have;
input code $;
datalines;
132323C
24578D
5147896G
;
run;
data want;
set have;
new_code=input(cats(compress(code,,'dk'),rank(compress(upcase(code),,'ak'))-64),best12.);
run;

Keith's solution is probably better for most uses, but I can't help seeing this as a good chance to play with PROC FCMP (Function Compile). This works nicely in the case where you have A-I only; starting with J it won't work since I'm only allowing a single character's space. If it can have 2 digits, the FCMP would need to be changed to do what Keith's solution does.
proc fcmp outlib=work.funcs.trial;
function cton(charvar $) $;
do n = 1 to length(charvar);
if 48 le rank(char(charvar,n)) le 57 then ;
else substr(charvar,n,1) = put(rank(upcase(char(charvar,n)))-64,1.);
put charvar;
end;
return (charvar);
endsub;
quit;
options cmplib=work.funcs;
data test;
x="23456CAB";
y = cton(x);
put x= y=;
run;
I also return it as a character, but that's not important - you could return it as a numeric if you prefer (I saw the " " in the original question).

Related

Can I use Perl-regular expressions in SAS to add imputations to date?

For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.
SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;
Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;

Adding columns to a dataset in SAS using a for loop

I'm coming at SAS from a Python/R/Stata background, and learning that things are rather different in SAS. I'm approaching the following problem from the standpoint of one of these languages, perhaps SAS isn't up to what I want to do.
I have a panel dataset with an age column in it. I want to add new columns to the dataset using this age column. I'm going to simplify the functions of age to keep it simple in my example.
The goal is to loop over a sequence, and use the value of that sequence at each loop step to 1. assign the name of the new column and 2. assign the values of that column. I'm hoping to get my starting dataset, with new columns added to it taking values spline1 spline2... spline7
data somePath.FinalDataset;
do i = 1 to 7;
if i = 1 then
spline&i. = age;
if i ^= 1 then spline&i. = age + i;
end;
set somePath.StartingDataset;
run;
This code won't even run, though in an earlier version I was able to get it to run, but the new columns had their values shifted down one row from what they should have been. I include this code block as pseudocode of what I'm trying to do. Any help is much appreciated
One way to do this in SAS is with arrays. A SAS array can be used to reference a group of variables, and it can also create variables.
data have;
input age;
cards;
5
10
;
run;
data want;
set have;
array spline{7}; *create spline1 spline2 ... spline7;
do i=1 to 7;
if i = 1 then spline{i} = age;
else spline{i} = age + i;
end;
drop i;
run;
Spline{i} referes to the ith variable of the array named spline.
i is a regular variable, the DROP statement prevents it from being written to the output dataset.
When you say new columns were "shifted by one," note that spline1=age and spline2=age+2. You can change your code accordingly, e.g. if you want spline2=age+1, you could change your else statement to else spline{i} = age + i - 1 ; It is also possible to change the array statement to define it with 0 as the lower bound, rather than 1.
Arrays are likely the best way to solve this, but I will demonstrate a macro approach, which is necessary in some cases.
SAS separates its doing-things-with-data language from its writing-code language into the 'data step language' and the 'macro language'. They don't really talk to each other during a data step, because the macro language runs during the compilation stage (before any data is processed) while the data step language runs during the execution stage (while rows of data are being processed).
In any event, for something like this it's quite possible to write a macro to do what you want. Borrowing Quentin's general structure and initial dataset:
data have;
input age;
cards;
5
10
;
run;
%macro make_spline(var=, count=);
%local i;
%do i = 1 %to &count;
%if &i=1 %then &var.&i. = &var.;
%else &var.&i. = &var. + &i.;
; *this semicolon ends the assignment statement;
%end;
/* You end up with the IF statement generating:
age1 = age
and the extra semicolon after the if/else generates the ; for that line, making it
age1 = age;
etc. for the other lines.
*/
%mend make_spline;
data want;
set have;
%make_spline(var=age,count=7);
run;
This would then perform what you're looking to perform. The looping is in the macro language, not in the data step. You can assign parameters however you see fit; I prefer to have parameters like above, or even more (start loop could also be a parameter, and in fact the assignment code could be a parameter!).

SAS Perl Regular Expressions: How to write correct syntax?

I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;

How to combine text and numbers in catx statement

The variable upc is already defined in my cool dataset. How do I convert it to a macro variable? I am trying to combine both text and numbers. For example blah should equal upc=123;
data cool;
set cool;
blah = catx("","upc=&upc","ccc")
run;
If upc is a numeric variable and you just want to include its value into some character string then you don't need to do anything special. Concatenation function will convert it into character before concatenating automatically:
data cool;
blah = catx("","upc=",upc,"ccc");
run;
The result:
upc----blah
123 upc= 123ccc
BTW, if you want to concatenate strings without blanks between them, you can use function CATS(), which strips all leading and trailing spaces from each argument.
The following test code works for my SAS 9.3 x64 PC.
Please note that:
1.symputx() provide the connection between dataset and macro variables.
2.cats() will be more appropriate than catx() if delimiting characters are not needed.
3.If you did not attempt to create a new data set, data _NULL_ is fine.
You can check the log to see that the correct values are being assigned.
Bill
data a;
input ID $ x y ##;
datalines;
A 1 10 A 2 20 A 3 30
;
run;
options SymbolGen MPrint MLogic MExecNote MCompileNote=all;
data _NULL_;
set a;
call symputx(cats("blah",_N_),cats(ID,x),"G");
run;
%put blah1=&blah1;
%put blah2=&blah2;
%put blah3=&blah3;

Categorical variables with macro

I am trying to create categorical variables in sas. I have written the following macro, but I get an error: "Invalid symbolic variable name xxx" when I try to run. I am not sure this is even the correct way to accomplish my goal.
Here is my code:
%macro addvars;
proc sql noprint;
select distinct coverageid
into :coverageid1 - :coverageid9999999
from save.test;
%do i=1 %to &sqlobs;
%let n=coverageid&i;
%let v=%superq(&n);
%let f=coverageid_&v;
%put &f;
data save.test;
set save.test;
%if coverageid eq %superq(&v)
%then &f=1;
%else &f=0;
run;
%end;
%mend addvars;
%addvars;
You're combining macro code with data step code in a way that isn't correct. %if = macro language, meaning you are actually evaluating whether the text "coverageid" is equal to the text that %superq(&v) evaluates to, not whether the contents of the coverageid variable equal the value in &v. You could just convert %if to if, but even if you got that to work properly it would be hideously inefficient (you're rewriting the dataset N times, so if you have 1500 values for coverageID you rewrite the entire 500MB dataset or whatnot 1500 times, instead of just once).
If what you want to do is take the variable 'coverageid' and convert it to a set of variables that consist of all possible values of coverageid, 1/0 binary, for each, there are a nubmer of ways to do it. I'm fairly sure the ETS module has a procedure that just does this, but I don't recall it off the top of my head - if you were to post this to the SAS mailing list, one of the guys there would undoubtedly have it quickly.
The simple way for me, is to do this with entirely datastep code. First determine how many potential values there are for COVERAGEID, then assign each to a direct value, then assign the value to the correct variable.
If the COVERAGEID values are consecutive (ie, 1 to some number, no skips, or you don't mind skipping) then this is easy - set up an array and iterate over it. I will assume they are NOT consecutive.
*First, get the distinct values of coverageID. There are a dozen ways to do this, this works as well as any;
proc freq data=save.test;
tables coverageid/out=coverage_values(keep=coverageid);
run;
*Then save them into a format. This converts each value to a consecutive number (so the lowest value becomes 1, the next lowest 2, etc.) This is not only useful for this step, but it can be useful in the future in converting back.;
data coverage_values_fmt;
set coverage_values;
start=coverageid;
label=_n_;
fmtname='COVERAGEF';
type='i';
call symputx('CoverageCount',_n_);
run;
*Import the created format;
proc format cntlin=coverage_values_fmt;
quit;
*Now use the created format. If you had already-consecutive values, you could skip to this step and skip the input statement - just use the value itself;
data save.test_fin;
set save.test;
array coverageids coverageid1-coverageid&coveragecount.;
do _t = 1 to &coveragecount.;
if input(coverageid,COVERAGEF.) = _t then coverageids[_t]=1;
else coverageids[_t]=0;
end;
drop _t;
run;
Here's another way that doesn't use formats, and may be easier to follow.
First, just make some test data:
data test;
input coverageid ##;
cards;
3 27 99 105
;
run;
Next, create a data set with no observations but one variable for each level of coverageid. Note that this approach allows arbitrary values here.
proc transpose data=test out=wide(drop=_name_);
id coverageid;
run;
Finally, create a new data set that combines the initial data set and the wide one. Then, for each level of x, look at each categorical variable and decide whether to turn it "on".
data want;
set test wide;
array vars{*} _:;
do i=1 to dim(vars);
vars{i} = (coverageid = substr(vname(vars{i}),2,1));
end;
drop i;
run;
The line
vars{i} = (coverageid = substr(vname(vars{i}),2));
may require more explanation. vname returns the name of the variable, and since we didn't specify a prefix in proc transpose, all variables are named something like _1, _2, etc. So we take the substring of the variable name that starts in the second position, and compare it to coverageid; if they're the same, we set the variable to 1; otherwise it evaluates to 0.