modifying character variable contents based on lookup table in SAS - replace

HAVE is a wide dataset with names stored in the variables name1-name250. Here are the first two obs and several vars:
episode name1 name2 name3 name4 name5 ...
121 DETWEILER.TJ.M BLUMBERG.MIKEY GRISWOLD.GUS.N
451 BOB.KING KID.HUSTLER FINSTER.MS PRICKLEY.PETEY GRISWOLD.GUS
...
Some of the names need to be corrected. The corrections are stored in the dataset FIXES:
goodname badname
DETWEILER.TJ DETWEILER.TJ.M
GRISWOLD.GUS GRISWOLD.GUS.N
I simply need to find the badname from FIXES that appear in HAVE and replace them with goodname. I currently loop through name1-name250 in a data step for each row in FIXES to accomplish this:
data WANT;
set HAVE;
array name {*} name1-name250;
do i=1 to dim(name);
if name{i} = "DETWEILER.TJ.M" then name{i} = "DETWEILER.TJ";
else if name{i} = "GRISWOLD.GUS.N" then name{i} = "GRISWOLD.GUS";
/*manually add other corrections from FIXES dataset*/
else name{i} = name{i};
end;
run;
This feels really inefficient. What is a better way?

When you have a simple exact match translation like that a FORMAT is a simple way to implement it. You can convert your "lookup" data into a format.
data fixes ;
input goodname :$30. badname :$30. ;
cards;
DETWEILER.TJ DETWEILER.TJ.M
GRISWOLD.GUS GRISWOLD.GUS.N
;
data format ;
retain fmtname '$FIXNAME' ;
set fixes end=eof;
rename badname=start goodname=label;
run;
proc format cntlin=format;
run;
Then just use the format to convert the names.
data want;
set have;
array name name1-name5;
do over name;
name=put(name,$fixname30.);
end;
run;
Result:
episode name1 name2 name3 name4 name5
121 DETWEILER.TJ BLUMBERG.MIKEY GRISWOLD.GUS
451 BOB.KING KID.HUSTLER FINSTER.MS PRICKLEY.PETEY GRISWOLD.GUS

Related

SAS function to every observaton (finance xirr)

I have an sql table like this one
id | payment | date |
______|_____________|________________________|
obs1 | -20,10,13 | 21184,22765,22704 |
And so on (1M+ observation). I prepeared all the data for using finance() in SQL, so in SAS i just need to take them and pass to the function. I am confident, that the data i prepared will return right answer
The problem is that i can't find the most proper way to do caclulate the function on entire data. Right now i am going row by row in cycle and passing data to macro variables throught proc sql BUT i can't get string larger than 1000 characters, so my program isn't working.
I am running next function:
finance('XIRR', payment, date, 0.15);
Can you help me please? Thanks
The code i had before the answer. Worked unacceptable long!
%macro eir (input_data, cash_var, dt_var, output_data);
data rawdata;
set &input_data(dbmax_text=32000);
run;
proc sql noprint;
select count(*) into :n from rawdata ;
quit;
%let n = 100;
%do j=1 %to &n;
data x;
set rawdata(firstobs = &j obs= &j);
run;
proc sql noprint;
select &cash_var into: cf from x;
select &dt_var into: dt from x;
quit;
data x;
set x;
r= finance('xirr', &cf, &dt, 0.15);
drop &cash_var &dt_var;
run;
data out;
set %if &j>1 %then %do; out %end; x;
run;
%end;
proc append base = &output_data data=out;
run;
proc datasets nolist;
delete x out rawdata;
run;
%mend eir;
%eir(input_data = have, cash_var = pmt, dt_var = dt, output_data = ggg);
Took 20 minutes to calculate 50,000 rows
and now it's just
data want;
set have(dbmax_text=32000);
eir = input(resolve(catx(',','%sysfunc(finance(XIRR',pmt,dt,'0.15),hex16)')),hex16.);
run;
Took 6 minutes to calcuate 1,400,000 rows
Tom just saved our project =)
The FINANCE() function wants a list of values, not a character string. You could parse the string and convert the text back into numbers and pass those to the function. But if the lengths of the lists vary from observation to observation that will cause issues.
You could use the macro processor to help you. You can generate a call to %sysfunc(finance()) and read the generated string back into a numeric variable.
It also might work to pad the short lists with zero payments on the last recorded date.
Let's make some test data.
data have ;
infile cards dsd dlm='|' ;
length id $20 payment date $100 ;
input id payment date;
cards;
obs1 | -20,10,13 | 21184,22765,22704
obs2 | -20,10 | 21184,22765
;
Now let's try converting it two ways. One by creating numeric variables to pass to the FINANCE() function call and the other by generating %sysfunc(finance()) call so that we can make sure the %sysfunc() call is working properly.
data want;
set have ;
array v (3) _temporary_;
array d (3) _temporary_;
do i=1 to dim(v);
v(i)=coalesce(input(scan(payment,i,','),32.),0);
d(i)=input(scan(date,i,','),32.);
if missing(d(i)) and i>1 then d(i)=d(i-1);
end;
drop i;
value1=finance('XIRR',of v(*),of d(*),0.15);
value2=input(resolve(catx(',','%sysfunc(finance(XIRR',payment,date,'0.15),hex16)')),hex16.);
run;
Here's my best guess based on the limited details you've provided. I think you need to split out each date and payment into separate variables before you can call the finance function, e.g.:
data have;
infile datalines dlm='|';
input id :$8. amount :$20. date :$20.;
datalines;
obs1 | -20,10,13 | 21184,22765,22704
;
run;
data want;
set have;
array dates[3] d1-d3;
array amounts[3] a1-a3;
do i = 1 to 3;
amounts[i] = input(scan(amount, i, ','), 8.);
dates[i] = input(scan(date, i, ','), 8.);
end;
XIRR = finance('XIRR', of a1-a3, of d1-d3, 0.15);
run;
I suspect this will only work you have the same number of dates and payments in every row, otherwise you will run into array out of bounds issues or problems with the IRR calculation.

SAS group by first digits

I have a variable in SAS with a lot of numbers, for example 11000, 30129, 11111, 30999. I want to group this by the first two digits so "11000 and 11111" and "30129 and 30999" will be in a own table.
It's quite simple,
You have to create a second column and extract the 2 first digit.
Then sort the dataset by this second columns.
data test;
infile datalines dsd ;
input a : 15. ;
datalines;
11000,
30129,
11111,
309999,
;
run;
data test_a;
length val_a $2;
set test;
val_a= SUBSTRN(a,1,2);
run;
proc sort data=test_a out=test_b;
by val_a;
run;
Result will be :
val_a a
11 11000
11 11111
30 30129
30 309999
And then you can create 2 dataset with selection on the val_a like this :
data want data_11 data_30;
set test_b;
if val_a = 11 then output data_11;
if val_a = 30 then output data_30;
run;
Regards,
I think I did like you, but my new column only shows with ".". But I think your answer can give me some help anyways, thank you!
data books;
infile "&path\Boken.csv" dlm=';' missover dsd firstobs=2;
input ISBN: $12.
Book: $quote150.;
run;
data test_a;
format val_ISBN 15.;
set books;
val_ISBN= SUBSTRN(ISBN,1,2);
run;
proc sort data=test_a out=test_b;
by val_ISBN;
run;
proc print data=test_b (obs=10) noobs ;
run;

SAS replace character in ALL columns

I have a SAS dataset that I have to export to a .csv-file. I have the following two contradicting requirements.
I have to use the semicolon as the delimiter in the .csv-file.
Some of the character variables are manually inputted strings from formulas, hence they may contain semicolons.
My solution to the above is to either escape the semicolon or to replace it with a comma.
How can I, in a nice, clean and efficient way use e.g. tranwrd on an entire dataset?
My attempt:
For each variable, use the tranwrd(.., ";", ",") function on a variable in the data set. Update the dataset and loop through all variables. This, however, is naturally a very inefficient way of doing it for even semi-large datasets, since I have to do a datastep for each variable. The code for it is quite ugly, since I have to get the variable names by a few steps, but the inefficiency definitely takes the cake.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
* Get the variable names;
proc contents data=test out=vars(keep=name type varnum) order=varnum noprint;
run;
* Sort by variable number;
proc sort data=vars;
by varnum;
run;
* Put variable names into a space-separated string;
proc sql noprint;
select compress(name)
into :name_list separated by ' '
from vars;
quit;
%let len = %sysfunc(countw(&name_list));
*Initialize loop dataset;
data a;
set test;
run;
%macro loop;
%do i = 1 %to &len;
%let j = %scan(&name_list,&i);
data a(rename=(v_&j = &j) drop=&j);
set a;
v_&j.=compress(tranwrd(&j,";",","));
run;
%end;
%mend;
%loop;
I think I may have more elegant solution to your problem:
data class;
set sashelp.class;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","X"));
end;
drop i;
run;
You can use array to reference all character columns from your data set and then loop through them.
The most widely used standard for csv files whose fields can contain delimiters is to quote fields that contain them, and double up any quotes. In SAS you can do this automatically using the dlm and dsd options in a put statement:
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb" 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
data _null_;
set test;
file "c:\temp\test.csv" dsd dlm=';';
put (_ALL_) (&);
run;
This results in the following semicolon-delimited csv (minus a header row, but that's a separate issue):
"Aaa;;";50;11;1;222;"a;s"
"Bbb""";35;12;2;250;qw
Comma,;75;13;3;foo;zx
Sorry, didn't notice your comment about the workaround until after I posted this. I'll leave it here in case anyone finds it helpful.
Fields in a properly formatted delimited file are quoted. PROC EXPORT will do that. There is no need to change the data.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
filename FT45F001 temp;
proc export data=test outfile=FT45F001 dbms=csv;
delimiter=';';
run;
data _null_;
infile FT45F001;
input;
list;
run;
proc import replace datafile=FT45F001 dbms=csv out=test2;
delimiter=';';
run;
proc print;
run;
proc compare base=test compare=test2;
run;

SAS: Creating dummy variables from categorical variable

I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!

Populate SAS variable based on content of another variable

I have a variable, textvar, that looks like this:
type=1&name=bob
type=2&name=sue
I want to create a new table that looks like this:
type name
1 bob
2 sue
My approach is to use scan to split the variables on & so for the first observation I have
var1 var2
type=1 name=bob
So now I can use scan again to split on =:
vname = scan(var1, 1, '=');
value = scan(var1, 2, '=');
But how can I now assign value to the variable named vname?
PROC TRANPSOSE is the quickest way. You need an ID variable (dummy or real).
data test;
informat testvar $50.;
input testvar $;
datalines;
type=1&name=bob
type=2&name=sue
;;;;
run;
data test_vert;
set test;
id+1;
length scanner $20 vname vvalue $20;
scanner=scan(testvar,1,"&");
do _t=2 by 1 until (scanner=' ');
vname=scan(scanner,1,"=");
vvalue=scan(scanner,2,"=");
output;
scanner=scan(testvar,_t,"&");
end;
run;
proc transpose data=test_vert out=test_T;
by id;
id vname;
var vvalue;
run;
Does this help? Dynamic variable names in SAS
I think I have some code to address this, but left it at my workplace.
Obviously you haven't included your real data, but can't you just hard code some of the values if the format of the raw data is the same in each row? My code converts the "=" and "&" to "," to make the scan function easier to use.
data want (keep=type name);
set test;
_newvar=translate(testvar,",,","&=");
type=input(scan(_newvar,2),best12.);
length name $20;
name=scan(_newvar,4);
run;