How to rename variables without using their original names? - sas

I have a data set that I am uploading to sas. There are always 4 variables in the exact same order. The problem is sometimes the variables could have slightly different names.
For example the first variable user . The next day i get the same dataset, it might be userid . . . So I cannot use rename(user=my_user)
Is there any way i could refer to the variable by their order . . something like this
rename(var_order_1=my_user) ;
rename(var_order_3=my_inc) ;
rename _ALL_=x1-x4 ;

There are a few ways to do this. One is to determine the variable names from PROC CONTENTS or dictionary.columns and generate rename statements.
data have;
input x1-x4;
datalines;
1 2 3 4
5 6 7 8
;;;;
run;
%macro rename(var=,newvar=);
rename &var.=&newvar.;
%mend rename;
data my_vars; *the list of your new variable names, and their variable number;
length varname $10;
input varnum varname $;
datalines;
1 FirstVar
2 SecondVar
3 ThirdVar
4 FourthVar
;;;;
run;
proc sql; *Create a list of macro calls to the rename macro from joining dictionary.columns with your data. ;
* Dictionary.columns is like proc contents.;
select cats('%rename(var=',name,',newvar=',varname,')')
into :renamelist separated by ' '
from dictionary.columns C, my_vars M
where C.memname='HAVE' and C.libname='WORK'
and C.varnum=M.varnum;
quit;
proc datasets;
modify have;
&renamelist; *use the calls;
quit;
Another is to put/input the data using the input stream and the _INFILE_ automatic variable (that references the current line in the input stream). Here's an example. You would of course keep only the new variables if you wanted.
data have;
input x1-x4;
datalines;
1 2 3 4
5 6 7 8
;;;;
run;
data want;
set have;
infile datalines truncover; *or it will go to next line and EOF prematurely;
input #1 ##; *Reinitialize to the start of the line or it will eventually EOF early;
_infile_=catx(' ',of _all_); *put to input stream as space delimited - if your data has spaces you need something else;
input y1-y4 ##; *input as space delimited;
put _all_; *just checking our work, for debugging;
datalines; *dummy datalines (could use a dummy filename as well);
;;;;
run;

Here is another approach using the dictionary tables..
data have;
format var1-var4 $1.;
call missing (of _all_);
run;
proc sql noprint;
select name into: namelist separated by ' ' /* create macro var */
from dictionary.columns
where libname='WORK' and memname='HAVE' /* uppercase */
order by varnum; /* should be ordered by this anyway */
%macro create_rename(invar=);
%do x=1 %to %sysfunc(countw(&namelist,%str( )));
/* OLDVAR = NEWVARx */
%scan(&namelist,&x) = NEWVAR&x
%end;
%mend;
data want ;
set have (rename=(%create_rename(invar=&namelist)));
put _all_;
run;
gives:
NEWVAR1= NEWVAR2= NEWVAR3= NEWVAR4=

Related

SAS - Loop through rows and calculate MD5

I want to sweep each table in a libname and calculate a hash over each row.
For that purpose, i have already a table with libname, memname, concatenated columns with ',' and number of observations
libname
memname
columns
num_obs
lib_A
table_A
col1a,col2a...colna
1
lib_A
table_B
col1b,col2b...colnb
2
lib_B
table_C
col1c,col2c...colnc
1
I first get all data into ranged macro variables (i think its easier to work, but could be wrong, ofc)
proc sql;
select libname, memname, columns, num_obs
into :lib1-, :tab1-, :column1-, :sqlobs1-
from have
where libname="&sel_livraria"; /*macro var = prompt from user*/
quit;
Just for developing guideline i made the code just to check one specific table without getting the row number of it since with a simple counter doesn't work (i get the order of the rows mess up each time i run) and it works for that purpose
%let lib=lib_A;
%let tab=table_B;
%let columns=col1b,col2b,colnb;
data want;
length check $32.;
format check $hex32.;
set &lib..&tab;
libname="&lib";
memname="&tab";
check = md5(cats(&columns));
hash = put(check,$hex32.);
keep libname memname hash;
put hash;
put _all_;
run;
So, what’s the best approach for getting a MD5 from each row (same order as tables) of all tables in a libname? I saw problems i couldn’t overcame using data steps, procs or macros.
The result i wanted if lib_A was selected in prompt were something like:
libname
memname
obs_row
hash
lib_A
table_A
1
64A29CCA15F53C83A9583841294A26AA
lib_A
table_B
1
80DAC7B9854CF71A67F9C00A7EC4D9EF
lib_A
table_B
2
0AC44CD79DAB2E33C93BB2312D3A9A40
Need some help.
Tks in advance.
You're pretty close. This is how I would approach it. We'll create a macro with three parameters: data, lib, and out. data is the dataset you have with the column information. lib is the library you want to pull from your dataset, and out is the output dataset that you want to have.
We'll read each column into an individual macro variable:
memname1
memname2
memname3
libname1
libname2
libname3
etc.
From here, we simply need to loop over all of the macro variables and apply them where appropriate. We can easily count how many there are in a data step. All we need to do is add double-ampersands to resolve them correctly. For more information on why this is, check out this MWSUG paper.
%macro get_md5(data=, lib=, out=);
/* Save all variables into macro variables:
memname1 memname2 ...
columns1 columns2 ...
*/
data _null_;
set &data.;
where upcase(libname)=upcase("&lib.");
call symputx(cats('memname', _N_), memname);
call symputx(cats('columns', _N_), columns);
call symputx(cats('obs', _N_), obs);
call symputx('n_datasets', _N_);
run;
/* Loop through all the datasets and access each macro variable */
%do i = 1 %to &n_datasets.;
/* Double ampersand needed:
First, resolve &i. to get &memname1
Then resolve &mename1 to get the value stored in the macro variable memname1
*/
%let memname = &&memname&i.;
%let columns = &&columns&i.;
%let obs = &&obs&i.;
/* Calculate md5 in a temporary dataset */
data _tmp_;
length lib $8.
memname $32.
obs_row 8.
hash $32.
;
set &lib..&memname.(obs=&obs.);
lib = "&lib.";
memname = "&memname.";
obs_row = _N_;
hash = put(md5(cats(&columns.)), $hex32.);
keep libname memname obs_row hash;
run;
/* Overwrite the dataset so we don't keep appending */
%if(&i. = 1) %then %do;
data &out.;
set _tmp_;
run;
%end;
%else %do;
proc append base=&out. data=_tmp_;
run;
%end;
%end;
/* Remove temporary data */
proc datasets lib=work nolist;
delete _tmp_;
quit;
%mend;
Example:
data have;
length libname memname columns $15.;
input libname$ memname$ columns$ obs;
datalines;
sashelp cars make,model,msrp 1
sashelp class age,height,name 2
sashelp comet dose,length,sample 1
;
run;
%get_md5(data=have, lib=sashelp, out=want);
Output:
libname memname obs_row hash
sashelp cars 1 258DADA4843E7068ABAF95667E881B7F
sashelp class 1 29E8F4F03AD2275C0F191FE3DAA03778
sashelp class 2 DB664382B88BE7E445418B1A1C8CE13B
sashelp comet 1 210394B77E7696506FDEFD78890A8AB9
I would make a macro that takes as input the four values in your metadata dataset. Note that commas are anathema to SAS programs, especially macro code, so make the macro so it can accept space delimited variable lists (like normal SAS program statements do).
To reduce the risk of name conflict I will name the variable using triple underscores and then rename them back to human friendly names when the dataset is written.
%macro next_ds(libname,memname,num_obs,varlist);
data next_ds;
length ___1 $8 ___2 $32 ___3 8 ___4 $32 ;
___1 = "&libname";
___2 = "&memname";
___3 + 1;
set &libname..&memname(obs=&num_obs keep=&varlist);
___4 = put(md5(cats(of &varlist)),$hex32.);
keep ___1-___4 ;
rename ___1=libname ___2=memname ___3=obs_row ___4=hash;
run;
%mend next_ds;
Let's make some test metadata that reference datasets everyone should have.
data have;
infile cards truncover ;
input libname :$8. memname :$32. num_obs columns $200.;
cards;
sashelp class 3 name,sex,age
sashelp cars 2 make,model
;
And make sure the target dataset does not already exists.
%if %sysfunc(exist(want)) %then %do;
proc delete data=want; run;
%end;
Now you can call that macro once for each observation in your source metadata dataset. There is no need to generated oodles of macro variables. Instead you can use CALL EXECUTE() to generate the macro calls directly from the dataset.
We can replace the commas in the column lists when making the macro call. You can add in a PROC APPEND step after each macro call to aggregate the results into a single dataset.
data _null_;
set have;
call execute(cats(
'%nrstr(%next_ds)(',libname,',',memname,',',num_obs
,',',translate(columns,' ',','),')'
));
call execute('proc append data=next_ds base=want force; run;');
run;
Notice that wrapping the macro call in %NRSTR() makes the SAS log easier to read.
1 + %next_ds(sashelp,class,3,name sex age)
2 + proc append data=next_ds base=want force; run;
3 + %next_ds(sashelp,cars,2,make model)
4 + proc append data=next_ds base=want force; run;
Results:
Obs libname memname obs_row hash
1 sashelp class 1 5425E9CEDA1DDEB71B2692A3C7050A8A
2 sashelp class 2 C532D227D358A3764C2D225DC8C02D18
3 sashelp class 3 13AD5F1517E0C4494780773B6DC15211
4 sashelp cars 1 777C60693BF5E16F38706C89301CD0A8
5 sashelp cars 2 07080C9321145395D1A2BCC10FBE6B83
Note that CATS() might not be the best method for generating the string to pass to the MD5() function. That can generate the same string for different combinations of the source variables. For example 'AB' || 'CD' is the same as 'A' || 'BCD'. Perhaps just use CAT() instead.
Stu's approach is nice, and will work most of the time but will fall over when you have wiiiide variables, a large number of variables, variables with large precision, and other edge cases.
So for the actual hashing part, you might consider this macro, which is extensively tested within Data Controller for SAS:
https://core.sasjs.io/mp__md5_8sas.html
Usage:
data _null_;
set sashelp.class;
hashvar=%mp_md5(cvars=name sex, nvars=age height weight);
put hashvar=;
run;

Mixed Delimiters in Proc Export

Is there a method to make the first delimiter in an observation different to the rest? In Microsoft SQL Server Integration Services (SSIS), there is an option to set the delimiter per column. I wonder if there is a similar way to achieve this in SAS with an amendment to the below code, whereby the first delimiter would be tab instead and the rest pipe:
proc export
dbms=csv
data=mydata.dataset1
outfile="E:\OutPutFile_%sysfunc(putn("&sysdate9"d,yymmdd10.)).txt"
replace
label;
delimiter='|';
run;
For example
From:
var1|var2|var3|var4
to
var1 var2|var3|var4
...Where the large space between var1 and var2 is a tab.
Many thanks in advance.
Sounds like you just want to make a new variable that has the first two variables combined and then write that out using tab delimiter.
data fix ;
length new1 $50 ;
set have ;
new1=catx('09'x,var1,var2);
drop var1 var2 ;
run;
proc export data=fix ... delimiter='|' ...
Note that you can reference a variable in the DLM= option on the FILE statement in a data step.
data _null_;
dlm='09'x ;
file 'outfile.txt' dsd dlm=dlm ;
set have ;
put var1 # ;
dlm='|' ;
put var2-var4 ;
run;
Or you could use the catx() trick in a data _null step. You also might want to use vvalue() function to insure formats are applied.
data _null_;
length newvar $200;
file 'outfile.txt' dsd dlm='|' ;
set have ;
newvar = catx('09'x,vvalue(var1),vvalue(var2));
put newvar var3-var4 ;
run;
Updated Fixed order of delimiters to match question.
Final code based on the marked answer by Tom:
data _null_;
dlm='09'x ;
file "E:\outputfile_%sysfunc(putn("&sysdate9"d,yymmdd10.)).txt" dsd dlm=dlm ;
set work.have;
put
var1 # ;
dlm='|';
put var2 var3 var4;
run;

Replace Missing Values with Mean of column in SAS

I have a data set in SAS that has multiple columns that have missing data. This post replaces all the missing values in the entire data set with zeros. But since it goes through the entire data set you can't just replace the zero with the mean or median for that column. How do I replace missing data with the mean of that column?
There are only 5 or so columns so the script doesn't need to go through the entire data set.
PROC STDIZE has an option to do just this. The REPONLY option tells it you want it to only replace missing values, and METHOD=MEAN tells it how you want to replace those values. (PROC EXPAND also could be used, if you are using time series data, but if you're just using mean, STDIZE is the simpler one.)
For example:
data missing_class;
set sashelp.class;
if _N_=5 then call missing(age);
if _N_=7 then call missing(height);
if _N_=9 then call missing(weight);
run;
proc stdize data=missing_class out=imputed_class
method=mean reponly;
var age height weight;
run;
Ideally, you would want to use PROC MI to do multiple imputation and get a more accurate representation of missing values; however, if you wish to use the average, and alternate way of doing so can be done with PROC MEANS and a data step.
/* Set up data */
data have(index=(sex) );
set sashelp.class;
if(_N_ IN(3,7,9,12) ) then call missing(height);
run;
/* Calculate mean of all non-missing values */
proc means data=have noprint;
by sex;
output out=means mean(height) = imp_height;
run;
/* Merge avg. values with original data */
data want;
merge have
means;
by sex;
if(missing(height) ) then height = imp_height;
drop imp_height;
run;
You can use the mean function in proc sql to replace only the missing observations in each column:
data temp;
input var1 var2 var3 var4 var5;
datalines;
. 2 3 4 .
6 7 8 9 10
. 12 . . 15
16 17 18 19 .
21 . 23 24 25
;
run;
proc sql;
create table temp2 as select
case when missing(var1) then mean(var1) else var1 end as var1,
case when missing(var2) then mean(var2) else var2 end as var2,
case when missing(var3) then mean(var3) else var3 end as var3,
case when missing(var4) then mean(var4) else var4 end as var4,
case when missing(var5) then mean(var5) else var5 end as var5
from temp;
quit;
And, as Joe mentioned, you can use coalesce instead if you prefer that syntax:
coalesce(var1, mean(var1)) as var1

SAS loop through datasets

I have multiple tables in a library call snap1:
cust1, cust2, cust3, etc
I want to generate a loop that gets the records' count of the same column in each of these tables and then insert the results into a different table.
My desired output is:
Table Count
cust1 5,000
cust2 5,555
cust3 6,000
I'm trying this but its not working:
%macro sqlloop(data, byvar);
proc sql noprint;
select &byvar.into:_values SEPARATED by '_'
from %data.;
quit;
data_&values.;
set &data;
select (%byvar);
%do i=1 %to %sysfunc(count(_&_values.,_));
%let var = %sysfunc(scan(_&_values.,&i.));
output &var.;
%end;
end;
run;
%mend;
%sqlloop(data=libsnap, byvar=membername);
First off, if you just want the number of observations, you can get that trivially from dictionary.tables or sashelp.vtable without any loops.
proc sql;
select memname, nlobs
from dictionary.tables
where libname='SNAP1';
quit;
This is fine to retrieve number of rows if you haven't done anything that would cause the number of logical observations to differ - usually a delete in proc sql.
Second, if you're interested in the number of valid responses, there are easier non-loopy ways too.
For example, given whatever query that you can write determining your table names, we can just put them all in a set statement and count in a simple data step.
%let varname=mycol; *the column you are counting;
%let libname=snap1;
proc sql;
select cats("&libname..",memname)
into :tables separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
data counts;
set &tables. indsname=ds_name end=eof; *9.3 or later;
retain count dataset_name;
if _n_=1 then count=0;
if ds_name ne lag(ds_name) and _n_ ne 1 then do;
output;
count=0;
end;
dataset_name=ds_name;
count = count + ifn(&varname.,1,1,0); *true, false, missing; *false is 0 only;
if eof then output;
keep count dataset_name;
run;
Macros are rarely needed for this sort of thing, and macro loops like you're writing even less so.
If you did want to write a macro, the easier way to do it is:
Write code to do it once, for one dataset
Wrap that in a macro that takes a parameter (dataset name)
Create macro calls for that macro as needed
That way you don't have to deal with %scan and troubleshooting macro code that's hard to debug. You write something that works once, then just call it several times.
proc sql;
select cats('%mymacro(name=',"&libname..",memname,')')
into :macrocalls separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
&macrocalls.;
Assuming you have a macro, %mymacro, which does whatever counting you want for one dataset.
* Updated *
In the future, please post the log so we can see what is specifically not working. I can see some issues in your code, particularly where your macro variables are being declared, and a select statement that is not doing anything. Here is an alternative process to achieve your goal:
Step 1: Read all of the customer datasets in the snap1 library into a macro variable:
proc sql noprint;
select memname
into :total_cust separated by ' '
from sashelp.vmember
where upcase(memname) LIKE 'CUST%'
AND upcase(libname) = 'SNAP1';
quit;
Step 2: Count the total number of obs in each data set, output to permanent table:
%macro count_obs;
%do i = 1 %to %sysfunc(countw(&total_cust) );
%let dsname = %scan(&total_cust, &i);
%let dsid=%sysfunc(open(&dsname) );
%let nobs=%sysfunc(attrn(&dsid,nobs) );
%let rc=%sysfunc(close(&dsid) );
data _total_obs;
length Member_Name $15.;
Member_Name = "&dsname";
Total_Obs = &nobs;
format Total_Obs comma8.;
run;
proc append base=Total_Obs
data=_total_obs;
run;
%end;
proc datasets lib=work nolist;
delete _total_obs;
quit;
%mend;
%count_obs;
You will need to delete the permanent table Total_Obs if it already exists, but you can add code to handle that if you wish.
If you want to get the total number of non-missing observations for a particular column, do the same code as above, but delete the 3 %let statements below %let dsname = and replace the data step with:
data _total_obs;
length Member_Name $7.;
set snap1.&dsname end=eof;
retain Member_Name "&dsname";
if(NOT missing(var) ) then Total_Obs+1;
if(eof);
format Total_Obs comma8.;
run;
(Update: Fixed %do loop in step 2)

SAS : keep if exists

I have several databases, one per geographical variables, that I want to append in the end. I am doing some data steps on them. As I have large databases, I select only the variables I need when I first call each table. But on tables in which one variable always equals 0, the variable is not in the table.
So when I select my (keep=var) in a for loop, it works fine if the variable exists, but it produces an error in the other case, so that these tables are ignored.
%do i=1 to 10 ;
data temp;
set area_i(keep= var1 var2);
run;
proc append base=want data=temp force;
run;
%end;
Is there a simple way to tackle that ?
In fact I have found a solution : the DKRICOND (or DKROCOND) options specify the level of error detection to report when a variable is missing from respectively an input (or output) data set during the processing of a DROP=, KEEP=, or RENAME= data set option.
The options are DKRICOND=ERROR | WARN | WARNING | NOWARN | NOWARNING, so you just wave to set
dkricond=warn
/*your program, in my case :*/
%do i=1 to 10 ;
data temp;
set area_i(keep= var1 var2);
run;
proc append base=want data=temp force;
run;
%end;
dkricond=error /* the standard value, probably better to set it back after/ */
How about just adding it to the table if it doesn't already exist?
/*look at dictionary.columns to see if the column already exists*/
proc sql;
select name into :flag separated by ' ' from dictionary.columns where libname = 'WORK' and memname = 'AREA_I' and name = 'VAR1';
run;
/*if it doesn't, then created it as empty*/
%if &flag. ne VAR1 %then %do;
data area_i;
set area_i;
call missing(var1);
run;
%end;