I have a database where 12,000 variables are named "A0122_40", "A0122_45", "A0122_50" and so on. I would like to rename them by keeping in the initial name the numbers from 2 to 5. I would then like to create variables adding all the columns with the same name.
CASE 1
If you want to join two tables with same column names, and keep all data:
data class;
set sashelp.class (obs=2);
rename name=A0122_40
Sex=A0122_45
Weight=A0122_50
Height=A0122_55
;run;
Prepared Table:
+----------+----------+-----+----------+----------+
| A0122_40 | A0122_45 | Age | A0122_55 | A0122_50 |
+----------+----------+-----+----------+----------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+----------+----------+-----+----------+----------+
Code to rename:
%macro renameCols(lb , ds);
proc sql noprint;
select name
into :rn_vr1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A0122%'
;
%let num_vars = &sqlobs.;
proc datasets library=&lb nolist nodetails nowarn;
modify &ds;
rename
%do i=1 %to &num_vars;
&&rn_vr&i=new_&&rn_vr&i
%end;
;
%mend renameCols;
%renameCols(work,class);
Result Table:
+--------------+--------------+-----+--------------+--------------+
| new_A0122_40 | new_A0122_45 | Age | new_A0122_55 | new_A0122_50 |
+--------------+--------------+-----+--------------+--------------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+--------------+--------------+-----+--------------+--------------+
As you see, all columns were renamed, exclude Age.
CASE 2
If you want to "append" all A0122_... in one , I suggest the next code:
data class;
set sashelp.class (obs=2);
rename name=A0123_40
Sex=A0123_45
Weight=A0122_50
Height=A0121_55
;
run;
Prepared Table:
+----------+----------+-----+----------+----------+
| A0123_40 | A0123_45 | Age | A0121_55 | A0122_50 |
+----------+----------+-----+----------+----------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+----------+----------+-----+----------+----------+
Code to rename:
%macro renameCols(lb , ds);
%macro d;
%mend d;
proc sql noprint;
select name
into :rn_vr1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A012%'
;
%let num_vars = &sqlobs.;
/*Create a lot of tables with one column from one input*/
data
%do i=1 %to &num_vars;
&&rn_vr&i (rename=(&&rn_vr&i=%scan(&&rn_vr&i,1,_)) keep = &&rn_vr&i )
%end;
;
set &lb..&ds.;
run;
/*Count variable patterns and write it to macro*/
proc sql noprint;
select distinct scan(name,1,'_')
into :aggr_vars1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A012%'
;
%let num_aggr_vars=&sqlobs.;
/*Append all tables that contains same column name pattern*/
%do i=1 %to &num_aggr_vars;
data _&&aggr_vars&i;
set &&aggr_vars&i:;
n=_n_;
run;
%end;
/*Merge that tables into one*/
data res (drop= n);
merge
%do i=1 %to &num_aggr_vars;
_&&aggr_vars&i
%end;
;
by n;
run;
run;
%mend renameCols;
%renameCols(work,class);
The res table:
+-------+-------+--------+
| A0121 | A0122 | A0123 |
+-------+-------+--------+
| 69 | 112.5 | Alfred |
| 56.5 | 84 | Alice |
| . | . | M |
| . | . | F |
+-------+-------+--------+
Is this what you are looking for:
proc contents data=have out=cols noprint;
run;
proc sql noprint;
select distinct substr(name,1,6) into :colgrps separated by " " from cols;
run;
%macro process;
data want;
set have;
%let ii = 1;
%do %while (%scan(&colgrps, &ii, %str( )) ~= );
%let grp = %scan(&colgrps, &ii, %str( ));
&grp._sum = sum(of &grp.:);
%let ii = %eval(&ii + 1);
%end;
run;
%mend;
%process;
Related
I have table in SAS with missing values like below:
col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111 | | abc | ... | abc
222 | 11 | C1 | ... | 11
333 | 18 | | ... | 12
... | ... | ... | ... | ...
And I need to delete from above table variables where is more than 80% missing values (>=80%).
How can I do taht in SAS ?
The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.
%macro get_missing_vars(lib=, dsn=, threshold=);
%global drop_vars;
/* Generate a select statement that calculates the proportion missing:
nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */
proc sql noprint;
select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
into :calculate_pct_missing separated by ','
from dictionary.columns
where libname = upcase("&lib")
AND memname = upcase("&dsn")
;
quit;
/* Calculate the percent missing */
proc sql;
create table pct_missing as
select &calculate_pct_missing.
from &lib..&dsn.
;
quit;
/* Convert to a long table */
proc transpose data=pct_missing out=drop_list;
var _NUMERIC_;
run;
/* Get a list of variables to drop that are >= the drop threshold */
proc sql noprint;
select _NAME_
into :drop_vars separated by ' '
from drop_list
where COL1 GE &threshold.
;
quit;
%mend;
It has three parameters:
lib: Library of your dataset
dsn: Dataset name without the library
threshold: Proportion of missing values a variable must meet or exceed to be dropped
For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.
data have;
array col[10];
do i = 1 to 10;
do j = 1 to 10;
col[j] = i;
if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
end;
output;
end;
drop i j;
run;
We'll run the macro and check the log:
%get_missing_vars(lib=work, dsn=have, threshold=0.8);
%put &drop_vars;
The log shows:
col1 col2 col3
Now we can pass this into a simple data step.
data want;
set have;
drop &drop_vars;
run;
I have a dataset with several variables like the one below:
Data have(drop=x);
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
Run;
In my analysis I need to get some summary stats using PROC MEANS and output the results for each variable into one dataset. I tried doing it with the code below, but it only includes stats from the first variable in the dataset. How can I output the remaining variables into the same dataset?
Proc means data=have n sum mean;
By group;
Output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
Run;
Output:
+-------+----+----------+----------+
| group | n | sum | mean |
+-------+----+----------+----------+
| A | 10 | 4.517081 | 0.451708 |
+-------+----+----------+----------+
| B | 10 | -0.77369 | -0.07737 |
+-------+----+----------+----------+
Desired output:
+----------+-------+----+----------+----------+
| variable | group | n | sum | mean |
+----------+-------+----+----------+----------+
| var1 | A | 10 | 4.517081 | 0.451708 |
+----------+-------+----+----------+----------+
| var1 | B | 10 | -0.77369 | -0.07737 |
+----------+-------+----+----------+----------+
| var2 | A | 10 | 7.947089 | 0.794709 |
+----------+-------+----+----------+----------+
| var2 | B | 10 | 5.003049 | 0.500305 |
+----------+-------+----+----------+----------+
You requested SAS to name the count n, the sum sum and the mean mean.
It can only do that for one variable.
This is the syntax to ask SAS to use different names for the statistics of each variable:
Output out=want(drop=_freq_ _type_)
n(var1 var2)=n1 n2
sum(var1 var2)=sum1 sum2
mean(var1 var2)=mean1 mean2;
To get that output you will need to transpose the data. Either transpose before hand and add the _NAME_ variable to the BY or CLASS statement.
data have;
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
run;
proc transpose data=have out=tall;
by group x;
run;
proc means data=tall nway n sum mean;
by group;
class _name_;
output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
run;
Or use /autoname and transpose the resulting dataset from one observation per GROUP to multiple observations.
proc means data=have(drop=x) nway n sum mean;
by group;
output out=wide(drop=_freq_ _type_) n= sum= mean= /autoname;
run;
proc transpose data=wide out=tall;
by group;
run;
data tall ;
set tall ;
stat=scan(_name_,-1,'_');
_name_=substrn(_name_,1,length(_name_)-length(stat) -1);
rename _name_=varname;
run;
proc sort data=tall;
by group varname;
run;
proc transpose data=tall out=want(drop=_name_);
by group varname ;
id stat;
var col1;
run;
proc print data=want;
run;
I have a macro array &start_num
+-------+
| start |
+-------+
| 25.5 |
| 33.5 |
| 42.5 |
| 54.5 |
| 98 |
+-------+
but when I am using
%put %scan(&start_num,1);
it returns :25
%put %scan(&start_num,2);
returns me:533
why and how to fix it?
You can specify delimiter in scan function.
For example,if macro variable were initialized from variable start in table:
proc sql noprint;
select start into:start_num separated by ' ' from have;
quit;
START_NUM=25.5 33.5 42.5 54.5 98
Or in %let statement:
%let START_NUM=25.5 33.5 42.5 54.5 98;
Using %scan function:
%let var1 = %scan(&start_num,1,%str( ));
%put &=var1;
VAR1=25.5
%let var2 = %scan(&start_num,2,%str( ));
%put &=var2;
VAR2=33.5
because %scan thinks dot as seperator, that is why you get 25 and 533 respectively. checkout below example
%let start_num= 25.533.545.554.598;
%let var1 = %scan(&start_num,1);
%put value of with dot as separator &var1;
%put &var1 yields 25
%let var2 = %scan(&start_num,2);
%put value of with dot as separator &var2;
%put &var2 yields 533
I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.
For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;
I need help in sas proc freq table.
I have two columns : TYPE and STATUS
TYPE has two values : TypeA and TypeB
STATUS has two values : ON and OFF
I need the sas proc freq table output as follows :
--------------------------------------------------
| TYPE | STATUS |
--------------------------------------------------
| | ON | OFF | ON | OFF |
--------------------------------------------------
| TypeA| 33 | 44 | 22 | 55 |
--------------------------------------------------
| TypeB| 11 | 22 | 33 | 44 |
--------------------------------------------------
How can I obtain this ? I have tried:
proc freq data = XYZ;
tables TYPE*STATUS/missing nocol norow nopercent nocum;
run;
PROC TABULATE is probably the easiest way to get that total column. The keyword all gives you a total across a class.
proc tabulate data=xyz;
class type status;
tables type,(status all)*n;
run;