I have table in SAS with missing values like below:
col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111 | | abc | ... | abc
222 | 11 | C1 | ... | 11
333 | 18 | | ... | 12
... | ... | ... | ... | ...
And I need to delete from above table variables where is more than 80% missing values (>=80%).
How can I do taht in SAS ?
The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.
%macro get_missing_vars(lib=, dsn=, threshold=);
%global drop_vars;
/* Generate a select statement that calculates the proportion missing:
nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */
proc sql noprint;
select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
into :calculate_pct_missing separated by ','
from dictionary.columns
where libname = upcase("&lib")
AND memname = upcase("&dsn")
;
quit;
/* Calculate the percent missing */
proc sql;
create table pct_missing as
select &calculate_pct_missing.
from &lib..&dsn.
;
quit;
/* Convert to a long table */
proc transpose data=pct_missing out=drop_list;
var _NUMERIC_;
run;
/* Get a list of variables to drop that are >= the drop threshold */
proc sql noprint;
select _NAME_
into :drop_vars separated by ' '
from drop_list
where COL1 GE &threshold.
;
quit;
%mend;
It has three parameters:
lib: Library of your dataset
dsn: Dataset name without the library
threshold: Proportion of missing values a variable must meet or exceed to be dropped
For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.
data have;
array col[10];
do i = 1 to 10;
do j = 1 to 10;
col[j] = i;
if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
end;
output;
end;
drop i j;
run;
We'll run the macro and check the log:
%get_missing_vars(lib=work, dsn=have, threshold=0.8);
%put &drop_vars;
The log shows:
col1 col2 col3
Now we can pass this into a simple data step.
data want;
set have;
drop &drop_vars;
run;
I have a dataset with several variables like the one below:
Data have(drop=x);
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
Run;
In my analysis I need to get some summary stats using PROC MEANS and output the results for each variable into one dataset. I tried doing it with the code below, but it only includes stats from the first variable in the dataset. How can I output the remaining variables into the same dataset?
Proc means data=have n sum mean;
By group;
Output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
Run;
Output:
+-------+----+----------+----------+
| group | n | sum | mean |
+-------+----+----------+----------+
| A | 10 | 4.517081 | 0.451708 |
+-------+----+----------+----------+
| B | 10 | -0.77369 | -0.07737 |
+-------+----+----------+----------+
Desired output:
+----------+-------+----+----------+----------+
| variable | group | n | sum | mean |
+----------+-------+----+----------+----------+
| var1 | A | 10 | 4.517081 | 0.451708 |
+----------+-------+----+----------+----------+
| var1 | B | 10 | -0.77369 | -0.07737 |
+----------+-------+----+----------+----------+
| var2 | A | 10 | 7.947089 | 0.794709 |
+----------+-------+----+----------+----------+
| var2 | B | 10 | 5.003049 | 0.500305 |
+----------+-------+----+----------+----------+
You requested SAS to name the count n, the sum sum and the mean mean.
It can only do that for one variable.
This is the syntax to ask SAS to use different names for the statistics of each variable:
Output out=want(drop=_freq_ _type_)
n(var1 var2)=n1 n2
sum(var1 var2)=sum1 sum2
mean(var1 var2)=mean1 mean2;
To get that output you will need to transpose the data. Either transpose before hand and add the _NAME_ variable to the BY or CLASS statement.
data have;
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
run;
proc transpose data=have out=tall;
by group x;
run;
proc means data=tall nway n sum mean;
by group;
class _name_;
output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
run;
Or use /autoname and transpose the resulting dataset from one observation per GROUP to multiple observations.
proc means data=have(drop=x) nway n sum mean;
by group;
output out=wide(drop=_freq_ _type_) n= sum= mean= /autoname;
run;
proc transpose data=wide out=tall;
by group;
run;
data tall ;
set tall ;
stat=scan(_name_,-1,'_');
_name_=substrn(_name_,1,length(_name_)-length(stat) -1);
rename _name_=varname;
run;
proc sort data=tall;
by group varname;
run;
proc transpose data=tall out=want(drop=_name_);
by group varname ;
id stat;
var col1;
run;
proc print data=want;
run;
I have the following (fake) crime data of offenders:
/* Some fake-data */
DATA offenders;
INPUT id :$12. crime :4. offenderSex :$1. count :3.;
INFORMAT id $12.;
INFILE DATALINES DSD;
DATALINES;
1,110,f,3
2,32,f,1
3,31,m,1
4,113,m,1
5,110,m,1
6,31,m,1
7,31,m,1
8,110,f,2
9,113,m,1
10,31,m,1
11,113,m,1
12,110,f,1
13,32,m,1
14,31,m,1
15,31,m,1
16,31,m,1
17,110,f,2
18,113,m,2
19,31,m,1
20,31,m,1
21,110,m,4
22,32,f,1
23,31,m,1
24,31,m,1
25,110,f,4
26,110,m,1
27,110,m,1
28,110,m,2
29,32,m,1
30,113,f,1
31,32,m,1
32,31,f,1
33,110,m,1
34,32,f,1
35,113,m,2
36,31,m,1
37,113,m,1
38,110,f,1
39,113,u,2
;
RUN;
proc format;
value crimes 110 = 'Theft'
113 = 'Robbery'
32 = 'Assault'
31 = 'Minor assault';
run;
I want to create a cross table using PROC TABULATE:
proc tabulate;
format crime crimes.;
freq count;
class crime offenderSex;
table crime="Type of crime", offenderSex="Sex of the offender" /misstext="0";
run;
This gives me a table like this:
m f
------------------------------------
Minor assault |
Assault |
Theft |
Robbery |
Now, I'd like to group the different types of crimes:
'Assault' and 'minor assault' should be in a category "Violent crimes" and 'theft' and 'robbery' should be in a category "Crimes against property":
m f
------------------------------------
Minor assault |
Assault |
*Total violent crimes* |
Theft |
Robbery |
*Total property crimes* |
Can anyone explain me how to do this? I tried to use another format for the 'crime'-variable and use "category * crime" within PROC TABULATE, but then it turned out like this, which is not exactly what I want:
m f
-------------------------------------------------------
Violent crimes Minor assault |
Assault |
Property crimes Theft |
Robbery |
Use the all= option within a table dimension :
table group='Category' * (crime="Type of crime" All='Total'), offenderSex="Sex of the offender" /misstext="0";
I have a dataset that looks like the following:
pt_fin Admit_Type MONTH_YEAR BED_ORDERED_TO_DISPO (minutes)
1 Acute Jan 214
2 Acute Jan 628
3 ICU Jan 300
4 ICU Feb 99
I already have a code (see below) that produces a plot with a x (admit type grouped my month) and y axes (median bed to dispo time), but I want to add a secondary Y axes which counts the number of patients which were used to compute each respective median.
For example, I want a secondary Y axis data point that corresponds to the month and admit type, so for Jan, the secondary Y axis data point will have a 2 separate counts 1)of the patients admitted to acute and 2) of the patients admitted to ICU.
proc sgplot data=Combined;
title "Median Bed Order To Dispo By Month, Admit Location";
vbar MONTH_YEAR / response=BED_ORDERED_TO_DISPO stat=median
group = Admit_Type groupdisplay=cluster ;
run;
I've been trying to adapt what I've found here but the plots my code produces are super messy and incorrect.
https://blogs.sas.com/content/iml/2019/01/14/align-y-y2-axes-sgplot.html
Desired output(pretend X's and *'s, respectively, are connected in a line graph corresponding to the Y axis):
| * |
m | | | X | | #
e | x | | * |
d | | | | | |
|-------------------------------|
Acute ICU Acute ICU
Jan FEb
Code which I've tried that produce rubbish
proc sgplot data=Combined;
vbarbasic MONTH_YEAR/ response=Bed_Order_Hour y2axis; /*needs to be on y axis 1*/
group = Admit_Type
series x=MONTH_YEAR y=Pt_fin/ markers; *Pt_fin needs to be on y axis 2*/
run;
Your visualization explanation is weak. You might want to use two plotting statements in your SGPLOT, VBAR and VLINE.
data have;
do type = 'Acute', 'ICU';
do month = '01jan2018'd to '31dec2018'd;
do _n_ = 1 to floor (50 * ranuni(123));
patid + 1;
minutes = 10 + floor(1000 * ranuni(123));
output;
end;
month = intnx ('month', month, 0, 'e');
end;
end;
format month monname3.;
run;
ods html5 file="plot.html" path="c:\temp";
proc sgplot data=have;
title "Median of patient minutes by month";
vbar month / group=type groupdisplay=cluster response=minutes stat=median;
vline month / group=type groupdisplay=cluster response=minutes stat=freq y2axis ;
run;
ods html5 close;
The vline presents the viewer a secondary focus on the frequency for each median. The same information (as an aspect) of the median could be communicated instead with just a modification of the vbar intensity. The highest freq bars (of median) would be 'strongest' shade and the lower 'freq' bars would be faded.
I have a data set which contains four monthly observations in each row.
1Sep11 389.00 1Oct11 491.00 1Nov11 370.00 1Dec11 335.00
2Sep11 423.00 2Oct11 478.00 2Nov11 407.00 2Dec11 442.00
3Sep11 482.00 3Oct11 300.00 3Nov11 303.00 3Dec11 372.00
I need to have a data set, which would contain the months (Sep, Oct, Nov, Dec) as four columns, and the readings against each month placed on the right column. Example.
Day|Sep|Oct|Nov|Dec
1|389.00|491.00|370.00|335.00
2|423.00|478.00|407.00|442.00
3|482.00|300.00|303.00|372.00
How can I do this in SAS? I have tried the ## option, but that only helps me to read the four readings in the row, and create one observation for each reading.
You original data is in categorical form. That is good!
You are asking for a transformation that changes data (month part of date) into meta data (month name as column). This means down the road you will be dealing with arrays or variable name lists.
I would recommend keeping your data in categorical form. Categorical form means you can use CLASS and BY statements for efficient processing. Use Proc TABULATE to arrange your data items for output or delivery consumption (such as ODS EXCEL).
data have;
attrib date informat=date9. format=date9.;
input date value ##;
date_again = date;
datalines;
1Sep11 389.00 1Oct11 491.00 1Nov11 370.00 1Dec11 335.00
2Sep11 423.00 2Oct11 478.00 2Nov11 407.00 2Dec11 442.00
3Sep11 482.00 3Oct11 300.00 3Nov11 303.00 3Dec11 372.00
run;
proc tabulate data=have;
class date date_again;
var value;
format date monname.;
format date_again day.;
table date_again='', date=''*value=''*max='' / nocellmerge;
run;
ODS LISTING output
-------------------------------------------------------------
| | September | October | November | December |
|-------+------------+------------+------------+------------|
|1 | 389.00| 491.00| 370.00| 335.00|
|-------+------------+------------+------------+------------|
|2 | 423.00| 478.00| 407.00| 442.00|
|-------+------------+------------+------------+------------|
|3 | 482.00| 300.00| 303.00| 372.00|
-------------------------------------------------------------
If you feel you must transpose the data, split out the day and month for use as by and id
data have2(keep=day month value);
attrib date informat=date9. format=date9.;
input date value ##;
day = day(date);
month = put(date,monname3.);
datalines;
1Sep11 389.00 1Oct11 491.00 1Nov11 370.00 1Dec11 335.00
2Sep11 423.00 2Oct11 478.00 2Nov11 407.00 2Dec11 442.00
3Sep11 482.00 3Oct11 300.00 3Nov11 303.00 3Dec11 372.00
run;
proc transpose data=have2 out=want2(drop=_name_);
by day;
var value;
id month;
run;
You are also going to run into problems when overall date range exceeds one year, or if the raw data rows are not day in month grouped or are disordered.
Code:
/* Step 1: Read each line in a string*/
data raw;
input line $ 1-70;
cards;
1Sep11 389.00 1Oct11 491.00 1Nov11 370.00 1Dec11 335.00
2Sep11 423.00 2Oct11 478.00 2Nov11 407.00 2Dec11 442.00
3Sep11 482.00 3Oct11 300.00 3Nov11 303.00 3Dec11 372.00
;;;
run;
/*Step 2: Exract the individual values separated by space */
data input;
set raw;
September= input(scan(line,1,' '),date7.);
S_Value= scan(line,2,' ');
October= input(scan(line,3,' '),date7.);
O_Value= scan(line,4,' ');
November= input(scan(line,5,' '),date7.);
N_Value= scan(line,6,' ');
December= input(scan(line,7,' '),date7.);
D_Value= scan(line,8,' ');
format September October November December date7. ;
drop line;
put _ALL_;
run;
Output:
September=01SEP11 S_Value=389.00 October=01OCT11 O_Value=491.00
November=01NOV11 N_Value=370.00 December=01DEC11 D_Value=335.00 _ERROR_=0 _N_=1
September=02SEP11 S_Value=423.00 October=02OCT11 O_Value=478.00
November=02NOV11 N_Value=407.00 December=02DEC11 D_Value=442.00 _ERROR_=0 _N_=2
September=03SEP11 S_Value=482.00 October=03OCT11 O_Value=300.00
November=03NOV11 N_Value=303.00 December=03DEC11 D_Value=372.00 _ERROR_=0 _N_=3