SAS: Limiting variables in PROC EXPORT - sas

I have a PROC EXPORT question that I am wondering if you can answer.
I have a SAS dataset with 800+ variables and over 200K observations and I am trying to export a subset of the variables to a CSV file (i.e. I need all records; I just don’t want all 800+ variables). I can always create a temporary dataset “KEEP”ing just the fields I need and run the EXPORT on that temp dataset, but I am trying to avoid the additional step because I have a large number of records.
To demonstrate this, consider a dataset that has three variables named x, y and z. But, I want the text file generated through PROC EXPORT to only contain x and y. My attempt at a solution below does not quite work.
The SAS Code
When I run the following code, I don’t get exactly what I need. If you run this code and look at the text file that was generated, it has a comma at the end of every line and the header includes all variables in the dataset anyway. Also, I get some messages in the log that I shouldnt be getting.
data ds1;
do x = 1 to 100;
y = x * x;
z = x * x * x;
output;
end;
run;
proc export data=ds1(keep=x y)
file='c:\test.csv'
dbms=csv
replace;
quit;
Here are the first few lines of the text file that was generated ("C:\test.csv")
x,y,z
1,1,
2,4,
3,9,
4,16,
The SAS Log
9343 proc export data=ds1(keep=x y)
9344 file='c:\test.csv'
9345 dbms=csv
9346 replace;
9347 quit;
9348 /**********************************************************************
9349 * PRODUCT: SAS
9350 * VERSION: 9.2
9351 * CREATOR: External File Interface
9352 * DATE: 30JUL12
9353 * DESC: Generated SAS Datastep Code
9354 * TEMPLATE SOURCE: (None Specified.)
9355 ***********************************************************************/
9356 data _null_;
9357 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
9358 %let _EFIREC_ = 0; /* clear export record count macro variable */
9359 file 'c:\test.csv' delimiter=',' DSD DROPOVER lrecl=32767;
9360 if _n_ = 1 then /* write column names or labels */
9361 do;
9362 put
9363 "x"
9364 ','
9365 "y"
9366 ','
9367 "z"
9368 ;
9369 end;
9370 set DS1(keep=x y) end=EFIEOD;
9371 format x best12. ;
9372 format y best12. ;
9373 format z best12. ;
9374 do;
9375 EFIOUT + 1;
9376 put x #;
9377 put y #;
9378 put z ;
9379 ;
9380 end;
9381 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
9382 if EFIEOD then call symputx('_EFIREC_',EFIOUT);
9383 run;
NOTE: Variable z is uninitialized.
NOTE: The file 'c:\test.csv' is:
Filename=c:\test.csv,
RECFM=V,LRECL=32767,File Size (bytes)=0,
Last Modified=30Jul2012:12:05:02,
Create Time=30Jul2012:12:05:02
NOTE: 101 records were written to the file 'c:\test.csv'.
The minimum record length was 4.
The maximum record length was 10.
NOTE: There were 100 observations read from the data set WORK.DS1.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds
100 records created in c:\test.csv from DS1.
NOTE: "c:\test.csv" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
real time 0.12 seconds
cpu time 0.06 seconds
Any ideas how I can solve this problem? I am running SAS 9.2 on windows 7.
Any help would be appreciated. Thanks.
Karthik

Based in Itzy's comment to my question, here is the answer and this does exactly what I need.
proc sql;
create view vw_ds1 as
select x, y from ds1;
quit;
proc export data=vw_ds1
file='c:\test.csv'
dbms=csv
replace;
quit;
Thanks for the help!

Related

SaS 9.4: How to use different weights on the same variable without datastep or proc sql

I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here

How to write a concise list of variables in table of a freq when the variables are differentiated only by a suffix?

I have a dataset with some variables named sx for x = 1 to n.
Is it possible to write a freq which gives the same result as:
proc freq data=prova;
table s1 * s2 * s3 * ... * sn /list missing;
run;
but without listing all the names of the variables?
I would like an output like this:
S1 S2 S3 S4 Frequency
A 10
A E 100
A E J F 300
B 10
B E 100
B E J F 300
but with an istruction like this (which, of course, is invented):
proc freq data=prova;
table s1:sn /list missing;
run;
Why not just use PROC SUMMARY instead?
Here is an example using two variables from SASHELP.CARS.
So this is PROC FREQ code.
proc freq data=sashelp.cars;
where make in: ('A','B');
tables make*type / list;
run;
Here is way to get counts using PROC SUMMARY
proc summary missing nway data=sashelp.cars ;
where make in: ('A','B');
class make type ;
output out=want;
run;
proc print data=want ;
run;
If you need to calculate the percentages you can instead use the WAYS statement to get both the overall and the individual cell counts. And then add a data step to calculate the percentages.
proc summary missing data=sashelp.cars ;
where make in: ('A','B');
class make type ;
ways 0 2 ;
output out=want;
run;
data want ;
set want ;
retain total;
if _type_=0 then total=_freq_;
percent=100*_freq_/total;
run;
So if you have 10 variables you would use
ways 0 10 ;
class s1-s10 ;
If you just want to build up the string "S1*S2*..." then you could use a DO loop or a macro %DO loop and put the result into a macro variable.
data _null_;
length namelist $200;
do i=1 to 10;
namelist=catx('*',namelist,cats('S',i));
end;
call symputx('namelist',namelist);
run;
But here is an easy way to make such a macro variable from ANY variable list not just those with numeric suffixes.
First get the variables names into a dataset. PROC TRANSPOSE is a good way if you use the OBS=0 dataset option so that you only get the _NAME_ column.
proc transpose data=have(obs=0) ;
var s1-s10 ;
run;
Then use PROC SQL to stuff the names into a macro variable.
proc sql noprint;
select _name_
into :namelist separated by '*'
from &syslast
;
quit;
Then you can use the macro variable in your TABLES statement.
proc freq data=have ;
tables &namelist / list missing ;
run;
Car':
In short, no. There is no shortcut syntax for specifying a variable list that crosses dimension.
In long, yes -- if you create a surrogate variable that is an equivalent crossing.
Discussion
Sample data generator:
%macro have(top=5);
%local index;
data have;
%do index = 1 %to ⊤
do s&index = 1 to 2+ceil(3*ranuni(123));
%end;
array V s:;
do _n_ = 1 to 5*ranuni(123);
x = ceil(100*ranuni(123));
if ranuni(123) < 0.1 then do;
ix = ceil(&top*ranuni(123));
h = V(ix);
V(ix) = .;
output;
V(ix) = h;
end;
else
output;
end;
%do index = 1 %to &top;
end;
%end;
run;
%mend;
%have;
As you probably noticed table s: created one freq per s* variable.
For example:
title "One table per variable";
proc freq data=have;
tables s: / list missing ;
run;
There is no shortcut syntax for specifying a variable list that crosses dimension.
NOTE: If you specify out=, the column names in the output data set will be the last variable in the level. So for above, the out= table will have a column "s5", but contain counts corresponding to combinations for each s1 through s5.
At each dimensional level you can use a variable list, as in level1 * (sublev:) * leaf. The same caveat for out= data applies.
Now, reconsider the original request discretely (no-shortcut) crossing all the s* variables:
title "1 table - 5 columns of crossings";
proc freq data=have;
tables s1*s2*s3*s4*s5 / list missing out=outEach;
run;
And, compare to what happens when a data step view uses a variable list to compute a surrogate value corresponding to the discrete combinations reported above.
data haveV / view=haveV;
set have;
crossing = catx(' * ', of s:); * concatenation of all the s variables;
keep crossing;
run;
title "1 table - 1 column of concatenated crossings";
proc freq data=haveV;
tables crossing / list missing out=outCat;
run;
Reality check with COMPARE, I don't trust eyeballs. If zero rows with differences (per noequal) then the out= data sets have identical counts.
proc compare noprint base=outEach compare=outCat out=diffs outnoequal;
var count;
run;
----- Log -----
NOTE: There were 31 observations read from the data set WORK.OUTEACH.
NOTE: There were 31 observations read from the data set WORK.OUTCAT.
NOTE: The data set WORK.DIFFS has 0 observations and 3 variables.
NOTE: PROCEDURE COMPARE used (Total process time)

How to run prediction model

Is there an equivalent of R's function predict(model, data) in SAS?
For example, how would you apply the model below to a large test data set where the response variable "Age" is unknown?
proc reg data=sashelp.class;
model Age = Height Weight ;
run;
I understand you can extract the formula Age = Intercept + Height(Estimate_height) + Weight(Estimate_weight) from the results window and manually predict "Age" for unknown observations, but that's not very efficient.
SAS does this by itself. As long as the model has enough data points to go on, it will output the predicted value. I've used proc glm, but you can use any model procedure to create this kind of output.
/* this is a sample dataset */
data mydata;
input age weight dataset $;
cards;
1 10 mydata
2 11 mydata
3 12 mydata
4 15 mydata
5 12 mydata
;
run;
/* this is a test dataset. It needs to have all of the variables that you'll use in the model */
data test;
input weight dataset $;
cards;
6 test
7 test
10 test
;
run;
/* append (add to the bottom) the test to the original dataset */
proc append data=test base=mydata force; run;
/* you can look at mydata to see if that worked, the dependent var (age) should be '.' */
/* do the model */
proc glm data=mydata;
model age = weight/p clparm; /* these options after the '/' are to show predicte values in results screen - you don't need it */
output out=preddata predicted=pred lcl=lower ucl=upper; /* this line creates a dataset with the predicted value for all observations */
run;
quit;
/* look at the dataset (preddata) for the predicted values */
proc print data=preddata;
where dataset='test';
run;

SAS Remove Outliers

I'm looking for a macro or something in SAS that can help me in isolating the outliers from a dataset. I define an outlier as: Upperbound: Q3+1.5(IQR) Lowerbound: Q1-1.5(IQR). I have the following SAS code:
title 'Fall 2015';
proc univariate data = fall2015 freq;
var enrollment_count;
histogram enrollment_count / vscale = percent vaxis = 0 to 50 by 5 midpoints = 0 to 300 by 5;
inset n mean std max min range / position = ne;
run;
I would like to get rid of the outliers from fall2015 dataset. I found some macros, but no luck in working the macro. Several have a class variable which I don't have. Any ideas how to separate my data?
Here's a macro I wrote a while ago to do this, under slightly different rules. I've modified it to meet your criteria (1.5).
Use proc means to calculate Q1/Q3 and IQR (QRANGE)
Build Macro to cap based on rules
Call macro using call execute and boundaries set, using the output from step 1.
*Calculate IQR and first/third quartiles;
proc means data=sashelp.class stackods n qrange p25 p75;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p25-1.5*qrange;
upper=p75+1.5*qrange;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;

Macro returning a value

I created the following macro. Proc power returns table pw_cout containing column Power. The data _null_ step assigns the value in column Power of pw_out to macro variable tpw. I want the macro to return the value of tpw, so that in the main program, I can call it in DATA step like:
data test;
set tmp;
pw_tmp=ttest_power(meanA=a, stdA=s1, nA=n1, meanB=a2, stdB=s2, nB=n2);
run;
Here is the code of the macro:
%macro ttest_power(meanA=, stdA=, nA=, meanB=, stdB=, nB=);
proc power;
twosamplemeans test=diff_satt
groupmeans = &meanA | &meanB
groupstddevs = &stdA | &stdB
groupns = (&nA &nB)
power = .;
ods output Output=pw_out;
run;
data _null_;
set pw_out;
call symput('tpw'=&power);
run;
&tpw
%mend ttest_power;
#itzy is correct in pointing out why your approach won't work. But there is a solution maintaing the spirit of your approach: you need to create a power-calculation function uisng PROC FCMP. In fact, AFAIK, to call a procedure from within a function in PROC FCMP, you need to wrap the call in a macro, so you are almost there.
Here is your macro - slightly modified (mostly to fix the symput statement):
%macro ttest_power;
proc power;
twosamplemeans test=diff_satt
groupmeans = &meanA | &meanB
groupstddevs = &stdA | &stdB
groupns = (&nA &nB)
power = .;
ods output Output=pw_out;
run;
data _null_;
set pw_out;
call symput('tpw', power);
run;
%mend ttest_power;
Now we create a function that will call it:
proc fcmp outlib=work.funcs.test;
function ttest_power_fun(meanA, stdA, nA, meanB, stdB, nB);
rc = run_macro('ttest_power', meanA, stdA, nA, meanB, stdB, nB, tpw);
if rc = 0 then return(tpw);
else return(.);
endsub;
run;
And finally, we can try using this function in a data step:
options cmplib=work.funcs;
data test;
input a s1 n1 a2 s2 n2;
pw_tmp=ttest_power_fun(a, s1, n1, a2, s2, n2);
cards;
0 1 10 0 1 10
0 1 10 1 1 10
;
run;
proc print data=test;
You can't do what you're trying to do this way. Macros in SAS are a little different than in a typical programming language: they aren't subroutines that you can call, but rather just code that generate other SAS code that gets executed. Since you can't run proc power inside of a data step, you can't run this macro from a data step either. (Just imagine copying all the code inside the macro into the data step -- it wouldn't work. That's what a macro in SAS does.)
One way to do what you want would be to read each observation from tmp one at a time, and then run proc power. I would do something like this:
/* First count the observations */
data _null_;
call symputx('nobs',obs);
stop;
set tmp nobs=obs;
run;
/* Now read them one at a time in a macro and call proc power */
%macro power;
%do j=1 %to &nobs;
data _null_;
nrec = &j;
set tmp point=nrec;
call symputx('meanA',meanA);
call symputx('stdA',stdA);
call symputx('nA',nA);
call symputx('meanB',meanB);
call symputx('stdB',stdB);
call symputx('nB',nB);
stop;
run;
proc power;
twosamplemeans test=diff_satt
groupmeans = &meanA | &meanB
groupstddevs = &stdA | &stdB
groupns = (&nA &nB)
power = .;
ods output Output=pw_out;
run;
proc append base=pw_out_all data=pw_out; run;
%end;
%mend;
%power;
By using proc append you can store the results of each round of output.
I haven't checked this code so it might have a bug, but this approach will work.
You can invoke a macro which calls procedures, etc. (like the example) from within a datastep using call execute(), but it can get a bit messy and difficult to debug.