Extracting sub-data from a SAS dataset & applying to a different dataset - sas

I have written a macro to use proc univariate to calculate custom quantiles for variables in a dataset (say dsn1) %cust_quants(dsn= , varlist= , quant_list= ). The output is a summary dataset (say dsn2)that looks something like the following:
q_1 q_2.5 q_50 q_80 q_97.5 q_99 var_name
1 2.5 50 80 97.5 99 ex_var_1_100
-2 10 25 150 500 20000 ex_var_pos_skew
-20000 -500 -150 0 10 50 ex_var_neg_skew
What I would like to do is to use the summary dataset to cap/floor extreme values in the original dataset. My idea is to extract the column of interest (say q_99) and put it into a vector of macro-variables (say q_99_1, q_99_2, ..., q_99_n). I can then do something like the following:
/* create summary of dsn1 as above example */
%cust_quants(dsn= dsn1, varlist= ex_var_1_100 ex_var_pos_skew ex_var_neg_skew,
quant_list= 1 2.5 50 80 97.5 99);
/* cap dsn1 var's at 99th percentile */
data dsn1_cap;
set dsn1;
if ex_var_1_100 > &q_99_1 then ex_var_1_100 = &q_99_1;
if ex_var_pos_skew > &q_99_2 then ex_var_pos_skew = &q_99_2;
/* don't cap neg skew */
run;
In R, it is very easy to do this. One can extract sub-data from a data-frame using matrix like indexing and assign this sub-data to an object. This second object can then be referenced later. R example--extracting b from data-frame a:
> a <- as.data.frame(cbind(c(1,2,3), c(4,5,6)))
> print(a)
V1 V2
1 1 4
2 2 5
3 3 6
> a[, 2]
[1] 4 5 6
> b <- a[, 2]
> b[1]
[1] 4
Is it possible to do the same thing in SAS? I want to be able to assign a column(s) of sub-data to a macro variable / array, such that I can then use the macro / array within a 2nd data step. One thought is proc sql into::
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
However, this creates a single string variable when what I really want is a vector of variables (or array--no vectors in SAS). Another thought is to add %scan (assuming this is inside a macro):
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
%let i = 1;
%do %until(%scan(&v2_macro, &i) = "");
%let var_&i = %scan(&v2_macro, &i);
%let &i = %eval(&i + 1);
%end;
This seems inefficient and takes a lot of code. It also requires the programmer to remember which var_&i corresponds to each future purpose. Is there a simpler / cleaner way to do this?
**Please let me know in the comments if this is enough background / example. I'm happy to give a more complete description of why I'm doing what I'm attempting if needed.

First off, I assume you are talking about SAS/Base not SAS/IML; SAS/IML is essentially similar to R and has the same kind of operations available in the same manner.
SAS/Base is more similar to a database language than a matrix language (though has some elements of both, and some elements of an OOP language, as well as being a full-featured functional programming language).
As a result, you do things somewhat differently in order to achieve the same goal. Additionally, because of the cost of moving data in a large data table, you are given multiple methods to achieve the same result; you can choose the appropriate method for the required situation.
To begin with, you generally should not store data in a macro variable in the manner you suggest. It is bad programming practice, and it is inefficient (as you have already noticed). SAS Datasets exist to store data; SAS macro variables exist to help simplify your programming tasks and drive the code.
Creating the dataset "b" as above is trivial in Base SAS:
data b;
set a;
keep v2;
run;
That creates a new dataset with the same rows as A, but only the second column. KEEP and DROP allow you to control which columns are in the dataset.
However, there would be very little point in this dataset, unless you were planning on modifying the data; after all, it contains the same information as A, just less. So for example, if you wanted to merge V2 into another dataset, rather than creating b, you could simply use a dataset option with A:
data c;
merge z a(keep=v2);
by id;
run;
(Note: I presuppose an ID variable of some form to combine A and Z.)
This merge combines the v2 column onto z, in a new dataset, c. This is equivalent to vertically concatenating two matrices (although a straight-up concatenation would remove the 'by id;' requirement, in databases you do not typically do that, as order is not guaranteed to be what you expect).
If you plan on using b to do something else, how you create and/or use it depends on that usage. You can create a format, which is a mapping of values [ie, 1='Hello' 2='Goodbye'] and thus allows you to convert one value to another with a single programming statement. You can load it into a hash table. You can transpose it into a row (proc transpose). Supply more detail and a more specific answer can be provided.

Related

SAS: Adding aggregated data to same dataset

I'm migrating from SPSS to SAS.
I need to compute the sum of variable varX, separately by groups of variables varA varB, and add it as a new variable sumX to the same dataset.
In SPSS this is implemented easily with aggregate:
aggregate outfile *
/break varA varB
/SUMvarX = sum(varX).
can this be done in SAS?
There are a number of ways to do this, but the best way depends on your data.
For a typical use case, the PROC MEANS solution is what I'd recommend. It's not the fastest, but it gets the job done, and it has a lot lower opportunity of error - you're not really doing anything except match-merging afterwards.
Use the class statement instead of by in most cases; it shouldn't make much of a difference, but it's the purpose of class. by runs the analysis separately for each value of those variables; class runs one analysis grouping by all of those variables. It is more flexible and doesn't require a sorted dataset (though you would have to sort anyway for the later merge). class also lets you do multiple combinations - not just the nway combination you ask for here, but if you want it grouped just by a, just by b, and by a*b, you can get that (with class and types).
proc means data=have;
class a b;
var x;
output out=summary sum(x)=;
run;
data want;
merge have summary;
by a b;
run;
The DoW loop covered in Kermit's answer is a reasonable data step option also, though more risky in terms of programmer error; I'd use it only in particular cases where the dataset is very very large - more than fits in memory in summary size large - and performance was important.
If the data fits in memory, you can also use a hash table to do the summary, and that's what I'd do if the summary dataset fit comfortably in memory. This is too long for an answer here, but Data Aggregation using Hash Object is a good start for how to do that. Basically, you use a hash table to store the results of the summary (not the raw data), adding to it with each row, and then output the hash table at the end. A bit faster than the DoW loop, but slightly memory constrained (although if you used SPSS, you're much more memory constrained than this!). Also very easy to handle multiple combinations.
Another "programmer easy" way to do it is with SQL.
proc sql;
create want as
select *, sum(x) as sum_x
from have
group by a,b
;
quit;
This is not standard SQL, but SAS manages it - basically it does the two step process of the proc means and the merge, in one step. I like this in some ways (because it skips the intermediate dataset, even if it does actually make this dataset in the util folder, just cleans up for you automatically) and dislike it in others (it's not standard SQL so it will confuse people, and it leaves a note in the log - only a note, so not a big deal, but still).
Adding a note about SPSS -> SAS thinking. One of the bigger differences you'll see going from SPSS to SAS is that, in SPSS, you have one dataset, and you do stuff to it (mostly). You could save it as a different dataset, but you mostly don't until the end - all of your work really is just editing one dataset, in memory.
In SAS, you read datasets from disk and do stuff and then write them out, and if you're doing anything that is at the dataset level (like a summary), you mostly will do it separately and then recombine with the data in a later step. As such, it's very, very common to have lots of datasets - a program I just ran probably has a thousand. Not kidding! Don't worry about random temporary datasets being produced - it doesn't mean your code is not efficient. It's just how SAS works. There are times where you do have to be careful about it - like you have 150GB datasets or something - but if you're working with 5000 rows with 150 variables, your dataset is so small you could write it a thousand times without noticing a meaningful difference to your code execution time.
The big benefit to this style is that you have different datasets for each step, so if you go back and want to rerun part of your code, you can safely - knowing the predecessor dataset still exists, without having to rerun all of your code. It also lets you debug really easily since you can see each of the component parts.
It's a tradeoff for sure, because it does mean it takes a little longer to run the code, but in the modern day CPUs are really really fast, and so are SSDs - it's just not necessary to write code that stays all in one data step or runs entirely in memory. The tradeoff is that you get the ability to do crazy large amounts of things that couldn't possibly fit in memory, work with massive datasets, etc. - only constrained by disk, which is usually in far greater supply. It's a tradeoff worth making in many cases. When it's possible to do something in a PROC, do so, even when that means it costs a tiny bit of time at the end to re-merge it - the PROCs are what you're paying SAS the big bucks for, they are easy to use, well tested, and fast at what they do.
OK, I think I found a way of doing that.
First, you produce the summing varables:
proc means data= <dataset> noprint nway;
by varA varB;
var varX;
output out=<TEMPdataset> sum = SUMvarX;
run;
then you merge the two datasets:
DATA <dataset>;
MERGE <TEMPdataset> <dataset>;
BY varA varB;
run;
This seems to work, although an extra dataset and several extra variables are formed in the process.
There are probably more efficient ways of doing it...
Ever heard of DoW Loop?
*-- Create synthetic data --*
data have;
varA=2; varB=4; varX=21; output;
varA=4; varB=6; varX=32; output;
varA=5; varB=8; varX=83; output;
varA=4; varB=3; varX=78; output;
varA=4; varB=8; varX=72; output;
varA=2; varB=4; varX=72; output;
run;
proc sort data=have; by varA varB; quit;
varA varB varX
2 4 21
2 4 72
4 3 78
4 6 32
4 8 72
5 8 83
data stage1;
set have;
by varA varB;
if first.varB then group_number+1;
run;
data want;
do _n_=1 by 1 until (last.group_number);
set stage1;
by group_number;
SUMvarX=sum(SUMvarX, varX);
end;
do until (last.group_number);
set stage1;
by group_number;
output;
end;
drop group_number;
run;
varA varB varX SUMvarX
2 4 21 93
2 4 72 93
4 3 78 78
4 6 32 32
4 8 72 72
5 8 83 83

Summing Multiple lags in SAS using LAG

I'm trying to make a data step that creates a column in my table that has the sum of ten, fifteen, twenty and fortyfive lagged variables. What I have below works, but it is not practicle to write this code for the twenty and fortyfive summed lags. I'm new to SAS and can't find a good way to write the code. Any help would be greatly appreciated.
Here's what I have:
data averages;
set work.cuts;
sum_lag_ten = (lag10(col) + lag9(col) + lag8(col) + lag7(col) + lag6(col) + lag5(col) + lag4(col) + lag3(col) + lag2(col) + lag1(col));
run;
Proc EXPAND allows the easy calculation for moving statistics.
Technically it requires a time component, but if you don't have one you can make one up, just make sure it's consecutive. A row number would work.
Given this, I'm not sure it's less code, but it's easier to read and type. And if you're calculating for multiple variables it's much more scalable.
Transformout specifies the transformation, In this case a moving sum with a window of 10 periods. Trimleft/right can be used to ensure that only records with a full 10 days are included.
You may need to tweak these depending on what exactly you want. The third example under PROC EXPAND has examples.
Data have;
Set have;
RowNum = _n_;
Run;
Proc EXPAND data=have out=want;
ID rownum;
Convert col=col_lag10 / transformout=(MOVSUM 10 trimleft 9);
Run;
Documentation(SAS/STAT 14.1)
http://support.sas.com/documentation/cdl/en/etsug/68148/HTML/default/viewer.htm#etsug_expand_examples04.htm
If you must do this in the datastep (and if you do things like this regularly, SAS/ETS has better tools for sure), I would do it like this.
data want;
set sashelp.steel;
array lags[20];
retain lags1-lags20;
*move everything up one;
do _i = dim(lags) to 2 by -1;
lags[_i] = lags[_i-1];
end;
*assign the current record value;
lags[1] = steel;
*now calculate sums;
*if you want only earlier records and NOT this record, then use lags2-lags11, or do the sum before the move everything up one step;
lag_sum_10 = sum(of lags1-lags10);
lag_sum_15 = sum(of lags1-lags15); *etc.;
run;
Note - this is not the best solution (I think a hash table is better), but this is better for a more intermediate level programmer as it uses data step variables.
I don't use a temporary array because you need to use variable shortcuts to do the sum; with temporary array you don't get that, unfortunately (so no way to sum just 1-10, you have to sum [*] only).

Holding Sampled Macro Variable Constant

Hopefully a simple answer. I'm doing a simulation study, where I need to sample a random number of individuals, N, from a uniform distribution, U(25,200), at each of a thousand or so replications. Code for one replication is shown below:
%LET U = RAND("UNIFORM");
%LET N = ROUND(25 + (200 - 25)*&U.);
I created both of these macro variables outside of a DATA step because I need to call the N variable repeatedly in subsequent DATA steps and DO loops in both SAS and IML.
The problem is that every time I call N within a replication, it re-samples U, which necessarily modifies N. Thus, N is not held constant within a replication. This issue is shown in the code below, where I first create N as a variable (that is constant across individuals) and sample predictor values for X for each individual using a DO loop. Note that the value in N is not the same as the total number of individuals, which is also a problem.
DATA ID;
N = &N.;
DO PersonID = 1 TO &N.;
X = RAND("NORMAL",0,1); OUTPUT;
END;
RUN;
I'm guessing that what I need to do is to somehow hold U constant throughout the entirety of one replication, and then allow it to be re-sampled for replication 2, and so on. By holding U constant, N will necessarily be held constant.
Is there a way to do this using macro variables?
&N does not store a value. &N stores the code "ROUND(...(RAND..." etc. You're misusing macro variables, here: while you could store a number in &N you aren't doing so; you have to use %sysfunc, and either way it's not really the right answer here.
First, if you're repeatedly sampling replicates, look at the paper Don't be Loopy', which has some applications here. Also consider Rick Wicklin's paper, Sampling with Replacement, and his book that he references ("Simulating Data in SAS") in there is quite good as well. If you're running your process on a one-sample-one-execution model, that's the slow and difficult to work with way. Do all the replicates at once, process them all at once; IML and SAS are both happy to do that for you. Your uniform random sample size is a bit more difficult to work with, but it's not insurmountable.
If you must do it the way you're doing it, I would ask the data step to create the macro variable, if there's a reason to do that. At the end of the sample, you can use call symput to put out the value of N. IE:
%let iter=7; *we happen to be on the seventh iteration of your master macro;
DATA ID;
CALL STREAMINIT(&iter.);
U = RAND("UNIFORM");
N = ROUND(25 + (200 - 25)*U);
DO PersonID = 1 TO N;
X = RAND("NORMAL",0,1);
OUTPUT;
END;
CALL SYMPUTX('N',N);
CALL SYMPUTX('U',U);
RUN;
But again, a one-data-step model is probably your most efficient model.
I'm not sure how to do it in the macro world, but this is how you could convert your code to a data step to accomplish the same thing.
The key is setting the random number stream initialization value, using CALL STREAMINIT.
Data _null_;
call streaminit(35);
u=rand('uniform');
call symput('U', u);
call symput('N', ROUND(25 + (200 - 25)*U));
run;
%put &n;
%put &u;
As Joe points out, the efficient way to perform this simulation is to generate all 1000 samples in a single data step, as follows:
data AllSamples;
call streaminit(123);
do SampleID = 1 to 1000;
N = ROUND(25 + (200 - 25)*RAND("UNIFORM"));
/* simulate sample of size N HERE */
do PersonID = 1 to N;
X = RAND("NORMAL",0,1);
OUTPUT;
end;
end;
run;
This ensures independence of the random number streams, and it takes a fraction of a second to produce the 1000 samples. You can then use a BY statement to analyze the sampling distributions of the statistics on each sample. For example, the following call to PROC MEANS outputs the sample size, sample mean, and sample standard deviation for each of the 1000 samples:
proc means data=AllSamples noprint;
by SampleID;
var X;
output out=OutStats n=SampleN mean=SampleMean std=SampleStd;
run;
proc print data=OutStats(obs=5);
var SampleID SampleN SampleMean SampleStd;
run;
For more details about why the BY-group approach is more efficient (total time= less than 1 second!) see the article "Simulation in SAS: The slow way or the BY way."

Unknown Errors with Proc Transpose

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.
In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).
Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

How to subset data in SAS

Variable ParameterEstimate
slope 1
intercept 2.5
slope 2
intercept 5.6
slope 22.2
intercept 9
Suppose my dataset looks something like this, where the variable names are Variable and ParameterEstimate. I want to extract just the ParameterEstimates of the slope. However, I can't think of a simple way to do that. How can I go about getting just the slopes, i.e just 1, 2, and 22.2?
You can use a subsetting where clause like this:
data want;
set have;
where variable = 'slope';
run;
This reads only those observations from data set "have" where the value of variable equals 'slope'
Depending on what you're doing with it next, you can probably do this without a separate datastep.
Say you wanted the mean of the slopes:
proc means data=have(where=(variable='slope'));
var parameterEstimate;
run;
In most cases you can use a where data set option, unless you have reason to create a new dataset (perhaps you're going to use this subset for 50 different steps or somesuch where it's easier to create one than to type that bit out 50 times).