SAS: Adding aggregated data to same dataset - sas

I'm migrating from SPSS to SAS.
I need to compute the sum of variable varX, separately by groups of variables varA varB, and add it as a new variable sumX to the same dataset.
In SPSS this is implemented easily with aggregate:
aggregate outfile *
/break varA varB
/SUMvarX = sum(varX).
can this be done in SAS?

There are a number of ways to do this, but the best way depends on your data.
For a typical use case, the PROC MEANS solution is what I'd recommend. It's not the fastest, but it gets the job done, and it has a lot lower opportunity of error - you're not really doing anything except match-merging afterwards.
Use the class statement instead of by in most cases; it shouldn't make much of a difference, but it's the purpose of class. by runs the analysis separately for each value of those variables; class runs one analysis grouping by all of those variables. It is more flexible and doesn't require a sorted dataset (though you would have to sort anyway for the later merge). class also lets you do multiple combinations - not just the nway combination you ask for here, but if you want it grouped just by a, just by b, and by a*b, you can get that (with class and types).
proc means data=have;
class a b;
var x;
output out=summary sum(x)=;
run;
data want;
merge have summary;
by a b;
run;
The DoW loop covered in Kermit's answer is a reasonable data step option also, though more risky in terms of programmer error; I'd use it only in particular cases where the dataset is very very large - more than fits in memory in summary size large - and performance was important.
If the data fits in memory, you can also use a hash table to do the summary, and that's what I'd do if the summary dataset fit comfortably in memory. This is too long for an answer here, but Data Aggregation using Hash Object is a good start for how to do that. Basically, you use a hash table to store the results of the summary (not the raw data), adding to it with each row, and then output the hash table at the end. A bit faster than the DoW loop, but slightly memory constrained (although if you used SPSS, you're much more memory constrained than this!). Also very easy to handle multiple combinations.
Another "programmer easy" way to do it is with SQL.
proc sql;
create want as
select *, sum(x) as sum_x
from have
group by a,b
;
quit;
This is not standard SQL, but SAS manages it - basically it does the two step process of the proc means and the merge, in one step. I like this in some ways (because it skips the intermediate dataset, even if it does actually make this dataset in the util folder, just cleans up for you automatically) and dislike it in others (it's not standard SQL so it will confuse people, and it leaves a note in the log - only a note, so not a big deal, but still).
Adding a note about SPSS -> SAS thinking. One of the bigger differences you'll see going from SPSS to SAS is that, in SPSS, you have one dataset, and you do stuff to it (mostly). You could save it as a different dataset, but you mostly don't until the end - all of your work really is just editing one dataset, in memory.
In SAS, you read datasets from disk and do stuff and then write them out, and if you're doing anything that is at the dataset level (like a summary), you mostly will do it separately and then recombine with the data in a later step. As such, it's very, very common to have lots of datasets - a program I just ran probably has a thousand. Not kidding! Don't worry about random temporary datasets being produced - it doesn't mean your code is not efficient. It's just how SAS works. There are times where you do have to be careful about it - like you have 150GB datasets or something - but if you're working with 5000 rows with 150 variables, your dataset is so small you could write it a thousand times without noticing a meaningful difference to your code execution time.
The big benefit to this style is that you have different datasets for each step, so if you go back and want to rerun part of your code, you can safely - knowing the predecessor dataset still exists, without having to rerun all of your code. It also lets you debug really easily since you can see each of the component parts.
It's a tradeoff for sure, because it does mean it takes a little longer to run the code, but in the modern day CPUs are really really fast, and so are SSDs - it's just not necessary to write code that stays all in one data step or runs entirely in memory. The tradeoff is that you get the ability to do crazy large amounts of things that couldn't possibly fit in memory, work with massive datasets, etc. - only constrained by disk, which is usually in far greater supply. It's a tradeoff worth making in many cases. When it's possible to do something in a PROC, do so, even when that means it costs a tiny bit of time at the end to re-merge it - the PROCs are what you're paying SAS the big bucks for, they are easy to use, well tested, and fast at what they do.

OK, I think I found a way of doing that.
First, you produce the summing varables:
proc means data= <dataset> noprint nway;
by varA varB;
var varX;
output out=<TEMPdataset> sum = SUMvarX;
run;
then you merge the two datasets:
DATA <dataset>;
MERGE <TEMPdataset> <dataset>;
BY varA varB;
run;
This seems to work, although an extra dataset and several extra variables are formed in the process.
There are probably more efficient ways of doing it...

Ever heard of DoW Loop?
*-- Create synthetic data --*
data have;
varA=2; varB=4; varX=21; output;
varA=4; varB=6; varX=32; output;
varA=5; varB=8; varX=83; output;
varA=4; varB=3; varX=78; output;
varA=4; varB=8; varX=72; output;
varA=2; varB=4; varX=72; output;
run;
proc sort data=have; by varA varB; quit;
varA varB varX
2 4 21
2 4 72
4 3 78
4 6 32
4 8 72
5 8 83
data stage1;
set have;
by varA varB;
if first.varB then group_number+1;
run;
data want;
do _n_=1 by 1 until (last.group_number);
set stage1;
by group_number;
SUMvarX=sum(SUMvarX, varX);
end;
do until (last.group_number);
set stage1;
by group_number;
output;
end;
drop group_number;
run;
varA varB varX SUMvarX
2 4 21 93
2 4 72 93
4 3 78 78
4 6 32 32
4 8 72 72
5 8 83 83

Related

Missing values in a FREQ (SAS)

I'm going to ask this with an example...
Suppose i have a data set where each observation represents a person. Two of the variables are AGE and HASADOG (and say this has values 1 for yes and 2 for no.) Is there a way to run a PROC FREQ (by AGE*HASADOG) that forces SAS to include in the report a line for instances where the count is zero?
By this I mean: if there is a particular value for AGE such that no observation with this AGE value has a 1 in the HASADOG variable, the report will still include a row for this combination (with a row percent of 0.)
Is this possible?
The SPARSE option in PROC FREQ is likely all you need.
proc freq data=sashelp.class;
table sex*age / sparse list;
run;
If the value is nowhere in your data set at all, then there's no way for SAS to know it exists. In this case you'd need a more complex solution, basically a way to tell SAS all values you would be using ahead of time. This can be done via a PRELOADFMT or CLASSDATA option on several procs. There are asked an answered questions on this topic here on SO, so I won't provide a solution for this option, which seems beyond the scope of your question.

Unknown Errors with Proc Transpose

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.
In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).
Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

Hash object in SAS - is it possible to concatenate two tables below using hash object?

I'm trying to find ways to substitute all possible proc sql and regular merges with hash whenever possible.
Sample data looks like
data TABLE1;
input Date Property $6. Headcount;
datalines;
01Jul2013 East 100
02Jul2013 East 50
02Jul2013 West 50
;
run;
data TABLE2;
input Date Property $6. Headcount;
datalines;
11Aug2013 East 60
02Oct2013 East 50
22Dec2013 West 40
run;
Both data sets are already sorted by Date and Property. Currently I do it via
data WANT;
set TABLE1 TABLE2;
run;
But the problem is the total number of records in both tables are quite large. The codes above require 20mins or even more to finish this concatenation.
I do know how to use hash object to obtain a outer join result. But how to use it for this purpose?
Do you subsequently use your WANT datastep in other steps (data or proc), e.g. to summarize or subset it down?
If so, you can reduce the I/O by specifying WANT as a view instead of a table.
data want /view=want ;
set table1 table2 ;
run ;
/* Then use `want` elsewhere... */
proc summary data=want ... ;
...
run ;
BUT... if you use want several times, it may still be more efficient (in terms or time or I/O) to build it as a table.
You are unlikely to get much of a performance gain from using a hash object in this scenario. The main benefit of using hash objects is that they allow you to merge on values from one or more small datasets onto a larger dataset without having to sort the large dataset. In this scenario:
Both of your datasets are large
You aren't doing any merging
Appending is possible via the use of hash iterators, if you're really keen, but I wouldn't bother. As other users have suggested, appending is the way to go here, as it will reduce the I/O requirements. Look at the documentation for proc append for more details.

editing categorical data for uniformity

i have 1 million + rows of data and on of the columns is channel_name. The people collecting the data didn't seem to care that they entered one channel in about 10 different variations, lots of which contain the # symbol. Google search isn't giving me any decent documentation, can anyone direct me to something useful?
To some extent the answer has to be, "it depends". Your actual data will determine the best solution to this; and there may not be one true solution - you may have to try a few things, and there may well be more manual work than you'd like.
One option is to build a format based on what you see. That format can either convert various values to one consistent value, or convert to a numeric category (which is then overlaid with a format that shows the consistent value).
For example, you might have 'channel' as retail store:
data have;
infile datalines truncover;
input #1 channel $8.;
datalines;
Best Buy
BestBuy
BB
;;;;
run;
So you can do one of two things:
proc format;
value $channel
"Best Buy","BB","BestBuy" = "Best Buy";
quit;
data want;
set have;
channel_coded = put(channel,$channel.);
run;
Or you can do:
proc format;
invalue channeli
"Best Buy", "BB","BestBuy" = 1
;
value channelf
1 = "Best Buy"
;
quit;
data want;
set have;
channel_coded = input(channel,CHANNELI.);
format channel_coded channelf.;
run;
Which you do is largely up to you - the latter gives you more flexibility in the long run, for example when Sears and K-Mart merged, it would be somewhat to take 2 and 16 and format then as Sears, than to change the stored values for the character format - and even easier to roll back if/when KMart splits off again.
This does require some manual work, though; you have to code things by hand here, or develop some method for figuring out what the coding is. You can use the other option in proc format to easily identify new values and add them to the format (which can be derived from a dataset, instead of hand written code), but at the end of the day the actual values you have determine what solution is best for the actual work of determining what is "Best Buy", and a by-hand solution (each time a new value comes in, it is looked at by a person and coded) may ultimately be the best.

Extracting sub-data from a SAS dataset & applying to a different dataset

I have written a macro to use proc univariate to calculate custom quantiles for variables in a dataset (say dsn1) %cust_quants(dsn= , varlist= , quant_list= ). The output is a summary dataset (say dsn2)that looks something like the following:
q_1 q_2.5 q_50 q_80 q_97.5 q_99 var_name
1 2.5 50 80 97.5 99 ex_var_1_100
-2 10 25 150 500 20000 ex_var_pos_skew
-20000 -500 -150 0 10 50 ex_var_neg_skew
What I would like to do is to use the summary dataset to cap/floor extreme values in the original dataset. My idea is to extract the column of interest (say q_99) and put it into a vector of macro-variables (say q_99_1, q_99_2, ..., q_99_n). I can then do something like the following:
/* create summary of dsn1 as above example */
%cust_quants(dsn= dsn1, varlist= ex_var_1_100 ex_var_pos_skew ex_var_neg_skew,
quant_list= 1 2.5 50 80 97.5 99);
/* cap dsn1 var's at 99th percentile */
data dsn1_cap;
set dsn1;
if ex_var_1_100 > &q_99_1 then ex_var_1_100 = &q_99_1;
if ex_var_pos_skew > &q_99_2 then ex_var_pos_skew = &q_99_2;
/* don't cap neg skew */
run;
In R, it is very easy to do this. One can extract sub-data from a data-frame using matrix like indexing and assign this sub-data to an object. This second object can then be referenced later. R example--extracting b from data-frame a:
> a <- as.data.frame(cbind(c(1,2,3), c(4,5,6)))
> print(a)
V1 V2
1 1 4
2 2 5
3 3 6
> a[, 2]
[1] 4 5 6
> b <- a[, 2]
> b[1]
[1] 4
Is it possible to do the same thing in SAS? I want to be able to assign a column(s) of sub-data to a macro variable / array, such that I can then use the macro / array within a 2nd data step. One thought is proc sql into::
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
However, this creates a single string variable when what I really want is a vector of variables (or array--no vectors in SAS). Another thought is to add %scan (assuming this is inside a macro):
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
%let i = 1;
%do %until(%scan(&v2_macro, &i) = "");
%let var_&i = %scan(&v2_macro, &i);
%let &i = %eval(&i + 1);
%end;
This seems inefficient and takes a lot of code. It also requires the programmer to remember which var_&i corresponds to each future purpose. Is there a simpler / cleaner way to do this?
**Please let me know in the comments if this is enough background / example. I'm happy to give a more complete description of why I'm doing what I'm attempting if needed.
First off, I assume you are talking about SAS/Base not SAS/IML; SAS/IML is essentially similar to R and has the same kind of operations available in the same manner.
SAS/Base is more similar to a database language than a matrix language (though has some elements of both, and some elements of an OOP language, as well as being a full-featured functional programming language).
As a result, you do things somewhat differently in order to achieve the same goal. Additionally, because of the cost of moving data in a large data table, you are given multiple methods to achieve the same result; you can choose the appropriate method for the required situation.
To begin with, you generally should not store data in a macro variable in the manner you suggest. It is bad programming practice, and it is inefficient (as you have already noticed). SAS Datasets exist to store data; SAS macro variables exist to help simplify your programming tasks and drive the code.
Creating the dataset "b" as above is trivial in Base SAS:
data b;
set a;
keep v2;
run;
That creates a new dataset with the same rows as A, but only the second column. KEEP and DROP allow you to control which columns are in the dataset.
However, there would be very little point in this dataset, unless you were planning on modifying the data; after all, it contains the same information as A, just less. So for example, if you wanted to merge V2 into another dataset, rather than creating b, you could simply use a dataset option with A:
data c;
merge z a(keep=v2);
by id;
run;
(Note: I presuppose an ID variable of some form to combine A and Z.)
This merge combines the v2 column onto z, in a new dataset, c. This is equivalent to vertically concatenating two matrices (although a straight-up concatenation would remove the 'by id;' requirement, in databases you do not typically do that, as order is not guaranteed to be what you expect).
If you plan on using b to do something else, how you create and/or use it depends on that usage. You can create a format, which is a mapping of values [ie, 1='Hello' 2='Goodbye'] and thus allows you to convert one value to another with a single programming statement. You can load it into a hash table. You can transpose it into a row (proc transpose). Supply more detail and a more specific answer can be provided.