Perform calculations using n and nmiss values - sas

I have the following SAS PROC MEANS statement that works great as it is.
proc means data=MBA_NODUP_APPLICANT_&TERM. missing nmiss n mean median p10 p90 fw = 8;
where ENR = 1;
by SRC_TYPE;
var gmattotal greverb2 grequant2 greanwrt;
run;
However, I am trying to add new variable calculating nmiss/(nmiss+n). I don't see any examples of this online, but also nothing that says that it cannot be done.

To calculate the percent missing, which is what your formula means, just use the OUTPUT statement to generate a dataset with the NMISS and N values. Then add a step to do the arithmetic yourself.
Or you could create a new binary variable using the MISSING() function and take the MEAN of that. The mean of a 1/0 variable is the same are the percent that were 1 (TRUE).
Example:
data test;
set sashelp.cars;
missing_cylinders=missing(cylinders);
run;
proc means data=test nmiss n mean;
var cylinders missing_cylinders ;
run;
So 2/428 is a little less than 0.5%.
The MEANS Procedure
N
Variable Miss N Mean
------------------------------------------------
Cylinders 2 426 5.8075117
missing_cylinders 0 428 0.0046729

Related

Calculate the top 5 and summarize them by store

Let's say I have stores all around the world and I want to know what was my top losses sales across the world per store. What is the code for that?!
here is my try:
proc sort data= store out=sorted_store;
by store descending amount;
run;
and
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then "Sum_5Largest_Losses"n=sum(amount);
end;
run;
but this just prints out the 5:th amount and not 1.. TO .. 5! and I really don't know how to select the top 5 of EACH store . I think a kind of group by would be a perfect fit. But first things, first. How do I selct i= 1...5 ? And not just = 5?
There is also way of doing it with proc sql:
data have;
input store$ amount;
datalines;
A 100
A 200
A 300
A 400
A 500
A 600
A 700
B 1000
B 1100
C 1200
C 1300
C 1400
D 600
D 700
E 1000
E 1100
F 1200
;
run;
proc sql outobs=4; /* limit to first 10 results */
select store, sum(amount) as TOTAL_AMT
from have
group by 1
order by 2 desc; /* order them to the TOP selection*/
quit;
The data step sum(,) function adds up its arguments. If you only give it one argument then there is nothing to actually sum so it just returns the input value.
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then Sum_5Largest_Losses=sum(Sum_5Largest_Losses,amount);
end;
run;
I would highly recommend learning the basic methods before getting into DOW loops.
Add a counter so you can find the first 5 of each store
As the data step loops the sum accumulates
Output sum for counter=5
proc sort data= store out=sorted_store;
by store descending amount;
run;
data calc1;
set sorted_store;
by store;
*if first store then set counter to 1 and total sum to 0;
if first.store then do ;
counter=1;
total_sum=0;
end;
*otherwise increment the counter;
else counter+1;
*accumulate the sum if counter <= 5;
if counter <=5 then total_sum = sum(total_sum, amount);
*output only when on last/5th record for each store;
if counter=5 then output;
run;

Macro does not retain in a IF-THEN loop in DATA STEP

data output;
set input;
by id;
if first.id = 1 then do;
call symputx('i', 1); ------ also tried %let i = 1;
a_&i = a;
end;
else do;
call symputx('i', &i + 1); ------ also tried %let i = %sysevalf (&i + 1);
a_&i = a;
end;
run;
Example Data:
ID A
1 2
1 3
2 2
2 4
Want output:
ID A A_1 A_2
1 2 2 .
1 3 . 3
2 2 2 .
2 4 . 4
I know that you can do this using transpose, but i'm just curious why does this way not work. The macro does not retain its value for the next observation.
Thanks!
edit: Since %let is compile time, and call symput is execution time, %let will run only once and call symput will always be 1 step slow.
why does this way not work
The sequence of behavior in SAS executor is
resolve macro expressions
process steps
automatic compile of proc or data step (compile-time)
run the compilation (run-time)
a running data step can not modify its pdv layout (part of the compilation process) while it is running.
call symput() is performed at run-time, so any changes it makes will not and can not be applied to the source code as a_&i = a;
Array based transposition
You will need to determine the maximum number of items in the groups prior to coding the data step. Use array addressing to place the a value in the desired array slot:
* Hand coded transpose requires a scan over the data first;
* determine largest group size;
data _null_;
set have end=lastrecord_flag;
by id;
if first.id
then seq=1;
else seq+1;
retain maxseq 0;
if last.id then maxseq = max(seq,maxseq);
if lastrecord_flag then call symputx('maxseq', maxseq);
run;
* Use maxseq macro variable computed during scan to define array size;
data want (drop=seq);
set have;
by id;
array a_[&maxseq]; %* <--- set array size to max group size;
if first.id
then seq=1;
else seq+1;
a_[seq] = a; * 'triangular' transpose;
run;
Note: Your 'want' is a triangular reshaping of the data. To achieve a row per id reshaping the a_ elements would have to be cleared (call missing()) at first.id and output at last.id.

Assign missing variables values based on distribution SAS

I would like to assign IDs with blank Sizes a size based on the frequency distribution of their Group.
Dataset A contains a snapshot of my data:
ID Group Size
1 A Large
2 B Small
3 C Small
5 D Medium
6 C Large
7 B Medium
8 B -
Dataset B shows the frequency distribution of the Sizes among the Groups:
Group Small Medium Large
A 0.31 0.25 0.44
B 0.43 0.22 0.35
C 0.10 0.13 0.78
D 0.29 0.27 0.44
For ID 8, we know that it has a 43% probability of being "small", a 22% probability of being "medium" and a 35% probability of being "large". That's because these are the Size distributions for Group B.
How do I assign ID 8 (and other blank IDs) a Size based on the Group distributions in Dataset B? I'm using SAS 9.4. Macros, SQL, anything is welcome!
The table distribution is ideal for this. The last datastep here shows that; before that I set things up to create the data at random and determine the frequency table, so you can skip that if you already do that.
See Rick Wicklin's blog about simulating multinomial data for an example of this in other use cases (and more information about the function).
*Setting this up to help generate random data;
proc format;
value sizef
low - 1.3 = 'Small'
1.3 <-<2.3 = 'Medium'
2.3 - high = 'Large'
;
quit;
*Generating random data;
data have;
call streaminit(7);
do id = 1 to 1e5;
group = byte(65+rand('Uniform')*4); *A = 65, B = 66, etc.;
size = put((rank(group)-66)*0.5 + rand('Uniform')*3,sizef.); *Intentionally making size somewhat linked to group to allow for differences in the frequency;
if rand('Uniform') < 0.05 then call missing(size); *A separate call to set missingness;
output;
end;
run;
proc sort data=have;
by group;
run;
title "Initial frequency of size by group";
proc freq data=have;
by group;
tables size/list out=freq_size;
run;
title;
*Transpose to one row per group, needed for table distribution;
proc transpose data=freq_size out=table_size prefix=pct_;
var percent;
id size;
by group;
run;
data want;
merge have table_size;
by group;
array pcts pct_:; *convenience array;
if first.group then do _i = 1 to dim(pcts); *must divide by 100 but only once!;
pcts[_i] = pcts[_i]/100;
end;
if missing(size) then do;
size_new = rand('table',of pcts[*]); *table uses the pcts[] array to tell SAS the table of probabilities;
size = scan(vname(pcts[size_new]),2,'_');
end;
run;
title "Final frequency of size by group";
proc freq data=want;
by group;
tables size/list;
run;
title;
You can also do this with a random value and some if-else logic:
proc sql;
create table temp_assigned as select
a.*, rand("Uniform") as random_roll, /*generate a random number from 0 to 1*/
case when missing(size) then
case when calculated random_roll < small then small
when calculated random_roll < sum(small, medium) then medium
when calculated random_roll < sum(small, medium, large) then large
end end as value_selected, /*pick the value of the size associated with that value in each group*/
coalesce(case when calculated value_selected = small then "Small"
when calculated value_selected = medium then "Medium"
when calculated value_selected = large then "Large" end, size) as group_assigned /*pick the value associated with that size*/
from temp as a
left join freqs as b
on a.group = b.group;
quit;
Obviously you can do this without creating the value_selected variable, but I thought showing it for demonstrative purposes would be helpful.

Determine rates of change for different groups

I have a SAS issue that I know is probably fairly straightforward for SAS users who are familiar with array programming, but I am new to this aspect.
My dataset looks like this:
Data have;
Input group $ size price;
Datalines;
A 24 5
A 28 10
A 30 14
A 32 16
B 26 10
B 28 12
B 32 13
C 10 100
C 11 130
C 12 140
;
Run;
What I want to do is determine the rate at which price changes for the first two items in the family and apply that rate to every other member in the family.
So, I’ll end up with something that looks like this (for A only…):
Data want;
Input group $ size price newprice;
Datalines;
A 24 5 5
A 28 10 10
A 30 14 12.5
A 32 16 15
;
Run;
The technique you'll need to learn is either retain or diff/lag. Both methods would work here.
The following illustrates one way to solve this, but would need additional work by you to deal with things like size not changing (meaning a 0 denominator) and other potential exceptions.
Basically, we use retain to cause a value to persist across records, and use that in the calculations.
data want;
set have;
by group;
retain lastprice rateprice lastsize;
if first.group then do;
counter=0;
call missing(of lastprice rateprice lastsize); *clear these out;
end;
counter+1; *Increment the counter;
if counter=2 then do;
rateprice=(price-lastprice)/(size-lastsize); *Calculate the rate over 2;
end;
if counter le 2 then newprice=price; *For the first two just move price into newprice;
else if counter>2 then newprice=lastprice+(size-lastsize)*rateprice; *Else set it to the change;
output;
lastprice=newprice; *save the price and size in the retained vars;
lastsize=size;
run;
Here a different approach that is obviously longer than Joe's, but could be generalized to other similar situations where the calculation is different or depends on more values.
Add a sequence number to your data set:
data have2;
set have;
by group;
if first.group the seq = 0;
seq + 1;
run;
Use proc reg to calculate the intercept and slope for the first two rows of each group, outputting the estimates with outest:
proc reg data=have2 outest=est;
by group;
model price = size;
where seq le 2;
run;
Join the original table to the parameter estimates and calculate the predicted values:
proc sql;
create table want as
select
h.*,
e.intercept + h.size * e.size as newprice
from
have h
left join est e
on h.group = e.group
order by
group,
size
;
quit;

Differentiating between missing and total in output of proc means?

I've got something like the following:
proc means data = ... missing;
class 1 2 3 4 5;
var a b;
output sum=;
run;
This does what I want it to do, except for the fact that it is very difficult to differentiate between a missing value that represents a total, and a missing value that represents a missing value. For example, the following would appear in my output:
1 2 3 4 5 type sumA sumB
. . . . . 0 num num
. . . . . 1 num num
Ways I can think of handling this:
1) Change missings to a cardinal value prior to proc means. This is definitely doable...but not exactly clean.
2) Format the missings to something else prior, and then use preloadfmt? This is a bit of a pain...I'd really rather not.
3) Somehow use the proc means-generated variable type to determine whether the missing is a missing or a total
4) Other??
I feel like this is clearly a common enough problem that there must be a clean, easy way, and I just don't know what it is.
Option 3, for sure . Type is simply a binary number with 1 for each class variable, in order, that is included in the current row and 0 for each one that is missing. You can use the CHARTYPE option to ask for it to be given explicitly as a string ('01101110' etc.), or work with math if that's more your style.
How exactly you use this depends on what you're trying to accomplish. Rows that have a missing value on them will have a type that suggests a class variable should exist, but doesn't. So for example:
data want;
set have; *post-proc means assuming used CHARTYPE option;
array classvars a b c d e; *whatever they are;
hasmissing=0;
do _t = 1 to dim(classvars);
if char(_type_,_t) = 1 and classvars[_t] = . then hasmissing=1;
end;
*or;
if cmiss(of classvars[*]) = countc(_type_,'0') then hasmissing=0;
else hasmissing=1; *number of 0s = number of missings = total row, otherwise not;
run;
That's a brute force application, of course. You may also be able to identify it based on the number of missings, if you have a small number of types requested. For example, let's say you have 3 class variables (so 0 to 7 values for type), and you only asked for the 3 way combination (7, '111') and the 3 two way combination 'totals' (6,5,3, ie, '110','101','011'). Then:
data want;
set have;
if (_type_=7 and cmiss(of a b c) = 0) or (cmiss(of a b c) = 1) then ... ; *either base row or total row, no missings;
else ... ; *has at least one missing;
run;
Depending on your data, NMISS may also work. That checks to see if the number of missings is appropriate for the type of data.
Joe's strategy, modified slightly for my exact problem, because it may be useful to somebody at some point in the future.
data want;
set have;
array classvars a b c d e;
do _t = 1 to dim(classvars);
if char(_type_,_t) = 1 and (strip(classvars[_t] = "") or strip(classvars[_t]) = ".") then classvars[_t] = "TOTAL";
end;
run;
The rationale for the changes is as follows:
1) I'm working with (mostly) character variables, not numeric.
2) I'm not interested in whether a row has any missing or not, as those are very frequent, and I want to keep them. Instead, I just want the output to differentiate between the missings and the totals, which I have accomplished by renaming the instances of non-missing to something that indicates total.