Convert Stata "egen, group" to SAS - sas

I am trying to find the equivalent of the Stata code "egen group" in SAS.
The goal is:
I have three variables x, y, and z. I want to create a new variable which will assign a different ordinal number for each combination of values of x, y, and z. How can I do this in SAS?

If you order your data by x, y, and z, SAS knows exactly where the groups x, y, and z start/end. You can use this to create unique identifiers.
Let's make some sample data. This data purposefully has duplicate values to illustrate how first. works.
data have;
do x = 'a', 'b', 'c';
do y = 'd', 'e', 'f';
do z = 'g', 'h', 'i';
output;
output;
end;
end;
end;
run;
Single-Threaded Unique IDs
This is the most likely case for you. This applies if you're running code in Base SAS.
First, sort the data by x y z.
proc sort data=have;
by x y z;
run;
Next, create your identifiers. We'll tell SAS that the data is ordered by x y z. Since z is nested within y and x, if we reach the first value of z, we've reached a unique combination of x y z.
data want;
set have;
by x y z;
if(first.z) then id+1;
run;
Output:
x y z id
a d g 1
a d g 1
a d h 2
a d h 2
a d i 3
a d i 3
...
id+1 is a special SAS shortcut called a sum statement and is equivalent to the following code:
retain id 0;
if(first.z) then id = id+1;
Multi-threaded Unique IDs
This applies if you're running code in SAS Viya in CAS. You need to add _THREADID_ to the ID to make it unique. For example:
cas;
libname casuser cas caslib='casuser';
data casuser.have;
set have;
run;
data casuser.want;
set casuser.have;
by x y z;
if(first.z) then _id+1;
id = catx('_', _THREADID_, _id);
drop _id;
run;
Output:
x y z id
a d g 15_1
a d g 15_1
a d h 15_2
a d h 15_2
a d i 15_3
a d i 15_3
...

Related

Deleting first instance of a column after group by in sas proc sql

I have the following SAS dataset.
correlation
policynum
risknum
A
X
Y
A
X
Y
A
X
Y
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
C
Z
M
C
Z
M
C
Z
M
D
Z
M
D
Z
M
D
Z
M
In SAS, I want to filter the above dataset so I get my final output as:
correlation
policynum
risknum
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
D
Z
M
D
Z
M
D
Z
M
i.e. for each group of policynum and risknum, if multiple values exist for correlation, I want to keep the second value and get rid of the first value.
If only a single value of correlation exists for a group of policynum and risknum, I want to retain that group in my final output too.
What would be the best way to do this? It might be something simple as I am relatively new to SAS.
Thanks in advance!
If the order of the correlation values, in sort order, is the same ordering as they appear row-wise in the data set you can use SQL. Otherwise, SQL, being based on set theory, which does not have implicit row numbers, can not be used. A DATA step with DOW loop can be used.
Example:
FYI, one common situation in which SAS coders use the phrase 'DOW loop' is when SET & BY statements occur inside a DO loop.
data have;
input correlation $ policynum $ risknum $;
datalines;
A X Y
A X Y
A X Y
B X Y
B X Y
B X Y
B X L
B X L
B X L
C Z M
C Z M
C Z M
D Z M
D Z M
D Z M
;
/* keep last group of a nested group */
* SQL can be used only if correlation wanted is ALWAYS highest valued correlation;
proc sql;
create table want as
select * from have
group by policynum, risknum
having correlation = max(correlation)
;
* DATA Step DOW loops can be used when correlation wanted is last occurring correlation within by group;
data want;
do _n_ = 1 by 1 until (last.policynum);
set have;
by policynum risknum notsorted; /* presume at least contiguous */
end;
_want_correlation = correlation;
do _n_ = 1 to _n_;
set have;
if _want_correlation = correlation then OUTPUT;
end;
run;

COUNTING VALUE PER PARTCIPANTS

I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;

Creating variables based on values of another variable

I have two groups, A and B, and two numeric variables, X and Y. I want to create two new variables, new1 and new2, based on the values of X and Y (respectively) for group B (i.e., IF group = B THEN new1 = X, new2 = Y). I want to take those newly created variables, append them to group A, and then delete group B. In the end, there should be one row for group A containing X, Y, new1, and new2. I'm uncertain how to accomplish this.
I've looked into using PROC TRANSPOSE, but I'm unsure if that's the right starting point. My internet searches are lacking because I'm not even sure what to call what I'm attempting to do, though I'm betting this is a common procedure requiring a common solution.
EXAMPLE
Not sure how to generalize the problem, but for the given problem this will work:
/* Just reversing the records */
proc sort data = have;
by descending group;
run;
data want;
set have;
retain new1 new2;
if _N_ = 1 then do;
new1 = x;
new2 = y;
end;
else output;
run;
This sounds like a case of 1 to 1 merging (merge with out BY).
data have; input
group $1. x y; datalines;
A 3 4
B 2 6
run;
data want;
merge
have(where=( group='A'))
have(where=(Bgroup='B') rename=(x=Bx y=By group=Bgroup))
;
drop Bgroup;
run;

Comparison of two data sets in SAS

I have the following data set:
data data_one;
length X 3
Y $ 20;
input x y ;
datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1
run;
data data_two;
length Z 3
A $ 20;
input Z A;
datalines;
1 test
2 test1
3 test2
run;
What I would like to have is a data set which tells me how often column Y in data_one contains the same string of column A in data_two. The result should look like this one:
Obs test test1 test2
1 4 3 0
Thanks in advance!
First we need the counts for those values of Y present in data_one.
Then we create a sorted (for the next merge) list of the values present in data_two.
The data_one Y counts from 1. are merged with the list from 2.
The Y values present in data_two but not in data_one (b and not a) are assigned count=0, the Y values not present in data_two are discarded (if b).
The last passage transposes the vertical list of counts in an horizontal set of variables.
proc freq data=data_one noprint;
table y / out=count_one (keep=y count);
run;
proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
by a;
run;
data count_all;
merge count_one (in=a) list_two (in=b);
by y;
if (b and not a) then count=0;
if b;
run;
proc transpose data=count_all out=final (drop=_name_ _label_);
id y;
run;
The first 3 steps can be replaced with one proc SQL:
proc sql;
create table count_all as
select distinct
coalesce(t1.y,t2.a) as y,
case
when missing(t1.y) then 0
else count(t1.y)
end as N
from data_one as t1
right join data_two as t2
on t1.y=t2.a
group by 1
order by 1;
quit;
proc transpose data=count_all out=final (drop=_name_);
id y;
run;

SAS Proc means, storing mean values as variables

I need to find a ratio of two mean values, that I have found using proc means.
proc means data=a;
class X Y;
var x1 x2;
run;
Then I get the output mean values for variables x1 and x2 in the two categories of X and Y, but it is x1/x2 for each category that I am interested in, and doing it by hand is not really a solution.
I am not a professional programmer, so I hope there is a simple piece of code that I can understand and use.
You need to precompute x1/x2 or postcompute x1/x2 (Depending on whether you want mean(x1/x2) or mean(x1)/mean(x2), which can have different answers of x1 and x2 have different numbers of responses).
So either (... means fill in what you have already)
data premean;
set have;
x1x2 = x1/x2;
run;
proc means ... ;
class ... ;
var x1x2;
run;
or
proc means ...;
class ... ;
var x1 x2;
output out=postmeans mean=;
run;
data want;
set postmeans;
x1x2=x1/x2;
run;
proc sql noprint;
create table xy_ratio as /* New table name*/
select distinct X, Y, avg(x1)/avg(x2) as x1_x2_ratio /* selects distinct rows containing variables listed here. (Must include group by variables) mean of x1 / mean of x2 to form ratio*/
from a /*source dataset*/
group by X, Y /*Similar to class statement, will provide an average for each distinct combination of X and Y that appear in the dataset*/
;
quit;