Count the number of unique ids for every subset of variables - sas

I want to find the number of unique ids for every subset combination of the variables. For example
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
run;
I want the result to be
var1 var2 var3 count
. . 0 5
. . 1 5
. 0 . 7
. 0 0 5
. 0 1 3
. 1 . 2
. 1 1 2
0 . . 3
0 . 0 2
0 . 1 1
0 0 . 3
0 0 0 2
0 0 1 1
1 . . 7
1 . 0 4
1 . 1 4
1 0 . 5
1 0 0 4
1 0 1 2
1 1 . 2
1 1 1 2
which is the result of appending all the possible proc sql; group bys (var1 is shown below)
proc sql;
create table sub1 as
select var1, count(distinct id) as count
from have
where not missing(var1)
group by var1
;
quit;
I don't care about the case where all variables are missing or when any of the variables in the group by are missing. Is there a more efficient way of doing this?

You can use Proc SUMMARY to compute the combinations of var1-var3 values for each id by group. From the SUMMARY output a SQL query can count the distinct ids per combination.
Example:
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
proc summary noprint missing data=have;
by id;
class var1-var3;
output out=combos;
run;
proc sql;
create table want as
select var1, var2, var3, count(distinct id) as count
from combos
group by var1, var2, var3
;

Related

Create duplicate rows in SAS and change values of variables

I have been so confused on how to implement this in SAS. I am trying to create duplicate rows if the value of "2" occurs more than once between the variables (member1 -member4). For example, if a row has the value 2 in member2, member3, and member4, then I will create 2 duplicate rows since the initial row will serve for the first variable and the duplicate rows will be for member 3 and 4. On the duplicate row for member3 for example, member 2 and 4 will be missing if their values is equal to 2. Basically the value "2" can only occur once per row. let's assume sa1 to sa4 corresponds to other variables of member1 to member4 respectively. When we create a duplicate row for each member, the other variables should be missing if they have a value of "1". For example, if the duplicate row is for member 3, then values that equal "1" for sa1, sa2 and sa4 should be set to missing. There are other variables in the dataset that will have same values for all duplicate rows as initial rows. Duplicate rows will also have a suffix for the ID to indicate the parent rows.
This is an example of the data I have
id member1 member2 member3 member4 sa1 sa2 sa3 sa4
1 0 2 2 0 0 1 1 0
2 2 2 0 5 . 1 0 0
3 2 2 3 2 1 1 0 1
Then this is the output I am trying to achieve
id member1 member2 member3 member4 sa1 sa2 sa3 sa4
1 0 2 . 0 0 1 . 0
1_1 0 . 2 0 0 . 1 0
2 2 . 0 5 . . 0 0
2_1 . 2 0 5 . 1 0 0
3 2 . 3 . 1 . 0 .
3_1 . 2 3 . . 1 0 .
3_2 . . 3 2 . . 0 1
Will appreciate any help. Thank you!
You need to count the number of '2's. You also need to remember where they used to be. "I had the spots removed for good luck, but I remember where the spots formerly were."
data have ;
input id :$10. member1 member2 member3 member4 sa1 sa2 sa3 sa4 ;
cards;
1 0 2 2 0 0 1 1 0
2 2 2 0 5 . 1 0 0
3 2 2 3 2 1 1 0 1
4 2 0 0 0 . . . .
5 0 0 0 0 . . . .
;
data want ;
set have ;
array m member1-member4 ;
array x [4] _temporary_;
do index=1 to dim(m);
x[index]=m[index]=2;
end;
n2 = sum(of x[*]);
if n2<2 then output;
else do counter=1 to n2;
id=scan(id,1,'_');
if counter > 1 then id=catx('_',id,counter-1);
counter2=0;
do index=1 to dim(m);
if x[index] then do;
counter2+1;
if counter = counter2 then m[index]=2;
else m[index]=.;
end;
end;
output;
end;
drop index n2 counter counter2;
run;
Results
Obs id member1 member2 member3 member4 sa1 sa2 sa3 sa4
1 1 0 2 . 0 0 1 1 0
2 1_1 0 . 2 0 0 1 1 0
3 2 2 . 0 5 . 1 0 0
4 2_1 . 2 0 5 . 1 0 0
5 3 2 . 3 . 1 1 0 1
6 3_1 . 2 3 . 1 1 0 1
7 3_2 . . 3 2 1 1 0 1
8 4 2 0 0 0 . . . .
9 5 0 0 0 0 . . . .
I think your expecting us to code the whole thing for you... I dont get your logic explanation of what you want - but to start off with:
create a new dataset
rename all the variables on the way in - prefix with O_ (Original)
code however you like to see how many values contain 2 (HOWMANYTWOS)
do ROW = 1 to HOWMANYTWOS
4.1 again go through the values on the O_ variables you have
4.2 if the ROW - corresponds to your increasing counter its the 2 you wish to keep and so you dont touch it - if the 2 does not correspond to your ROW - make it .
4.3 output the record with a new(if required) ID
a start for you:
data NEW;
set ORIG (rename=(MEMBER1-MEMBER4=O_MEMBER1-O_MEMBER4 ID=O_ID etc..)
HOWMANYTWOS = sum(O_MEMBER1=2,O_MEMBER2=2,O_MEMBER3=2,O_MEMBER4=2);
do ROW = 1 to HOWMANYTWOS; /* This is stepping through and creating the new rows - you need to step through the variables to see if you want to make them null before outputting... NOTE do not change O_ variables only create/update the variables going to the output dataset (The O_ version is for checking against only)
ID = ifc(ROW = 1, O_ID, catx("_", O_ID, ROW);
/* create a counter
output;
end;
run;
Sorry - Not got sas here and its been a little while

Identifying first occurrence after trigger event

I have a big panel dataset that looks somewhat like this:
data have;
input id t a b ;
datalines;
1 1 0 0
1 2 0 0
1 3 1 0
1 4 0 0
1 5 0 1
1 6 1 0
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 1
2 6 0 1
2 7 0 1
2 8 0 1
2 9 1 0
2 10 0 1
3 1 0 0
3 2 0 0
3 3 0 0
3 4 0 0
3 5 0 0
3 6 0 0
3 7 1 0
3 8 0 0
3 9 0 0
3 10 0 0
;
run;
For every ID I want to record all 'trigger' events, namely when a=1 and then I need to how long it takes to the next occurrence of b=1. The final output should then give me the following:
data want;
input id a_no a_t b_t diff ;
datalines;
1 1 3 5 2
1 2 6 10 4
2 1 2 5 3
2 2 9 10 1
3 1 7 . .
;
run;
It is of course no problem to get all a=1 and b=1 events, but as it is a very big dataset with a lot of both events for every ID I am searching for an elegant and straight-forward solution. Any ideas?
Here's a fairly simple SQL approach that gives more or less the desired output:
proc sql;
create table want
as select
t1.id,
t1.t as a_t,
t2.t as b_t,
t2.t - t1.t as diff
from
have(where = (a=1)) t1
left join
have(where = (b=1)) t2
on
t1.id = t2.id
and t2.t > t1.t
group by t1.id, t1.t
having diff = min(diff)
;
quit;
The only part missing is a_no. This sort of row-incrementing ID is quite a lot of work to generate consistently in SQL, but it's trivial with an extra data step:
data want;
set want;
by id;
if first.id then a_no = 0;
a_no + 1;
run;
An elegant DATA step way can use nested DOW loops. It's straight forward when you understand DOW loops.
data want(keep=id--diff);
length id a_no a_t b_t diff 8;
do until (last.id); * process each group;
do a_no = 1 by 1 until(last.id); * counter for each output;
do until ( output_condition or end); * process each triggering state change;
SET have end=end; * read data;
by id; * setup first. last. variables for group;
if a=1 then a_t = t; * detect and record start of trigger state;
output_condition = (b=1 and t > a_t > 0); * evaluate for proper end of trigger state;
end;
if output_condition then do;
b_t = t; * compute remaining info at output point;
diff = b_t - a_t;
OUTPUT;
a_t = .; * reset trigger state tracking variables;
b_t = .;
end;
else
OUTPUT; * end of data reached without triggered output;
end;
end;
run;
Note: A SQL way (not shown) can use self join within groups.

Create values for group - SAS

data test;
input Index Indicator value FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;
run;
I have a data set with the first 3 columns. How do I get the 4th columns based on the indicators? For example, for the index, when the indicator =1, the value is 21, so I put 21 is the final values in all lines for index 1.
Use the SAS Retain Keyword.
You can do this in a data step; by Retaining the Value where indicator = 1.
Steps:
Sort your data by Index and Indicator
Group by the Index & Retain the Value where Indicator=1
Code:
/*Sort Data by Index and Indicator & remove the hardcodeed finalvalue*/
proc sort data=test (keep= Index Indicator value);
by index descending indicator ;
run;
/*Retain the FinalValue*/
data want;
set test;
retain FinalValue;
keep Index Indicator value FinalValue;
if indicator =1 then do;FinalValue=value;end;
/*The If statement below will assign . to records that doesn't have an indicator value of 1*/
if indicator ne 1 and FIRST.Index=1 then FinalValue=.;
by index;
run;
Output:
Index=1 Indicator=1 value=21 FinalValue=21
Index=1 Indicator=0 value=5 FinalValue=21
Index=2 Indicator=1 value=0 FinalValue=0
Index=3 Indicator=1 value=7 FinalValue=7
Index=3 Indicator=0 value=4 FinalValue=7
Index=3 Indicator=0 value=8 FinalValue=7
Index=3 Indicator=0 value=2 FinalValue=7
Index=4 Indicator=1 value=1 FinalValue=1
Index=4 Indicator=0 value=4 FinalValue=1
Use proc sql by left join. Select value which indicator=1 and group by index, then left join with original dataset. It seemed that your first row of index=3 should be 7, not 0.
proc sql;
select a.*,b.finalvalue from test a
left join (select *,value as finalvalue from test group by index having indicator=1) b
on a.index=b.index;
quit;
This is rather old school but should be adequate. I reckon you call it a self merge or something.
data test;
input Index Indicator value;* FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;;;;
run;
data final;
if 0 then set test;
merge test(where=(indicator eq 1) rename=(value=FinalValue)) test;
by index;
run;
proc print;
run;
Final
Obs Index Indicator value Value
1 1 0 5 21
2 1 1 21 21
3 2 1 0 0
4 3 0 4 7
5 3 1 7 7
6 3 0 8 7
7 3 0 2 7
8 4 1 1 1
9 4 0 4 1

SAS : proc freq on 3 categorical variables

I have data with
One binary variable, poor
Two socio-demographic variables var1 and var2
I would like to have the poverty rate of each of my var1 * var2 possible value, that would look like that :
But with three variables in a proc freq, I get multiple outputs, one for each value of the first variable I put on my product
proc freq data=test;
table var1*var2*poor;
run;
How can I get something close to what I would like ?
Try this
data test;
input var1 var2 poor;
cards;
1 1 1
2 3 0
3 2 1
4 1 1
1 2 1
2 3 0
4 1 0
4 2 0
3 1 1
1 2 0
3 2 0
1 3 1
3 3 0
3 3 0
3 3 1
1 1 0
2 2 0
2 2 1
2 2 1
2 1 1
2 1 1
2 1 1
;
run;
proc tabulate data=test;
class var1 var2 poor;
tables var1,
var2*poor*pctn<poor>={label="%"};
run;

How to show variable value by proc tabulate in sas?

How can I manage proc tabulate to show the value of a variable with missing value instead of its statistic? Thanks!
For example, I want to show the value of sym. It takes value 'x' or missing value. How can I do it?
Sample code:
data test;
input tx mod bm $ yr sym $;
datalines;
1 1 a 0 x
1 2 a 0 x
1 3 a 0 x
2 1 a 0 x
2 2 a 0 x
2 3 a 0 x
3 1 a 0
3 2 a 0
3 3 a 0 x
1 1 b 0 x
1 2 b 0
1 3 b 0
1 4 b 0
1 5 b 0
2 1 b 0
2 2 b 0
2 3 b 0
2 4 b 0
2 5 b 0
3 1 b 0 x
3 2 b 0
3 3 b 0
1 1 c 0
1 2 c 0 x
1 3 c 0
2 1 c 0
2 2 c 0
2 3 c 0
3 1 c 0
3 2 c 0
3 3 c 0
1 3 a 1 x
2 3 a 1
3 3 a 1
1 3 b 1
2 3 b 1
3 3 b 1
1 3 c 1 x
2 3 c 1
3 3 c 1
;
run;
proc tabulate data=test;
class yr bm tx mod ;
var sym;
table yr*bm, tx*mod;
run;
proc tabulate data=test;
class tx mod bm yr sym;
table yr*bm, tx*mod*sym*n;
run;
That gives you ones for each SYM=x (since n=missing). That hides the rows for SYM=missing, hence you miss some values overall from your example table. (You could format the column with a format that defines 1 = 'x' easily).
proc tabulate data=test;
class tx mod bm yr;
class sym /missing;
table yr*bm, tx*mod*sym=' '*n;
run;
That gives you all of your combinations of the 4 main variables, but includes missing syms as their own column.
If you want to have your cake and eat it too, then you need to redefine SYM to be a numeric variable, so you can use it as a VAR.
proc format;
invalue ISYM
x=1
;
value FSYM
1='x';
quit;
data test;
infile datalines truncover;
input tx mod bm $ yr sym :ISYM.;
format sym FSYM.;
datalines;
1 1 a 0 x
1 2 a 0 x
1 3 a 0 x
... more lines ...
;
run;
proc tabulate data=test;
class tx mod bm yr;
var sym;
table yr*bm, tx*mod*sym*sum*f=FSYM.;
run;
All of these assume these are unique combination rows. If you start having multiples of yr*bm*tx*mod, you would have a problem here as this wouldn't give you the expected result (sum 1+1+1=3 would not give you an 'x').