I have following dataset:
ID Status
1 cake
1 cake
1 flower
2 flower
2 flower
3 cake
3 flower
4 cake
4 cake
4 cake
Basically, I am only interested in the observations that, grouped by the ID, include at least one flower. Also I want an indication of whether the observation grouped by ID only has flower or if it was cake too. E.g. I would ideally like something like:
ID Status Indicator
1 cake 1
1 cake 1
1 flower 1
2 flower 2
2 flower 2
3 cake 1
3 flower 1
4 cake 0
4 cake 0
4 cake 0
I have tried to subset the dataset in multiple ways and merge together, conditional on the ID, but it does not seem to be working.
This SAS data step based on your input (which I called test here) will return that indicator value by ID group.
proc sort data=test;
by ID descending status;
run;
data result(drop=status);
set test;
by ID;
retain indicator;
if first.ID then indicator=0;
if status='flower' and indicator=0 then indicator=2;
if status='cake' and indicator=2 then indicator=1;
if last.ID then output;
run;
You could join that result with the source data to get the result as you provided it in your post.
NOTE: I don't have enough reputation to comment on the answer provided by Gordon Linoff but I just want to point out that there the indicator will not take three values (0='no flower',1='cake+flower',2='only flower') but will instead be a count of the number of 'flower' entries per ID, which I don't think is quite what the poster is asking for.
Rewritten as follows will give the expected result with indicator values 0='no flower',1='only flower',2='cake+flower'
proc sql;
select t.*,
(count(distinct status))*(sum(case when status = 'flower' then 1 else 0 end)>0) as indicator
from test t
group by id;
;
quit;
proc sql comes to mind:
proc sql;
select t.*, tt.indicator
from t join
(select id, sum(case when status = 'flower' then 1 else 0 end) as indicator
from t
group by id
) tt
on tt.id = t.id;
proc sql also has a "remerge" extension to SQL. That allows you to do:
proc sql;
select t.*, tt.indicator,
sum(case when status = 'flower' then 1 else 0 end) as indicator
from t j
group by id;
If your data is already sorted by ID then you could use a double DOW loop. The first loop will check for the presence of the values. Then you can use another loop to write back all of the detail rows for that group.
data want ;
do until (last.id);
set have;
by id;
if status='flower' then _flower=1;
else if status='cake' then _cake=1;
end;
if _flower and _cake then indicator=1;
else if _flower then indicator=2;
else indicator=0;
do until (last.id);
set have;
by id;
output;
end;
run;
This should be fast assuming the data is already sorted.
Related
I have a data of three variables. one is id, second is observation count for that id, and third is the value of that observation. I want to transpose the data from long to wide. The issue is that I am getting an error saying my by group is not sorted in ascending order (even though it is). Another issue is that not all values have same amout of observations , please see example below and data structure of what I am looking for
data have;
input id observation value;
cards;
1 1 '4.8.9'
1 2 '4.5.7'
2 1 '5.0.5'
3 1 '4.2.0'
3 2 '4.1.0'
3 3 '5.1.9';run;
data want;
input id observation1 observation2 observation3;
cards;
1 '4.8.9' '4.5.7' NA
2 '5.0.5' NA NA
3 '4.2.0' '4.1.0' '5.1.9'
;run;
/* i have tried the following:
proc transpose data=b out=c ;
by value ;
id id;
var value;
run;
proc transpose data=b out=c ;
by value ;
id id;
var observation;
run;
*/
Your BY variable is called ID in your example dataset.
Your example data step is not defining VALUE as character. Also don't indent the in-line data lines.
You can use the prefix= option to help name the new variables. Also let's modify the value of OBSERVATION for ID=2 to demonstrate more clearly how the value of OBSERVATION is setting the variable name instead of just the order of the observations in the ID group. Now the value '5.0.5' will be stored in OBSERVATION2 even though it is the first observation for that value of ID.
data have;
input id observation value $;
cards;
1 1 '4.8.9'
1 2 '4.5.7'
2 2 '5.0.5'
3 1 '4.2.0'
3 2 '4.1.0'
3 3 '5.1.9'
;
proc transpose data=have out=want(drop=_name_) prefix=observation;
by id;
id observation;
var value;
run;
Results:
Obs id observation1 observation2 observation3
1 1 '4.8.9' '4.5.7'
2 2 '5.0.5'
3 3 '4.2.0' '4.1.0' '5.1.9'
I have a dataset as follows:
ID status
101 Checked
101 Checked
101 NotChecked
101 Checked
101 NotChecked
I want to count the number of obs base don the status variable like
ID status Count
101 Checked 2
101 Checked 2
101 NotChecked 1
101 Checked 1
101 NotChecked 1
I dont want to use proc sql because when I say group by then it sorts the dataset and gives the result where as here the Status variable is not sorted.
Aggregating by groups will always require sorting unless you want to use some complex data step logic.
If you have a particular sort order that you want to keep, the easiest way is to create a key column that holds your desired order. You can then resort it back to the way you'd like it after grouping.
data have2;
set have;
varorder = _N_;
run;
proc sql;
create table want as
select id, status, count(*) as count
from have2
group by id, status
order by varorder
;
quit;
This works for me, a bit of a longer solution but basically add row and group identifiers to control the count. The NOTSORTED option on the BY statement helps to identify your groups uniquely.
data have;
input ID status $12.;
cards;
101 Checked
101 Checked
101 NotChecked
101 Checked
101 NotChecked
;;;;
run;
data grouped;
set have;
by id status notsorted;
retain MyGroups count;
if first.id then count=1;
else count+1;
if first.status then MyGroups+1;
run;
proc sql;
create table want as
select *, count(*) as numberFound
from grouped
group by MyGroups
order by ID, count;
quit;
I have a dataset that contains the ID and a variable called CC. The CC holds multiple numbered values where each value represents something. It looks like this:
An ID can have the same CC in multiple rows, I just want to flag if the CC exists or not so even if Joe had five rows stating that he has CC equal to 3 I just want a 1 or 0 stating if Joe ever had a CC equal to 3.
I want it to look like this:
I tried coding it as shown below but the issue is that although I know an ID can have more than one type of CC the final dataset that's created from the code only shows 1 CC for each ID that is filled. I think maybe it's overwriting it?
Also I should note that prior to this code I created the CC Flag variables and filled it all as zeros.
proc sql;
DROP TABLE Flagged_CCs;
CREATE TABLE Flagged_CCs AS
select
ID,
COUNT(ID) as count_ID,
case when CC=1 then 1 end as CC_1,
case when CC=2 then 1 end as CC_2,
case when CC=3 then 1 end as CC_3
from Original_Dataset
group by ID;
quit;
Any help is appreciated, thank you.
Is your issue the fact that after running your new code you still get multiple line per ID?
If so I propose this:
proc sql;
DROP TABLE Flagged_CCs;
CREATE TABLE Flagged_CCs AS
select ID
,case when CC_1 >0 then 1 else 0 end as CC_1
,case when CC_2 >0 then 1 else 0 end as CC_2
,case when CC_3 >0 then 1 else 0 end as CC_3
from (
select
ID,
COUNT(ID) as count_ID,
sum(case when CC=1 then 1 end) as CC_1,
sum(case when CC=2 then 1 end) as CC_2,
sum(case when CC=3 then 1 end) as CC_3
from Original_Dataset
group by ID
);
quit;
The reason you are having the issue is that you are only aggregating the count of ID and not the other values, using an aggregate on them will eliminate duplicate records.
Hope this helps
If you're looking for a report here's one method, using PROC TABULATE.
proc format ;
value indicator_fmt
low - 0, . = 0
0 - high = 1;
run;
proc tabulate data=have;
class id cc;
table id , cc*N=''*f=indicator_fmt.;
run;
Your output will look like this then:
If you want a fully dynamic approach in a table where you don't need to know anything ahead of time, such as the number of CC's this is a different approach. It's a bit longer but the dynamic part makes it possibly worthwhile to implement.
I want to ask a complicated (for me) question about SAS programming. I think I can explain better by using simple example. So, I have the following dataset:
Group Category
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
I want to count the each category for each group. I can do it by using PROC FREQ. But it is not better way for my dataset. It will be time consuming for me as my dataset is too large and I have a huge number of groups. So, if I use PROC FREQ, firstly I need to create new datasets for each group and then use PROC FREQ for each group. In sum, I need to create the following dataset:
CATEGORIES
Group 1 (first category) 2 3
A 3 2 1
B 2 3 1
So, the number of first category in group A is 3. The number of first category in group B is 2 and so on. I think I can explain it. Thanks for your helps.
There is more than one way to do this in SAS. My bias is proc sql, so:
proc sql;
select grp,
sum(case when category = 1 then 1 else 0 end) as cat_1,
sum(case when category = 2 then 1 else 0 end) as cat_2,
sum(case when category = 3 then 1 else 0 end) as cat_3
from t
group by grp;
Either proc freq or proc summary will do the job of producing frequency counts:
data example;
length group category $1;
input group category;
cards;
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
;
run;
proc freq data=example;
table group*category;
run;
proc summary data=example nway;
class group category;
output out=example_frequency (drop=_type_);
run;
proc summary will produce a dataset in a 'long' format. If you need to transpose it (I'd suggest not doing so: you'll probably find working with the long format easier in most circumstances) you can use proc transpose:
proc transpose data=example_frequency out=example_matrix (drop=_name_);
by group;
id category;
var _freq_;
run;
Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;