I am wanting to count the number of time a certain value appears in a particular column in sas. For example in the following dataset the value 1 appears 3 times
value 2 appears twice, value 3 appears once, value 4 appears 4 times and value 5 appears four times.
Game_ball
1
1
1
2
2
3
4
4
4
5
5
5
5
5
I want the dataset to represented like the following:
Game_ball Count
1 3
2 2
3 1
4 4
5 4
. .
. .
. .
Thanks in advance
As per #Dwal, proc freq is the easiest solution.
Using your sample data,
proc freq data=sample;
table game_ball/out=output;
run;
Or do it in one-pass data step
proc sort data = sample;by game_ball;run;
data output;
set sample;
retain count;
if first.game_ball then count = 0;
count + 1;
if last.game_ball then output;
by game_ball;
run;
Or in SQL
proc sql;
create table output as
select game_ball, count(*) as count
from sample
group by game_ball;
quit;
Related
I have a dataset with varying observations per ID, and these participants are also in different treatment status (Group). I wonder if I can use proc means to quickly calculate the number of participants and visits to clinic per group status by using proc means? Ideally, I can use proc means sum function quickly capture those with 0 and 1 based on group status and gain the total number? However, I got stuck in how to proceed.
ID Visit Group
1 1 0
1 2 0
2 1 1
2 2 1
2 3 1
3 1 0
4 1 1
4 2 1
5 1 0
5 2 0
6 1 1
6 2 1
6 3 1
6 4 1
Specifically, I am interested in 1) the total number of participants in each group status. In this case we can 3 participants (ID:1,3,and 5)in the control group (0) and another 3 participants (ID:2,4,and 6) in the treatment group (1).
2) the total number of visits per group status. In this case, the total visits in the control group (0) will be 5 (2+1+2=5) and the total visits in the treatment group (1) will be 9 (3+2+4=9).
I wonder if proc means procedure can help quickly calculate such values? Thanks.
Yes, you can use proc means to get counts.
data have;
input ID$ Visit Group;
cards;
1 1 0
1 2 0
2 1 1
2 2 1
2 3 1
3 1 0
4 1 1
4 2 1
5 1 0
5 2 0
;
run;
proc means data=have n;
class group id;
var visit;
types group id group*id;
run;
If you want the sum of visit, add "sum" behind proc means data=have n and ;.
It looks like GROUP is assigned at the ID level and not the ID/VISIT level. In that case if you want to count the number of ID's in each group you need to first get down to one observation per ID.
proc sort data=have nodupkey out=unique_ids ;
by id;
run;
Now you can count how many ID's are in each group. The normal way is to use PROC FREQ.
proc freq data=unique_ids;
tables group;
run;
But you can count with PROC MEANS/SUMMARY also.
proc summary data=unique_ids nway;
class group;
output out=counts N=N_ids ;
run;
proc print data=counts;
var group n_ids;
run;
MEANS doesn't do a distinct count easily so SQL may be a simpler to understand option here.
proc sql;
create table want as
select group, count(*) as num_visits, count(distinct ID) as num_participants
from have
group by group
order by 1;
quit;
I try to show all orders of Mode.
For example, I import excel like:
A
1
1
2
3
3
3
and code is :
ods select Modes;
proc univariate data=Want modes;
var A;
run;
this Result shows like:
Mode Count
3 3
I want to show like
Mode Count
3 3
1 2
2 1
how can I do that???
Your desired output is actually not modes. Modes returns most frequent value or values (if there is more than one with the same frequency) with the corresponding count. In your example, there is only one mode (3), as it is the value with the highest frequency. And that's what the result shows.
You may be interested in showing frequencies of every value present in variable A. In that case, you want to use this code:
ods select Frequencies;
proc univariate data=Want freq;
var A;
run;
That is a frequency table.
data have ;
input A ##;
cards;
1 1 2 3 3 3
;
proc freq data=have order=freq ;
tables a / out=counts;
run;
proc print data=counts;
run;
Result:
Obs A COUNT PERCENT
1 3 3 50.0000
2 1 2 33.3333
3 2 1 16.6667
The google search has been difficult for this. I have two categorical variables, age and months, with 7 levels each. for a few levels, say age =7 and month = 7 there is no value and when I use proc sql the intersections that do not have entries do not show, eg:
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
THIS LINE DOESNT SHOW
what i want
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
7 7 0
this happens a few times in the data, where tha last groups dont have value so they dont show, but I'd like them to for later purposes
You have a few options available, both seem to work on the premise of creating the master data and then merging it in.
Another is to use a PRELOADFMT and FORMATs or CLASSDATA option.
And the last - but possibly the easiest, if you have all months in the data set and all ages, then use the SPARSE option within PROC FREQ. It creates all possible combinations.
proc freq data=have;
table age*month /out = want SPARSE;
weight value;
run;
First some sample data:
data test;
do age=1 to 7;
do month=1 to 12;
value = ceil(10*ranuni(1));
if ranuni(1) < .9 then
output;
end;
end;
run;
This leaves a few holes, notably, (1,1).
I would use a series of SQL statements to get the levels, cross join those, and then left join the values on, doing a coalesce to put 0 when missing.
proc sql;
create table ages as
select distinct age from test;
create table months as
select distinct month from test;
create table want as
select a.age,
a.month,
coalesce(b.value,0) as value
from (
select age, month from ages, months
) as a
left join
test as b
on a.age = b.age
and a.month = b.month;
quit;
The group independent crossing of the classification variables requires a distinct selection of each level variable be crossed joined with the others -- this forms a hull that can be left joined to the original data. For the case of age*month having more than one item you need to determine if you want
rows with repeated age and month and original value
rows with distinct age and month with either
aggregate function to summarize the values, or
an indication of too many values
data have;
input age month value;
datalines;
1 1 4
2 1 12
3 1 5
7 1 6
1 7 8
5 7 44
6 7 5
8 8 1
8 8 11
run;
proc sql;
create table want1(label="Original class combos including duplicates and zeros for absent cross joins")
as
select
allAges.age
, allMonths.month
, coalesce(have.value,0) as value
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
order by
allMonths.month, allAges.age
;
quit;
And a slight variation that marks duplicated class crossings
proc format;
value S_V_V .t = 'Too many source values'; /* single valued value */
quit;
proc sql;
create table want2(label="Distinct class combos allowing only one contributor to value, or defaulting to zero when none")
as
select distinct
allAges.age
, allMonths.month
, case
when count(*) = 1 then coalesce(have.value,0)
else .t
end as value format=S_V_V.
, count(*) as dup_check
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
group by
allMonths.month, allAges.age
order by
allMonths.month, allAges.age
;
quit;
This type of processing can also be done in Proc TABULATE using the CLASSDATA= option.
I have data that's tracking a certain eye phenomena. Some patients have it in both eyes, and some patients have it in a single eye. This is what some of the data looks like:
EyeID PatientID STATUS Gender
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
As you can see from the data above, there are 9 patients total and all of them have the particular phenomena in one eye.
I need the count the number of patients with this eye phenomena.
To get the number of total patients in the dataset, I used:
PROC FREQ data=new nlevels;
tables PatientID;
run;
To count the number of patients with this eye phenomena, I used:
PROC SORT data=new out=new1 nodupkey;
by Patientid Status;
run;
proc freq data=new1 nlevels;
tables Status;
run;
However, it gave the correct number of patients with the phenomena (9), but not the correct number without (0).
I now need to calculate the gender distribution of this phenomena. I used:
proc freq data=new1;
tables gender*Status/chisq;
run;
However, in the cross table, it has the correct number of patients who have the phenomena (9), but not the correct number without (0). Does anyone have any thoughts on how to do this chi-square, where if the has this phenomena in at least 1 eye, then they are positive for this phenomena?
Thanks!
PROC FREQ is doing what you told it to: counting the status=0 cases.
In general here you are using sort of blunt tools to accomplish what you're trying to accomplish, when you probably should use a more precise tool. PROC SORT NODUPKEY is sort of overkill for example, and it doesn't really do what you want anyway.
To set up a dataset of has/doesn't have, for example, let's do a few things. First I add one more row - someone who actually doesn't have - so we see that working.
data have;
input eyeID patientID status gender $;
datalines;
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
15 10 0 M
;;;;
run;
Now we use the data step. We want a patient-level dataset at the end, where we have eye-level now. So we create a new patient-level status.
data patient_level;
set have;
by patientID;
retain patient_status;
if first.patientID then patient_status =0;
patient_status = (patient_Status or status);
if last.patientID then output;
keep patientID patient_Status gender;
run;
Now, we can run your second proc freq. Also note you have a nice dataset of patients.
title "Patients with/without condition in any eye";
proc freq data=patient_level;
tables patient_status;
run;
title;
You also may be able to do your chi-square analysis, though I'm not a statistician and won't dip my toe into whether this is an appropriate analysis. It's likely better than your first, anyway - as it correctly identifies has/doesn't have status in at least one eye. You may need a different indicator, if you need to know number of eyes.
title "Crosstab of gender by patient having/not having condition";
proc freq data=patient_level;
tables gender*patient_Status/chisq;
run;
title;
If your actual data has every single patient having the condition, of course, it's unlikely a chi-square analysis is appropriate.
I have the data in this format- it is just an
example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values for n=2 , n-3, and n=4 I will plot a graph of these values.
Below is code that I think is getting what you are looking for. It assumes that the data that you described is stored separately in the three datasets named data_2, data_3, and data_4. Each of these datasets is processed by the %FIND_GOOD_GROUPS macro that determines which groups of X have all "GOOD" values in INFO, then this summary information is appended as a new row to the BASE dataset. I didn't add the code, but you could calculate the ratio of GOOD_COUNT to FREQ in a separate data step, then use a procedure to plot the N value and the ratio. Hope this gets close to what you're trying to accomplish.
%******************************************************************************;
%macro main;
%find_good_groups(dsn=data_2, n=2);
%find_good_groups(dsn=data_3, n=3);
%find_good_groups(dsn=data_4, n=4);
proc print data=base uniform noobs;
%mend main;
%******************************************************************************;
%******************************************************************************;
%macro find_good_groups(dsn=,n=);
%***************************************************************************;
%* Sort data by X and Y so that you can use FIRST.X variable in Data step. *;
%***************************************************************************;
proc sort data=&dsn;
by x y;
run;
%***************************************************************************;
%* TEMP dataset uses the FIRST.X variable to reset COUNT and GOOD_COUNT to *;
%* initial values for each row where X changes. Each row in the X groups *;
%* adds 1 to COUNT and sets GOOD_COUNT to 0 (zero) if INFO is ever "BAD". *;
%* A record is output if COUNT is equal to the macro parameter &N. *;
%***************************************************************************;
data temp;
keep good_count n;
retain count 0 good_count 1 n &n;
set &dsn;
by x y;
if first.x then do;
count = 0;
good_count = 1;
end;
count = count + 1;
if good_count eq 1 then do;
if trim(left(upcase(info))) eq "BAD" then do;
good_count = 0;
end;
end;
if count eq &n then output;
run;
%***************************************************************************;
%* Summarize the TEMP data to find the number of times that all of the *;
%* rows had "GOOD" in the INFO column for each value of X. *;
%***************************************************************************;
proc summary data=temp;
id n;
var good_count;
output out=n_&n (drop=_type_) sum=;
run;
%***************************************************************************;
%* Append to BASE dataset to retain the sums and frequencies from all of *;
%* the datasets. BASE can be used to plot the N / number of Good records. *;
%***************************************************************************;
proc append data=n_&n base=base force; run;
%mend find_good_groups;
%******************************************************************************;
%main