SAS - group individual observations - sas

Sorry I'm new to a lot of the features of SAS - I've only been using for a couple months, mostly for survey data analysis but now I'm working with a dataset which has individual level data for a cross-over study. It's in the form: ID treatment period measure1 measure2 ....
What I want to do is be able to group these individuals by their treatment group and then output a variable with a group average for measure 1 and measure 2 and another variable with the count of observations in each group.
ie
ID trt per m1 m2
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
what I want is the data in the form:
group a = where trt=1 & per=1
group b = where trt=2 & per=2
group c = where trt=2 & per=1
group d = where trt=1 & per=2
trtgrp avg_m1 avg_m2 n
A 102 76 2
B ... ... ...
C
D
Thank you for the help.

/Creating Sample dataset/
data test;
infile datalines dlm=" ";
input ID : 8.
trt : 8.
per : 8.
m1 : 8.
m2 : 8.;
put ID=;
datalines;
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
;
run;
/Using proc summary to summarize trt and per/
Variables(dimensions) on which you want to summarize would go into class
Variables(measures) for which you want to have average would go into var
Since you want to have produce average so you will have to write mean as the desired statistics.
Read more about proc summary here
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473735.htm
and here
http://web.utk.edu/sas/OnlineTutor/1.2/en/60476/m41/m41_19.htm
proc summary data=test nway;
class trt per;
var m1 m2;
output out=final(drop= _type_)
mean=;
run;

The alternative method uses PROC SQL, the advantage being that it makes use of plain-English syntax, so the concept of a group in your question is maintained in the syntax:
PROC SQL;
CREATE TABLE final AS
SELECT
trt,
per,
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY trt, per;
QUIT;
You can even add your own group headings by applying conditional CASE logic as you did in your question:
PROC SQL;
CREATE TABLE final AS
SELECT
CASE
WHEN trt=1 AND per=1 THEN 'A'
WHEN trt=2 AND per=2 THEN 'B'
WHEN trt=2 AND per=1 THEN 'C'
WHEN trt=1 AND per=2 THEN 'D'
END AS group
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY group;
QUIT;
COUNT(*) simply counts the number of rows found within the group. The AVG function calculates the average for the given column.
In each example, you can replace the explicitly named columns in the GROUP BY clause with a number representing column position in the SELECT clause.
GROUP BY 1,2
However, take care with this method, as adding columns to the SELECT clause later can cause problems.

Related

SAS: Count number of a particular type of disease with patient data on multiple lines

I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
We have found that a particular type of disease is frequently comorbid with a common condition.
I would like to count the number of this type of disease that each patient has, and then create a histogram showing how many people have 1,2,3,4, etc. additional diseases.
This is the format of the data.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 282.1 1/2/10 F 87
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 282.1 12/15/13 M 41
2 232.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 601.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Generally, I was thinking of a DO loop, but I'm not sure where to start because there are duplicates in the dataset, like with patient 1 (282.1 is listed twice). I'm not sure how to account for that. Any thoughts?
Target diagnoses to count would be 282.1, 232.1, 250.02. In this example, patient 1 would have a count of 3, patient 2 would have 2, etc.
Edit:
This is what I have used, but the output is showing each PatientID on multiple lines in the output.
PROC SQL;
create table want as
select age, gender, patientID,
count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID;
quit;
This is what the output table looks like. Why is this patientID showing up so many times?
Obs AGE GENDER PATIENTID count
1 55 Male 107828695 1
2 54 Male 107828695 1
3 54 Male 107828695 1
4 54 Male 107828695 1
5 54 Male 107828695 1
If you include variables that are neither grouping variables or summary statistics then SAS will happily re-merge your summary statistics back with all of the source records. That is why you are getting multiple records. AGE can usually vary if your dataset covers many years. And GENDER can also vary if your data is messy. So for a quick analysis you might try something like this.
create table want as
select patientID
, min(age) as age_at_onset
, min(gender) as gender
, count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID
;
I think you can get what you want with an SQL statement
PROC SQL NOPRINT;
create table want as
select PatientID,
count(distinct Diagnosis) as count
from have
where Diagnosis in (282.1, 232.1, 250.02)
group by PatientID;
quit;
This filters to only the diagnoses you are interested in, counts the distinct times they are seen, by the PatientID, and saves the results to a new table.

How to subset automatically in SAS?

I am new to SAS, so this might be a silly type of question.
Assume there are several datasets with similar structure but different column names. I want to get new datasets with the same number of rows but only a subset of columns.
In the following example, Data_A and Data_B are original datasets and SubA and SubBare what I want. What is the efficient way of deriving SubA and SubB?
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
DATA SubA;
set A_auto;
keep A_make A_price;
RUN;
DATA SubB;
set B_auto;
keep B_make B_price;
RUN;
Here's my new answer. This introduces quite a few concepts, but all are necessary to complete this task.
First of all I would store the required part variable names (the suffixes that are common to all datasets) in a new dataset. This keeps them all in one place and makes it easier to change if required.
The next step is to create a regular expression (regex) search string that combines all the names, separated by a pipe (|), which is the regex symbol for or. I've also added a $ symbol to end of the names, this ensures only variables ending with the part names will be selected.
select into :[macroname] is the method to create macro variables within proc sql
Then I set up a macro to extract the specific variable names for the current dataset and use those names to create a view (like my original answer)
The dictionary library referenced in the proc sql is a metadata library that contains information on all active libraries, tables, columns etc, so is a good source of identifying what the actual variable names are called (based on the regex search string created earlier).
You won't need the proc print in your code, I just put it in to show everything is working as expected.
Let me know if this works for you
/* create intial datasets */
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH B_make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
/* create dataset containing partial name of variables to keep */
data keepvars;
input part_name $ :20.;
datalines;
_make
_price
;
run;
/* create regular expression search string from partial names */
proc sql noprint;
select
cats(part_name,'$') /* '$' matches end of string */
into
:name_str separated by '|' /* '|' is an 'or' search operator in regular expressions */
from
keepvars;
quit;
%put &name_str.; /* print search string to log */
/* macro to create views from datasets */
%macro create_views (dsname, vwname); /* inputs are dataset name being read in and view name being created */
/* extract specific variable names to be kept, based on search string */
proc sql noprint;
select
name
into
:vars separated by ' '
from
dictionary.columns
where
libname = 'WORK'
and memname = upper("&dsname.")
and prxmatch("/&name_str./",strip(name))>0; /* prxmatch is regular expression search function */
quit;
%put &vars.; /* print variables to keep to log */
/* create views */
data &vwname. / view=&vwname.;
set &dsname. (keep=&vars.);
run;
/* test view by printing */
proc print data=&vwname.;;
run;
%mend create_views;
/* run macro for each dataset */
%create_views(A_auto, SubA);
%create_views(B_auto, SubB);

how to find the maximum value across dynamically change observations in sas?

so i have the following dataset with three variables: account, balance, and time.
account balance time
1 110 01/2006
1 111 02/2006
1 88 03/2006
1 61 04/2006
1 1203 05/2006
2 112 01/2006
2 111 02/2006
2 665 03/2006
2 61 04/2006
2 1243 05/2006
3 110 01/2006
3 111 02/2006
3 88 03/2006
3 61 04/2006
3 1203 05/2006
each account has more records. so the starting time might be before what I wrote and the ending time might be after what i wrote.
so my question is:
I am trying to find the maximum balance for each account in prievious 12 monthes. For example, for account 3 on 05/2006, i am trying to find max(account 3 balance at 04/2006, account 3 balance at 03/2006, account 3 balance at 03/2006,............., account 3 balance at 04/2006).
what is in your mind? what i did is to use lag function with array. however, it is NOT so efficient. because i will be in trouble if i need previous 120 months.
Thank you.
Best
Xintong
Try this:
PROC SQL;
create table max_balance as
select *
from your_table
group by account
having balance=max(balance)
;
QUIT;
options mprint;
%macro lag;
data temp (drop=count i);
set balance;
by acct time;
array x(*) lag_bal1 - lag_bal3;
%do i=1 %to 3;
lag_bal&i=lag&i.(balance);
%end;
if first.acct then count=1;
do i=count to dim(x);
call missing(x(i));
end;
count+1;
max_ball=max(balance, of lag_bal1-lag_bal3);
%mend;
%lag;

How to transpose data with multiple observations for the id variable in SAS?

I am wondering the best way to transpose data in SAS when I have multiple occurances of my id variable. I know I can use the let option in the proc transpose statement to do this, but I do not want to get rid of any data, as I intend to compute averages.
Here is an example of my data and my code:
data grades;
input student testnum grade;
cards;
1 1 30
1 1 25
1 2 45
1 3 67
2 1 22
2 2 63
2 2 12
2 2 77
3 1 22
3 1 17
3 2 14
3 4 17
;
run;
proc sort data=grades;
by student testnum;
run;
proc transpose data=grades out=trgrades;
by student;
id testnum;
var grade;
run;
Here is how I would like my resulting dataset to look:
student testnum1 testnum2 testnum3 testnum4 avg12 avg34
1 30 45 67 . 33.33 67
1 25 . . . 33.33 67
2 22 63 . . 43.5 .
2 . 12 . . 43.5 .
2 . 77 . . 43.5 .
3 22 14 . 17 53 17
3 17 . . . 53 17
I want to use this new dataset (not sure how yet) to create the new columns that are the average score of all testnum1's and testnum2's for a student (avg12) and the average of all testenum3's and testnum4's (avg34) for a student.
There may be a much more efficient way to do this but I am stumped.
Any advice is appreciated.
If all you really need is the average of all test 1's and 2's, and 3's and 4's for each student, then you don't need to transpose at all. All you need is a simple data step:
data grouped;
set grades;
if testnum In (1,2) then group=1;
else if testnum in (3,4) then group=2;
run;
Then a basic proc means:
proc means data=grouped;
by student group;
var grade;
output out=averages mean=groupaverage;
run;
If you need the averages in a single observation, you can easily transpose the averages dataset.
proc transpose data=grades out=trgrades;
by student;
id group;
var grade;
run;
Update:
As mentioned by #Keith, using a format to group the tests is an excellent choice as well. Skip the data step and create the format like so:
proc format;
value TestGroup
1,2 = 'Tests 1 and 2'
3,4 = 'Tests 3 and 4'
;
run;
Then the proc means becomes:
proc means data=grouped;
by student testnum;
var grade;
format testnum TestGroup.;
output out=averages mean=groupaverage;
run;
End Update
If, for some reason, you really need to have all the test scores in one observation then I would recommend using a data step to make them uniquely identifiable. Use by, testnum.first, retain, and a simple counter to assign each score a retake number. Now your transpose uses retake and testnum as id variables. You should be able to figure it out from there.
Really hoping right now that I didn't just do your SAS homework assignment for you.

SAS: No valid observations are found Error - Simple Regression

I have big panel time series data set. I wish to do this basic SAS regression code:
proc sort data=dataset;
by time_id;
run;
ods output parameterestimates=pe;
proc reg data=dataset;
by time_id;
model y=x1 x2 x3....x15;
quit;
run;
I get this error when I run the code:
ERROR: No valid observations are found.
NOTE: The above message was for the following BY group:
time_id=1
ERROR: No valid observations are found.
NOTE: The above message was for the following BY group:
time_id=2....
Why? My time_id variable exists... is it because I have too many time_id variables? If I select firm_id it works but I want time_id.
Here's a sample of my data (panel time series):
y x firm_id time_id
3.4 100 1 1
2.3 200 1 2
6.5 653 1 3
3 50 2 1
4.34 23 2 2
4.8 55 2 3
1.311 400 3 1
1.23 200 3 2
5.63 50 3 3
You'll get this error message if all values of a particular x variable are missing for a given time_id. Take a look at the example below where all values of x2 are missing for time_id 1, when you run the code the Results Output window details the problem (number of missing observations the same as the number of observations).
It works for firm_id because you have fewer values than time_id, therefore not all values of a particular x variable are missing for each firm_id.
data have;
input y x1 x2 firm_id time_id;
cards;
3.4 100 . 1 1
2.3 200 200 1 2
6.5 653 653 1 3
3 50 . 2 1
4.34 23 23 2 2
4.8 55 55 2 3
1.311 400 . 3 1
1.23 200 200 3 2
5.63 50 50 3 3
;
run;
proc sort data=have;
by time_id;
run;
ods output parameterestimates=pe;
proc reg data=have;
by time_id;
model y=x1-x2;
quit;
run;