I am trying to combine the following rows in SAS. Here is the data:
StudentNumber Test1 Test2 Test3
001 . 86 .
001 94 . .
001 . . 75
002 68 . .
002 . 82 .
002 . . 97
I'd like the rows to look like the following:
StudentNumber Test1 Test2 Test3
001 94 86 75
002 68 82 97
I'm used to merging columns with the COALESCE function, but I'm not sure how to do this with rows.
You can use the UPDATE statement to do that. The update statement expects to have a source dataset with unique observations per BY group and a transaction dataset that could have multiple observations per BY group. Only the non-missing values of the transactions will change the values. The output will have one observation per BY group with all transactions applied.
You can use your existing data as both the source and the transaction datasets by adding the dataset option obs=0 to the first reference.
data want;
update have(obs=0) have;
by studentnumber;
run;
Related
I'm not sure of the best way to describe this, and I'll admit that the code I wrote to recreate the problem in a smaller format isn't quite accurate.
I have 7 data sets that have the same number of columns (122) but a different number of rows. The labels for these columns are identical except for an underscore and an integer. Example: first column of each data set is "study_id_1" "study_id_2" ... "study_id_7"
I am trying to stack each of these data sets, in numerical order, on top of each other AND drop the underscore and integer.
However, if I use this code, all of the values are in chunks but along a diagonal.
data all;
set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
run;
The following code (written in SAS Studio) pretty much illustrates the problem and the "diagonal." However, in my actual data (working in SAS EG), all of the missing values are periods, regardless of variable type. In the example below, I could only get periods to appear for missing values for the numerical variables.
data have;
input study_id_1 $ variable1_1 $ variable2_1 variable3_1 study_id_2 $ variable1_2 $ variable2_2 variable3_2 study_id_3 $ variable1_3 $ variable2_3 variable3_3;
cards;
A treatment 35 24 . . . . . . . .
B placebo 24 44 . . . . . . . .
C treatment 66 77 . . . . . . . .
D placebo 73 45 . . . . . . . .
. . . . A treatment 23 34 . . . .
. . . . B placebo 43 56 . . . .
. . . . C treatment 34 34 . . . .
. . . . D placebo 54 67 . . . .
. . . . . . . . A treatment 22 66
. . . . . . . . B placebo 33 67
. . . . . . . . C treatment 23 48
. . . . . . . . D placebo 69 70
;
run;
proc print data=have;
run;
data want;
input study_id $ variable1 $ variable2 variable3;
cards;
A treatment 35 24
B placebo 24 44
C treatment 66 77
D placebo 73 45
A treatment 23 34
B placebo 43 56
C treatment 34 34
D placebo 54 67
A treatment 22 66
B placebo 33 67
C treatment 23 48
D placebo 69 70
;
run;
proc print data=want;
run;
I hope I've described the problem sufficiently and thanks for any help.
The first non-missing from a list of values is returned by the COALESCE and COALESCEC functions.
A list of variables is very simple in your data set because alike variables have a common prefix (and 1,2,3 suffixes). The syntax for specifying the alike variables is <prefix>:
Example:
data want;
set have;
* coalesce during stacking;
* set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
length study_id $8 variable1 $9;
study_id = coalesceC(of study_id_:);
variable1 = coalesceC(of variable1_:);
variable2 = coalesce (of variable2_:);
variable3 = coalesce (of variable3_:);
drop study_id_: variable1_: variable2_: variable3_:;
run;
Rather than clean up the compiled dataset output that are diagonal due to misaligned column names, adjust the inputs by appropriately renaming columns. Specifically, remove the suffix at underscore with scan using a dynamic macro of oldname=newname pattern built from proc sql. Then pass this macro into a subsequent rename command.
Below assumes all datasets resides in WORK library. Adjust SQL WHERE accordingly.
%macro rename_cols(dset);
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffix_clean separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = "&dset.";
quit;
data &dset;
set &dset;
rename &suffix_clean;
run;
%mend rename_cols;
%rename_cols(PT_BS1_ALL);
%rename_cols(PT_BS2_ALL);
%rename_cols(PT_BS3_ALL);
%rename_cols(PT_BS4_ALL);
%rename_cols(PT_BS5_ALL);
%rename_cols(PT_BS6_ALL);
%rename_cols(PT_BS7_ALL);
data all;
set PT_BS1_all
PT_BS2_all
PT_BS3_all
PT_BS4_all
PT_BS5_all
PT_BS6_all
PT_BS7_all;
run;
I´m trying to combine and sum certain observations of a dataset with different values for their common variables, in this case, I am trying to combine the deaths of three age intervals (85-90), (91-95), (95+) in one only (85+) age interval. Our teacher told us it is better if we do not create a new variable and use proc means, tabulate etc.
I have read every google page and all I can find is a proc means combining and summing by variable, but I don´t need the whole group summed, just some observations of the group.
Having the dataset like:
.
.
.
71 to 75 3
76 to 80 4
81 to 85 2
86 to 90 3
91 to 95 1
95+ 3
I would like to have it like
.
.
.
71 to 75 3
76 to 80 4
81 to 85 2
85+ 7
Thanks!
Create a custom format to map the existing literal categorizations into a new ones.
* A format to map literal agecat strings to broader categories;
proc format ;
value $age_cat_want (default=20)
'86 to 90' = '86+'
'91 to 95' = '86+'
'95+' = '86+'
;
This only works for concatenating categories, creating a coarser aggregation.
Example:
* A format to get you into the pickle you are in;
proc format;
value age_cat_have
71-75 = '71 to 75'
76-80 = '76 to 80'
81-84 = '81 to 85'
86-90 = '86 to 90'
91-95 = '91 to 95'
95-high = '95+'
;
data have;
input age ##;
agecat = put (age, age_cat_have.);
datalines;
71 72 73
76 77 78 79
82 83
87 86 86
94
99 101 113
;
proc freq data=have;
title "Original categories are character literals";
table agecat;
run;
* A format to map literal agecat strings to broader categories;
proc format ;
value $age_cat_want (default=20)
'86 to 90' = '86+'
'91 to 95' = '86+'
'95+' = '86+'
;
proc freq data=have;
title "New age categories via custom format $age_cat_want";
table agecat;
format agecat $age_cat_want.;
run;
Note: An existing literal categorization cannot be explicitly split. You would have to make presumptions about the age value distribution within each category and impute a specific age that could be applied to a different age mapping format.
I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
We have found that a particular type of disease is frequently comorbid with a common condition.
I would like to count the number of this type of disease that each patient has, and then create a histogram showing how many people have 1,2,3,4, etc. additional diseases.
This is the format of the data.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 282.1 1/2/10 F 87
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 282.1 12/15/13 M 41
2 232.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 601.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Generally, I was thinking of a DO loop, but I'm not sure where to start because there are duplicates in the dataset, like with patient 1 (282.1 is listed twice). I'm not sure how to account for that. Any thoughts?
Target diagnoses to count would be 282.1, 232.1, 250.02. In this example, patient 1 would have a count of 3, patient 2 would have 2, etc.
Edit:
This is what I have used, but the output is showing each PatientID on multiple lines in the output.
PROC SQL;
create table want as
select age, gender, patientID,
count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID;
quit;
This is what the output table looks like. Why is this patientID showing up so many times?
Obs AGE GENDER PATIENTID count
1 55 Male 107828695 1
2 54 Male 107828695 1
3 54 Male 107828695 1
4 54 Male 107828695 1
5 54 Male 107828695 1
If you include variables that are neither grouping variables or summary statistics then SAS will happily re-merge your summary statistics back with all of the source records. That is why you are getting multiple records. AGE can usually vary if your dataset covers many years. And GENDER can also vary if your data is messy. So for a quick analysis you might try something like this.
create table want as
select patientID
, min(age) as age_at_onset
, min(gender) as gender
, count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID
;
I think you can get what you want with an SQL statement
PROC SQL NOPRINT;
create table want as
select PatientID,
count(distinct Diagnosis) as count
from have
where Diagnosis in (282.1, 232.1, 250.02)
group by PatientID;
quit;
This filters to only the diagnoses you are interested in, counts the distinct times they are seen, by the PatientID, and saves the results to a new table.
Sorry I'm new to a lot of the features of SAS - I've only been using for a couple months, mostly for survey data analysis but now I'm working with a dataset which has individual level data for a cross-over study. It's in the form: ID treatment period measure1 measure2 ....
What I want to do is be able to group these individuals by their treatment group and then output a variable with a group average for measure 1 and measure 2 and another variable with the count of observations in each group.
ie
ID trt per m1 m2
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
what I want is the data in the form:
group a = where trt=1 & per=1
group b = where trt=2 & per=2
group c = where trt=2 & per=1
group d = where trt=1 & per=2
trtgrp avg_m1 avg_m2 n
A 102 76 2
B ... ... ...
C
D
Thank you for the help.
/Creating Sample dataset/
data test;
infile datalines dlm=" ";
input ID : 8.
trt : 8.
per : 8.
m1 : 8.
m2 : 8.;
put ID=;
datalines;
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
;
run;
/Using proc summary to summarize trt and per/
Variables(dimensions) on which you want to summarize would go into class
Variables(measures) for which you want to have average would go into var
Since you want to have produce average so you will have to write mean as the desired statistics.
Read more about proc summary here
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473735.htm
and here
http://web.utk.edu/sas/OnlineTutor/1.2/en/60476/m41/m41_19.htm
proc summary data=test nway;
class trt per;
var m1 m2;
output out=final(drop= _type_)
mean=;
run;
The alternative method uses PROC SQL, the advantage being that it makes use of plain-English syntax, so the concept of a group in your question is maintained in the syntax:
PROC SQL;
CREATE TABLE final AS
SELECT
trt,
per,
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY trt, per;
QUIT;
You can even add your own group headings by applying conditional CASE logic as you did in your question:
PROC SQL;
CREATE TABLE final AS
SELECT
CASE
WHEN trt=1 AND per=1 THEN 'A'
WHEN trt=2 AND per=2 THEN 'B'
WHEN trt=2 AND per=1 THEN 'C'
WHEN trt=1 AND per=2 THEN 'D'
END AS group
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY group;
QUIT;
COUNT(*) simply counts the number of rows found within the group. The AVG function calculates the average for the given column.
In each example, you can replace the explicitly named columns in the GROUP BY clause with a number representing column position in the SELECT clause.
GROUP BY 1,2
However, take care with this method, as adding columns to the SELECT clause later can cause problems.
This is my input dataset:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03 Col_04 Col_bb
NYC 10 0 44 55 66 34 44
CHG 90 55 4 33 22 34 23
TAR 10 8 0 25 65 88 22
I need to calculate the % of Col_A0 for a specific reference.
For example % col_A0 would be calculated as
10/(10+0+44+55+66+34+44)=.0395 i.e. 3.95%
So my output should be
Ref %Col_A0 %Rest
NYC 3.95% 96.05%
CHG 34.48% 65.52%
TAR 4.58% 95.42%
I can do this part but the issue is column variables.
Col_A0 and Ref are fixed columns so they will be there in the input every time. But the other columns won't be there. And there can be some additional columns too like Col_10, col_11 till col_30 and col_cc till col_zz.
For example the input data set in some scenarios can be just:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03
NYC 10 0 44 55 66
CHG 90 55 4 33 22
TAR 10 8 0 25 65
So is there a way I can write a SAS code which checks to see if the column exists or not. Or if there is any other better way to do it.
This is my current SAS code written in Enterprise Guide.
PROC SQL;
CREATE TABLE output123 AS
select
ref,
(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb)) FORMAT=PERCENT8.2 AS PERCNT_ColA0,
(1-(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb))) FORMAT=PERCENT8.2 AS PERCNT_Rest
From Input123;
quit;
Scenarios where all the columns are not there I get an error. And if there are additional columns then I miss those. Please advice.
Thanks
I would not use SQL, but would use regular datastep.
data want;
set have;
a0_prop = col_a0/sum(of _numeric_);
run;
If you wanted to do this in SQL, the easiest way is to keep (or transform) the dataset in vertical format, ie, each variable a separate row per ID. Then you don't need to know how many variables there are to figure it out.
If you always want to sum all the numeric columns then just do :
col_A0 / sum(of _numeric_)