How to align data that are on a diagonal in SAS - sas

I'm not sure of the best way to describe this, and I'll admit that the code I wrote to recreate the problem in a smaller format isn't quite accurate.
I have 7 data sets that have the same number of columns (122) but a different number of rows. The labels for these columns are identical except for an underscore and an integer. Example: first column of each data set is "study_id_1" "study_id_2" ... "study_id_7"
I am trying to stack each of these data sets, in numerical order, on top of each other AND drop the underscore and integer.
However, if I use this code, all of the values are in chunks but along a diagonal.
data all;
set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
run;
The following code (written in SAS Studio) pretty much illustrates the problem and the "diagonal." However, in my actual data (working in SAS EG), all of the missing values are periods, regardless of variable type. In the example below, I could only get periods to appear for missing values for the numerical variables.
data have;
input study_id_1 $ variable1_1 $ variable2_1 variable3_1 study_id_2 $ variable1_2 $ variable2_2 variable3_2 study_id_3 $ variable1_3 $ variable2_3 variable3_3;
cards;
A treatment 35 24 . . . . . . . .
B placebo 24 44 . . . . . . . .
C treatment 66 77 . . . . . . . .
D placebo 73 45 . . . . . . . .
. . . . A treatment 23 34 . . . .
. . . . B placebo 43 56 . . . .
. . . . C treatment 34 34 . . . .
. . . . D placebo 54 67 . . . .
. . . . . . . . A treatment 22 66
. . . . . . . . B placebo 33 67
. . . . . . . . C treatment 23 48
. . . . . . . . D placebo 69 70
;
run;
proc print data=have;
run;
data want;
input study_id $ variable1 $ variable2 variable3;
cards;
A treatment 35 24
B placebo 24 44
C treatment 66 77
D placebo 73 45
A treatment 23 34
B placebo 43 56
C treatment 34 34
D placebo 54 67
A treatment 22 66
B placebo 33 67
C treatment 23 48
D placebo 69 70
;
run;
proc print data=want;
run;
I hope I've described the problem sufficiently and thanks for any help.

The first non-missing from a list of values is returned by the COALESCE and COALESCEC functions.
A list of variables is very simple in your data set because alike variables have a common prefix (and 1,2,3 suffixes). The syntax for specifying the alike variables is <prefix>:
Example:
data want;
set have;
* coalesce during stacking;
* set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
length study_id $8 variable1 $9;
study_id = coalesceC(of study_id_:);
variable1 = coalesceC(of variable1_:);
variable2 = coalesce (of variable2_:);
variable3 = coalesce (of variable3_:);
drop study_id_: variable1_: variable2_: variable3_:;
run;

Rather than clean up the compiled dataset output that are diagonal due to misaligned column names, adjust the inputs by appropriately renaming columns. Specifically, remove the suffix at underscore with scan using a dynamic macro of oldname=newname pattern built from proc sql. Then pass this macro into a subsequent rename command.
Below assumes all datasets resides in WORK library. Adjust SQL WHERE accordingly.
%macro rename_cols(dset);
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffix_clean separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = "&dset.";
quit;
data &dset;
set &dset;
rename &suffix_clean;
run;
%mend rename_cols;
%rename_cols(PT_BS1_ALL);
%rename_cols(PT_BS2_ALL);
%rename_cols(PT_BS3_ALL);
%rename_cols(PT_BS4_ALL);
%rename_cols(PT_BS5_ALL);
%rename_cols(PT_BS6_ALL);
%rename_cols(PT_BS7_ALL);
data all;
set PT_BS1_all
PT_BS2_all
PT_BS3_all
PT_BS4_all
PT_BS5_all
PT_BS6_all
PT_BS7_all;
run;

Related

Create new SAS observations until number of observations equals other variable?

I am trying to calculate attack rate in some populations using GEE, but at this time only have total number of non-cases. My dataset has individual level data for each case, and a population-wide count for number of non-cases.
In order to do the GEE, I am trying to get create non-case observations so I have one observation for each non-case.
For example:
In this starting data...
PopID CaseNum N_cases N_noncases
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
I would need to create 3 new observations with PopID14 and 2 new observations in PopID 15.
so it would look like this
PopID CaseNum No_case N_cases N_noncases
14 . . 2 3
14 1 . . .
14 2 . . .
14 . 1 . 3
14 . 2 . 3
14 . 3 . 3
15 . . 5 2
15 1 . . .
15 2 . . .
15 3 . . .
15 4 . . .
15 5 . . .
15 . 1 . 2
15 . 2 . 2
Once I have the non-case observations, I'm planning to separate into case-level and population-level datasets before doing my GEE in the case-level dataset.
I have tried a DO-UNTIL loop set to end when no_case=n_noncases, but it just continues forever and never stops.
data test1;
set test;
do until (no_case=n_noncases) ;
no_case +1;
by Popid;
output;
end; run;
I am open to any and all other ways of doing this :) (I also attempted a proc sql, but that went downhill quickly because I have only ever used them to go from case level to population level data, and not vice versa)
Weird but doable.
data have;
input PopID CaseNum N_cases N_noncases;
cards;
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
;
;
;;
run;
data want;
set have;
by popID;
retain nrecs;
if first.popID then
nrecs=N_noncases;
output;
if last.popid then
do;
N_noncases=nrecs;
call missing(caseNum);
do No_case=1 to nrecs;
output;
end;
end;
keep popID Casenum No_case n_cases N_noncases;
run;
Probably clearer if you do it in steps.
Generate the "non-cases" .
data controls;
set have(keep=popid n_noncases);
where n_noncases > 0 ;
do controlnum=1 to n_noncases;
output;
end;
run;
Then combine with the "cases" ;
data want;
set have(where=(not missing(casenum)) controls;
by popid;
run;
If performance is an issue then make the first one a data step view instead.
data controls / view=controls;
...

Determine mean/median/IQ range of age for two separate groups

I have a dataset in Stata with variables age and carrier, an indicator for carrier of a particular disease.
Using univar age I am able to getsome descriptive statistics of age for the dataset, but now I want to compare mean/median/IQ range between carriers and non-carriers. Is there some way to do this?
I have tried one line so far:
univar age if carrier = 1
which resulted in invalid syntax error, r(198)
I had expected descriptive statistics of age when carrier is 1.
Sample Data
clear
set obs 100
gen age = runiformint(18,70)
gen carrier = runiformint(0,1)
Summary Stats
There are several ways to get summary statistics in Stata, but one way is to use the tabstat command:
tabstat age, by(carrier) statistics(n mean sd min p25 median p75 max iqr)
Summary for variables: age
Group variable: carrier
carrier | N Mean SD Min p25 p50 p75 Max IQR
---------+------------------------------------------------------------------------------------------
0 | 52 43.96154 16.45667 19 30 39.5 59 70 29
1 | 48 48.4375 14.24692 20 39 49 60.5 69 21.5
---------+------------------------------------------------------------------------------------------
Total | 100 46.11 15.52183 19 33 44 59.5 70 26.5
----------------------------------------------------------------------------------------------------
See help tabstat for additional statistics options.
Edited to mimic output of univar.
You'd have to search quite hard for univar if you had not heard of it already. It's community-contributed and dates from 1997 and 1999:
STB-51 sg67.1 . . . . . . . . . . . . . . . . . . . . . . . Update to univar
(help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
9/99 pp.27--28; STB Reprints Vol 9, pp.159--161
improvements and new options to univar
STB-36 sg67 . . . . . . . . . . . . . . . Univariate summaries with boxplots
(help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
3/97 pp.23--25; STB Reprints Vol 6, pp.179--183
command that offers a streamlined display of univariate summaries,
including, optionally, text-mode boxplots
Looking at its help indicates that you need its by() option. Here's a reproducible
example:
. sysuse auto, clear
(1978 automobile data)
. univar mpg, by(foreign)
-> foreign=Domestic
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
mpg 52 19.83 4.74 12.00 16.50 19.00 22.00 34.00
-------------------------------------------------------------------------------
-> foreign=Foreign
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
mpg 22 24.77 6.61 14.00 21.00 24.50 28.00 41.00
-------------------------------------------------------------------------------
Like #JR96, I recommend tabstat here.

Merging SAS rows with COALESCE function

I am trying to combine the following rows in SAS. Here is the data:
StudentNumber Test1 Test2 Test3
001 . 86 .
001 94 . .
001 . . 75
002 68 . .
002 . 82 .
002 . . 97
I'd like the rows to look like the following:
StudentNumber Test1 Test2 Test3
001 94 86 75
002 68 82 97
I'm used to merging columns with the COALESCE function, but I'm not sure how to do this with rows.
You can use the UPDATE statement to do that. The update statement expects to have a source dataset with unique observations per BY group and a transaction dataset that could have multiple observations per BY group. Only the non-missing values of the transactions will change the values. The output will have one observation per BY group with all transactions applied.
You can use your existing data as both the source and the transaction datasets by adding the dataset option obs=0 to the first reference.
data want;
update have(obs=0) have;
by studentnumber;
run;

How to transpose data with multiple observations for the id variable in SAS?

I am wondering the best way to transpose data in SAS when I have multiple occurances of my id variable. I know I can use the let option in the proc transpose statement to do this, but I do not want to get rid of any data, as I intend to compute averages.
Here is an example of my data and my code:
data grades;
input student testnum grade;
cards;
1 1 30
1 1 25
1 2 45
1 3 67
2 1 22
2 2 63
2 2 12
2 2 77
3 1 22
3 1 17
3 2 14
3 4 17
;
run;
proc sort data=grades;
by student testnum;
run;
proc transpose data=grades out=trgrades;
by student;
id testnum;
var grade;
run;
Here is how I would like my resulting dataset to look:
student testnum1 testnum2 testnum3 testnum4 avg12 avg34
1 30 45 67 . 33.33 67
1 25 . . . 33.33 67
2 22 63 . . 43.5 .
2 . 12 . . 43.5 .
2 . 77 . . 43.5 .
3 22 14 . 17 53 17
3 17 . . . 53 17
I want to use this new dataset (not sure how yet) to create the new columns that are the average score of all testnum1's and testnum2's for a student (avg12) and the average of all testenum3's and testnum4's (avg34) for a student.
There may be a much more efficient way to do this but I am stumped.
Any advice is appreciated.
If all you really need is the average of all test 1's and 2's, and 3's and 4's for each student, then you don't need to transpose at all. All you need is a simple data step:
data grouped;
set grades;
if testnum In (1,2) then group=1;
else if testnum in (3,4) then group=2;
run;
Then a basic proc means:
proc means data=grouped;
by student group;
var grade;
output out=averages mean=groupaverage;
run;
If you need the averages in a single observation, you can easily transpose the averages dataset.
proc transpose data=grades out=trgrades;
by student;
id group;
var grade;
run;
Update:
As mentioned by #Keith, using a format to group the tests is an excellent choice as well. Skip the data step and create the format like so:
proc format;
value TestGroup
1,2 = 'Tests 1 and 2'
3,4 = 'Tests 3 and 4'
;
run;
Then the proc means becomes:
proc means data=grouped;
by student testnum;
var grade;
format testnum TestGroup.;
output out=averages mean=groupaverage;
run;
End Update
If, for some reason, you really need to have all the test scores in one observation then I would recommend using a data step to make them uniquely identifiable. Use by, testnum.first, retain, and a simple counter to assign each score a retake number. Now your transpose uses retake and testnum as id variables. You should be able to figure it out from there.
Really hoping right now that I didn't just do your SAS homework assignment for you.

Use variable name for calculation in SAS

I have a SAS dataset that looks like this:
var _12 _41 _17
12 . . .
41 . . .
17 . . .
So for each var there is a column named _var.
I want to use an array or macro to populate all the missing values with the product of the row and column:
_12 _41 _17
12 144 492 204
41 492 1681 697
17 204 697 289
Any thoughts on how to approach this? I want it to be completely general, so I don't need to know the names of the columns, and make no assumptions about their order or values, other than that they are all numbers.
As all the variables (apart from var) begin with an underscore then it is easy to reference them in an array. You can then use the INPUT, COMPRESS and VNAME functions to extract the number and perform the calculation in a single line! Here is the code.
data have;
input var _12 _41 _17;
cards;
12 . . .
41 . . .
17 . . .
;
run;
data want;
set have;
array nums{*} _: ;
do i=1 to dim(nums);
nums{i}=var*input(compress(vname(nums{i}),"_"),best12.);
end;
drop i;
run;