I have a SAS dataset that looks like this:
var _12 _41 _17
12 . . .
41 . . .
17 . . .
So for each var there is a column named _var.
I want to use an array or macro to populate all the missing values with the product of the row and column:
_12 _41 _17
12 144 492 204
41 492 1681 697
17 204 697 289
Any thoughts on how to approach this? I want it to be completely general, so I don't need to know the names of the columns, and make no assumptions about their order or values, other than that they are all numbers.
As all the variables (apart from var) begin with an underscore then it is easy to reference them in an array. You can then use the INPUT, COMPRESS and VNAME functions to extract the number and perform the calculation in a single line! Here is the code.
data have;
input var _12 _41 _17;
cards;
12 . . .
41 . . .
17 . . .
;
run;
data want;
set have;
array nums{*} _: ;
do i=1 to dim(nums);
nums{i}=var*input(compress(vname(nums{i}),"_"),best12.);
end;
drop i;
run;
Related
I am trying to calculate attack rate in some populations using GEE, but at this time only have total number of non-cases. My dataset has individual level data for each case, and a population-wide count for number of non-cases.
In order to do the GEE, I am trying to get create non-case observations so I have one observation for each non-case.
For example:
In this starting data...
PopID CaseNum N_cases N_noncases
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
I would need to create 3 new observations with PopID14 and 2 new observations in PopID 15.
so it would look like this
PopID CaseNum No_case N_cases N_noncases
14 . . 2 3
14 1 . . .
14 2 . . .
14 . 1 . 3
14 . 2 . 3
14 . 3 . 3
15 . . 5 2
15 1 . . .
15 2 . . .
15 3 . . .
15 4 . . .
15 5 . . .
15 . 1 . 2
15 . 2 . 2
Once I have the non-case observations, I'm planning to separate into case-level and population-level datasets before doing my GEE in the case-level dataset.
I have tried a DO-UNTIL loop set to end when no_case=n_noncases, but it just continues forever and never stops.
data test1;
set test;
do until (no_case=n_noncases) ;
no_case +1;
by Popid;
output;
end; run;
I am open to any and all other ways of doing this :) (I also attempted a proc sql, but that went downhill quickly because I have only ever used them to go from case level to population level data, and not vice versa)
Weird but doable.
data have;
input PopID CaseNum N_cases N_noncases;
cards;
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
;
;
;;
run;
data want;
set have;
by popID;
retain nrecs;
if first.popID then
nrecs=N_noncases;
output;
if last.popid then
do;
N_noncases=nrecs;
call missing(caseNum);
do No_case=1 to nrecs;
output;
end;
end;
keep popID Casenum No_case n_cases N_noncases;
run;
Probably clearer if you do it in steps.
Generate the "non-cases" .
data controls;
set have(keep=popid n_noncases);
where n_noncases > 0 ;
do controlnum=1 to n_noncases;
output;
end;
run;
Then combine with the "cases" ;
data want;
set have(where=(not missing(casenum)) controls;
by popid;
run;
If performance is an issue then make the first one a data step view instead.
data controls / view=controls;
...
I'm not sure of the best way to describe this, and I'll admit that the code I wrote to recreate the problem in a smaller format isn't quite accurate.
I have 7 data sets that have the same number of columns (122) but a different number of rows. The labels for these columns are identical except for an underscore and an integer. Example: first column of each data set is "study_id_1" "study_id_2" ... "study_id_7"
I am trying to stack each of these data sets, in numerical order, on top of each other AND drop the underscore and integer.
However, if I use this code, all of the values are in chunks but along a diagonal.
data all;
set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
run;
The following code (written in SAS Studio) pretty much illustrates the problem and the "diagonal." However, in my actual data (working in SAS EG), all of the missing values are periods, regardless of variable type. In the example below, I could only get periods to appear for missing values for the numerical variables.
data have;
input study_id_1 $ variable1_1 $ variable2_1 variable3_1 study_id_2 $ variable1_2 $ variable2_2 variable3_2 study_id_3 $ variable1_3 $ variable2_3 variable3_3;
cards;
A treatment 35 24 . . . . . . . .
B placebo 24 44 . . . . . . . .
C treatment 66 77 . . . . . . . .
D placebo 73 45 . . . . . . . .
. . . . A treatment 23 34 . . . .
. . . . B placebo 43 56 . . . .
. . . . C treatment 34 34 . . . .
. . . . D placebo 54 67 . . . .
. . . . . . . . A treatment 22 66
. . . . . . . . B placebo 33 67
. . . . . . . . C treatment 23 48
. . . . . . . . D placebo 69 70
;
run;
proc print data=have;
run;
data want;
input study_id $ variable1 $ variable2 variable3;
cards;
A treatment 35 24
B placebo 24 44
C treatment 66 77
D placebo 73 45
A treatment 23 34
B placebo 43 56
C treatment 34 34
D placebo 54 67
A treatment 22 66
B placebo 33 67
C treatment 23 48
D placebo 69 70
;
run;
proc print data=want;
run;
I hope I've described the problem sufficiently and thanks for any help.
The first non-missing from a list of values is returned by the COALESCE and COALESCEC functions.
A list of variables is very simple in your data set because alike variables have a common prefix (and 1,2,3 suffixes). The syntax for specifying the alike variables is <prefix>:
Example:
data want;
set have;
* coalesce during stacking;
* set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
length study_id $8 variable1 $9;
study_id = coalesceC(of study_id_:);
variable1 = coalesceC(of variable1_:);
variable2 = coalesce (of variable2_:);
variable3 = coalesce (of variable3_:);
drop study_id_: variable1_: variable2_: variable3_:;
run;
Rather than clean up the compiled dataset output that are diagonal due to misaligned column names, adjust the inputs by appropriately renaming columns. Specifically, remove the suffix at underscore with scan using a dynamic macro of oldname=newname pattern built from proc sql. Then pass this macro into a subsequent rename command.
Below assumes all datasets resides in WORK library. Adjust SQL WHERE accordingly.
%macro rename_cols(dset);
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffix_clean separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = "&dset.";
quit;
data &dset;
set &dset;
rename &suffix_clean;
run;
%mend rename_cols;
%rename_cols(PT_BS1_ALL);
%rename_cols(PT_BS2_ALL);
%rename_cols(PT_BS3_ALL);
%rename_cols(PT_BS4_ALL);
%rename_cols(PT_BS5_ALL);
%rename_cols(PT_BS6_ALL);
%rename_cols(PT_BS7_ALL);
data all;
set PT_BS1_all
PT_BS2_all
PT_BS3_all
PT_BS4_all
PT_BS5_all
PT_BS6_all
PT_BS7_all;
run;
Problem Statement: I have a text file and I want to read it using SAS INFILE function. But SAS is not giving me the proper output.
Text File:
Akash 18 19 20
Tejas 20 16
Shashank 16 20
Meera 18 20
The Code that I have tried:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover;
INPUT Name$ SAS DS R;
RUN;
PROC PRINT DATA=arr;
RUN;
While the result i got is :
Table of Contents
Obs Name SAS DS R
1 Akash 18 19 20
2 Tejas 20 16 .
3 Shashank16 20 .
4 Meera 18 20 .
Which is improper. So what is wrong with the code? I need to read the file in SAS with the same sequence of marks as in text file. Please help.
Expected result:
Table of Contents
Obs Name SAS DS R
1 Akash 18 19 20
2 Tejas . 20 16
3 Shashank16 20 .
4 Meera 18 . 20
Thanks in advance.
If that text file is tab-delimited, you should specify the delimiter in the infile statement and use the dsd option to account for missing values:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover dlm='09'x dsd;
INPUT Name $ SAS DS R;
RUN;
PROC PRINT DATA=arr;
RUN;
EDIT: after editing, your sample text file now looks fixed-width rather than space-delimited. In that case you should be using column input:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover;
INPUT Name $1-9 SAS 10-12 DS 13-15 R 16-18;
RUN;
example with datalines:
DATA Arr;
INFILE datalines missover;
INPUT Name $1-9 SAS 10-12 DS 13-15 R 16-18;
datalines;
Akash 18 19 20
Tejas 20 16
Shashank 16 20
Meera 18 20
RUN;
I need to delete duplicates from a data set. My issue is that once I sort the data and flag the duplicates (using lag function), some information across variables is present within the duplicate observation and some within the original observation. I need to retain information across all variables while also deleting the duplicates.
My thought was to first fill in all the information between both the original and duplicate before deleting the duplicate.
Example of observations after sorting data and flagging duplicates (fake data values):
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
What I want:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
AB 36 1980 45654 . 2135 1
ON 26 1990 . 35464 8868 0
ON 26 1990 . 35464 8868 1
So I can delete duplicates and eventually have this:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
ON 26 1990 . 35464 8868 0
I created lag and lead variables to attempt to fill in information but it only seems to be working on some of the data set.
Here is the code for the lead variables:
data uncleaned_data;
merge uncleaned_data
uncleaned_data(
firstobs=2
keep= TRANS_ID MORB_ID Varx
rename=(TRANS_ID=lead_TRANS_ID MORB_ID=lead_MORB_ID Varx=lead_Varx ));
if lag(flag_duplicate=1) then do;
if TRANS_ID=. then do;
TRANS_ID= lead_TRANS_ID;
end;
if MORB_ID=. then do;
MORB_ID= lead_MORB_ID;
end;
if Varx=. then do;
Varx= lead_Varx;
end;
end;
run;
I did the same kind of thing for lag variables except my initial if statement is 'if flag_duplicate=1 then do;'
This method does not seem to work for many duplicate pairs in my data set.
Is there a better way to approach my problem overall? possibly through proc SQL?
Thanks for reading and any advice offered!
I'm assuming that you don't have different values of Trans_id, for example, for the same Province. If that is the case then you can flatten the original data in one go to achieve your goal, using an update statement with a by statement. In my code, the first reference to the dataset, with obs=0, just creates the variables, the second reference populates the values and the by statement ensures that only one row is updated per Providence.
Using this method means you don't need to identify the duplicate values beforehand.
data have;
input Province $ AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate;
datalines;
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
;
run;
data want;
update have(obs=0) have;
by province;
run;
Something like this should work...
proc sort data=uncleaned_data; by Province AGE BRTHYEAR; run;
data cleaned_data (DROP=TRANS_ID RENAME=(KEEP_TRANS_ID=TRANS_ID) ...);
set uncleaned_data;
by Province AGE BRTHYEAR;
if first.BRTHYEAR then do;
keep_TRANS_ID=TRANS_ID;
...
end;
else do;
if keep_TRANS_ID=. then keep_TRANS_ID=TRANS_ID;
...
end;
if last.BRTHYEAR then output;
run;
I am wondering the best way to transpose data in SAS when I have multiple occurances of my id variable. I know I can use the let option in the proc transpose statement to do this, but I do not want to get rid of any data, as I intend to compute averages.
Here is an example of my data and my code:
data grades;
input student testnum grade;
cards;
1 1 30
1 1 25
1 2 45
1 3 67
2 1 22
2 2 63
2 2 12
2 2 77
3 1 22
3 1 17
3 2 14
3 4 17
;
run;
proc sort data=grades;
by student testnum;
run;
proc transpose data=grades out=trgrades;
by student;
id testnum;
var grade;
run;
Here is how I would like my resulting dataset to look:
student testnum1 testnum2 testnum3 testnum4 avg12 avg34
1 30 45 67 . 33.33 67
1 25 . . . 33.33 67
2 22 63 . . 43.5 .
2 . 12 . . 43.5 .
2 . 77 . . 43.5 .
3 22 14 . 17 53 17
3 17 . . . 53 17
I want to use this new dataset (not sure how yet) to create the new columns that are the average score of all testnum1's and testnum2's for a student (avg12) and the average of all testenum3's and testnum4's (avg34) for a student.
There may be a much more efficient way to do this but I am stumped.
Any advice is appreciated.
If all you really need is the average of all test 1's and 2's, and 3's and 4's for each student, then you don't need to transpose at all. All you need is a simple data step:
data grouped;
set grades;
if testnum In (1,2) then group=1;
else if testnum in (3,4) then group=2;
run;
Then a basic proc means:
proc means data=grouped;
by student group;
var grade;
output out=averages mean=groupaverage;
run;
If you need the averages in a single observation, you can easily transpose the averages dataset.
proc transpose data=grades out=trgrades;
by student;
id group;
var grade;
run;
Update:
As mentioned by #Keith, using a format to group the tests is an excellent choice as well. Skip the data step and create the format like so:
proc format;
value TestGroup
1,2 = 'Tests 1 and 2'
3,4 = 'Tests 3 and 4'
;
run;
Then the proc means becomes:
proc means data=grouped;
by student testnum;
var grade;
format testnum TestGroup.;
output out=averages mean=groupaverage;
run;
End Update
If, for some reason, you really need to have all the test scores in one observation then I would recommend using a data step to make them uniquely identifiable. Use by, testnum.first, retain, and a simple counter to assign each score a retake number. Now your transpose uses retake and testnum as id variables. You should be able to figure it out from there.
Really hoping right now that I didn't just do your SAS homework assignment for you.