Dear SAS users in the SO community:
I am using the sample data publicly available at:
https://stats.idre.ucla.edu/wp-content/uploads/2016/02/whas500.sas7bdat
The tutorial is found here along with a description of the data set: https://stats.idre.ucla.edu/sas/seminars/sas-survival/
lenfol: length of followup, terminated either by death or censoring. The outcome in this study.
fstat: the censoring variable, loss to followup=0, death=1
age: age at hospitalization
bmi: body mass index
hr: initial heart rate
gender: males=0, females=1
I believe the units of follow-up are days, but for the sake of my question, let's instead assume that the units of follow-up are years. If this were the case, the minimum follow-up time captured by the LENFOL variable is 1 year and the maximum is 2358 years.
My understanding of Cox PH regression is that while the Hazard function may vary over time, the HAZARDRATIO is supposed to remain constant. Please correct me if i am wrong, but this implies that HAZARDRATIO at year =1 is equal to the HAZARDRATIO at year =2358 when estimating the HAZARDRATIO from the entire length of follow-up (2358 years in this study).
If I wanted to estimate the 5-year HAZARDRATIO (ie, assuming the study ended at year=5), could the PHREG procedure return for me the HAZARDRATIO assuming that the length of follow-up ended at year 5 instead of the actual full length of the study (2358 years in this case)? For example, say I wanted to estimate the association between death and gender, I used the following SAS code:
libname ucla "C:\<FILEPATH>";
data ucla_surv;
set ucla.whas500;
run;
proc phreg data=ucla_surv;
model lenfol*fstat(0) = gender/ties=efron;
run;
This results in a HAZARDRATIO (HR) estimate over the entire length of follow-up. Could my code be modified to estimate the 5-year HR as I mentioned above (study artificially ends at year=5)?
Related, would it be appropriate to create a new LENFOL variable that truncates the data at year 5 and execute the model with these new variables as follows:?
data ucla_surv_5yr;
set ucla_surv;
label
lenfol5="5-year follow-up"
fstat5="Event indicator for 5-year FU; 1=death,0=censor"
;
if lenfol <5 then do;
fstat5=fstat;
lenfol5=lenfol;
end;
else do;
fstat5=0;
lenfol5=5;
end;
run;
proc phreg data=ucla_surv;
model lenfol*fstat(0) = gender/ties=efron;
title "HR over entire study FU";
ods select ParameterEstimates;
run;
title;
proc phreg data=ucla_surv_5yr;
model lenfol5*fstat5(0) = gender/ties=efron;
title "HR over 5 years of FU";
ods select ParameterEstimates;
run;
title;
One can see from the output that the HR estimate has changed: over the entire follow-up period, the HR for death modeled against gender was 1.465 while at 5-year FU the estimate was 1.363. Because of the truncation though, the 5-year estimate is less precise.
I welcome any thoughts about my approach from the SO community.
Thanks very much.
Related
For an if-query I would like to create a macro varibale giving the respective frequency of the underlying time
series. I tried to get some descriptive statistics from proc time series. However, they unfortunately do not include the figure for the frequency.
The underlying times series does not necessarily conclude all periods of the frequency. That excludes a selected count by proc sql from my point of view.
Does anyone know an efficient procedure to determine the frequency without computing the frequency on my own (in a data step or a proc sql code)?
You can use the outspectra statement to help learn what kind of seasonality it has. Based on the data, give PROC TIMESERIES your best guess of day, month, etc. In the example below, we know we want to forecast by month but we do not know what seasonality it has.
proc timeseries data=sashelp.air outspectra=spectra;
id date interval=month;
var air;
run;
Plot this spectra dataset in proc sgplot and you'll see something that looks like this:
proc sgplot data=spectra;
where NOT missing(period);
series x=period y=p;
run;
This line will naturally increase over time, but we're looking for a bumps in the line. Notice the large bump somewhere between 0 and 24 months and the several smaller bumps before it. Let's zoom in on that by filtering out the longer periods.
proc sgplot data=spectra;
where period < 24 and NOT missing(period);
series x=period y=p;
run;
It's pretty clear that there is a strong seasonality of 12, with potentially smaller cycles at 3 and 6 months. From this plot, we can conclude that our seasonality should be 12 based on our spectra plot.
You can turn this into a macro to help identify the season if you'd like. Simply search for the largest bump within a reasonable timeframe. In our case we'll choose 36 because we do not suspect that we have any seasonality > 36 months.
proc sort data=spectra;
by period;
run;
data identify_period;
set spectra;
by period;
where NOT missing(period) AND period LE 36;
delta = abs(p - lag(p) );
run;
proc sql;
select period, max(delta) as max_delta
from identify_period
having delta = max(delta)
;
quit;
Output:
PERIOD max_delta
12 163712
I don't know how to do this without data step logic, but you could wrap the data step in a macro as follows:
%macro get_frequency(data,date_variable,output_variable);
proc sort data=&data (keep=&date_variable) out=__tempsorted;
by &date_variable;
run;
data _null_;
set __tempsorted end=lastobs;
prevdate=lag(&date_variable);
if _n_ > 1 then do;
interval_number+1;
interval_total + (&date_variable - prevdate);
end;
if lastobs then do;
average_interval = interval_total/interval_number;
frequency = round(365.25/average_interval);
call symput ("&output_variable",left(put(frequency,best32.)));
end;
run;
proc datasets nolist;
delete __tempsorted;
run;
quit;
%mend get_frequency;
Then you can call the macro on your original data set timeseries to examine the variable date and create a new macro variable frequency1 with the required frequency.
data work.timeseries;
input date date. value;
format date date9.;
datalines;
01Oct18 3000
01Nov18 4000
01Dec18 6500
01Jan19 7000
01Feb19 4000
01Mar19 5000
01Apr19 7500
01May19 4800
01Jun19 4500
;
run;
%get_frequency(timeseries,date,freqency1)
%put &=frequency1;
This seems to work on your sample data where each date is the first of the month. If your dates are evenly distributed (e.g. always near month start/end, or always near mid-month etc.) then this macro should work ok. Obviously if you have multiple observations per date then it will give the completely incorrect frequency.
I am working with a SAS dataset that includes up to 30 medications prescribed to an individual patient. The medications are coded med1, med2 ... med30. Each medication is represented by a 5-digit character variable. Using the identifier, I can then code the name of the drug, and whether that particular medication is a topical antibiotic or a systemic antibiotic.
For each patient, I want to use all 30 medication codes to create one variable indicating whether the patient got a topical antibiotic only, a systemic antibiotic only, or both a topical and an oral antibiotic. So if any of the 30 medications is a systemic antibiotic, I want the patient coded as oral_antibiotic=1.
I currently have this code:
data want;
set have;
array meds[30] med1-med30;
if meds[i] in ('06925' '06920') then do;
penicillin=1;
oral_antibiotic=1;
end;
else if meds[i] in ('03197') then do;
neosporin=1;
topical_antibiotic=1;
end;
.... (many more do loops with many more medications)
run;
The problem is that this code creates one indicator variable instead of 30, overwriting previous information.
I think that I really need 30 indicator variables, indicating whether each of the 30 drugs is an oral or topical antibiotic, before I write code that says if any of the drugs are oral antibiotics, the patient received an oral antibiotic.
I am new to macros and would really appreciate help.
data current;
input med1 med2 med3;
cards;
'06925' '06920' '03197' ;
run;
And I want this:
data want;
input med1 topical_antibiotic1 oral_antibiotic1 med2 topical_antibiotic2 oral_antibiotic2 med3 topical_antibiotic3 oral_antibiotic3;
cards;
'06925' 0 1 '06920' 0 1 '03197' 1 0
;
run;
I think that I really need 30 indicator variables, indicating whether
each of the 30 drugs is an oral or topical antibiotic, before I write
code that says if any of the drugs are oral antibiotics, the patient
received an oral antibiotic.
That's not true. Your current approach is fine as long as you're not resetting them. You don't show us the full code, so it's hard to say, but I'm going to assume that's what is happening here.
Your loop should look like:
array med(30) med1-med30;
*set to 0 at top of the loop;
topical_antibiotic=0; oral_antibiotic=0;
do i=1 to dim(med);
if med(i) in (.....) /*list of topical codes*/ then topical_antibiotic=1;
else if med(i) in (.....) /*list of oral codes*/ then oral_antibiotic=1;
end;
This assumes that an antibiotic cannot be in both Topical/Oral groups. If it can, you need to remove the ELSE from the second IF statement.
I agree that you probably only need one indicator variable for each drug group, (medication of interest). Seems like you just want to know for each subject, "Do they have it?" This example flips the arguments of the IN operator. If you had given more example data I could have done better with this example.
data current;
infile cards missover;
array med[3] $5;
input med[*];
oral_antibotic = '069' in: med; /*Assume oral all start with '069'*/;
topical_antibotic = '03197' in med;
cards;
06925 06920 03197
06925
;;;;
run;
data work.smallmarket;
set work.market;
where country=Nigeria;
NetMargin=profit2/Rev2;
keep Product# NetMargin DT;
run;
Question 1: How can i calculate an industry average NetMargin by date (DT) across all products bearing in mind that not all products will have any data? i.e. no data is not the same as 0.
Question 2: How can I calculate a moving industry average for NetMargin?
Question 1:
proc sort data= smallmarket; by date_var; run;
proc means data=smallmarket noprint;
by createdportaldate;
output out= by_date
mean(NetMargin)=
;
run;
Question 2:
If you have access, you could use Proc expand, if not, then you can find a worked example at:
http://support.sas.com/kb/25/027.html
Edit: found better example:
https://communities.sas.com/t5/Base-SAS-Programming/Calculate-moving-average-by-group/td-p/296267?nobounce
I have some data about students and their dropout procent. I have information abaout which education they started on, in a city (some educations are found in more cities) and the year they started their education. I also have information about wheter there were a quotient the studetns had to meet to be able to start their education.
The quotient variable can consist of numeric values and character values (see the table)
I want to make a table in SAS where I have the quotient and the dropout % like in the below picure:
So for each education and for each city I have the years out as rows and in the cells I have the quota for that year and the dropout % for the year.
I can not do it in SAS. I have tried:
proc tabulate data= sammensat missing;
var dropout;
class education year city quota ;
Table education* city,year *dropout all/ rts=180;
run;
This gives me part of the output I want. But I want another row showing the quota for each combination of education and city for each year.
Two problems: including the quota, and dealing with the char values.
Including quota is easy, if it's numeric.
proc tabulate data= sammensat missing;
class education year city;
var dropout quota;
Table education*city*(quota dropout),year all/ rts=180;
run;
You might need to add in statistics to those if they're not both the same (and both N); probably *mean for both, not sure exactly what your data looks like.
To deal with the character problem, you need to create a format that has either special values for the quota, if they're just assigned values (this city-year-education combination has no quota by definition), or uses values that show there is no quota (missing, 0, etc.).
proc format;
value quotaf
-1='NO QUOTA'
-9='PASSED AUDITON'
0-high=[3.1]
;
quit;
Then use that to format quota, either in the dataset or with a f= option on quota in the proc tabulate.
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;