For some reason when SAS does proportional hazards regression it is including those observations that are specified as . as a group in the results. I suspect it has something to do with how I created my variable (and that SAS thinks my numeric variables are characters) but I can't figure out what I did wrong. I am using SAS 9.4
data final; set final;
if edu_d = 'hs less' then edu_regress = 1;
else if edu_d = 'hs' then edu_regress = 1;
else if edu_d = 'some college' then edu_regress = 2;
else if edu_d = 'college plus' then edu_regress = 3;
else if edu_d = 'missing' then edu_regress=.;
run;
Then I run my regression:
proc phreg data=final;
class edu_regress;
model fuptime*dc(0)=edu_regress/rl;
run;
And the output is as follows:
edu_regress . 1 0.10963 0.12941 0.7177 0.3969 1.116 0.866 1.438
edu_regress 1 1 0.22514 0.10949 4.2278 0.0398 1.252 1.011 1.552
edu_regress 2 1 0.21706 0.11410 3.6190 0.0571 1.242 0.993 1.554
Where . is a category instead of treated as missing.
I'm sure I'm making a rookie mistake but I just can't figure it out.
I would clear your output, and re-run the code, and check the log and output.
As I read the docs, to get missing values treated as a category you would need to have /missing on your CLASS statement, which you do not have in the code shown. Without that, I think missing values should be automatically excluded.
When I run PHREG with a CLASS variable that has missing values, I get a note in the log about observations being deleted due to missing values, and the output shows that the number of observations used is less than the number of observations read.
If SAS thinks edu_regress is character, that's possible if it already was on the dataset as character. This is one reason not to do data x; set x; and instead make a new dataset. You should see notes in the datastep when you run it the way you have now regarding numeric to character conversion, if this is indeed the problem.
Anyway, one way to adjust this is to use CALL MISSING. It sets a variable to missing correctly regardless of the type.
data final;
set final;
if edu_d = 'hs less' then edu_regress = 1;
else if edu_d = 'hs' then edu_regress = 1;
else if edu_d = 'some college' then edu_regress = 2;
else if edu_d = 'college plus' then edu_regress = 3;
else if edu_d = 'missing' then call missing(edu_Regress);
run;
Related
I created a SAS function using fcmp to calculate the jaccard distance between two strings. I do not want to use macros, as I'm going to use it through a large dataset for multiples variables. the substrings I have are missing others.
proc fcmp outlib=work.functions.func;
function distance_jaccard(string1 $, string2 $);
n = length(string1);
m = length(string2);
ngrams1 = "";
do i = 1 to (n-1);
ngrams1 = cats(ngrams1, substr(string1, i, 2) || '*');
end;
/*ngrams1= ngrams1||'*';*/
put ngrams1=;
ngrams2 = "";
do j = 1 to (m-1);
ngrams2 = cats(ngrams2, substr(string2, j, 2) || '*');
end;
endsub;
options cmplib=(work.functions);
data test;
string1 = "joubrel";
string2 = "farjoubrel";
jaccard_distance = distance_jaccard(string1, string2);
run;
I expected ngrams1 and ngrams2 to contain all the substrings of length 2 instead I got this
ngrams1=jo*ou*ub
ngrams2=fa*ar*rj
If you want real help with your algorithm you need to explain in words what you want to do.
I suspect your problem is that you never defined how long you new character variables NGRAM1 and NGRAM2 should be. From the output you show it appears that FCMP defaulted them to length $8.
To define a variable you need use a LENGTH statement (or an ATTRIB statement with the LENGTH= option) before you start referencing the variable.
I want to display my data as either yes or no in the output for initaltesting, site visit, and follow up, how would I do that? There are numeric values for this on the data set but want character responses of "y" or "n"
PROC FORMAT;
VALUE SiteVisitfmt 1 = 'yes'
0 = 'no';
VALUE InitialTestingfmt 1 = 'yes'
2 = 'no';
VALUE TestEventfmt 1 = 'One Event '
2 = 'Two Events'
3 = 'Three Events'
4 = 'Four Events'
5 = 'Five Events';
VALUE FollowUpfmt 1 = 'yes'
0 = 'no';
FORMAT SiteVisit SiteVisitfmt. InitialTesting InitialTestingfmt. TestEvent TestEventfmt.
FollowUp FollowUpfmt.;
RUN;
data PMdataedits;
set PMdata (rename = (Number_of_Days_from_Onset_to_Sit =SiteVisit
Number_of_Days_between_Onset_and = InitialTesting
Number_of_Test_Events_in_IRIS = TestEvent
Number_of_Days_between_Test_1_an = FollowUp));
drop SPA;
attrib date1 format=date9.;
date1=input(date,mmddyy10.);
NewSiteVisit = put(SiteVisit, 8.);
NewInitialTesting = put(InitialTesting, 8.);
NewFollowUp = put(FollowUp, 8.);
NewSiteVisit=;
if (NewSiteVisit=<1) THEN NewSiteVisit= '1';
if (NewSiteVisit>1) THEN NewSiteVist= '0';
NewInitialTesting=;
if (NewInitialTesting<=2) THEN NewInitialTesting= '1';
if (NewInitialTesting>2) THEN NewInitialTesting='0';
This statement:
FORMAT SiteVisit SiteVisitfmt. InitialTesting InitialTestingfmt. TestEvent TestEventfmt.
FollowUp FollowUpfmt.;
Needs to be on the data step (sometime after data PMdataedits; but before the run; that you don't show), not in the proc format. That's the statement that assigns the format to a variable; each dataset (which is defined by a data step) has its own, unique set of variables that can be the same name as other datasets but have different contents and formats.
Also note that you don't have to name the formats after the variables, and don't need three different yes/no formats. You could have done:
proc format;
format ynf
'1'='yes'
'0'='no'
;
run;
And then used
format sitevisit initialtesting followup ynf.;
And that would have covered all three of them with one format. But what you did is legal, it's just more typing than you need!
I'm trying to understand how to code something along the lines of "NOT IN the LIST" type of logic in SAS.
I figured I could do "NOT" + "IN" as something like below.
Data work.OUT;
Set work.IN;
If VAR=1 then OUTPUT=1;
else if VAR=2 then OUTPUT=2;
else if VAR NOT in (1,2) then OUTPUT=3;
else OUTPUT=4;
run;
When I export the dataset all I see is OUTPUT=3 for all records. So something is happening in the derivation and it's transforming all VAR values into OUTPUT 3 values for some reason. Even though I know for a fact that other values exist in the VAR.
I don't understand what the problem is? Can we not combine NOT+IN operators? Alternatively, do you have any other ways of coding this type of logic in SAS? I rather not code each bit of code since I have more than 300 unique values for VAR
Welcome to Stack Overflow Alejandro. Your code assigns values 1 2 or 3 depending on what values are in the variable called var:
data in;
do var = 1 to 5;
output;
end;
run;
Data work.OUT;
set work.IN;
If VAR=1 then OUTPUT=1;
else if VAR=2 then OUTPUT=2;
else if VAR NOT in (1,2) then OUTPUT=3;
else OUTPUT=4;
run;
Your code says check for var = 1 then check for var = 2 and then check if it is not 1 or 2. The final else is never checked because a var will be 1 or 2 or not 1 or 2.
If you have a pile of if checks, you can use a select/when/otherwise/end block. It will check a series of rules (in the order you type them) and then will do something based on whichever rule is true first.
data out;
set in;
select;
when(var = 1) output = 1;
when(var = 2) output = 2;
when(var < 5) output = 3;
when(.) output = -9999999;
otherwise output = 42;
end;
run;
I hope that helps. If not please send up another flare.
Variable name is PRC. This is what I have so far. First block to delete negative values. Second block is to delete missing values.
data work.crspselected;
set work.crspraw;
where crspyear=2016;
if (PRC < 0)
then delete;
where ticker = 'SKYW';
run;
data work.crspselected;
set work.crspraw;
where ticker = 'SKYW';
where crspyear=2016;
where=(PRC ne .) ;
run;
Instead of using a function to remove negative and missing values, it can be done more simply when inputting or outputting the data. It can also be done with only one data step:
data work.crspselected;
set work.crspraw(where = (PRC >= 0 & PRC ^= .)); * delete values that are negative and missing;
where crspyear = 2016;
where ticker = 'SKYW';
run;
The section that does it is:
(where = (PRC >= 0 & PRC ^= .))
Which can be done for either the input dataset (work.crspraw) or the output dataset (work.crspselected).
If you must use a function, then the function missing() includes only missing values as per this answer. Hence ^missing() would do the opposite and include only non-missing values. There is not a function for non-negative values. But I think it's easier and quicker to do both together simultaneously without a function.
You don't need more than your first test to remove negative and missing values. SAS treats all 28 missing values (., ._, .A ... .Z) as less than any actual number.
I have some SAS code along the lines of:
DATA MY_SAMPLE;
SET SAMPLE;
BY A;
IF A = 1 THEN B = 1;
ELSE IF A ^= 1 THEN B = 0;
ELSE IF MISSING(A) THEN B = .;
IF FIRST.A;
RUN;
which is returning a set with 0 observations (it shouldn't do this). I have sorted the data by A and tried reading the data into an intermediate dataset before applying the IF FIRST.A but get the same results.
Am I missing something completely obvious? I use the FIRST and LAST all of the time!
Agree with #Robert, the sample code should output records, assuming there are records in your input data and it is sorted.
I would double-check the log from your real program/data, and make sure there are no errors, and that the input dataset has records.
If that doesn't help, I would add some debugging PUT statements, something like below (untested):
DATA MY_SAMPLE;
SET SAMPLE;
BY A;
IF A = 1 THEN B = 1;
ELSE IF A ^= 1 THEN B = 0;
ELSE IF MISSING(A) THEN B = .; *This will never be true ;
put "Before subsetting if " (_n_ A first.A)(=) ;
IF FIRST.A;
put "After subsetting if " (_n_ A first.A)(=) ;
RUN;
As Robert noted, as written your Else if Missing(A) would never be true, because if A is missing the prior Else if A ^= 1 will evaluate to true because SAS uses binary logic (true/false), not trinary logic(true/false/null).
Also I would check for any stray OUTPUT statements in your code.
Checked the log; checked the input; closed MSSQL down; opened it up again and lo and behold, code worked first time. Thanks for the downgrade, but I didn't realize that MSSQL is prone to twitches!