NOT+IN SAS operators combined, is this valid? Can't find documentation - sas

I'm trying to understand how to code something along the lines of "NOT IN the LIST" type of logic in SAS.
I figured I could do "NOT" + "IN" as something like below.
Data work.OUT;
Set work.IN;
If VAR=1 then OUTPUT=1;
else if VAR=2 then OUTPUT=2;
else if VAR NOT in (1,2) then OUTPUT=3;
else OUTPUT=4;
run;
When I export the dataset all I see is OUTPUT=3 for all records. So something is happening in the derivation and it's transforming all VAR values into OUTPUT 3 values for some reason. Even though I know for a fact that other values exist in the VAR.
I don't understand what the problem is? Can we not combine NOT+IN operators? Alternatively, do you have any other ways of coding this type of logic in SAS? I rather not code each bit of code since I have more than 300 unique values for VAR

Welcome to Stack Overflow Alejandro. Your code assigns values 1 2 or 3 depending on what values are in the variable called var:
data in;
do var = 1 to 5;
output;
end;
run;
Data work.OUT;
set work.IN;
If VAR=1 then OUTPUT=1;
else if VAR=2 then OUTPUT=2;
else if VAR NOT in (1,2) then OUTPUT=3;
else OUTPUT=4;
run;
Your code says check for var = 1 then check for var = 2 and then check if it is not 1 or 2. The final else is never checked because a var will be 1 or 2 or not 1 or 2.
If you have a pile of if checks, you can use a select/when/otherwise/end block. It will check a series of rules (in the order you type them) and then will do something based on whichever rule is true first.
data out;
set in;
select;
when(var = 1) output = 1;
when(var = 2) output = 2;
when(var < 5) output = 3;
when(.) output = -9999999;
otherwise output = 42;
end;
run;
I hope that helps. If not please send up another flare.

Related

How to correct this sas function in order to have the jaccard distance?

I created a SAS function using fcmp to calculate the jaccard distance between two strings. I do not want to use macros, as I'm going to use it through a large dataset for multiples variables. the substrings I have are missing others.
proc fcmp outlib=work.functions.func;
function distance_jaccard(string1 $, string2 $);
n = length(string1);
m = length(string2);
ngrams1 = "";
do i = 1 to (n-1);
ngrams1 = cats(ngrams1, substr(string1, i, 2) || '*');
end;
/*ngrams1= ngrams1||'*';*/
put ngrams1=;
ngrams2 = "";
do j = 1 to (m-1);
ngrams2 = cats(ngrams2, substr(string2, j, 2) || '*');
end;
endsub;
options cmplib=(work.functions);
data test;
string1 = "joubrel";
string2 = "farjoubrel";
jaccard_distance = distance_jaccard(string1, string2);
run;
I expected ngrams1 and ngrams2 to contain all the substrings of length 2 instead I got this
ngrams1=jo*ou*ub
ngrams2=fa*ar*rj
If you want real help with your algorithm you need to explain in words what you want to do.
I suspect your problem is that you never defined how long you new character variables NGRAM1 and NGRAM2 should be. From the output you show it appears that FCMP defaulted them to length $8.
To define a variable you need use a LENGTH statement (or an ATTRIB statement with the LENGTH= option) before you start referencing the variable.

Trying to make a new column conditional on whether other columns are empty

I have a dataset which is like this:
new_fish old_fish
1 2
4
3
And I want to make a column called status, where if new_fish is empty call it dead, and if old_fish is empty call it born, and if neither are empty call it alive.
I would want it to look like this:
new_fish old_fish status
1 2 alive
4 dead
3 born
I've tried the following code in sas,
data diff_withclass;
set diff;
if missing(new_fish) then status= 'dead';
if missing(old_fish) then status= 'born';
else status = 'alive';
run;
However, this doesn't work. It just sets status to alive.
ANy suggestions would be great.
You need to use else if. The second if statement is overwriting the first.
data diff_withclass;
set diff;
if missing(new_fish) then status= 'dead';
else if missing(old_fish) then status= 'born';
else status = 'alive';
run;
A useful construction is select, when you're choosing one of a list of options. Sometimes you select a varialbe:
select(var);
when (1) do something; *if var=1 then ... ;
when (2) do something; *else if var=2 then ... ;
otherwise do something; *else ... ;
end;
However, it can be used alone also, as in your case.
data diff_withclass;
set diff;
length status $5;
select;
when (missing(new_fish)) status= 'dead';
when (missing(old_fish)) status= 'born';
otherwise status = 'alive';
end;
run;
Another useful construction is the ifc function. This works like Excel's if, in that it takes an argument that is a boolean (so, the "if" part), if that is true it returns the second argument, and if it is false it returns the third argument. Here we can nest two of them to get your result.
data diff_ifc;
set diff;
length status $5;
status = ifc(missing(new_fish),'dead',ifc(missing(old_fish),'born','alive'));
run;
Finally, there is the good old "boolean arithmetic" solution. Here we create a numeric, and you can then turn that into a character if you prefer, or just use a user-defined format, which is really the optimal way to handle this anyway. This takes advantage of the fact that "true" evaluates to 1 and "false" to 0 in SAS, so we just add up the values, and arbitrarily call Dead the 2 and Born the 1 (could do the opposite, or even have Dead=-1, Born=1, Alive=0, by multiplying by -1 for Dead).
proc format;
value statusf
0 = 'Alive'
1 = 'Born'
2 = 'Dead'
;
quit;
data diff_bool;
set diff;
status = missing(new_fish)*2 + missing(old_fish);
format status statusf.;
run;

Is there a SAS function to delete negative and missing values from a variable in a dataset?

Variable name is PRC. This is what I have so far. First block to delete negative values. Second block is to delete missing values.
data work.crspselected;
set work.crspraw;
where crspyear=2016;
if (PRC < 0)
then delete;
where ticker = 'SKYW';
run;
data work.crspselected;
set work.crspraw;
where ticker = 'SKYW';
where crspyear=2016;
where=(PRC ne .) ;
run;
Instead of using a function to remove negative and missing values, it can be done more simply when inputting or outputting the data. It can also be done with only one data step:
data work.crspselected;
set work.crspraw(where = (PRC >= 0 & PRC ^= .)); * delete values that are negative and missing;
where crspyear = 2016;
where ticker = 'SKYW';
run;
The section that does it is:
(where = (PRC >= 0 & PRC ^= .))
Which can be done for either the input dataset (work.crspraw) or the output dataset (work.crspselected).
If you must use a function, then the function missing() includes only missing values as per this answer. Hence ^missing() would do the opposite and include only non-missing values. There is not a function for non-negative values. But I think it's easier and quicker to do both together simultaneously without a function.
You don't need more than your first test to remove negative and missing values. SAS treats all 28 missing values (., ._, .A ... .Z) as less than any actual number.

SAS FIRST.VARIABLE giving no output

I have some SAS code along the lines of:
DATA MY_SAMPLE;
SET SAMPLE;
BY A;
IF A = 1 THEN B = 1;
ELSE IF A ^= 1 THEN B = 0;
ELSE IF MISSING(A) THEN B = .;
IF FIRST.A;
RUN;
which is returning a set with 0 observations (it shouldn't do this). I have sorted the data by A and tried reading the data into an intermediate dataset before applying the IF FIRST.A but get the same results.
Am I missing something completely obvious? I use the FIRST and LAST all of the time!
Agree with #Robert, the sample code should output records, assuming there are records in your input data and it is sorted.
I would double-check the log from your real program/data, and make sure there are no errors, and that the input dataset has records.
If that doesn't help, I would add some debugging PUT statements, something like below (untested):
DATA MY_SAMPLE;
SET SAMPLE;
BY A;
IF A = 1 THEN B = 1;
ELSE IF A ^= 1 THEN B = 0;
ELSE IF MISSING(A) THEN B = .; *This will never be true ;
put "Before subsetting if " (_n_ A first.A)(=) ;
IF FIRST.A;
put "After subsetting if " (_n_ A first.A)(=) ;
RUN;
As Robert noted, as written your Else if Missing(A) would never be true, because if A is missing the prior Else if A ^= 1 will evaluate to true because SAS uses binary logic (true/false), not trinary logic(true/false/null).
Also I would check for any stray OUTPUT statements in your code.
Checked the log; checked the input; closed MSSQL down; opened it up again and lo and behold, code worked first time. Thanks for the downgrade, but I didn't realize that MSSQL is prone to twitches!

SAS regresses missing values

For some reason when SAS does proportional hazards regression it is including those observations that are specified as . as a group in the results. I suspect it has something to do with how I created my variable (and that SAS thinks my numeric variables are characters) but I can't figure out what I did wrong. I am using SAS 9.4
data final; set final;
if edu_d = 'hs less' then edu_regress = 1;
else if edu_d = 'hs' then edu_regress = 1;
else if edu_d = 'some college' then edu_regress = 2;
else if edu_d = 'college plus' then edu_regress = 3;
else if edu_d = 'missing' then edu_regress=.;
run;
Then I run my regression:
proc phreg data=final;
class edu_regress;
model fuptime*dc(0)=edu_regress/rl;
run;
And the output is as follows:
edu_regress . 1 0.10963 0.12941 0.7177 0.3969 1.116 0.866 1.438
edu_regress 1 1 0.22514 0.10949 4.2278 0.0398 1.252 1.011 1.552
edu_regress 2 1 0.21706 0.11410 3.6190 0.0571 1.242 0.993 1.554
Where . is a category instead of treated as missing.
I'm sure I'm making a rookie mistake but I just can't figure it out.
I would clear your output, and re-run the code, and check the log and output.
As I read the docs, to get missing values treated as a category you would need to have /missing on your CLASS statement, which you do not have in the code shown. Without that, I think missing values should be automatically excluded.
When I run PHREG with a CLASS variable that has missing values, I get a note in the log about observations being deleted due to missing values, and the output shows that the number of observations used is less than the number of observations read.
If SAS thinks edu_regress is character, that's possible if it already was on the dataset as character. This is one reason not to do data x; set x; and instead make a new dataset. You should see notes in the datastep when you run it the way you have now regarding numeric to character conversion, if this is indeed the problem.
Anyway, one way to adjust this is to use CALL MISSING. It sets a variable to missing correctly regardless of the type.
data final;
set final;
if edu_d = 'hs less' then edu_regress = 1;
else if edu_d = 'hs' then edu_regress = 1;
else if edu_d = 'some college' then edu_regress = 2;
else if edu_d = 'college plus' then edu_regress = 3;
else if edu_d = 'missing' then call missing(edu_Regress);
run;