how to drop a single row based on a condition in SAS? - sas

I have the following data set which is a set of variables and their respective p values and R squared values from a simple linear regression.
data have;
input Variable$ Probt R_value tie$;
cards;
X1 0.0016 0.4344 .
X2 0.0003 0.5204 .
X3 0.0001 0.7497 yes
X4 0.0001 0.9026 yes
run;
However, as you can see there are two variables that have the Probt value of 0.001 and I have created a variable called tie to capture a situation when two variables have the same p value.
What I want is the following. In situation where there is a tie, I want to break the tie by picking the variable with the highest R_value from the tie variable so that it looks like the following
data want;
input Variable$ Probt R_value tie$;
cards;
X1 0.0016 0.4344 .
X2 0.0003 0.5204 .
X4 0.0001 0.9026 yes
run;

Assuming the probt values are truly identical as they are in your example, you can do something as simple as using the last. variable (also assuming they're sorted in order, if not use proc sort first):
data want;
set have;
by descending probt r_value;
if last.probt; *if it is the last record from any set of identical probt values, keep it;
run;
If the probt values are rounded and not truly identical, you need to make a variable first which is truly identical (using round). If you already computed tie you may have done this already.

something like below. but beware of compute tie value as mentioned by #reeza and #joe
data have;
input Variable$ Probt R_value tie$;
cards;
X1 0.0016 0.4344 .
X2 0.0003 0.5204 .
X3 0.0001 0.7497 yes
X4 0.0001 0.9026 yes
X5 0.0001 0.9028 yes
X6 0.0002 0.7499 yes
X7 0.0002 0.9027 yes
run;
proc sql;
create table want as
select * from have a
where R_value not in
(select min(R_value) from have b
where a.probt =b.probt
and tie ='yes');

Related

SAS length warning : Multiple lengths were specified for the variable np by input data set(s)

I want to set two datasets with different length of the same variable. My example:
data set 1
det np ord
C 5 0
data set 2
det np ord
A 1(10) 1
B 3(30) 2
Could someone help me in order to set these 2 datasets correctly without warning?
Many thanks!!
I want to set these 2 datasets correctly without warning. My final dataset will be
det np ord
C 5 0
A 1(10) 1
B 3(30) 2
Assuming the following situation with two data sets with different lengths in column "det":
data have1;
length det $4 np ord c $10;
det='1234';np='np1'; ord='ord1';
c='c text';
run;
data have2;
length det $5 np ord a b $10;
det='12345'; np='np2'; ord='ord2';
a='a sample';
b='b others';
run;
If you put them together by a simple data step there will be a warning due to different lengths depending on the order of your input data sets:
data want1;
set have1 have2;
run;
WARNING: Multiple lengths were specified for the variable det by input data set(s). This can cause truncation of data.
And here is the code without warning:
data want2;
set have2 have1;
run;
/* Sort step, just to get the original order */
proc sort data=want2;
by det;
quit;
Keep in mind that in the first case the warning will have real consequences because your data is eventually truncated because the resulting column det will have a length of 4. So, in the example above the second row will have det="1234" instead of "12345", i.e. want1 looks like that:
det np ord c a b
-------------------------------------------------
1234 np1 ord1 c text
1234 np2 ord2 a sample b others
In the case of want2 the length will be 5 and there will not be any truncation.
Explanation: Concerning the length of a column in a resulting data set SAS takes the length of the first occurrence of a variable.
But the best way to avoid this warning is to define the lengths for the resulting table before the set statement. In this way you can also define the order of the resulting columns:
data want;
length det $5 np ord c a b $10;
set have1 have2;
run;

How to fix the range of x-axis in slicefit plot option in SAS

When I run the following codes to show the predicted probabilities of y (binary) vs. x1 (continuous) at different values of x2 (continuous), the range of x1 goes from its minimum to its maximum.
proc logistic data=data;
model y(event='1') = x1 | x2;
store logiMod;
run;
title "Predicted probabilities";
proc plm source=logiMod;
effectplot slicefit(x=x1 sliceby=x2=0 to 30 by 5);
run;
However, I want to show this graph only for x1 values ranging from 0 to 20 with an increment of 2 if possible. I don't want to change my model. I just want to change the range of the display for the x-axis. How do I do that?

Multiply each line of a file by all lines of another file using SAS

I have 2 databases:
Database 1
Database 2
I need to multiply each line of database 1 by all the lines of database 2 (ie. line 1 of database 1 by all lines of database 2; line 2 of database 1 by all lines of database 2, etc), in such a way:
Example equations
![Example equations
]3
I need to get a value for each stage within each id.
Can you help me with this, please? I use SAS software.
I am not going to retype all of the data from your pictures but here is a program that will work to two of your "stages". So the first dataset I called HAVE and the second one I called STAGES and this data step will generate a WANT dataset that keeps all of the data from HAVE and adds the new calculated variables.
data want ;
set have ;
array vars x y z ;
array stages a b ;
do p=1 to dim(stages);
set stages point=p ;
array factor m1-m3 ;
stages(p)=0;
do j=1 to dim(vars);
stages(p) + vars(j)*factor(j) ;
end;
end;
drop stage m1-m3 j;
run;
So here is the result for two rows of input data and two of the new stages.
Obs id x y z a b
1 1 0.5 0.5 0.3 1.40 1.12
2 2 0.3 0.1 0.1 0.48 0.34
To expand this to be more flexible you could use macro variables to specify the list of variable names in the ARRAY statements. You could even generate the list of names to use for the STAGES array by using PROC SQL and INTO clause to extract the names from the STAGE column in the STAGES dataset.
You can also just follow this example from data_null_ (https://communities.sas.com/t5/SAS-Procedures/Multiplication-of-tables-in-SAS/m-p/125059#M34355) on how to use PROC SCORE to multiply matrices. Setup your STAGES dataset to have the same variable names as your input dataset and include _TYPE_ and _NAME_ variables.
data stages ;
_TYPE_='SCORE';
input _NAME_ :$32. x y z ;
cards;
a 0.7 1.2 1.5
b 0.3 1.1 1.4
;
Then you can use it to "score" your source data.
proc score score=stages data=have out=want;
var x y z ;
run;

Numeric values in PROC FREQ or formatted values in PROC PRINT

I have a situation that seems like it should be easy to fix. But, I’m struggling to find an elegant solution. I was given data that was already formatted. Similar to the toy dataset below.
proc format;
value x1_f 1 = "Yes"
0 = "No";
value x2_f 1 = "Yes"
2 = "No";
run;
data ds;
input x1 x2;
datalines;
1 2
1 1
0 1
;
data ds;
set ds;
format x1 x1_f.
x2 x2_f.;
run;
Now, as part of my data management process I create a 2x2 table using x1 and x2. Let’s say I’m checking my data, and expect x1 and x2 to always agree.
proc freq data = ds;
tables x1*x2;
run;
When I look at the report I notice that x1 and x2 don’t always agree. So, I want to print the observations that don’t agree to see if I can figure out what might be going on. Because this is a toy example, there are not other variables to look at, but hopefully you get the idea.
proc print data = ds;
where x1 = "Yes" & x2 = "No";
run;
SAS gives me the following error:
ERROR: WHERE clause operator requires compatible variables
Ok, I guess I need to give SAS the numeric values instead of the formatted values. But, when I go look at the PROC FREQ report from earlier, it only shows me the formatted values. So, I run another PROC FREQ.
proc freq data = ds;
tables x1*x2;
format x1 x2;
run;
Now I can see which variable uses 0’s and 1’s, and which variable uses 1’s and 2’s.
proc print data = ds;
where x1 = 0 & x2 = 1;
run;
Finally, I get what I’m looking for. This just seems really clunky and inelegant. Can someone tell me how to either view my numeric values and formatted values simultaneously in my frequency report, OR how to use the formatted values in proc print?
If you know the format name then use the PUT() function in the WHERE statement.
proc print data=sashelp.class ;
where put(age,2.) = '12';
run;
If you don't know the format name then you can use the VVALUE() function. But you probably need to add a data step for it to work.
data to_print;
set sashelp.class ;
if strip(vvalue(age))='12';
run;
proc print data=to_print;
run;
In the old days I used to just create a separate format catalog with formats that included the values in the labels.
proc format;
value x1_f 1 = "1=Yes" 0 = "0=No";
run;
Then when you read your output you knew the values the variables actually had. It is pretty simple to create a program to convert a format catalog.
http://github.com/sasutils/macros/blob/master/cfmtgen.sas

SAS backwards retain

I have a large data set that can be dissected into the following:
ID x
1 0
1 0
1 0
1 1
1 1
I have one ID variable telling me which individual that the X value corresponds to.
The X variable is 0 if no event has occurred for this individual and 1 if an event has occurred.
I'm interested in creating a variable which tells me if an event has at all occurred for a customers during my whole time series for that specific ID, as seen in X2 below.
ID x x2
1 0 1
1 0 1
1 0 1
1 1 1
1 1 1
Hence x2 takes the value 1 across all observations because x takes the value 1 in at least one instance.
I have looked at creating a reversed lag through the "SAS leading technique" but it doesn't seem to be able to retain the value, so I would need to do a reverse lag multiple times which is not an option since my actual data set contains thousands of rows and every ID needs different amounts of lags.
Does anyone have an idea about how to solve this?
Thanks in advance!
The easiest way to do this is the double DoW loop.
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if x then x2=1;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
output;
end;
run;
This loops through the dataset once to find the value of x2, sets it, and then loops through again to output the data. You don't actually retain anything because it's all in one data step loop iteration for a single ID - x2 isn't being reset except between IDs.
This is reasonably fast, as long as you don't have more records per ID than you can fit in the read buffer, as it will buffer the first read and thus not have to re-read from disk a second time.
Try a SQL solution.
proc sql;
create table flagged as
select
a.*,
b.x2
from
table a
join
(select
id,
max(x) as x2
from table
group by id) b
on
a.id = b.id
;
quit;
You could do this by merging the data with itself, applying a where= dataset option to the second copy. You will need to keep a copy of the X variable, but renamed, so that it can used in the where=. You could use this renamed variable as the new X2, but then you would need to convert missings to zeros. Or you could use the IN= dataset option to generate the new X2 variable with 0/1 values.
data want;
merge have have(in=in2 keep=id x rename=(x=x3) where=(x3=1)) ;
by id;
x2 = in2;
drop x3;
run;