Select observations using variables from a second data set - sas

I have two datasets in SAS. They both contain the same variable x. In the first data set, I want to remove those observations whose x value is also in the x values in the second data set.
Example,
data set1;
input x y z;
datalines;
1 1.5 2.2
1 2.1 9.0
2 4.2 4.4
3 4.5 2.4
;
run;
data set2;
input x y;
datalines;
1 15
2 44
;
run;
In set 1, I want to remove those observations if x=1 or x=2 where 1 and 2 come from the x values from second data set. I only want to keep the last row in set 1.

So your final answer should only include the 3? There are a few ways, but I find this the clearest method for understanding.
proc sql;
create table want as
select *
from set1
where x not in (select x from set2);
quit;

Data step version:
data want;
merge set1(in = _1)
set2(in = _2 keep = x);
by x;
if _1 and not(_2);
run;
This assumes that set1 and set2 have both either been sorted by x or have an index on x.

Related

values of commun column in A replaced by that in B with function merge in SAS

I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.

Does SAS have a equivalent function to all() or any() in R

In R you can perform a condition across all rows in a column variable by using the all() or any() function. Is there an equivalent method in SAS?
I want condition if ANY rows in column x are negative, this should return TRUE.
Or, if ALL rows in column y are negative, this should return TRUE.
For example
x y
-1 -2
2 -4
3 -4
4 -3
In R:
all(x<0) would give the output FALSE
all(y<0) would give the output TRUE
I wish to replicate the same column-wise operation in SAS.
For completeness' sake, here is the SAS-IML solution. Of course, it's trivial as the any and all functions exist by the same name... I also include an example of using loc to identify the positive elements.
data have ;
input x y ##;
cards;
1 2
2 4
3 -4
4 -3
;
run;
proc iml;
use have;
read all var {"x" "y"};
print x y;
x_all = all(x>0);
x_any = any(x>0);
y_all = all(y>0);
y_any = any(y>0);
y_pos = y[loc(y>0)];
print x_all x_any y_all y_any;
print y_pos;
quit;
If you want to operate on all observations that might be easiest to do using SQL summary functions.
SAS will evaluate boolean expressions as 1 for true and 0 for false. So to find out if any observation has a condition you want to test if the MAX( condition ) is true (ie equal to 1). To find out if all observations have the condition you want to test if the MIN( condition ) is true.
data have ;
input x y ##;
cards;
-1 -2 2 -4 3 -4 4 -3
;
proc sql ;
create table want as
select
min(x<0) as ALL_X
, max(x<0) as ANY_X
, min(y<0) as ALL_Y
, max(y<0) as ANY_Y
from have
;
quit;
Result
Obs ALL_X ANY_X ALL_Y ANY_Y
1 0 1 1 1
SQL is probably the most feels-like similar way to do this, but the data step is just as efficient, and lends itself a bit better to any sort of modification - and frankly, if you're trying to learn SAS, is probably the way to go simply from the point of view of learning how to do things the SAS way.
data want;
set have end=eof;
retain any_x all_x; *persist the value across rows;
any_x = max(any_x, (x>0)); *(x>0) 1=true 0=false, keep the MAX (so keep any true);
all_x = min(all_x, (x>0)); *(x>0) keep the MIN (so keep any false);
if eof then output; *only output the final row to a dataset;
*and/or;
if eof then do; *here we output the any/all values to macro variables;
call symputx('any_x',any_x); *these macro variables can further drive logic;
call symputx('all_x',all_x); *and exist in a global scope (unless we define otherwise);
end;
run;

Comparison of two data sets in SAS

I have the following data set:
data data_one;
length X 3
Y $ 20;
input x y ;
datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1
run;
data data_two;
length Z 3
A $ 20;
input Z A;
datalines;
1 test
2 test1
3 test2
run;
What I would like to have is a data set which tells me how often column Y in data_one contains the same string of column A in data_two. The result should look like this one:
Obs test test1 test2
1 4 3 0
Thanks in advance!
First we need the counts for those values of Y present in data_one.
Then we create a sorted (for the next merge) list of the values present in data_two.
The data_one Y counts from 1. are merged with the list from 2.
The Y values present in data_two but not in data_one (b and not a) are assigned count=0, the Y values not present in data_two are discarded (if b).
The last passage transposes the vertical list of counts in an horizontal set of variables.
proc freq data=data_one noprint;
table y / out=count_one (keep=y count);
run;
proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
by a;
run;
data count_all;
merge count_one (in=a) list_two (in=b);
by y;
if (b and not a) then count=0;
if b;
run;
proc transpose data=count_all out=final (drop=_name_ _label_);
id y;
run;
The first 3 steps can be replaced with one proc SQL:
proc sql;
create table count_all as
select distinct
coalesce(t1.y,t2.a) as y,
case
when missing(t1.y) then 0
else count(t1.y)
end as N
from data_one as t1
right join data_two as t2
on t1.y=t2.a
group by 1
order by 1;
quit;
proc transpose data=count_all out=final (drop=_name_);
id y;
run;

How to add new observation to already created dataset in SAS?

How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !

What does this if mean in a data step?

In this data step I do not understand what if last.y do...
Could you tell me ?
data stop2;
set stop2;
by x y z t;
if last.y; /*WHAT DOES THIS DO ??*/
if t ne 999999 then
t=t+1;
else do;
t=0;
z=z+1;
end;
run;
LAST.Y refers to the row immediately before a change in the value of Y. So, in the following dataset:
data have;
input x y z;
datalines
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
2 3 1
2 3 2
2 3 3
;;;;
run;
LAST.Y would occur on the third, sixth, ninth, and twelfth rows in that dataset (on each row where Z=3). The first two times are when Y is about to change from 1 to 2, and when it is about to change from 2 to 3. The third time is when X is about to change - LAST.Y triggers when Y is about to change or when any variable before it in the BY list changes. Finally, the last row in the dataset is always LAST.(whatever).
In the specific dataset above, the subsetting if means you only take the last row for each group of Ys. In this code:
data want;
set have;
by x y z;
if last.y;
run;
You would end up with the following dataset:
data want;
input x y z;
datalines;
1 1 3
1 2 3
1 3 3
2 3 3
;;;;
run;
at the end.
One thing you can do if you want to see how FIRST and LAST operate is to use PUT _ALL_;. For example:
data want;
set have;
by x y z;
put _all_;
if last.y;
run;
It will show you all of the variables, including FIRST.(whatever) and LAST.(whatever) on the dataset. (FIRST.Y and LAST.Y are actually variables.)
In SAS, first. and last. are variables created implicitly within a data step.
Each variable will have a first. and a last. corresponding to each record in the DATA step. These values will be wither 0 or 1. last.y is same as saying if last.y = 1.
Please refer here for further info.
That is an example of subsetting IF statement. Which is different than an IF/THEN statement. It basically means that if the condition is not true then stop this iteration of the data step right now.
So
if last.y;
is equivalent to
if not last.y then delete;
or
if not last.y then return;
or
if last.y then do;
... rest of the data step before the run ...
end;