I need to merge two datasets like the ones below using a data step:
Data have1;
x=1; output;
x=2; output;
x=3; output;
Run;
Data have2;
y = 'A';
z = 'B';
Run;
Data want;
Merge have1 have2;
Run;
The result should be the following:
x y z
1 A B
2 A B
3 A B
However when I run the merge SAS only merges the first row and gives me the following:
x y z
1 A B
2
3
I know this can be done using a left join, however in order to process the variables in the full dataset I would prefer doing it via a merge. Can anyone help please?
Where did variable Z come from? I think this may be what you want.
Data want;
set have1;
if _n_ eq 1 then set have2;
Run;
To explain what's going on, what SAS is doing when you bring in data from a dataset and "run out" of rows, it sets all variables that come from that dataset to missing. This can happen when you do set a b (two datasets on the same set statement), or merge a b with no BY, or set a; set b;, though the datastep will terminate when the first dataset runs out of rows in the latter case, so it doesn't matter.
Why #data_null_'s code works, is that
if _n_ eq 1 then set b;
That never attempts to pull a row that's not there! It pulls the first row, and then stops trying to pull. Since all variables coming from a set or merge are retained automatically, the values are kept even after the first iteration of the data step loop (as long as you don't change them).
Related
Below code will solve for getting last 2 observations from the dataset without using loops, first & last dot concept or sorting.
data a;
set sashelp.cars nobs=_nobs_;/*create the temporary variable to store total no of obs*/
if _N_ ge _nobs_-1;/*Now compare the automatic variable _N_ to _nobs_*/
run;
Not sure there is a question here.
You can also use the POINT= option on the SET statement. You have to explicitly end the data step since most data steps end when they read past the end of the input data and this step cannot do that.
data want;
do p=max(1,nobs-1) to nobs;
set have point=p nobs=nobs;
output;
end;
stop;
run;
A DATA step is an implicit loop with an automatic index variable _N_. You can leverage that fact to implicitly output rows without an explicit DO loop. As per #Tom, the point= option is used so the entire data set does not need to be read to reach the last two rows.
Example:
data want;
if _N_ > min(2,_Z_) then stop;
_P_ = _Z_ - min(2,_Z_) + _N_;
set sashelp.class point=_P_ nobs=_Z_;
run;
Consider I have two datasets:
data dataset_1;
input CASENO X;
datalines;
1 100
2 200
3 300
;
data dataset_2;
input CASENO Y;
datalines;
2 200000
3 300000
;
I'm looking to find how many CASENOs appear in both lists: in the example above, I would get 2.
My data is very large. It is taking a long time to get this sort of result by using a merge.
data result;
merge dataset_1 (in = a) and dataset_2 (in = b);
by CASENO;
if a and b;
RUN;
I'm looking for a more efficient way -
edit: for clarity, is there a way to return the number of matches in two datasets without SAS having to write out the resulting file?
If the datasets are already sorted, the data step merge is incredibly efficient. It passes over each row in each table exactly once. Of course, if you just want the count, you don't need to output all of the rows to a dataset, you can just:
data _null_;
merge dataset_1 (in = a keep=caseno) dataset_2 (in = b keep=caseno) end=eof;
by CASENO;
if a and b then count+1;
if eof then call symputx('count',count);
RUN;
This will be much faster to run since you're not writing anything out. I also add KEEP statements (as Tom points out in comments) to the incoming datasets to only read in the by variable, this produces a speed-up of about 10%.
If the datasets are indexed, you have some additional options that will be faster as they will be doing index scans (such as using SQL). But sorted, non-indexed tables, it's hard to improve on the data step merge.
Try:
proc sql;
select count(caseno) as Number from dataset_1 where caseno in (select caseno from dataset_2);
quit;
When running a data step with two datasets in the set statement, sometimes variables do not reset to missing between iterations. This is also true of merge when you have duplicate by values (ie, when your by variables do not guarantee a unique record).
For example:
data have1;
do x=1 to 5;
y=1;
output;
end;
run;
data have2;
do x = 6 to 10;
z=x+1;
output;
end;
run;
data want;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
run;
Here, y is given a value of 2 for every record coming from have2, as opposed to only the even z values.
Similarly,
data have1;
do x = 1 to 5;
y=1;
output;
end;
run;
data have2;
do x = 1 to 5;
do z = 1 to 4;
output;
end;
end;
run;
data want;
merge have1 have2;
by x;
if mod(z,4)=3 then y=3;
run;
Why does this happen, and how can I prevent it from causing unexpected consequences?
Why is this happening?
As discussed at length in the SAS documentation in Combining SAS Datsets: Methods, this arises from the fact that variables that are defined on a set, merge, or update statement are not set to missing on each iteration of the data step (This is the equivalent to using retain for all variables on the incoming data sets).
For the first example, this follows naturally from the retain concept: y is retained, so when it is not replaced by a new record from set having a value of y on it, it keeps its last value. (As we'll see later, it is cleared once: when the set dataset changes, hence why it doesn't still have the earlier value from the previous dataset).
However, this doesn't quite explain the functionality of the merge (how it goes back and forth). That is caused by a different behavior when a by group is involved.
Specifically, variables are not set to missing between each data step iteration; however, they are set to missing for each new by group or data set. From the documentation:
The values of the variables in the program data vector are set to
missing each time SAS starts to read a new data set and when the BY
group changes.
The implications of this are why the second example has y set back to 1 for the first two iterations of z but is kept at 3 for the z=4 iteration.
In order, labelling each iteration by its z value:
Z=1: first record of by group, so everything is set to missing. HAVE1 is read, HAVE2 is read. X=1, Y=1, Z=1 are all set.
Z=2: Second record of have2 is read. y retains its value of 1 from the previous iteration.
Z=3: Third record of have2 is read. y is set to 3.
Z=4: Fourth record of have2 is read. y retains its value of 3 from the previous iteration.
Note that HAVE1 is only read once, on the z=1 iteration. If this were a many-to-many merge, HAVE1 would be read once for each different row that had the same x value on it.
How do we prevent it from happening?
You have several options to deal with this, assuming you want it to act as if it was not automatically retained.
Add a by statement
As was noted before, on new by values it will automatically reset everything to missing. So if you ran
data want;
set have1 have2;
by x;
if missing(y) and mod(z,2)=0 then y=2;
run;
This would work as expected (though giving a slightly different result here).
Set some or all variables to missing on your own
You can do this in two places:
data want;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
output;
call missing(of _all_);
run;
or
data want;
y=.;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
run;
One or the other may be more appropriate for your program depending on your needs (the first sets everything to missing, but requires an extra statement (output;), while the second only sets y to missing (which is all that's needed) but changes the variable order by putting y first).
For a merge with duplicate by values, if you want to preserve the value of y you may need to do something like:
data want;
merge have1 have2;
by x;
y_new=y;
if mod(z,4)=3 then y_new=3;
rename y_new=y;
drop y;
run;
which gets around things by using a separate variable to store the new value. You also can set it to missing similarly to the above, if that is what is desired.
I tried to recode the missing values but instead lost all my other variables within a dataset
BEFORE:
AFTER:
data work.newdataset;
if (year =.) then year = 2000;
run;
You are missing the SET statement.
data want;
set have;
myvar=5;
run;
will create a new dataset, want, from have, with the new variable value applied (or the recode or whatever). You could also do
data have;
set have;
myvar=5;
run;
That would replace have with itself plus the recode/whatever. This is actually less common in SAS; it is often preferable to do all recodes in one step, but to create a new dataset (so that the code is reversible easily).
I'm very new to SAS and I'm trying to figure out some basic things available in other languages.
I have a table
ID Number
-- ------
1 2
2 5
3 6
4 1
I would like to create a new variable where I sum the value of one observation of Number to each other observations, like
Number2 = Number + Number[3]
ID Number Number2
-- ------ ------
1 2 8
2 5 11
3 6 12
4 1 7
How to I get the value of third observation of Number and add this to each observation of Number in a new variable?
There are several ways to do this; here is one using the SAS POINT= option:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
run;
data want;
retain adder;
drop adder;
if _n_=1 then do;
adder = 3;
set have point=adder;
adder = number;
end;
set have;
number = number + adder;
run;
The RETAIN and DROP statements define a temp variable to hold the value you want to add. RETAIN means the value is not to be re-initialized to missing each time through the data step and DROP means you do not want to include that variable in the output data set.
The POINT= option allows one to read a specific observation from a SAS data set. The _n_=1 part is a control mechanism to only execute that bit of code once, assigning the variable adder to the value of the third observation.
The next section reads the data set one observation at a time and adds applies your change.
Note that the same data set is read twice; a handy SAS feature.
I'll start by suggesting that Base SAS doesn't really work this way, normally; it's not that it can't, but normally you can solve most problems without pointing to a specific row.
So while this answer will solve your explicit problem, it's probably not something useful in a real world scenario; usually in the real world you'd have a match key or some other element other than 'row number' to combine with, and if you did then you could do it much more efficiently. You also likely could rearrange your data structure in a way that made this operation more convenient.
That said, the specific example you give is trivial:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
;;;;
run;
data want;
set have;
_t = 3;
set have(rename=number=number3 keep=number) point=_t ;
number2=number+number3;
run;
If you have SAS/IML (SAS's matrix language), which is somewhat similar to R, then this is a very different story both in your likelihood to perform this operation and in how you'd do it.
proc iml;
a= {1 2, 2 5, 3 6, 4 1}; *create initial matrix;
b = a[,2] + a[3,2]; *create a new matrix which is the 2nd column of a added
elementwise to the value in the third row second column;
c = a||b; *append new matrix to a - could be done in same step of course;
print b c;
quit;
To do this with the First observation, it's a lot easier.
data want;
set have;
retain _firstpoint; *prevents _firstpoint from being set to missing each iteration;
if _n_ = 1 then _firstpoint=number; *on the first iteration (usually first row) set to number's value;
number = number - _firstpoint; *now subtract that from number to get relative value;
run;
I'll elaborate a little more on this. SAS works on a record-by-record level, where each record is independently processed in the DATA step. (PROCs on the other hand may not behave this way, though many do at some level). SAS, like SQl and similar databases, doesn't truly acknowledge that any row is "first" or "second" or "nth"; however, unlike SQL, it does let you pretend that it is, based on the current sort. The POINT= random access method is one way to go about doing that.
Most of the time, though, you're going to be using something in the data to determine what you want to do rather than some related to the ordering of the data. Here's a way you could do the same thing as the POINT= method, but using the value of ID:
data want;
if n = 1 then set have(where=(ID=3) rename=number=number3);
set have;
number2=number+number3;
run;
That in the first iteration of the data step (_N_=1) takes the row from HAVE where Id=3, and then takes the lines from have in order (really it does this:)
*check to see if _n_=1; it is; so take row id=3;
*take first row (id=1);
*check to see if _n_=1; it is not;
*take second row (id=2);
... continue ...
Variables that are in a SET statement are automatically retained, so NUMBER3 is automatically retained (yay!) and not set to missing between iterations of the data step loop. As long as you don't modify the value, it will stay for each iteration.