I figured out the solution to my problem already, but I'd like to know what is happening exactly, and why, or maybe if there is a workaround to the following:
Suppose you have:
data test;
length group $20.;
subject=1; hours=0; group= 'hour 1'; output;
subject=1; hours=1; group= 'hour 15'; output;
subject=1; hours=2; group= 'hour 15'; output;
subject=2; hours=0; group= 'hour 1'; output;
subject=2; hours=1; group= 'hour 15'; output;
subject=2; hours=2; group= 'hour 15'; output;
run;
And you are sorting on the hours first, then group because it is character and wouldn't properly sort otherwise.
proc sort data=test;
by subject hours group;
run;
Now when you run this code to retrieve only the first record of each group:
data test2;
set test;
by subject hours group;
if first.group;
run;
It will print each record.
I recently learned that 'When you use more than one variable in the BY statement; If the first/last variable linked to a primary BY-variable changes to 1, the first/last variable linked to the second BY-variable will also be changed to one.'.
So of course, because the hours variable changes, the first/last from group is also reset.
So 'why' is this code running fine?
data test2;
set test;
by subject group;
if first.group;
run;
It seems a bit weird to have to leave out variables you sorted on, and it doesn't appear so flexible, you can't use a macro variable list as an input to sort and by statement in a data step for example...? If this is just the way it is, is there maybe another preferred way of doing these kind of operations? I can see myself making this error often, just copy pasting the list of sorting variables...
If you want to use a BY statement to generate FIRST. and LAST. variables for a grouped variable that is not actually sorted then use the NOTSORTED keyword on the BY statement.
For example you might want to order the data by HOUR and then group it by the STATUS so that you can find out at what hour they transitioned to that STATUS.
data have;
input subject hour status $;
cards;
1 0 C
1 1 B
1 2 B
1 3 D
2 0 A
2 1 D
2 2 D
;
data want ;
set have ;
by subject status notsorted;
if first.status;
run;
Result:
Obs subject hour status
1 1 0 C
2 1 1 B
3 1 3 D
4 2 0 A
5 2 1 D
Related
I have the following dataset.
ID var1 var2 var3
1 100 200
1 150 300
2 120
2 100 150 200
3 200 150
3 250 300
I would like to have a new dataset with only the last not blank record for each group of variables.
id var1 var2 var3
1 150 200 300
2 100 150 200
3 250 300 150
last. select the last reord, but i need to selet the last not null record
Looks like you want the last non missing value for each non-key variable. So you can let the UPDATE statement do the work for you. Normally for update operation you apply transactions to a master dataset. But for your application you can use OBS=0 dataset option to make your current dataset work as both the master and the transactions.
data want ;
update have(obs=0) have ;
by id;
run;
Riccardo:
There are many ways to select, within each group, the last non-missing value of each column. Here are three ways. I would not say one is the best way, each approach has it's merits depending on the specific data set, coder comfort and long term maintainability.
Way 1 - UPDATE statement
Perhaps the simplest of the coding approaches goes like this:
make a data set that has one row per id and the same columns as the original data.
use the UPDATE statement to replace each like named variable with a non-missing value.
Example:
data want_base_table(label="One row per id, original columns");
set have;
by id;
if first.id;
run;
* use have as a transaction data set in the update statement;
data want_by_update;
update want_base_table have;
by id;
run;
Way 2 - DOW loop
Others will involve arrays and first. and last. flag variables of the BY group. This example shows a DOW loop that tracks the non-missing value and then uses them for the output of each ID:
data want_dow;
do until (last.id);
set have;
by id;
array myvar var1-var3 ;
array myhas has1-has3 ;
do _i = 1 to dim(myvar);
if not missing (myvar(_i)) then
myhas(_i) = myvar(_i);
end;
end;
do _i = 1 to dim(myhas);
myvar(_i) = myhas(_i);
end;
output;
drop _i has1-has3;
run;
A loop is most often called a DOW loop when there is a SET statement inside the DO; END; block and the loop termination is triggered by the last. flag variable. A similar non DOW approach (not shown) would use the implicit loop and first. to initialize the tracking array and last. for copying the values (tracked within group) into the columns for output.
Way 3 - Merging column slices
data want_by_column_slices;
merge
have (keep=id var1 where=(var1 ne .))
have (keep=id var2 where=(var2 ne .))
have (keep=id var3 where=(var3 ne .))
;
by id;
if last.id;
run;
proc sort data = group;
by studystyle;
run;
proc means data= group mean;
var test1 test2;
by studystyle;
output out = groupmeans mean = groupmeans;
run;
so I have this dataset of groups of students containing student ID, test1 scores, test2 scores, and their study styles.
I then created a new dataset of the means of these test scores sorted by the study styles.
I am trying to create 2 new datasets based around the 2 tests, both datasets should include the study style, the mean, and test #.
I figure I can just start by creating a new dataset using the set command to use the previous dataset. However I don't really know how to grab the test means for each study style. instead i just used datalines to manually place the mean values in, however I would prefer to grab those values from the previous dataset itself.
data newgroup1;
set groupmeans;
drop test1 test2 _type_ _freq_ _stat_;
input StudyStyle AVG Testnum;
datalines;
1 51.6875 1
2 49.27273 1
3 49.09091 1
;
run;
data newgroup2;
set groupmeans;
drop test2 test1 _type_ _freq_ _stat_;
input StudyStyle AVG Testnum;
datalines;
1 51.5 2
2 65.2727 2
3 90.5454 2
;
run;
data newgroup;
set newgroup1 newgroup2;
run;
If I understand your problem correctly, all you need to change is to create the means of test1 and test2 separately and then write two datasets. Try the code below.
proc sort data = group;
by studystyle;
run;
proc means data= group mean;
var test1 test2;
by studystyle;
output out = groupmeans mean(test1) = mtest1 mean(test2) = mtest2;
run;
data newgroup1 (keep=studystyle mtest1) newgroup2(keep=studystyle mtest2) ;
set groupmeans;
run;
I have following dataset:
ID Status
1 cake
1 cake
1 flower
2 flower
2 flower
3 cake
3 flower
4 cake
4 cake
4 cake
Basically, I am only interested in the observations that, grouped by the ID, include at least one flower. Also I want an indication of whether the observation grouped by ID only has flower or if it was cake too. E.g. I would ideally like something like:
ID Status Indicator
1 cake 1
1 cake 1
1 flower 1
2 flower 2
2 flower 2
3 cake 1
3 flower 1
4 cake 0
4 cake 0
4 cake 0
I have tried to subset the dataset in multiple ways and merge together, conditional on the ID, but it does not seem to be working.
This SAS data step based on your input (which I called test here) will return that indicator value by ID group.
proc sort data=test;
by ID descending status;
run;
data result(drop=status);
set test;
by ID;
retain indicator;
if first.ID then indicator=0;
if status='flower' and indicator=0 then indicator=2;
if status='cake' and indicator=2 then indicator=1;
if last.ID then output;
run;
You could join that result with the source data to get the result as you provided it in your post.
NOTE: I don't have enough reputation to comment on the answer provided by Gordon Linoff but I just want to point out that there the indicator will not take three values (0='no flower',1='cake+flower',2='only flower') but will instead be a count of the number of 'flower' entries per ID, which I don't think is quite what the poster is asking for.
Rewritten as follows will give the expected result with indicator values 0='no flower',1='only flower',2='cake+flower'
proc sql;
select t.*,
(count(distinct status))*(sum(case when status = 'flower' then 1 else 0 end)>0) as indicator
from test t
group by id;
;
quit;
proc sql comes to mind:
proc sql;
select t.*, tt.indicator
from t join
(select id, sum(case when status = 'flower' then 1 else 0 end) as indicator
from t
group by id
) tt
on tt.id = t.id;
proc sql also has a "remerge" extension to SQL. That allows you to do:
proc sql;
select t.*, tt.indicator,
sum(case when status = 'flower' then 1 else 0 end) as indicator
from t j
group by id;
If your data is already sorted by ID then you could use a double DOW loop. The first loop will check for the presence of the values. Then you can use another loop to write back all of the detail rows for that group.
data want ;
do until (last.id);
set have;
by id;
if status='flower' then _flower=1;
else if status='cake' then _cake=1;
end;
if _flower and _cake then indicator=1;
else if _flower then indicator=2;
else indicator=0;
do until (last.id);
set have;
by id;
output;
end;
run;
This should be fast assuming the data is already sorted.
Suppose the dataset has three columns
Date Region Price
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
I try to get the lead price by date and region through following code.
data want;
set have;
by _ric date_l_;
do until (eof);
set have(firstobs=2 keep=price rename=(price=lagprice)) end=eof;
end;
if last.date_l_ then call missing(lagprice);
run;
However, the WANT only have one observations. Then I create new_date=date and try another code:
data want;
set have nobs=nobs;
do _i = _n_ to nobs until (new_date ne Date);
if eof1=0 then
set have (firstobs=2 keep=price rename=(price=leadprice)) end=eof1;
else leadprice=.;
end;
run;
With this code, SAS is working slowly. So I think this code is also not appropriate. Could anyone give some suggestions? Thanks
Try sorting by the variables you want lead price for then set together twice:
data test;
length Date Region $12 Price 8 ;
input Date $ Region $ Price ;
datalines;
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
;
run;
** sort by vars you want lead price for **;
proc sort data = test;
by DATE REGION;
run;
** set together twice -- once for lead price and once for all variables **;
data lead_price;
set test;
by DATE REGION;
set test (firstobs = 2 keep = PRICE rename = (PRICE = LEAD_PRICE))
test (obs = 1 drop = _ALL_);
if last.DATE or last.REGION then do;
LEAD_PRICE = .;
end;
run;
You can use proc expand to generate leads on numeric variables by group. Try the following method instead:
Step 1: Sort by Region, Date
proc sort data=have;
by Region Date;
run;
Step 2: Create a new ID variable to denote observation numbers
Because you have multiple values per date per region, we need to generate a new ID variable so that proc expand uses lead by observation number rather than by date.
data have2;
set have;
_ID_ = _N_;
run;
Step 3: Run proc expand by region with the lead transformation
lead will do exactly as it sounds. You can lead by as many values as you'd like, as long as the data supports it. In this case, we are leading by one observation.
proc expand data=have2
out=want;
by Region;
id _ID_;
convert Price = Lead_Price / transform=(lead 1) ;
run;
Hi another quick question
in proc sql we have on which is used for conditional join is there something similar for sas data step
for example
proc sql;
....
data1 left join data2
on first<value<last
quit;
can we replicate this in sas datastep
like
data work.combined
set data1(in=a) data2(in=b)
if a then output;
run;
You can also can reproduce sql join in one DATA-step using hash objects. It can be really fast but depends on the size of RAM of your machine since this method loads one table into memory. So the more RAM - the larger dataset you can wrap into hash. This method is particularly effective for look-ups in relatively small reference table.
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
data want;
if _N_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2');
h.defineKey('value');
h.defineData('value');
h.defineDone();
declare hiter hi('h');
end;
set have1;
rc=hi.first();
do while(rc=0);
if first<value<last then output;
rc=hi.next();
end;
drop rc;
run;
The result:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
Yes there is a simple (but subtle) way in just 7 lines of code.
What you intend to achieve is intrinsically a conditional Cartesian join which can be done by a do-looped set statement. The following code use the test dataset from Dmitry and a modified version of the code in the appendix of SUGI Paper 249-30
data data1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data data2;
input value;
datalines;
2
5
6
7
;
run;
/***** by data step looped SET *****/
DATA CART_data;
SET data1;
DO i=1 TO NN; /*NN can be referenced before set*/
SET data2 point=i nobs=NN; /*point=i - random access*/
if first<value<last then OUTPUT; /*conditional output*/
END;
RUN;
/***** by SQL *****/
proc sql;
create table cart_SQL as
select * from data1
left join data2
on first<value<last;
quit;
One can easily see that the results coincide.
Also note that from SAS 9.2 documentation: "At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you CAN refer to the NOBS= variable BEFORE the SET statement. The variable is available in the DATA step but is not added to any output data set."
There isn't a direct way to do this with a MERGE. This is one example where the SQL method is clearly superior to any SAS data step methods, as anything you do will take much more code and possibly more time.
However, depending on the data, it's possible a few approaches may make sense. In particular, the format merge.
If data1 is fairly small (even, say, millions of records), you can make a format out of it. Like so:
data fmt_set;
set data1;
format label $8.;
start=first; *set up the names correctly;
end=last;
label='MATCH';
fmtname='DATA1F';
output;
if _n_=1 then do; *put out a hlo='o' line which is for unmatched lines;
start=.; *both unnecessary but nice for clarity;
end=.;
label='NOMATCH';
hlo='o';
output;
end;
run;
proc format cntlin=fmt_set; *import the dataset;
quit;
data want;
set data2;
if put(value,DATA1F.)="MATCH";
run;
This is very fast to run, unless data1 is extremely large (hundreds of millions of rows, on my system) - faster than a data step merge, if you include sort time, since this doesn't require a sort. One major limitation is that this will only give you one row per data2 row; if that is what is desired, then this will work. If you want repeats of data2 then you can't do it this way.
If data1 may have overlapping rows (ie, two rows where start/end overlap each other), you also will need to address this, since start/end aren't allowed to overlap normally. You can set hlo="m" for every row, and "om" for the non-match row, or you can resolve the overlaps.
I'd still do the sql join, however, since it's much shorter to code and much easier to read, unless you have performance issues, or it doesn't work the way you want it to.
Here's another solution, using a temporary array to hold the lookup dataset. Performance is probably similar to Dmitry's hash-based solution, but this should also work for people still using versions of SAS prior to 9.1 (i.e. when hash objects were first introduced).
I've reused Dmitry's sample datasets:
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
/*We need a macro var with the number of obs in the lookup dataset*/
/*This is so we can specify the dimension for the array to hold it*/
data _null_;
if 0 then set have2 nobs = nobs;
call symput('have2_nobs',put(nobs,8.));
stop;
run;
data want_temparray;
array v{&have2_nobs} _temporary_;
do _n_ = 1 to &have2_nobs;
set have2 (rename=(value=value_array));
v{_n_}=value_array;
end;
do _n_ = 1 by 1 until (eof_have1);
set have1 end = eof_have1;
value=.;
do i=1 to &have2_nobs;
if first < v{i} < last then do;
value=v{i};
output;
end;
end;
if missing(value) then output;
end;
drop i value_array;
run;
Output:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
This matches the output from the equivalent SQL:
proc sql;
create table want_sql as
select * from
have1 left join have2
on first<value<last
;
quit;
run;