I'm having trouble with the SAS statement retain in a group.
Suppose I have a data set:
data have_data;
input dev nr amount flag $ ;
cards;
1 1356 30000 S
2 1356 35000 S
3 1356 40000 L
4 1356 35000 S
1 2345 15000 S
2 2345 20000 S
3 2345 20000 S
4 2345 25000 S
5 2345 25000 S
1 3456 39000 S
2 3456 40000 L
3 3456 45000 L
4 3456 35000 S
;
run;
I want to create a column flag2, which keeps 'L' if the amount >= 40000 within the dev and nr group.
The output should be like:
data want_data;
input dev nr amount flag $ flag2 $ ;
cards;
1 1356 30000 S S
2 1356 35000 S S
3 1356 40000 L L
4 1356 35000 S L
1 2345 15000 S S
2 2345 20000 S S
3 2345 20000 S S
4 2345 25000 S S
5 2345 25000 S S
1 3456 39000 S S
2 3456 40000 L L
3 3456 45000 L L
4 3456 35000 S L
;
run;
I sorted the data first and tried the following as I found a similar post about this, but it is not working..
data new_data;
set have_data;
by dev nr;
retain test;
if flag = 'L' then help=1;
if first.nr then test = help;
flag2 = test;
run;
Please help?
Many thanks!!
dev appears to be simply a row counter within group nr and the have seems to be focused on nr group alone.
Under the presumption that L is already present from a prior step, and the by group is just nr, you can carry forward the flag state L in flag2 by never reassigning flag2 after it is L.
Example:
data want;
set have;
by nr;
retain flag2;
* flag2 is reset at start of group, or assigned if L state not reached yet;
if first.nr or flag2 ne 'L' then flag2=flag;
run;
Related
I've got group data and it has flags created anytime a name is changed within that group. I can pull the last two or first two observations within the group, but I am struggling figuring out how to pull the last observation with a name change AND the row right after.
The below code give me the first or last two observations per group, depending on how I sort the data.
DATA LastTwo;
SET WhatIveGot;
count + 1;
BY group_ID /*data pre sorted*/;
IF FIRST.group_ID THEN count=1;
IF count<=2 THEN OUTPUT;
RUN;
What I need is to be the LAST observation with a name change AND the following row.
group_ID NAME DATE NAME_CHange
1 TOM 1/1/19 0
1 Jill 1/30/19 1
1 Jill 1/20/19 0
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 TOM 2/1/19 0
2 Jill 2/30/19 1
2 Jill 2/20/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
2 Jim 4/15/19 0
3 Joe 2/20/19 0
3 Kim 3/10/19 1
3 Kim 3/30/19 0
3 Ken 4/15/19 1
4 Tim 3/10/19 0
4 Tim 3/30/19 0
The desired output:
group_ID NAME DATE NAME_CHange
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
3 Ken 4/15/19 1
The cases for Group_ID 2 and 3 are the roadblock. The data is already sorted by date.
Thank you for any help in advance
Use DOW processing to determine where the last name change was. Apply that information in a succeeding loop.
Example:
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id name notsorted;
if first.name then _index_of_last_name_change = _n_;
end;
do _n_ = 1 to _n_;
set have;
if _index_of_last_name_change <= _n_ <= _index_of_last_name_change+1 then OUTPUT;
end;
drop _:;
run;
Proc SQL Version=9.4. No windows functions to use.
There are client id, time period(month), amount and corresponding class.
client_id data_period amount class
1 200801 30000 2
2 200801 17000 1
3 200801 9000 1
1 200802 30000 2
2 200802 55555 2
3 200802 11000 2
Threshold amount = 20 000.
amount > 20k gives class = 2, amount <= 20k makes class = 1
client_id = 1, amount and class are the same for 200801 and 200802.
client_id = 2, amount gets higher from 17k to 55.5k, class change is correct, from 1 to 2.
client_id =3, amount changed within the same class 1 (<20K), but class changed incorrectly.
Desired result is
client_id oldDate newDate AmtOld AmtNew ClassOld ClassNew Good Bad
2 200801 200802 17000 55555 1 2 1 0
3 200801 200802 9000 11000 1 1 0 1
I tried to applied self join to get all the differences btw data periods, but there are too many rows in output. Data below is not from example above, real numbers.
client_id oldDate newDate AmtOld AmtNew ClassOld ClassNew
A001687463 200808 200802 -5613 1690386 I03 I04
A001687463 200807 200802 -5613 1690386 I03 I04
A001687463 200806 200802 -5613 1690386 I03 I04
A001687463 200805 200802 -5613 1690386 I03 I04
PROC SQL;
CREATE TABLE WORK.'Q'n AS
SELECT distinct
t1.client_id, t1.data_period as oldDate, t2.data_period as newDate, t1.amount as expAmtOld, t2.amount as expAmtNew, t1.class as classOld, t2.class as classNew
FROM WORK.'E'n t1, WORK.'E'n t2
where
t1.client_id = t2.client_id and
t1.amount <> t2.amount
order by t1.client_id;
Do not attempt to do sequential processing using SQL. It is not built for that.
It should be easy to do in a data step. For example let's convert your printout into an actual SAS dataset so we have something to code with.
data have ;
input client_id data_period amount class ;
cards;
1 200801 30000 2
2 200801 17000 1
3 200801 9000 1
1 200802 30000 2
2 200802 55555 2
3 200802 11000 2
;
And let's sort it by client and period.
proc sort data=have ;
by client_id data_period ;
run;
Now just set the data and use the LAG() function to get the previous values.
Not sure what you definition of GOOD and BAD were so I just created new class variables based on your rule of 20K.
data want ;
set have ;
by client_id;
old_period = lag(data_period);
old_class = lag(class);
newclass = 1 + (amount > 20000) ;
old_newclass = lag(newclass);
if first.client_id then call missing(of old_:);
bad = (class ne newclass) or (old_newclass ne old_class) ;
run;
So here are the results.
client_ data_ old_ old_ old_
id period amount class period class newclass newclass bad
1 200801 30000 2 . . 2 . 0
1 200802 30000 2 200801 2 2 2 0
2 200801 17000 1 . . 1 . 0
2 200802 55555 2 200801 1 2 1 0
3 200801 9000 1 . . 1 . 0
3 200802 11000 2 200801 1 1 1 1
I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.
data test;
input Index Indicator value FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;
run;
I have a data set with the first 3 columns. How do I get the 4th columns based on the indicators? For example, for the index, when the indicator =1, the value is 21, so I put 21 is the final values in all lines for index 1.
Use the SAS Retain Keyword.
You can do this in a data step; by Retaining the Value where indicator = 1.
Steps:
Sort your data by Index and Indicator
Group by the Index & Retain the Value where Indicator=1
Code:
/*Sort Data by Index and Indicator & remove the hardcodeed finalvalue*/
proc sort data=test (keep= Index Indicator value);
by index descending indicator ;
run;
/*Retain the FinalValue*/
data want;
set test;
retain FinalValue;
keep Index Indicator value FinalValue;
if indicator =1 then do;FinalValue=value;end;
/*The If statement below will assign . to records that doesn't have an indicator value of 1*/
if indicator ne 1 and FIRST.Index=1 then FinalValue=.;
by index;
run;
Output:
Index=1 Indicator=1 value=21 FinalValue=21
Index=1 Indicator=0 value=5 FinalValue=21
Index=2 Indicator=1 value=0 FinalValue=0
Index=3 Indicator=1 value=7 FinalValue=7
Index=3 Indicator=0 value=4 FinalValue=7
Index=3 Indicator=0 value=8 FinalValue=7
Index=3 Indicator=0 value=2 FinalValue=7
Index=4 Indicator=1 value=1 FinalValue=1
Index=4 Indicator=0 value=4 FinalValue=1
Use proc sql by left join. Select value which indicator=1 and group by index, then left join with original dataset. It seemed that your first row of index=3 should be 7, not 0.
proc sql;
select a.*,b.finalvalue from test a
left join (select *,value as finalvalue from test group by index having indicator=1) b
on a.index=b.index;
quit;
This is rather old school but should be adequate. I reckon you call it a self merge or something.
data test;
input Index Indicator value;* FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;;;;
run;
data final;
if 0 then set test;
merge test(where=(indicator eq 1) rename=(value=FinalValue)) test;
by index;
run;
proc print;
run;
Final
Obs Index Indicator value Value
1 1 0 5 21
2 1 1 21 21
3 2 1 0 0
4 3 0 4 7
5 3 1 7 7
6 3 0 8 7
7 3 0 2 7
8 4 1 1 1
9 4 0 4 1
I have panel data set that looks like this
ID Usage month
1234 2 -2
1234 4 -1
1234 3 1
1234 2 2
2345 5 -2
2345 6 -1
2345 3 1
2345 6 2
Obviously there are more ID variables and usage data, but this is the general form. I want to average the usage data when the month column is negative, and when it is positive for each ID. In other words for each unique ID, average the usage for negative months and for positive months. My goal is to get something like this.
ID avg_usage_neg avg_usage_pos
1234 3 2.5
2345 5.5 4.5
Here's a few options for you.
First create the test data:
data sample;
input ID
Usage
month;
datalines;
1234 2 -2
1234 4 -1
1234 3 1
1234 2 2
2345 5 -2
2345 6 -1
2345 3 1
2345 6 2
;
run;
Here's an SQL solution:
proc sql noprint;
create table result as
select id,
avg(ifn(month < 0, usage, .)) as avg_usage_neg,
avg(ifn(month > 0, usage, .)) as avg_usage_pos
from sample
group by 1
;
quit;
Here's a datastep / proc means solution:
data sample2;
set sample;
usage_neg = ifn(month < 0, usage, .);
usage_pos = ifn(month > 0, usage, .);
run;
proc means data=sample2 noprint missing nway;
class id;
var usage_neg usage_pos;
output out=result2 mean=;
run;