I have the following two sas datasets:
data have ;
input a b;
cards;
1 15
2 10
3 40
4 200
1 25
2 15
3 10
4 75
1 1
2 99
3 30
4 100
;
data ref ;
input x y;
cards;
1 10
2 20
3 30
4 100
;
I would like to have the following dataset:
data want ;
input a b outcome ;
cards;
1 15 0
2 10 1
3 40 0
4 200 0
1 25 0
2 15 1
3 10 1
4 75 1
1 1 1
2 99 0
3 30 1
4 100 1
;
I would like to create a variable 'outcome' which is produced by an if statement upon conditions of variables a, b, x and y. As in reality the 'have' dataset is extremely large I would like to avoid a sort and merging the two datasets together (where a = x).
I am trying to use macro variables with the following code:
data _null_ ;
set ref ;
call symput('listx', x) ;
call symput('listy', y) ;
run ;
data want ;
set have ;
if a=&listx and b le &listy then outcome = 1 ; else outcome = 0 ;
run ;
which does not however produce the desired result:
data want ;
input a b outcome ;
cards;
1 15 0
2 10 1
3 40 0
4 200 0
1 25 0
2 15 1
3 10 1
4 75 1
1 1 1
2 99 0
3 30 1
4 100 1
;
redone my solution using hash tables. Below my approach
data ref2(rename=(x=a));
set ref ;
run;
data want;
declare Hash Plan ();
rc = plan.DefineKey ('a'); /*x originally*/
rc = plan.DefineData('a', 'y');
rc = plan.DefineDone();
do until (eof1);
set ref2 end=eof1;
rc = plan.add(); /*add each record from ref2 to plan (hash table)*/
end;
do until (eof2);
set have end=eof2;
call missing(y);
rc = plan.find();
outcome = (rc =0 and b<y);
output;
end;
stop;
run;
hope it helps
Related
I have this data
data have;
input cust_id pmt months;
datalines;
AA 100 0
AA 50 1
AA 200 2
AA 350 3
AA 150 4
AA 700 5
BB 500 0
BB 300 1
BB 1000 2
BB 800 3
run;
and I'd like to generate an output that looks like this
data want;
input cust_id pmt months i;
datalines;
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 50 1 0
AA 200 1 1
AA 350 1 2
AA 150 1 3
AA 700 1 4
AA 200 2 0
AA 350 2 1
AA 150 2 2
AA 700 2 3
AA 350 3 0
AA 150 3 1
AA 700 3 2
AA 150 4 0
AA 700 4 1
AA 700 5 0
BB 500 0 0
BB 300 0 1
BB 1000 0 2
BB 800 0 3
BB 300 1 0
BB 1000 1 1
BB 800 1 2
BB 1000 2 0
BB 800 2 1
BB 800 3 0
run;
There are few thousand rows with different cust_ID and different months length. I tried joining tables but it couldn't get me the sequence of 100 50 200 350 150 700 (for cust_ID AA). I could only replicated 100 if my months are 0, 50 if months are 1 & so on. I created a maxval which is the maximum month value. My code is something like this
data temp1;
set have;
do i = 0 to maxval;
if (months <=maxval) then output;
end;
i thought of creating a uniquekey to join my have data and temp1 data but it could only give me
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 100 1 0
AA 50 1 1
AA 200 1 2
AA 350 1 3
AA 150 1 4
AA 100 2 0
AA 50 2 1
AA 200 2 2
AA 350 2 3
AA 100 3 0
AA 50 3 1
AA 200 3 2
AA 100 4 0
AA 50 4 1
AA 100 5 0
Any thoughts or different approach on how to generate my want table? Thank you!
This problem is a little tricky because you have things going in three directions
The number of group repetitions descends from group count. Within each repetition:
The payments item start index ascends and terminates at group count
The months (as I) item start index is 1 and termination descends from group count
SQL
One SQL approach is a three-way reflexive join with-in group. The months values act as a within group index and must be monotonic by 1 from 0 for this to work.
proc sql;
create table want as
select X.cust_id, Z.pmt, X.months, Y.months as i
from have as X
join have as Y on X.cust_id = Y.cust_id
join have as Z on Y.cust_id = Z.cust_id
where
X.months + Y.months = Z.months
order by
X.cust_id, X.months, Z.months
;
quit;
DATA Step
A DOW loop is used to count the group size. 2-deep looping crosses the combinations and three point= values are computed (finagled) to retrieve the relevant values.
data want2;
if 0 then set have; * prep pdv to match have;
retain point_end ;
point_start = sum(point_end,0);
do group_count = 1 by 1 until (last.cust_id);
set have(keep=cust_id);
by cust_id;
end;
do index1 = 1 to group_count;
point1 = point_start + index1;
set have (keep=months) point = point1;
do index2 = 0 to group_count - index1 ;
point2 = point_start + index1 + index2;
set have (keep=pmt) point=point2;
point3 = point_start + index2 + 1;
set have (keep=months rename=months=i) point=point3;
output;
end;
end;
point_end = point1;
keep cust_id pmt months i;
run;
Try the following:
data want(drop = start_obs limit j);
retain start_obs 1;
/* read by cust_id group */
do until(last.cust_id);
set have end = last_obs;
by cust_id;
end;
limit = months;
do j = 0 to limit;
i = 0;
do obs_num = start_obs + j to start_obs + limit;
/* read specific observations using direct access */
set have point = obs_num;
months = j;
output;
i = i + 1;
end;
end;
/* prepare for next direct access read */
start_obs = limit + 2;
if last_obs then
stop;
run;
I have a big panel dataset that looks somewhat like this:
data have;
input id t a b ;
datalines;
1 1 0 0
1 2 0 0
1 3 1 0
1 4 0 0
1 5 0 1
1 6 1 0
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 1
2 6 0 1
2 7 0 1
2 8 0 1
2 9 1 0
2 10 0 1
3 1 0 0
3 2 0 0
3 3 0 0
3 4 0 0
3 5 0 0
3 6 0 0
3 7 1 0
3 8 0 0
3 9 0 0
3 10 0 0
;
run;
For every ID I want to record all 'trigger' events, namely when a=1 and then I need to how long it takes to the next occurrence of b=1. The final output should then give me the following:
data want;
input id a_no a_t b_t diff ;
datalines;
1 1 3 5 2
1 2 6 10 4
2 1 2 5 3
2 2 9 10 1
3 1 7 . .
;
run;
It is of course no problem to get all a=1 and b=1 events, but as it is a very big dataset with a lot of both events for every ID I am searching for an elegant and straight-forward solution. Any ideas?
Here's a fairly simple SQL approach that gives more or less the desired output:
proc sql;
create table want
as select
t1.id,
t1.t as a_t,
t2.t as b_t,
t2.t - t1.t as diff
from
have(where = (a=1)) t1
left join
have(where = (b=1)) t2
on
t1.id = t2.id
and t2.t > t1.t
group by t1.id, t1.t
having diff = min(diff)
;
quit;
The only part missing is a_no. This sort of row-incrementing ID is quite a lot of work to generate consistently in SQL, but it's trivial with an extra data step:
data want;
set want;
by id;
if first.id then a_no = 0;
a_no + 1;
run;
An elegant DATA step way can use nested DOW loops. It's straight forward when you understand DOW loops.
data want(keep=id--diff);
length id a_no a_t b_t diff 8;
do until (last.id); * process each group;
do a_no = 1 by 1 until(last.id); * counter for each output;
do until ( output_condition or end); * process each triggering state change;
SET have end=end; * read data;
by id; * setup first. last. variables for group;
if a=1 then a_t = t; * detect and record start of trigger state;
output_condition = (b=1 and t > a_t > 0); * evaluate for proper end of trigger state;
end;
if output_condition then do;
b_t = t; * compute remaining info at output point;
diff = b_t - a_t;
OUTPUT;
a_t = .; * reset trigger state tracking variables;
b_t = .;
end;
else
OUTPUT; * end of data reached without triggered output;
end;
end;
run;
Note: A SQL way (not shown) can use self join within groups.
data test;
input ID month d_month;
datalines;
1 59 0
1 70 11
1 80 21
2 10 0
2 11 1
2 13 3
3 5 0
3 9 4
4 8 0
;
run;
I have two columns of data ID and Month. Column 1 is the ID, the same ID may have multiple rows (1-5). The second column is the enrolled month. I want to create the third column. It calculates the different between the current month and the initial month for each ID.
you can do it like that.
data test;
input ID month d_month;
datalines;
1 59 0
1 70 11
1 80 21
2 10 0
2 11 1
2 13 3
3 5 0
3 9 4
4 8 0
;
run;
data calc;
set test;
by id;
retain current_month;
if first.id then do;
current_month=month;
calc_month=0;
end;
if ^first.id then do;
calc_month = month - current_month ;
end;
run;
Krs
Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;
How can I manage proc tabulate to show the value of a variable with missing value instead of its statistic? Thanks!
For example, I want to show the value of sym. It takes value 'x' or missing value. How can I do it?
Sample code:
data test;
input tx mod bm $ yr sym $;
datalines;
1 1 a 0 x
1 2 a 0 x
1 3 a 0 x
2 1 a 0 x
2 2 a 0 x
2 3 a 0 x
3 1 a 0
3 2 a 0
3 3 a 0 x
1 1 b 0 x
1 2 b 0
1 3 b 0
1 4 b 0
1 5 b 0
2 1 b 0
2 2 b 0
2 3 b 0
2 4 b 0
2 5 b 0
3 1 b 0 x
3 2 b 0
3 3 b 0
1 1 c 0
1 2 c 0 x
1 3 c 0
2 1 c 0
2 2 c 0
2 3 c 0
3 1 c 0
3 2 c 0
3 3 c 0
1 3 a 1 x
2 3 a 1
3 3 a 1
1 3 b 1
2 3 b 1
3 3 b 1
1 3 c 1 x
2 3 c 1
3 3 c 1
;
run;
proc tabulate data=test;
class yr bm tx mod ;
var sym;
table yr*bm, tx*mod;
run;
proc tabulate data=test;
class tx mod bm yr sym;
table yr*bm, tx*mod*sym*n;
run;
That gives you ones for each SYM=x (since n=missing). That hides the rows for SYM=missing, hence you miss some values overall from your example table. (You could format the column with a format that defines 1 = 'x' easily).
proc tabulate data=test;
class tx mod bm yr;
class sym /missing;
table yr*bm, tx*mod*sym=' '*n;
run;
That gives you all of your combinations of the 4 main variables, but includes missing syms as their own column.
If you want to have your cake and eat it too, then you need to redefine SYM to be a numeric variable, so you can use it as a VAR.
proc format;
invalue ISYM
x=1
;
value FSYM
1='x';
quit;
data test;
infile datalines truncover;
input tx mod bm $ yr sym :ISYM.;
format sym FSYM.;
datalines;
1 1 a 0 x
1 2 a 0 x
1 3 a 0 x
... more lines ...
;
run;
proc tabulate data=test;
class tx mod bm yr;
var sym;
table yr*bm, tx*mod*sym*sum*f=FSYM.;
run;
All of these assume these are unique combination rows. If you start having multiples of yr*bm*tx*mod, you would have a problem here as this wouldn't give you the expected result (sum 1+1+1=3 would not give you an 'x').