do you know how to use n in function LAGn(variable) that refer to another macro variable in the program-> max in my case by V1?
data example1;
input V1 value V2;
datalines;
a 1.0 2.0
a 1.0 1.0
a 1.0 1.0
b 1.0 1.0
b 1.0 1.0
;
proc sql;
select max(V2) format = 1. into :n
from example1;
quit;
data example1;
set example1;
by V1;
lagval=lag&n(V2);
run;
Code from user667489 and works for one column. Now n changes by V1.
I expect:
MAX LAG
a 1.0 2.0 2 .
a 1.0 1.0 2 .
a 1.0 1.0 2 2
b 1.0 1.0 1 .
b 1.0 1.0 1 1
;
Forget about LAG(). Just add a counter variable and join on that.
Let's fix your example data step so it works.
data example1;
input V1 $ value V2;
datalines;
a 1 2
a 1 1
a 1 1
b 1 1
b 1 1
;
Now add a unique row id within each BY group.
data step1;
set example1;
by v1;
if first.v1 then row=0;
row+1;
run;
Now just join this dataset with itself.
proc sql ;
create table want as
select a.*,b.v2 as lag_v2
from (select *,max(v2) as max_v2 from step1 group by v1) a
left join step1 b
on a.v1= b.v1 and a.row = b.row + a.max_v2
;
quit;
Results:
Obs V1 value V2 row max_v2 lag_v2
1 a 1 2 1 2 .
2 a 1 1 2 2 .
3 a 1 1 3 2 2
4 b 1 1 1 1 .
5 b 1 1 2 1 1
Hopefully your real use case makes more sense than than this example.
The LAG<n> function is an in-place stack of fixed depth that is specific to it's code use location and thus step state at invocation. The stack is of depth and can not be altered dynamically at runtime.
A dynamic lag can be implemented in SAS DATA step using a hash object. The double DOW technique allows a group to be measured and then subsequently it's items operated upon.
Sample code
This example uses a defines a hash object that maintains a stack of values within a group. A first DOW loop computes the maximum of a field that becomes the dynamic stack height. The second DOW loop iterates of the group and retrieves the lag value while also building up the stack for future item lags.
* some faux data;
data have (keep=group value duration);
do group = 1 to 10;
limit = ceil(4 * ranuni(6));
put group= limit=;
do _n_ = 1 to 8 + 10*ranuni(123);
value = group*10 + _n_;
duration = 1 + floor(limit*ranuni(123));
output;
end;
end;
run;
* dynamic lag provided via hash;
data want;
if _n_ = 1 then do;
retain index lag_value .;
declare hash lag_stack();
lag_stack.defineKey('index');
lag_stack.defineData('lag_value');
lag_stack.defineDone();
end;
do _n_ = 1 by 1 until (last.group);
set have;
by group;
max_duration = max(max_duration, duration);
end;
* max_duration within group is the lag lag_stack height;
* pre-fill missings ;
do index = 1-max_duration to 0;
lag_stack.replace(key: index, data: .);
end;
do _n_ = 1 to _n_;
set have;
lag_stack.replace(key: _n_, data: value);
lag_stack.find(key: _n_ - max_duration);
output;
end;
drop index;
run;
Another technique would involve a fixed length ring-array instead of a hash-stack, but you would need to compute the maximum lag over all groups prior to coding the DATA step using the array.
Related
I have a dataset that contains an ID and some additional data. I want to perform transformations based on the ID with a by statement. The transformation works. Unfortunately SAS automatically reduces the dataset to one row per group. Does anybody know how to keep the original (number of) rows and still perform the group actions?
Here is some sample code to illustrate my problem
data dat;
input ID X $;
datalines;
1 a
1 b
1 c
1 d
2 a
2 b
3 a
4 k
5 z
5 a
5 c
;
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
drop x;
run;
Just add an OUTPUT statement inside the DO loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
output;
end;
drop x;
run;
When you do not have an explicit OUTPUT statement in a data step then an implied OUTPUT statement executes at the end of the data step. Your DO loop around the SET statement means that the end of the data step is only reached for the last observation per group.
If you want the final calculated value to be replicated on each observation then just add another loop to re-read the observations and put the OUTPUT statement in that loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
do until(last.ID);
set dat;
by ID notsorted;
output;
end;
drop x;
run;
When you want to associate a group level computation result to EACH row in the group you will need to first iterate over the group to compute the result, and then have a second loop that reads the same rows of the group and outputs each. Use additional variables if you need to know the sequence number within the group and the total number of rows in the group.
data want(keep=id x_csv_list by_group_size seq);
length x_csv_list $2100.;
do by_group_size = 1 by 1 until(last.ID);
set dat;
by ID notsorted;
x_csv_list = catx(',',x_csv_list,x);
end;
do seq = 1 to by_group_size;
set dat;
output;
end;
run;
Also, if you are at the 'never really get it' stage, remember NOTSORTED means contiguous rows with the same by group variable values.
by s
s group first.s last.s
- ----- ------- ------
A 1st 1 0
A 1st 0 0 /* trick knowledge both 0 means row is interior */
A 1st 0 1
B 2nd 1 1 /* trick knowledge both 1 means group size is 1 row */
A 3rd 1 0
A 3rd 0 1
B 4th 1 0
B 4th 0 0
B 4th 0 1
C 5th 1 0
C 5th 0 1
I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;
I want to apply a pre-defined format to several columns, but only for one variable. The problem is, this variable has two subgroups, LEFT and RIGHT, my codes only change the format for the first subgroup - Left, but not the second one - Right. I want to apply the same format to the second subgroup - Right.
Here is my code:
DATA have;
INPUT subject $ variable $ parameter $ V1-V6;
DATALINES;
A-001 qAF Left 1 2 3 4 5 6
A-001 qAF Right 1 2 3 4 5 6
A-001 Cortical Left 1 1 1 1 1 1
A-001 Cortical Right 1 2 1 1 1 1
A-001 Posterial Left 1 1 1 2 1 1
A-001 Posterial Right 1 1 1 1 1 3
;
RUN;
PROC FORMAT;
VALUE cort
1 = 'C1'
2 = 'C2';
RUN;
PROC REPORT DATA = have;
COLUMNS subject variable parameter V1 V2 V3 V4 V5 V6 dummy;
DEFINE subject / ORDER;
DEFINE variable / ORDER;
DEFINE dummy / COMPUTED NOPRINT;
COMPUTE dummy;
IF variable = 'Cortical' THEN DO;
DO i = 4 TO 9;
CALL DEFINE (i, 'format', 'cort.');
END;
END;
ENDCOMP;
COMPUTE AFTER variable;
LINE ' ';
ENDCOMP;
OPTIONS missing = '';
RUN;
You need to HOLD the value of VARIABLE. See COMPUTE BEFORE.
PROC REPORT DATA = have;
COLUMNS subject variable parameter V1 V2 V3 V4 V5 V6 dummy;
DEFINE subject / ORDER;
DEFINE variable / ORDER;
DEFINE dummy / COMPUTED NOPRINT;
compute before variable;
hold=variable;
endcomp;
COMPUTE dummy;
IF hold = 'Cortical' THEN DO;
DO i = 4 TO 9;
CALL DEFINE (i, 'format', 'cort.');
END;
END;
ENDCOMP;
COMPUTE AFTER variable;
LINE ' ';
ENDCOMP;
OPTIONS missing = '';
RUN;
I checked out this previous post (LINK) for potential solution, but still not working. I want to sum across rows using the ID as the common identifier. The num variable is constant. The id and comp the two variables I want to use to creat a pct variable, which = sum of [comp = 1] / num
Have:
id Comp Num
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
Want:
id tot pct
1 2 100
2 3 0.666666667
3 1 100
Currently have:
proc sort data=have;
by id;
run;
data want;
retain tot 0;
set have;
by id;
if first.id then do;
tot = 0;
end;
if comp in (1) then tot + 1;
else tot + 0;
if last.id;
pct = tot / num;
keep id tot pct;
output;
run;
I use SQL for things like this. You can do it in a Data Step, but the SQL is more compact.
data have;
input id Comp Num;
datalines;
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
;
run;
proc sql noprint;
create table want as
select id,
sum(comp) as tot,
sum(comp)/count(id) as pct
from have
group by id;
quit;
Hi there is a much more elegant solution to your problem :)
proc sort data = have;
by id;
run;
data want;
do _n_ = 1 by 1 until (last.id);
set have ;
by id ;
tot = sum (tot, comp) ;
end ;
pct = tot / num ;
run;
I hope it is clear. I use sql too because I am new and the DOW loop is rather complicated but in your case its pretty straightforward.
How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !