Suppose I have a dataset with three variables:
ID Year Status
1 2017 Y
1 2017 N
1 2018 N
1 2018 Y
2 2017 Y
2 2017
2 2018 N
2 2018 N
I would like to create a fourth column called NEW which has three possible values ('Yonly' 'Nonly' and 'yesno'). In the example above the output will be:
ID Year Status NEW
1 2017 Y
1 2017 N yesno
1 2018 N
1 2018 Y yesno
2 2017 Y
2 2017 yesonly
2 2018 N
2 2018 N noonly
Note: could have missing data. My solution so far is wrong:
retain tmp '';
by ID Year;
if Status='Y' then tmp='Yonly';
if Status='N' then tmp='Nonly';
if tmp='Yonly' and Status='N' then tmp='yesno';
if tmp='Nonly' and Status='Y' then tmp='yesno';
if last.Year=1 then NEW=tmp;
Please help? Any method will do, you don't have to use RETAIN.
Make sure to define the length of TMP. Your current code will set the length of TMP to 1 since the first usage is the initial value listed in the RETAIN statement.
You are missing an initialization step for when starting a new group.
if first.year then tmp=' ';
Your method can only set the result on the last record for each group. If you want all observations in a group to have the same value then I would suggest using a double DOW loop. The first loop can be used to see if there are any 'Y' or 'N' status. Then you can calculate your NEW variable. Then a second loop will read in the data for the group again and write the values out. Because all observations for a group are processed in a single data step iteration there is no need to use RETAIN.
data want ;
do until (last.year) ;
set have ;
by id year ;
y = y or (status='Y');
n = n or (status='N');
end;
length new $8;
if Y and N then new='yesno';
else if Y then new='yesonly';
else if N then new='noonly';
else new='none';
drop y n ;
do until (last.year) ;
set have ;
by id year ;
output ;
end;
run;
It's along the lines of what you were doing, but it's better to do the conditionals more like this.
data want;
set have;
by id year;
retain last_status;
if first.year then last_status = status;
if last.year then do;
if status = last_status or missing(last_status) then new=cats(status,'only');
else if missing(status) then new=cats(last_status,'only');
else new='yesno';
end;
run;
retain the value from the first row, and then on the last row just consider what to do based on the two variables - it's fairly straightforward that way.
Related
I have a dataset that contains an ID and some additional data. I want to perform transformations based on the ID with a by statement. The transformation works. Unfortunately SAS automatically reduces the dataset to one row per group. Does anybody know how to keep the original (number of) rows and still perform the group actions?
Here is some sample code to illustrate my problem
data dat;
input ID X $;
datalines;
1 a
1 b
1 c
1 d
2 a
2 b
3 a
4 k
5 z
5 a
5 c
;
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
drop x;
run;
Just add an OUTPUT statement inside the DO loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
output;
end;
drop x;
run;
When you do not have an explicit OUTPUT statement in a data step then an implied OUTPUT statement executes at the end of the data step. Your DO loop around the SET statement means that the end of the data step is only reached for the last observation per group.
If you want the final calculated value to be replicated on each observation then just add another loop to re-read the observations and put the OUTPUT statement in that loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
do until(last.ID);
set dat;
by ID notsorted;
output;
end;
drop x;
run;
When you want to associate a group level computation result to EACH row in the group you will need to first iterate over the group to compute the result, and then have a second loop that reads the same rows of the group and outputs each. Use additional variables if you need to know the sequence number within the group and the total number of rows in the group.
data want(keep=id x_csv_list by_group_size seq);
length x_csv_list $2100.;
do by_group_size = 1 by 1 until(last.ID);
set dat;
by ID notsorted;
x_csv_list = catx(',',x_csv_list,x);
end;
do seq = 1 to by_group_size;
set dat;
output;
end;
run;
Also, if you are at the 'never really get it' stage, remember NOTSORTED means contiguous rows with the same by group variable values.
by s
s group first.s last.s
- ----- ------- ------
A 1st 1 0
A 1st 0 0 /* trick knowledge both 0 means row is interior */
A 1st 0 1
B 2nd 1 1 /* trick knowledge both 1 means group size is 1 row */
A 3rd 1 0
A 3rd 0 1
B 4th 1 0
B 4th 0 0
B 4th 0 1
C 5th 1 0
C 5th 0 1
I would like to know if my data would be included in a specified month. Please see reprex below:
id Period_start Period_end
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016
4 05-10-2009 26-12-2015
I've tried the below codes using a single month. They worked the first time and did not work after that.
data dates2;
set work.dates;
by id;
if (period_start>='01MAR2016'd and period_end<='01MAR2016'd) or (period_start>='31MAR2016'd and period_end<='31MAR2016'd) then flag='March 2016';
else flag='';
run;
/* Or */
data dates2;
set work.dates;
by id;
if ('01MAR2016'd ge period_start and '01MAR2016'd le period_end) or ('31MAR2016'd ge period_start and '31MAR2016'd le period_end) then flag='March 2016';
else flag='';
run;
My intended outcome for this example is below:
id Period_start Period_end Flag
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020 March 2016
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017 March 2016
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016 March 2016
4 05-10-2009 26-12-2015
Please note that I have a number of months to compare them against which is why I didn't use the where function.
You can process multiple months to flag (i.e "number of months to compare") in one go if you store those months in a separate data set (as opposed to hard coding the month in a DATA Step program source code)
Example:
The months to flag are stored in a control data set, which, is then transposed to create flag variables. The flag variables are reloaded at every iteration of the DATA Step implicit loop using SET and POINT= and conditionally cleared based on date range comparison in an explicit loop over the flag variables.
data have;
attrib
id length=8
period_start period_end informat=ddmmyy10. format=ddmmyyd10.;
input
id Period_start Period_end; datalines;
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016
4 05-10-2009 26-12-2015
;
data flag_months;
attrib month informat=monyy7. format=monyy7.;
input month; datalines;
MAR2016
AUG2018
;
proc transpose data=flag_months out=flag_vars(drop=_name_) prefix=FLAG_;
id month;
var month;
run;
data want;
set have;
retain one 1;
set flag_vars point=one; drop one; * load flag values;
array flag_vars flag_:;
do _i_ = 1 to dim(flag_vars);
* clear flag value if month does not touch any day in the period;
if not
( intnx('month', period_start, 0)
<=
flag_vars(_i_)
<=
intnx('month', period_end, 0, 'E')
)
then
call missing(flag_vars(_i_));
end;
run;
Flag months
Transposed into Flag vars
Which are loaded and conditionally cleared during a pass over the data set containing date range information.
I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;
I need help to split a row into multiple rows when the value on the row is something like 1-5. The reason is that I need to count 1-5 to become 5, and not 1, as it is when it count on one row.
I've a ID, the value and where it belong.
As exempel:
ID Value Page
1 1-5 2
The output I want is something like this:
ID Value Page
1 1 2
1 2 2
1 3 2
1 4 2
1 5 2
I've tried using a IF-statement
IF bioVerdi='1-5' THEN
DO;
..
END;
So I don't know what I should put between the DO; and END;. Any clues to help me out here?
You need to loop over the values inside your range and OUTPUT the values. The OUTPUT statement causes the Data Step to write a record to the output data set.
data want;
set have;
if bioVerdi = '1-5' then do;
do value=1 to 5;
output;
end;
end;
Here is another solution that is less restricted to the actual value '1-5' given in your example, but would work for any value in the format '1-6', '1-7', '1-100', etc.
*this is the data you gave ;
data have ;
ID = 1 ;
value = '1-5';
page = 2;
run;
data want ;
set have ;
min = scan( value, 1, '-' ) ; * get the 1st word, delimited by a dash ;
max = scan( value, 2, '-' ) ; * get the 2nd word, delimited by a dash ;
/*loop through the values from min to max, and assign each value as the loop iterates to a new column 'NEWVALUE.' Each time the loop iterates through the next value, output a new line */
do newvalue = min to max ;
output ;
end;
/*drop the old variable 'value' so we can rename the newvalue to it in the next step*/
drop value min max;
/*newvalue was a temporary name, so renaming here to keep the original naming structure*/
rename newvalue = value ;
run;
how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;