I would like to know if my data would be included in a specified month. Please see reprex below:
id Period_start Period_end
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016
4 05-10-2009 26-12-2015
I've tried the below codes using a single month. They worked the first time and did not work after that.
data dates2;
set work.dates;
by id;
if (period_start>='01MAR2016'd and period_end<='01MAR2016'd) or (period_start>='31MAR2016'd and period_end<='31MAR2016'd) then flag='March 2016';
else flag='';
run;
/* Or */
data dates2;
set work.dates;
by id;
if ('01MAR2016'd ge period_start and '01MAR2016'd le period_end) or ('31MAR2016'd ge period_start and '31MAR2016'd le period_end) then flag='March 2016';
else flag='';
run;
My intended outcome for this example is below:
id Period_start Period_end Flag
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020 March 2016
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017 March 2016
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016 March 2016
4 05-10-2009 26-12-2015
Please note that I have a number of months to compare them against which is why I didn't use the where function.
You can process multiple months to flag (i.e "number of months to compare") in one go if you store those months in a separate data set (as opposed to hard coding the month in a DATA Step program source code)
Example:
The months to flag are stored in a control data set, which, is then transposed to create flag variables. The flag variables are reloaded at every iteration of the DATA Step implicit loop using SET and POINT= and conditionally cleared based on date range comparison in an explicit loop over the flag variables.
data have;
attrib
id length=8
period_start period_end informat=ddmmyy10. format=ddmmyyd10.;
input
id Period_start Period_end; datalines;
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016
4 05-10-2009 26-12-2015
;
data flag_months;
attrib month informat=monyy7. format=monyy7.;
input month; datalines;
MAR2016
AUG2018
;
proc transpose data=flag_months out=flag_vars(drop=_name_) prefix=FLAG_;
id month;
var month;
run;
data want;
set have;
retain one 1;
set flag_vars point=one; drop one; * load flag values;
array flag_vars flag_:;
do _i_ = 1 to dim(flag_vars);
* clear flag value if month does not touch any day in the period;
if not
( intnx('month', period_start, 0)
<=
flag_vars(_i_)
<=
intnx('month', period_end, 0, 'E')
)
then
call missing(flag_vars(_i_));
end;
run;
Flag months
Transposed into Flag vars
Which are loaded and conditionally cleared during a pass over the data set containing date range information.
Related
I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;
I want to output the last value of a variable pr. sub-group to a SAS dataset, preferably in just a few steps. The code below do it, but I was hoping to do it in one step a la by variable; if last.variable then output; as for the case with just 1 by-variable.
data two;
input year firm price;
cards;
1 1 48
1 1 45
2 2 50
1 2 42
2 1 41
2 2 51
2 1 52
1 1 43
1 2 52;
run;
proc sort data = two;by year firm;run;
/* a) Create id across both sub-groups */
data two1;
set two;
by year firm;
retain case_id;
if FIRST.year OR first.firm then case_id + 1;
run;
/* b) Use id to output last values across both by-groups */
data two2;
set two1;
by case_id;
if last.case_id then output;
run;
proc print data = two1;run;
proc print data = two2;run;
With just 1 by-variable the two steps marked a) and b) can be combined. Is it possible with more than one by-group?
In data step a) add condition if lst.firm then output two2.
The final code should looks like:
data two1 two2;
set two;
by year firm;
retain case_id;
if FIRST.year OR first.firm then case_id + 1;
if last.firm then output two2;
output two1;
run;
I want to create a column in my dataset that calculates the sum of the current row and next row for another field. There are several groups within the data, and I only want to take the sum of the next row if the next row is part of the current group. If a row is the last record for that group I want to fill with a null value.
I'm referencing reading next observation's value in current observation, but still can't figure out how to obtain the solution I need.
For example:
data have;
input Group ID Salary;
cards;
10 1 1
10 2 2
10 3 2
10 4 1
11 1 2
11 2 2
11 3 1
11 4 1
;
run;
The result I want to obtain here is this:
data want;
input Group ID Salary Sum;
cards;
10 1 1 3
10 2 2 4
10 3 2 3
10 4 1 .
11 1 2 4
11 2 2 3
11 3 1 2
11 4 1 .
;
run;
Similar to Tom's answer, but using a 'look-ahead' merge (without a by statement, and firstobs=2) :
data want ;
merge have
have (firstobs=2
keep=Group Salary
rename=(Group=NextGroup Salary=NextSalary)) ;
if Group = NextGroup then sum = sum(Salary,NextSalary) ;
drop Next: ;
run ;
Use BY group processing and a second SET statement that skips the first observation.
data want ;
set have end=eof;
by group ;
if not eof then set have (keep=Salary rename=(Salary=Sum) firstobs=2);
if last.group then Sum=.;
else sum=sum(sum,salary);
run;
I found a solution using proc expand that produced what I needed:
proc sort data = have;
by Group ID;
run;
proc expand data=have out=want method=none;
by Group;
convert Salary = Next_Sal / transformout=(lead 1);
run;
data want(keep=Group ID Salary Sum);
set want;
Sum = Salary + Next_Sal;
run;
I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;
I have a data set with daily data in SAS. I would like to convert this to monthly form by taking differences from the previous month's value by id. For example:
thedate, id, val
2012-01-01, 1, 10
2012-01-01, 2, 14
2012-01-02, 1, 11
2012-01-02, 2, 12
...
2012-02-01, 1, 20
2012-02-01, 2, 15
I would like to output:
thedate, id, val
2012-02-01, 1, 10
2012-02-01, 2, 1
Here is one way. If you license SAS-ETS, there might be a better way to do it with PROC EXPAND.
*Setting up the dataset initially;
data have;
informat thedate YYMMDD10.;
input thedate id val;
datalines;
2012-01-01 1 10
2012-01-01 2 14
2012-01-02 1 11
2012-01-02 2 12
2012-02-01 1 20
2012-02-01 2 15
;;;;
run;
*Sorting by ID and DATE so it is in the right order;
proc sort data=have;
by id thedate;
run;
data want;
set have;
retain lastval; *This is retained from record to record, so the value carries down;
by id thedate;
if (first.id) or (last.id) or (day(thedate)=1); *The only records of interest - the first record, the last record, and any record that is the first of a month.;
* To do END: if (first.id) or (last.id) or (thedate=intnx('MONTH',thedate,0,'E'));
if first.id then call missing(lastval); *Each time ID changes, reset lastval to missing;
if missing(lastval) then output; *This will be true for the first record of each ID only - put that record out without changes;
else do;
val = val-lastval; *set val to the new value (current value minus retained value);
output; *put the record out;
end;
lastval=sum(val,lastval); *this value is for the next record;
run;
You could achieve this using a PROC SQL, and the intnx function to bring last months date forward a month...
proc sql ;
create table lag as
select b.thedate, b.id, (b.val - a.val) as val
from mydata b
left join
mydata a on b.date = intnx('month',a.date,1,'s')
and b.id = a.id
order by b.date, b.id ;
quit ;
This may need tweaking to handle scenarios where the previous month doesn't exist or months which have a different number of days to the previous month.