I am working on a dataset in SAS to get the next observation's score should be the current observation's value for the column Next_Row_score. If there is no next observation then the current observation's value for the column Next_Row_score should be 'null'per group(ID). For better illustration i have provided the sample below dataset :
ID Score
10 1000
10 1500
10 2000
20 3000
20 4000
30 2500
Resultant output should be like -
ID Salary Next_Row_Salary
10 1000 1500
10 1500 2000
10 2000 .
20 3000 4000
20 4000 .
30 2500 2500
Thank you in advance for your help.
data want(drop=_: flag);
merge have have(firstobs=2 rename=(ID=_ID Score=_Score));
if ID=_ID then do;
Next_Row_Salary=_Score;
flag+1;
end;
else if ID^=_ID and flag>=1 then do;
Next_Row_Salary=.;
flag=.;
end;
else Next_Row_Salary=score;
run;
Try this :
data have;
input ID Score;
datalines;
10 1000
10 1500
10 2000
20 3000
20 4000
30 2500
;
run;
proc sql noprint;
select count(*) into :obsHave
from have;
quit;
data want2(rename=(id1=ID Score1=Salary) drop=ID id2 Score);
do i=1 to &obsHave;
set have point=i;
id1=ID;
Score1=Score;
j=i+1;
set have point=j;
id2=ID;
if id1=id2 then do;
Next_Row_Salary = Score;
end;
else Next_Row_Salary=".";
output;
end;
stop;
;
run;
There is a simpler (in my mind, at least) proc sql approach that doesn't involve loops:
data have;
input ID Score;
datalines;
10 1000
10 1500
10 2000
20 3000
20 4000
30 2500
;
run;
/*count each observation's place in its ID group*/
data have2;
set have;
count + 1;
by id;
if first.id then count = 1;
run;
/*if there is only one ID in a group, keep original score, else lag by 1*/
proc sql;
create table want as select distinct
a.id, a.score,
case when max(a.count) = 1 then a.score else b.score end as score2
from have2 as a
left join have2 (where = (count > 1)) as b
on a.id = b.id and a.count = b.count - 1
group by a.id;
quit;
Related
I have a dataset like below
Customer_ID Vistited_Date
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
I am trying find the cumulative unique count of customers by date, assuming my output will be like below
Cust_count Vistited_Date
3 7-Feb-20
2 14-Feb-20
7-Feb-2020 has 3 unique customers, whereas 14-Feb-2020 has only 2 hence customer 1234 has visited already.
Anyone knows how I could develop a data set in these conditions?
Sorry if my question is not clear enough, and I am available to give more details if necessary.
Thanks!
NOTE: #draycut's answer has the same logic but is faster, and I will explain why.
#draycut's code uses one hash method, add(), using the return code as test for conditional increment. My code uses check() to test for conditional increment and then add (which will never fail) to track. The one method approach can be perceived as being anywhere from 15% to 40% faster in performance (depending on number of groups, size of groups and id reuse rate)
You will need to track the IDs that have occurred in all prior groups, and exclude the tracked IDs from the current group count.
Tracking can be done with a hash, and conditional counting can be performed in a DOW loop over each group. A DOW loop places the SET statement inside an explicit DO.
Example:
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data counts(keep=date count);
if _n_ = 1 then do;
declare hash tracker();
tracker.defineKey('id');
tracker.defineDone();
end;
do until (last.date);
set have;
by date;
if tracker.check() ne 0 then do;
count = sum(count, 1);
tracker.add();
end;
end;
run;
Raw performance benchmark - no disk io, cpu required to fill array before doing hashing, so those performance components are combined.
The root performance is how fast can new items be added to the hash.
Simulate 3,000,000 'records', 1,000 groups of 3,000 dates, 10% id reuse (so the distinct ids will be ~2.7M).
%macro array_fill (top=3000000, n_group = 1000, overlap_factor=0.10);
%local group_size n_overlap index P Q;
%let group_size = %eval (&top / &n_group);
%if (&group_size < 1) %then %let group_size = 1;
%let n_overlap = %sysevalf (&group_size * &overlap_factor, floor);
%if &n_overlap < 0 %then %let n_overlap = 0;
%let top = %sysevalf (&group_size * &n_group);
P = 1;
Q = &group_size;
array ids(&top) _temporary_;
_n_ = 0;
do i = 1 to &n_group;
do j = P to Q;
_n_+1;
ids(_n_) = j;
end;
P = Q - &n_overlap;
Q = P + &group_size - 1;
end;
%mend;
options nomprint;
data _null_ (label='check then add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.check() ne 0 then do;
count = sum(count,1);
h.add();
end;
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data _null_ (label='just add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.add() = 0 then
count = sum(count,1);
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data have;
input Customer_ID Vistited_Date :anydtdte12.;
format Vistited_Date date9.;
datalines;
1234 7-Feb-2020
4567 7-Feb-2020
9870 7-Feb-2020
1234 14-Feb-2020
7654 14-Feb-2020
3421 14-Feb-2020
;
data want (drop=Customer_ID);
if _N_=1 then do;
declare hash h ();
h.definekey ('Customer_ID');
h.definedone ();
end;
do until (last.Vistited_Date);
set have;
by Vistited_Date;
if h.add() = 0 then Count = sum(Count, 1);
end;
run;
If your data is not sorted and you like the SQL maybe this solution is same good for you and it is very simple:
/* your example 3 rows */
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
1234 15-Feb-20
7654 15-Feb-20
1111 15-Feb-20
;
run;
/* simple set theory. Final dataset contains your final data like results
below*/
proc sql;
create table temp(where =(mindate=date)) as select
ID, date,min(date) as mindate from have
group by id;
create table final as select count(*) as customer_count,date from temp
group by date;
quit;
/* results:
customer_count Date
3 07.febr.20
2 14.febr.20
1 15.febr.20
*/
Another method cause I dont know hash so well. >_<
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data want;
length Used $200.;
retain Used;
set have;
by Date;
if first.Date then count = .;
if not find(Used,cats(ID)) then do;
count + 1;
Used = catx(',',Used,ID);
end;
if last.Date;
put Date= count=;
run;
If you are not overly concerned with processing speed and want something simple:
proc sort data=have;
by id date;
** Get date of each customer's first unique visit **;
proc sort data=have out=first_visit nodupkey;
by id;
proc freq data=first_visit noprint;
tables date /out=want (keep=date count);
run;
Since I am new to SAS I need some help to understand how to combine the overlap date ranges into one row.I want to combine the overlap date ranges when they have matching Id. If the dates don’t overlap then I want to keep them as it is. IF they over lap by Matching Id and drug code Then it should combine into one line. Please look at the same ple data set which I have below and the expected results:
Current Data set:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 1/1/2019
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
Expected results:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 3/31/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/2/2018 8/31/2018
1 100 9/1/2018 5/4/2019
I wrote some SAS code but I am combining the dates even when there is no overlap. I want to write some code which should work in SAS.
PROC SORT DATA=Want OUT=ONE;
BY PERSON_ID BEG_DATE DRUG_CODE END_DATE;
RUN;
data TWO (DROP=PERSON_ID2 DRUG_CODE2 BEG_DATE END_DATE
RENAME=(BEG2=BEG_DOS
END2=END_DOS));
SET ONE;
RETAIN BEG2 END2;
PERSON_ID2=LAG1(PERSON_ID);
DRUG_CODE2=LAG1(DRUG_CODE);
IF PERSON_ID2=PERSON_ID AND DRUG_CODE2=DRUG_CODE AND BEG_DATE LE(END2+1) THEN
DO;
BEG2=MIN(BEG_DATE,BEG2);
END2=MAX(END_DATE,END2);
END;
ELSE
DO;
SEG+1;
BEG2=BEG_DATE;
END2=END_DATE;
END;
FORMAT BEG2 END2 MMDDYY10.;
RUN;
DATA THREE(DROP=BEG_DOS END_DOS SEG);
RETAIN BEG_DATE END_DATE;
SET TWO;
BY PERSON_ID SEG;
FORMAT BEG_DATE END_DATE MMDDYY10.;
IF FIRST.SEG THEN
DO;
BEG_DATE=BEG_DOS;
END;
IF LAST.SEG THEN
DO;
END_DATE = END_DOS;
OUTPUT;
END;
RUN;
This is how I would do it. Create an obs for each ID DRUG and DATE. Flag the gaps and summarize by RUN.
data have;
input ID Drug_Code (BEG End)(:mmddyy.);
format BEG End mmddyyd10.;
cards;
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 90 6/1/2018 8/15/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
;;;;
run;
proc print;
run;
/*1 100 1/1/2018 1/1/2019*/
data exv/ view=exv;
set have;
do date = beg to end;
output;
end;
drop beg end;
format date mmddyyd10.;
run;
proc sort data=exv out=ex nodupkey;
by id drug_code date;
run;
data breaksV / view=BreaksV;
set ex;
by id drug_code;
dif = dif(date);
if first.drug_code then do; dif=1; run=1; end;
if dif ne 1 then run+1;
run;
proc summary data=breaksV nway missing;
class id drug_code run;
var date;
output out=want(drop=_type_) min=Begin max=End;
run;
Proc print;
run;
Computing the extent range composed of overlapping segment ranges requires a good understanding of the range conditions (cases).
Consider the scenarios when sorted by start date (within any larger grouping set, G, such as id and drug)
Let [ and ] be endpoints of a range
# be date values (integers) within
Extent be the combined range that grows
Segment be the range in the current row
Case 1 - Growth. Within G Segment start before Extent end
Segment will either not contribute to Extent or extend it.
[####] Extent
+ [#] Segment range DOES NOT contribute
--------
[####] Extent (do not output a row, still growing)
or
[####] Extent
+ [#####] Segment range DOES contribute
--------
[#######] Extent (do not output a row, still growing)
Case 2 - Terminus. 3 possibilities:
Within G Segment start after Extent end,
Next G reached (different id/drug combination),
End of data reached.
#2 and #3 can be tested by checking the appropriate last. flag.
[####] Extent
+ ..[#] Segment beyond Extent (gap is 2)
--------
[####] output Extent
[#] reset Extent to Segment
You can adjust your rules for Segment being adjacent (gap=0) or close enough (gap < threshold) to mean an Extent is either expanded, or, output and reset to Segment.
Note: The situation is a little more (not shown) complicated for the real world cases of:
missing start means the Segment has an unknown start date (presume it to be epoch (0=01JAN1960, or some date that pre-dates all dates in the data or study)
missing end means the Segment is active today (end date is date when processing data)
Sample code:
data have;
call streaminit(42);
do id = 1 to 10;
do _n_ = 1 to 50;
drug = ceil(rand('UNIFORM', 10));
beg_date = intnx ('MONTH', '01JAN2008'D, rand('UNIFORM',20));
end_date = intnx ('DAY', beg_date, rand('UNIFORM',75));
OUTPUT;
end;
end;
format beg_date end_date yymmdd10.;
run;
proc sort data=have out=segments;
by id drug beg_date end_date;
run;
data want;
set segments;
by id drug beg_date end_date; * will error if incoming data is NOT sorted;
retain ext_beg ext_end;
retain gap_allowed 0; * set to 1 for contiguously adjacent segment ;
if first.drug then do;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 0;
end;
if beg_date <= ext_end + gap_allowed then do;
ext_end = max (ext_end, end_date);
segment_count + 1;
end;
else do;
extent_id + 1;
OUTPUT;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 1;
end;
if last.drug then do;
extent_id + 1;
OUTPUT;
* reset occurs implicitly;
* it will happen at first. logic when control returns to top of step;
end;
format ext_: yymmdd10.;
keep id drug ext_beg ext_end segment_count extent_id;
run;
During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.
I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.
I have the following data where people in households are sorted by age (oldest to youngest):
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
I would like to calculate for each household the maximum age difference between consecutively aged people. So this example would give values of 5 (=25-20), 16 (=32-16) and 32 (=42-10) for households 1, 2 and 3 consecutively.
I could do this using lots of merges (i.e. extract person 1, merge with extract of person 2, and so on), but as there can be upto 20+ people in a household I'm looking for a much more direct method.
Here's a two pass solution. Same first step as the two solutions above, sort by age. In the second step keep track of max_diff per row, at the last record of HouseID output the results. This results in only two passes through the data.
proc sort data=houses; by houseid age;run;
data want;
set houses;
by houseID;
retain max_diff 0;
diff = dif1(age)*-1;
if first.HouseID then do;
diff = .; max_diff=.;
end;
if diff>max_diff then max_diff=diff;
if last.houseID then output;
keep houseID max_diff;
run;
proc sort data=houses; by houseid personid age;run;
data _t1;
set houses;
diff = dif1(age) * (-1);
if personid = 1 then diff = .;
run;
proc sql;
create table want as
select houseid, max(diff) as Max_Diff
from _t1
group by houseid;
proc sort data = house;
by houseid descending age;
run;
data house;
set house;
by houseid;
lag_age = lag1(age);
if first.houseid then age_diff = 0;
age_diff = lag_age - age;
run;
proc sql;
select houseid,max(age_diff) as max_age_diff
from house
group by houseid;
quit;
Working:
First sort the data set using houseid and descending Age.
Second data step will calculate difference between current age value (in PDV) and previous age value in PDV. Then, using sql procedure, we can get the max age difference for each houseid.
Just throwing one more into the mix. This one is a condensed version of Reeza's response.
/* No need to sort by PersonID as age is the only concern */
proc sort data = houses;
by HouseID Age;
run;
data want;
set houses;
by HouseID;
/* Keep the diff when a new row is loaded */
retain diff;
/* Only replace the diff if it is larger than previous */
diff = max(diff, abs(dif(Age)));
/* Reset diff for each new house */
if first.HouseID then diff = 0;
/* Only output the final diff for each house */
if last.HouseID;
keep HouseID diff;
run;
Here is an example using FIRST. and LAST. with one pass (after sort) through the data.
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
Proc sort data=HOUSES;
by houseid descending age ;
run;
Data WANT(keep=houseid max_diff);
format houseid max_diff;
retain max_diff age1 age2;
Set HOUSES;
by houseid descending age ;
if first.houseid and last.houseid then do;
max_diff=0;
output;
end;
else if first.houseid then do;
call missing(max_diff,age1,age2);
age1=age;
end;
else if not(first.houseid or last.houseid) then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
age1=age;
end;
else if last.houseid then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
output;
end;
Run;