I have three columns in my dataset: ID, Date1 and Date2. I need to fill the column Date1 according to Date2. In other words, if Date1 is older then Date2 then Date1 will be the same value.
My original data is:
ID |Date1 | Date2
84 | |13JAN2015
84 |12NOV2014 |05FEB2015
84 |12NOV2014 |07FEB2015
106 | |13JAN2015
106 |09MAY2014 |05FEB2015
106 |09MAY2014 |07FEB2015
106 |09MAY2014 |09MAR2015
153 |16JAN2015 |08OCT2015
153 |16JAN2015 |12NOV2015
155 | |13JAN2015
155 |01JUN2014 |05FEB2015
155 |25APR2015 |12NOV2015
155 |25APR2015 |28NOV2015
And I want to obtain this result:
ID |Date1 | Date2
84 |12NOV2014 |13JAN2015
84 |12NOV2014 |05FEB2015
84 |12NOV2014 |07FEB2015
106 |09MAY2014 |13JAN2015
106 |09MAY2014 |05FEB2015
106 |09MAY2014 |07FEB2015
106 |09MAY2014 |09MAR2015
153 |16JAN2015 |08OCT2015
153 |16JAN2015 |12NOV2015
155 |01JUN2014 |13JAN2015
155 |01JUN2014 |05FEB2015
155 |25APR2015 |12NOV2015
155 |25APR2015 |28NOV2015
Assuming Date1 is the birth date and Date2 is the collection information date, then all information that was collected after birth (for the same ID) must belong to the same birth date, thus filling in the blanks.
I've tried to use an IF statement, but it didn't work.
Someone could help me?
You could use the lag() function which allows you to read the value of the previous record in the data step loop.
data have;
infile datalines delimiter='|';
length Date1$10 Date2$10;
input ID Date1 $ Date2 $;
datalines;
84 | |13JAN2015
84 |12NOV2014 |05FEB2015
84 |12NOV2014 |07FEB2015
106 | |13JAN2015
106 |09MAY2014 |05FEB2015
106 |09MAY2014 |07FEB2015
106 |09MAY2014 |09MAR2015
153 |16JAN2015 |08OCT2015
153 |16JAN2015 |12NOV2015
155 | |13JAN2015
155 |01JUN2014 |05FEB2015
155 |25APR2015 |12NOV2015
155 |25APR2015 |28NOV2015
;
run;
*** convert to dates;
data have;
set have;
format d2 d1 Date9.;
*Date1 = COMPRESS( tranwrd(Date1,'/',''),,'s');
d1 = input(Date1,Date9.);
*Date2 = COMPRESS( tranwrd(Date2,'/',''),,'s');
d2 = input(Date2,Date9.);
rename d1=Date1 d2=Date2;
drop Date1 Date2;
run;
*** sort;
proc sort data=have out=want;
by id descending Date2;
run;quit;
*** if missing then fill in previous value;
data want;
set want;
lagDate1=lag(Date1);
if Date1=. then Date1=lagDate1;
drop lagDate1;
run;
*** sort back to normal;
proc sort data=want;
by id Date2;
run;quit;
sort the data
if missing then copy the previous value
sort back to normal
You might want to add a rule to verify that ID is the same as previous record before copying the date from it.
Related
I have a large dataset named Planes with missing values in Arrival Delays(Arr_Delay).I want to
Replace those missing values by Average delay on the Specific route(Origin - Dest) by Specific
Carrier.
Hereby is the sample of the dataset : -
date carrier Flight tailnum origin dest Distance Air_time Arr_Delay
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
code I tried : -
Proc stdize data=cs1.Planes reponly method=mean out=cs1.Complete_data;
var Arrival_delay_minutes;
Run;
But as my problem states..i want to get the mean by Specific Route and Specific Carrier for the Missing Value. Please help me on this!
stdize Procedure does not have a way to include by or class variables. you can use the below code to complete your task:-
Proc means data=cs1.Planes noprint;
var Arr_Delay;
class carrier origin dest;
output out=mean1;
Run;
proc sort data=cs1.Planes;
by carrier origin dest;
run;
proc sort data=mean1;
by carrier origin dest;
run;
data cs1.Complete_data(drop=Arr_Delay1 _stat_);
merge cs1.Planes(in=a) mean1(where=(_stat_="MEAN")
keep=carrier origin dest Arr_Delay _stat_
rename=(Arr_Delay = Arr_Delay1) in=b);
by carrier origin dest;
if a;
if Arr_Delay =. then Arr_Delay=Arr_Delay1;
run;
You just need to sort the table cs1.Planes by origin, dest & carrier before running Proc stdize and add by origin dest carrier; to do the grouping you wanted. The only case the values will remain missing is when there are no other values for this carrier/route.
You can find the SAS documentation here and available options here.
Code:
data have;
informat date ddmmyy10.;
format date ddmmyy10.;
input
date carrier $ Flight tailnum $ origin $ dest $ Distance Air_time Arr_Delay;
datalines;
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 15
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
;
run;
proc sort data=work.have; by origin dest carrier; run;
Proc stdize data=work.have reponly method=mean out=work.Complete_data ;
var Arr_Delay;
by origin dest carrier ;
Run;
this is my problem: I have a dataset that has 10 measurements over time, something like this:
ID Expenditure Age
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
.
.
.
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
.
.
.
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
.
.
.
Now I want to obtain the frequency of age, so I did this:
proc freq data=Expenditure;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age Frequency Count Percent of total frequency
79 10 0.1
80 140 1.4
89 50 0.5
The problem is that this counts all rows, but doesn't take into account the repeated measurements per id. So I wanted to create a new colum with the actual frequencies like this:
data Age;
set Age_freq;
freq = Frequency Count /10;
run;
but I think sas doesn't recognize this 'Frequency Count' variable, can anybody gives me some insight on this?
thanks
You have to remove the duplicate records so that each ID had one record containing the age.
Solution: create a new table with the disticnt values of the ID and Age. then run the proc freq
Code:
I created a new table called Expenditure_ids that doesn't have any duplicate values for the ID & Age.
data Expenditure;
input ID Expenditure Age ;
datalines;
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
28 100 80
28 102 80
28 178 80
28 290 80
28 200 80
;
run;
proc sql;
create table Expenditure_ids as
select distinct ID, Age from Expenditure ;
quit;
proc freq data=Expenditure_ids;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age=79 COUNT=1 PERCENT=25
Age=80 COUNT=2 PERCENT=50
Age=89 COUNT=1 PERCENT=25
I am currently practicing SAS programming on using two SAS dataset(sample and master) . Below are the hypothetical or dummy data created for illustration purpose to solve my problem through SAS programming . I would like to extract the data for the id's in sample dataset from master dataset(test). I have given an example with few id's as sample dataset, for which i need to extract next 12 month information from master table(test) for each id's based on the yearmonth information( desired output given in the third output).
Below is the code to extract the previous 12 month data but i am not getting idea to extract next 12 month records as pulled for previous months, Can anyone help me in solving this problem using SAS programming with optimized way.
proc sort data=test;
by id yearmonth;
run;
data result;
set test;
array prev_month {13} PREV_MONTH_0-PREV_MONTH_12;
by id;
if first.id then do;
do i =1 to 13;
prev_month(i)=0;
end;
end;
do i = 13 to 2 by -1;
prev_month(i)=prev_month(i-1);
end;
prev_month(1)=no_of_cust;
drop i prev_month_0;
retain prev_month:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
One sample dataset (dataset name - sample).
ID YEARMONTH NO_OF_CUST
1 200909 50
1 201005 65
1 201008 78
1 201106 95
2 200901 65
2 200902 45
2 200903 69
2 201005 14
2 201006 26
2 201007 98
One master dataset - dataset name (test) (huge dataset over the year for each id from start of the account to till date.)
ID YEARMONTH NO_OF_CUST
1 200808 125
1 200809 125
1 200810 111
1 200811 174
1 200812 98
1 200901 45
1 200902 74
1 200903 73
1 200904 101
1 200905 164
1 200906 104
1 200907 22
1 200908 35
1 200909 50
1 200910 77
1 200911 86
1 200912 95
1 201001 95
1 201002 87
1 201003 79
1 201004 71
1 201005 65
1 201006 66
1 201007 66
1 201008 78
1 201009 88
1 201010 54
1 201011 45
1 201012 100
1 201101 136
1 201102 111
1 201103 17
1 201104 77
1 201105 111
1 201106 95
1 201107 79
1 201108 777
1 201109 758
1 201110 32
1 201111 15
1 201112 22
2 200711 150
2 200712 150
2 200801 44
2 200802 385
2 200803 65
2 200804 66
2 200805 200
2 200806 333
2 200807 285
2 200808 265
2 200809 222
2 200810 220
2 200811 205
2 200812 185
2 200901 65
2 200902 45
2 200903 69
2 200904 546
2 200905 21
2 200906 256
2 200907 214
2 200908 14
2 200909 44
2 200910 65
2 200911 88
2 200912 79
2 201001 65
2 201002 45
2 201003 69
2 201004 54
2 201005 14
2 201006 26
2 201007 98
Desired Output should like below,
ID YEARMONTH NO_OF_CUST AFTER_MONTH_1 AFTER_MONTH_2 AFTER_MONTH_3 AFTER_MONTH_4 AFTER_MONTH_5 AFTER_MONTH_6 AFTER_MONTH_7 AFTER_MONTH_8 AFTER_MONTH_9 AFTER_MONTH_10 AFTER_MONTH_11 AFTER_MONTH_12
1 200909 50 77 86 95 95 87 79 71 65 66 66 78 88
Step1: Join your sample table with the main(test) table and using intnx to get all the values for next 12 months.
Step2: Making a column names "after month"
Step3: Transpose to get your final output
proc sql;
create table abc as
select a.id,a.yearmonth,b.yearmonth as yearmonth1, b.no_of_cust
from
sample a
left join
test b
on a.id = b.id and a.yearmonth <= b.yearmonth <= intnx("month",a.yearmonth,12)
order by a.id,a.yearmonth,b.yearmonth;
quit;
data abc1(drop=col yearmonth1);
set abc;
by id yearmonth;
if first.yearmonth then col=-1;
col+1;
columns = compress("after_month_"||col);
run;
proc transpose data=abc1 out=abc2(rename=(after_month_0 = no_of_cust) drop=_name_);
by id yearmonth;
id columns;
var no_of_cust;
run;
My Output:
Or
If you want to make changes in your query then you could use the below code.
proc sort data=test;
by id descending yearmonth;
run;
data result;
set test;
array after_month {13} after_MONTH_0-after_MONTH_12;
by id;
if first.id then do;
do i = 1 to 13;
after_month(i) = 0;
end;
end;
do i = 13 to 2 by -1;
after_month(i) = after_month(i-1);
end;
after_month(1) = NO_OF_CUST;
drop i after_MONTH_0;
retain after_MONTH:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=result;
by id yearmonth;
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
Let me know in case of any queries.
I would like to have a more efficient way perform this imputation. What I want is the value of variable CID to be copied on the row where CID is missing. For instace having CID=1818 reported also for dates 1,14,28 and 42.
The program that I wrote works fine but I would like to know if there is another more simple way to perform this action. Note that here RETAIN can't be used.
DATA test;
infile cards dlm='' dsd ;
input cid $ #6 days $ #9 CH #13 CL ;
cards;
1818 -2 117 46
1 107 45
14 97 46
28 104 46
42 106 44
5684 -2 100 62
1 58 78
14 87 46
28 102 45
42 155 41
;
RUN;
options mprint mlogic symbolgen;
%macro lag(var,num);
%do i=2 %to &num.;
sub&i.=lag&i.(&var);
if cid=' ' then cid=sub&i.;
/*drop sub&i.;*/
%end;
%mend lag;
data test_1 ;set test;
sub=lag(cid);
if cid=' ' then cid=sub;
%lag(cid,5);
run;
Why can't Retain be used?
data want;
set test;
retain _cid;
if not missing(cid) then _cid=cid;
else cid=_cid;
drop _cid;
run;
I want to check whether the operater (OP_NAME) is change by SN or not. If Operater changes then output with all the corresponding SN ,OP_NAME,Value as Table1:OP_CHANGE else Table2:OP_UNQ.
Condition : If any 1 operater changes with SN then output entire rows in to a table
SN OP_Name Value
109029 SPAN 150
109029 SPAN 235
109032 SPAN 550
109033 SPAN 650
109033 SPAN 700
109036 FRAN 124
109036 SECURIT 224
109036 SECURIT 560
109036 SECURIT 752
109037 AOM 44
109037 SECA 58
109037 SECURIT 85
109037 SECURIT 98
109038 FRAN 45
109038 SECURIT 47
109038 SECURIT 58
109038 SECURIT 65
109039 GOVER 33
109039 GOVER 45
109039 GOVER 48
109041 SOREL 45
109041 SOREM 55
109041 INA 45
109043 SPAN 96
109044 SPAN 53
109045 SOREM 25
109045 SOREM 65
I want to see output tables like
Table1:OP_CHANGE
SN OP Name value
109036 FRAN 124
109036 SECURIT 224
109036 SECURIT 560
109036 SECURIT 752
109037 AOM 44
109037 SECA 58
109037 SECURIT 85
109037 SECURIT 98
109038 FRAN 45
109038 SECURIT 47
109038 SECURIT 58
109038 SECURIT 65
109041 SOREM 45
109041 SOREM 55
109041 INA 45
Table2:OP_UNQ
SN OP Name value
109029 SPAN 150
109029 SPAN 235
109032 SPAN 550
109033 SPAN 650
109033 SPAN 700
109039 GOVER 33
109039 GOVER 45
109039 GOVER 48
109043 SPAN 96
109044 SPAN 53
109045 SOREM 25
109045 SOREM 65
You could do something like the following:
data op_change(drop=change) op_unq(drop=change);
set have(in=h1) have;
by sn;
length change $1;
retain change;
if h1 then do;
if lag(op_name) ne op_name then change = 'Y';
if first.sn then change = 'N';
end;
else do;
if change = 'Y' then output op_change;
else output op_unq;
end;
run;
This should work, assuming your input table is called table:
proc sql;
/* Find SN values with different OP_NAMEs */
create table op_change as
select t.*
from table t
inner join
(select SN, count(distinct OP_NAME)
from table
group by SN
having count(distinct OP_NAME) > 1) x
on t.SN = x.SN;
/* Find SN values with unique OP_NAMEs */
create table op_unq as
select t.*
from table t
inner join
(select SN, count(distinct OP_NAME)
from table
group by SN
having count(distinct OP_NAME) = 1) x
on t.SN = x.SN;
quit;