I have a dataset which looks like
ID STATUS YEAR AMOUNT DT_1
. OPEN 2010 12 12
. OPEN 2009 24 10
. OPEN 2008 32 1
AA CLOSE 2015 150 12
AA CLOSE 2014 200 10
AA CLOSE 2010 10 8
AA CLOSE 2009 20 7
AA CLOSE 2008 18 5
AA OPEN 2012 21 8
AA OPEN 2001 20 7
AA OPEN 2000 18 5
Column DT_1 may take from a max of 12 to a min of 1.
I would like to calculate how much amount there is within this range each time. This means that I should assign to the current year the previous amount.
I would like to expect something like this
ID STATUS YEAR AMOUNT DT_1
. OPEN 2010 12 24
. OPEN 2009 24 32
. OPEN 2008 32 .
AA CLOSE 2015 150 200
AA CLOSE 2014 200 10
AA CLOSE 2010 10 20
AA CLOSE 2009 20 18
AA CLOSE 2008 18 .
AA OPEN 2012 21 20
AA OPEN 2001 20 18
AA OPEN 2000 18 .
I have tried as follows
proc sql;
create table tab1 as
select ID, status, year, sum(amount) as tot_amount, dt_1
from tab
group by 1,2,3;
quit;
but it does not give me the expected output.
EDIT: I had to edit the question as the expected output was different.
So DT_1 is the amount form the previous year? If so it would be a lot easier if the data was sorted by increasing value of YEAR, instead of decreasing as displayed in the question. Then you can just use the LAG() function.
proc sort data=HAVE out=WANT ;
by id status year ;
run;
data WANT;
set want ;
by id status year;
dt_1 = lag(amount);
if first.status then dt_1=.;
run;
See if this is what you want
data have;
input ID $ STATUS $ YEAR AMOUNT;
datalines;
. OPEN 2010 12
. OPEN 2009 24
. OPEN 2008 32
AA CLOSE 2015 150
AA CLOSE 2014 200
AA CLOSE 2010 10
AA CLOSE 2009 20
AA CLOSE 2008 18
AA OPEN 2012 21
AA OPEN 2001 20
AA OPEN 2000 18
;
data want(drop = s);
merge have
have(firstobs = 2 keep = amount STATUS
rename = (amount = DT_1 STATUS = s));
if STATUS ne s then DT_1 = .;
run;
Related
I'm working with panel data in Stata, and I have a set up like the following:
ID
year
value
1
2010
1
2011
20
1
2012
20
1
2013
1
2014
2
2010
2
2011
14
2
2012
14
2
2013
14
2
2014
14
and I want to change the blank entries to be the same as the other entries within that ID, for any year. I.e., I want something like the following:
ID
year
value
1
2010
20
1
2011
20
1
2012
20
1
2013
20
1
2014
20
2
2010
14
2
2011
14
2
2012
14
2
2013
14
2
2014
14
What do you recommend?
If the value in variable value are always the same within id you can use this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year byte value
1 2010 .
1 2011 20
1 2012 20
1 2013 .
1 2014 .
2 2010 .
2 2011 14
2 2012 14
2 2013 14
2 2014 14
end
*Get mean of values within id
bysort id : egen value2 = mean(value)
*Transfer values back to original var to maintain var labels etc. then drop value2
replace value = value2
drop value2
I have an example table as below
id term subj prof hour
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
I need to divide hours if the id- term and subj same. There are 2 different prof with same id:20 - term and subj, so i divided hour 2.
There are 3 different prof with same id : 30 - term and subj. So i divided hour 3.
So the output should be like this;
id term subj prof hour
20 2016 COM James 2
20 2016 COM Henrey 2
30 2016 HUM Nelly 1
30 2016 HUM John 1
30 2016 HUM Jimmy 1
45 2016 CGS Tim 3
In SAS you can use a double DOW loop to achieve this, once the data has been sorted in the correct order. The first loop counts how many profs there are with the same id, term and subj. The second loop divides hour by the number of profs. The loops are performed at each change of id, term or subj.
I've created a new_hour variable and kept in the temporary _counter variable just so you can see the code working, you can obviously overwrite the hour variable and drop the _counter variable if you wish
/* create initial dataset */
data have;
input id term subj $ prof $ hour;
datalines;
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
;
run;
/* sort data */
proc sort data=have;
by id term subj prof;
run;
/* create output dataset */
data want;
do until(last.subj); /* 1st loop*/
set have;
by id term subj prof;
if first.subj then _counter=0; /* reset counter when id, term or subj change */
_counter+first.prof; /* count number of times prof changes */
end;
do until(last.subj); /* 2nd loop */
set have;
by id term subj;
new_hour=hour / _counter; /* divide hour by number of profs from 1st loop */
output; /* output record */
end;
run;
Assuming your problem is as simple as the one you gave as an example, one proc sql should suffice. If it is more complicated, please explain how so we can be more helpful!
data have;
input id term subj $ prof $ hour;
datalines;
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
;
run;
proc sql;
create table want as select
*, hour / count(prof) as hour_adj
from have
group by id, subj;
quit;
I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
I need to check a lot of iterations but my goal is to have the code
check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
My desired output dataset woudld be
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
Thanks for any insight!
You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
Then use PROC FORMAT with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
Make sure your date ranges don't overlap - if they do this will fail.
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;
I have a table as below:
id term subj degree
18 2007 ww Yes
32 2015 AA Yes
32 2016 AA No
25 2011 NM No
25 2001 ts No
18 2009 ww Yes
18 2010 ww No
I need another variable term2 if the degree is Yes, and I will write to term2 whatever same id and subj's term. So means:
id term subj degree term2
18 2007 ww Yes 2009
32 2015 AA Yes 2016
32 2016 AA No 0
25 2011 NM No 0
25 2001 ts No 0
18 2009 ww Yes 2010
18 2010 ww No 0
What I did with if then else doesn't work. Any idea? Thank you
this is the one I used
data have;
merge aa aa (rename=(id=id1 subj=subj1
term=term1);
term2=0;
if id=id1 and subj=subj1 and degree="Yes" then
term2=term1
run;
data have;
input id term subj $ degree $;
cards;
32 2015 AA Yes
32 2016 AA No
25 2011 NM No
25 2001 ts No
18 2007 ww Yes
18 2010 ww No
;
data want;
merge have have(firstobs=2 keep=id term rename=(id=_id term=_term));
term2=0;
if id=_id and degree='Yes' then term2=_term;
drop _:;
run;
There is missing some important information, like, when an id has an degree = yes value, is there always a degree = no row with the same id?
What should be done if there are more then one degree=no rows with different terms for an id if it also has an degree=yes value? Why do you want to solve this with an if-else Statement?
Assuming you have always exactly one id-matching degree=no row for a row with degree = yes you can use this:
Proc sql;
Select a.*, case when a.degree = "Yes" then b.term else 0 end from table as a
left outer join table as b on a.id = b.id and b.degree = "No" and a.degree="Yes";
quit;
This is without if-statement and no datastep, but you must provide more information if you want a more specific solution.
I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
I need to check a lot of iterations but my goal is to have the code
check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
My desired output dataset woudld be
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
Thanks for any insight!
You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
Then use PROC FORMAT with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
Make sure your date ranges don't overlap - if they do this will fail.
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;