Data tranfromation with if then else - if-statement

I have a table as below:
id term subj degree
18 2007 ww Yes
32 2015 AA Yes
32 2016 AA No
25 2011 NM No
25 2001 ts No
18 2009 ww Yes
18 2010 ww No
I need another variable term2 if the degree is Yes, and I will write to term2 whatever same id and subj's term. So means:
id term subj degree term2
18 2007 ww Yes 2009
32 2015 AA Yes 2016
32 2016 AA No 0
25 2011 NM No 0
25 2001 ts No 0
18 2009 ww Yes 2010
18 2010 ww No 0
What I did with if then else doesn't work. Any idea? Thank you
this is the one I used
data have;
merge aa aa (rename=(id=id1 subj=subj1
term=term1);
term2=0;
if id=id1 and subj=subj1 and degree="Yes" then
term2=term1
run;

data have;
input id term subj $ degree $;
cards;
32 2015 AA Yes
32 2016 AA No
25 2011 NM No
25 2001 ts No
18 2007 ww Yes
18 2010 ww No
;
data want;
merge have have(firstobs=2 keep=id term rename=(id=_id term=_term));
term2=0;
if id=_id and degree='Yes' then term2=_term;
drop _:;
run;

There is missing some important information, like, when an id has an degree = yes value, is there always a degree = no row with the same id?
What should be done if there are more then one degree=no rows with different terms for an id if it also has an degree=yes value? Why do you want to solve this with an if-else Statement?
Assuming you have always exactly one id-matching degree=no row for a row with degree = yes you can use this:
Proc sql;
Select a.*, case when a.degree = "Yes" then b.term else 0 end from table as a
left outer join table as b on a.id = b.id and b.degree = "No" and a.degree="Yes";
quit;
This is without if-statement and no datastep, but you must provide more information if you want a more specific solution.

Related

Calculate lags and update flags after comparing years

suppose to have the following:
ID yS yE flagS FlagE
0001 2015 2017 1 1
0001 2017 2020 2 2
0002 2017 2018 1 1
0002 2019 2020 2 2
I'm trying to generate the following:
ID yS yE flagS FlagE
0001 2015 2017 1 1
0001 2017 2020 2 1
0002 2017 2018 1 1
0002 2019 2020 2 2
meaning: flagE at the second row becomes equal to the flag at the first row (i.e., 1) because yE at the first row = 2017 is equal to yS at the second row, i.e., 2017.
Conversely, for ID=0002 nothing should be done because there is a gap between end and start in terms of years.
I tried the lag function on yE to then compare the year yS and yE but it performs the lag of the entire variable on the other IDs. Also using first.ID it does not work.
Can anyone help me please?
The following gives you the expected output
Sort by descending id yS
Simulate the leading function with the merging trick
Apply desired flag if the given condition is true
Sort back the output by ascending id yS to get expected format
proc sort data=have out=stage1;
by id descending ys;
run;
data stage2;
merge stage1 stage1(firstobs=2 keep=ye flage id rename=(ye=_ye flage=_flage
id=_id));
if id ne _id then
call missing(_ye, _flage, _id);
if _ye=ys then
flage=_flage;
drop _:;
run;
proc sort data=stage2 out=want;
by id ys;
run;
want
id yS yE flagS flagE
0001 2015 2017 1 1
0001 2017 2020 2 1
0002 2017 2018 1 1
0002 2019 2020 2 2

Count amount within an year

I have a dataset which looks like
ID STATUS YEAR AMOUNT DT_1
. OPEN 2010 12 12
. OPEN 2009 24 10
. OPEN 2008 32 1
AA CLOSE 2015 150 12
AA CLOSE 2014 200 10
AA CLOSE 2010 10 8
AA CLOSE 2009 20 7
AA CLOSE 2008 18 5
AA OPEN 2012 21 8
AA OPEN 2001 20 7
AA OPEN 2000 18 5
Column DT_1 may take from a max of 12 to a min of 1.
I would like to calculate how much amount there is within this range each time. This means that I should assign to the current year the previous amount.
I would like to expect something like this
ID STATUS YEAR AMOUNT DT_1
. OPEN 2010 12 24
. OPEN 2009 24 32
. OPEN 2008 32 .
AA CLOSE 2015 150 200
AA CLOSE 2014 200 10
AA CLOSE 2010 10 20
AA CLOSE 2009 20 18
AA CLOSE 2008 18 .
AA OPEN 2012 21 20
AA OPEN 2001 20 18
AA OPEN 2000 18 .
I have tried as follows
proc sql;
create table tab1 as
select ID, status, year, sum(amount) as tot_amount, dt_1
from tab
group by 1,2,3;
quit;
but it does not give me the expected output.
EDIT: I had to edit the question as the expected output was different.
So DT_1 is the amount form the previous year? If so it would be a lot easier if the data was sorted by increasing value of YEAR, instead of decreasing as displayed in the question. Then you can just use the LAG() function.
proc sort data=HAVE out=WANT ;
by id status year ;
run;
data WANT;
set want ;
by id status year;
dt_1 = lag(amount);
if first.status then dt_1=.;
run;
See if this is what you want
data have;
input ID $ STATUS $ YEAR AMOUNT;
datalines;
. OPEN 2010 12
. OPEN 2009 24
. OPEN 2008 32
AA CLOSE 2015 150
AA CLOSE 2014 200
AA CLOSE 2010 10
AA CLOSE 2009 20
AA CLOSE 2008 18
AA OPEN 2012 21
AA OPEN 2001 20
AA OPEN 2000 18
;
data want(drop = s);
merge have
have(firstobs = 2 keep = amount STATUS
rename = (amount = DT_1 STATUS = s));
if STATUS ne s then DT_1 = .;
run;

Divide variable if the rest is same

I have an example table as below
id term subj prof hour
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
I need to divide hours if the id- term and subj same. There are 2 different prof with same id:20 - term and subj, so i divided hour 2.
There are 3 different prof with same id : 30 - term and subj. So i divided hour 3.
So the output should be like this;
id term subj prof hour
20 2016 COM James 2
20 2016 COM Henrey 2
30 2016 HUM Nelly 1
30 2016 HUM John 1
30 2016 HUM Jimmy 1
45 2016 CGS Tim 3
In SAS you can use a double DOW loop to achieve this, once the data has been sorted in the correct order. The first loop counts how many profs there are with the same id, term and subj. The second loop divides hour by the number of profs. The loops are performed at each change of id, term or subj.
I've created a new_hour variable and kept in the temporary _counter variable just so you can see the code working, you can obviously overwrite the hour variable and drop the _counter variable if you wish
/* create initial dataset */
data have;
input id term subj $ prof $ hour;
datalines;
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
;
run;
/* sort data */
proc sort data=have;
by id term subj prof;
run;
/* create output dataset */
data want;
do until(last.subj); /* 1st loop*/
set have;
by id term subj prof;
if first.subj then _counter=0; /* reset counter when id, term or subj change */
_counter+first.prof; /* count number of times prof changes */
end;
do until(last.subj); /* 2nd loop */
set have;
by id term subj;
new_hour=hour / _counter; /* divide hour by number of profs from 1st loop */
output; /* output record */
end;
run;
Assuming your problem is as simple as the one you gave as an example, one proc sql should suffice. If it is more complicated, please explain how so we can be more helpful!
data have;
input id term subj $ prof $ hour;
datalines;
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
;
run;
proc sql;
create table want as select
*, hour / count(prof) as hour_adj
from have
group by id, subj;
quit;

Proc report- grouping

I have an easy table, and I need to create a complicated report. I tried to do it with proc report using lots of grouping but didn't give me right result. Here is my example table :
campus id year gender
West 35 2013 F
West 35 2014 F
West 35 2015 F
West 38 2014 M
West 38 2015 M
East 48 2014 -
East 48 2015 -
East 55 2013 F
East 55 2014 F
And this is the report I need to create:
west east
2014 2015 2014 2015
total 2 2 2 1
Gender 2 2 2 1
F 1 1 1 -
M 1 1 - -
none - - 1 1
So I have 4 different group: I worked on this code
proc tabulate data=a ;
class gender year ;
table gender, year*n*f=4. ;
by id;
run ;
Do you think I can do total first, then gender. And tehn I can append them?
This doesn't quite match your requested output, but I'm not sure having the total repeated makes sense either. Proc Tabulate works well here:
proc tabulate data=have;
class campus year gender/missing;
table (all='Total' gender='Gender'), campus=''*year=''*n='';
run;

Find matches by condition between 2 datasets in SAS

I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
I need to check a lot of iterations but my goal is to have the code
check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
My desired output dataset woudld be
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
Thanks for any insight!
You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
Then use PROC FORMAT with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
Make sure your date ranges don't overlap - if they do this will fail.
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;