Proc Transpose With multiple ID values per Group - sas

In this first data-set each employee has one team lead and one supervisor. I can transpose that no problem.
data a;
input employee_id ReportsTo $ ReportsToType $12.;
cards;
100 Jane Supervisor
100 Mark Team_lead
101 Max Supervisor
101 Marie Team_lead
102 Sarah Supervisor
102 Sam Team_lead
;
run;
proc transpose data = a
out = aTP(drop = _:);
by employee_id;
id ReportsToType;
var ReportsTo;
run;
/* Output */
/*employee_id Supervisor Team_lead */
/*100 Jane Mark */
/*101 Max Marie */
/*102 Sarah Sam */
Now, what if an employee can have anywhere from 1 to 3 team leads?
data b;
input employee_id ReportsTo $ ReportsToType $12.;
cards;
100 Jane Supervisor
100 Mark Team_lead
100 Jamie Team_lead
101 Max Supervisor
101 Marie Team_lead
101 Satyendra Team_lead
101 Usha Team_lead
102 Sarah Supervisor
102 Sam Team_lead
;
run;
/* Desired Output */
/*employee_id Supervisor Team_lead1 Team_lead2 Team_lead3 */
/*100 Jane Mark Jamie */
/*101 Max Marie Satyendra Usha */
/*102 Sarah Sam */
Using proc transpose gives an error telling me I can't have more than one identical ID variable in each group. Is there a procedure for transposing which does allow this?
ERROR: The ID value "Team_lead" occurs twice in the same BY group

You need to change your input data so that rather than the word Team_lead repeating, it shows it incrementing... i.e. Team_lead1, Team_lead2, etc...
You can use by-group processing and the retain statement to achieve this:
proc sort data=b;
by employee_id reportstotype;
run;
data want;
set b;
by employee_id reportstotype;
retain cnt .;
if first.reportstotype then do;
cnt = 1;
end;
if upcase(reportsToType) eq 'TEAM_LEAD' then do;
reportsToType = cats(reportsToType,cnt);
end;
cnt = cnt + 1;
run;
Then simply call proc transpose like you did beforehand:
proc transpose data=want out=trans;
by employee_id;
id reportsToType;
var reportsTo;
run;

Related

Drop observations once condition is met by multiple variables

I have the following data and used one of the existing answered questions to solve my data problem but could not get what I want. Here is what I have in my data
Amt1 is populated when the Evt_type is Fee
Amt2 is populated when the Evt_type is REF1/REF2
I don't want to display any observations after the last Flag='Y'
If there is no Flag='Y' then I want all the observations for that id (e.g. id=102)
I want to display if the next row for that id is a Fee followed by REF1/REF2 after flag='Y' (e.g. id=101) However I don't want if there is no REF1/REF2 (e.g.id=103)
Have:
id Date Evt_Type Flag Amt1 Amt2
101 2/2/2019 Fee 5
101 2/3/2019 REF1 Y 5
101 2/4/2019 Fee 10
101 2/6/2019 REF2 Y 10
101 2/7/2019 Fee 4
101 2/8/2019 REF1
102 2/2/2019 Fee 25
102 2/2/2019 REF1 N 25
103 2/3/2019 Fee 10
103 2/4/2019 REF1 Y 10
103 2/5/2019 Fee 10
Want:
id Date Evt_Type Flag Amt1 Amt2
101 2/2/2019 Fee 5
101 2/3/2019 REF1 Y 5
101 2/4/2019 Fee 10
101 2/6/2019 REF2 Y 10
101 2/7/2019 Fee 4
101 2/8/2019 REF1
102 2/2/2019 Fee 25
102 2/2/2019 REF1 N 25
103 2/4/2019 REF1 Y 10
103 2/5/2019 Fee 10
I tried the following
data want;
set have;
by id Date;
drop count;
if (first.id or first.date) and FLAG='Y' then
do;
retain count;
count=1;
output;
return;
end;
if count=1 and ((first.id or first.date) and Flag ne 'Y') then
do;
retain count;
delete;
return;
end;
output;
run;
Any help is appreciated.
Thanks
A technique known as DOW loop can perform a computation that measures a group in some way and then, in a second loop, apply that computation to members of the group.
The DOW relies on a SET statement inside the loop. In this case the computation is 'what row in the group is the last one having flag="Y".
data want;
* DOW loop, contains computation;
_max_n_with_Y = 1e12;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if flag='Y' then _max_n_with_Y = _n_;
end;
* Follow up loop, applies computation;
do _n_ = 1 to _n_;
set have;
if _n_ <= _max_n_with_Y then OUTPUT;
end;
drop _:;
run;
Here is one way
data have;
input id $ Date : mmddyy10. Evt_Type $ Flag $ Amt1 Amt2;
format Date mmddyy10.;
infile datalines dsd missover;
datalines;
101,2/2/2019,Fee,,5,
101,2/3/2019,REF1,Y,,5
101,2/4/2019,Fee,,10,
101,2/6/2019,REF2,Y,,10
101,2/7/2019,Fee,,4,
102,2/2/2019,Fee,,25,
102,2/2/2019,REF1,N,25,
;
data want;
do _N_ = 1 by 1 until (last.id);
set have;
by id;
if flag = "Y" then _iorc_ = _N_;
end;
do _N_ = 1 to _N_;
set have;
if _N_ le _iorc_ then output;
end;
_iorc_=1e7;
run;

compute variable after datalines

I have the following dataset (fictional data).
DATA test;
INPUT name $ age height weight;
DATALINES;
Peter 20 1.70 80
Hans 30 1.72 75
Tina 25 1.67 65
Luisa 10 1.20 50
;
RUN;
How can I compute a new variable "bmi" (weight / height^2) directly after the end of the DATALINE-command? Unfortunately in my SAS-book all the examples are with DATA ... INFILE= instead of using DATALINES.
PROC PRINT
DATA = test;
TITLE 'Fictional Data';
RUN;
Datalines appears at the end of the data step. Your computation statements should be placed before datalines, after the input
INPUT name $ age height weight;
bmi = weight / height**2;
DATALINES;
…

Is there any better ways to compare cases between different row in SAS?

During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.

Using the sum of the columns, to create a new varible

I have data set, that has States, Corn, and Cotton. I want to create a new variable, Corn_Pct in SAS (% of state corn output relative to the country's output of corn). The same for Cotton_pct.
sample of data: (numbers are not real)
State Corn Cotton
TX 135 500
AK 120 350
...
Can anyone help?
You can do this using a simple Proc SQL. Let the dataset be "Test",
Proc sql ;
create table test_percent as
select *,
Corn/sum(corn) as Corn_Pct format=percent7.1,
Cotton/sum(Cotton) as Cotton_Pct format=percent7.1
from test
;
quit;
If you have many columns, you can use Arrays and do loops to automatically generate percentages everytime.
I have calculated the total of a column in Inner Query and then used that total for the calculation in outer query using Cross Join
Hey Try this:-
/*My Dataset */
Data Test;
input State $ Corn Cotton ;
cards;
TK 135 500
AK 120 350
CK 100 250
FG 200 300
run;
/*Code*/
Proc sql;
create table test_percent as
Select a.*, (corn * 100/sm_corn) as Corn_pct, (Cotton * 100/sm_cotton) as Cotton_pct
from test a
cross join
(
select sum(corn) as sm_corn ,
sum(Cotton) as sm_cotton
from test
) b ;
quit;
/*My Output*/
State Corn Cotton Corn_pct Cotton_pct
TK 135 500 24.32432432 35.71428571
AK 120 350 21.62162162 25
CK 100 250 18.01801802 17.85714286
FG 200 300 36.03603604 21.42857143
Here you have an alternative using proc means and data step:
proc means data=test sum noprint;
output out=test2(keep=corn cotton) sum=corn cotton;
quit;
data test_percent (drop=corn_sum cotton_sum);
set test2(rename=(corn=corn_sum cotton=cotton_sum) in=in1) test(in=in2);
if (in1=1) then do;
call symput('corn_sum',corn_sum);
call symput('cotton_sum',cotton_sum);
end;
else do;
Corn_pct = corn/symget('corn_sum');
Cotton_pct = cotton/symget('cotton_sum');
output;
end;
run;

first and last statements in SAS

I am trying to do a count on the number of births. the data looks this way
ID date
101 2016-01-01
101 2016-02-01
101 2016-02-01
102 2015-03-02
102 2016-04-01
103 2016-02-08
So now i want to create a count based on the date
the output expected is this way
ID date count
101 2016-01-01 1
101 2016-02-01 2
101 2016-02-01 2
102 2015-03-02 1
102 2016-04-01 2
103 2016-02-08 1
I am trying to do it by first and last and also the count from proc sql but I am missing something here.
data temp;
set temp;
by ID DATE notsorted;
if first.date then c=1;
else c+1;
if first.ID then m=1;
else m+1;
run;
Another solution with your original approach
data x;
input id : 3. date : ddmmyy10.;
FORMAT DATE ddmmyy10.;
datalines;
101 01-01-2016
101 02-01-2016
101 02-01-2016
102 03-02-2015
102 04-01-2016
103 02-08-2016
;
run;
data x;
set x;
by ID DATE notsorted;
if first.ID then c=0; /*reset count every time id changes*/
if first.date then c+1; /*raise count when date changes*/
run;
produces
Do you absolutely require to use first?
I would use proc freq to achieve this
data have;
infile datalines delimiter='09'x;
input ID $ date $10. ;
datalines;
101 2016-01-01
101 2016-02-01
101 2016-02-01
102 2015-03-02
102 2016-04-01
103 2016-02-08
;run;
proc freq DATA=have NOPRINT;
TABLES ID * date / OUT=want(drop=percent);
run;
creates this:
ID date count
101 2016-01-01 1
101 2016-02-01 2
102 2015-03-02 1
102 2016-04-01 1
103 2016-02-08 1
If you want to reproduce COUNT in the datastep you will have to use the double DOW. The dataset is SET twice. First time to count rows by ID and date. Second time to output all rows.
data out;
do _n_ = 1 by 1 until (last.date);
set test ;
by ID date;
if first.date then count = 1;
else count + 1;
end;
do _n_ = 1 by 1 until (last.date);
set test ;
by ID date;
output;
end;
run;
You forget to add RETAIN statement in your data-step.
data temp;
set temp;
retain c m 0;
by ID DATE notsorted;
if first.date then c=1;
else c+1;
if first.ID then m=1;
else m+1;
run;
Okay, I have edited the previous code. Hopefully this will suit your needs. Just make sure your date variable is in numeric or calendar format so that you can sort your table by ID and date first.
data want;
set have;
by id date;
if first.date then count=0;
count+1;
run;