I have a three columns
Sample data:
Employee_ID Error_Type Month Count
-------------------------------------------
101. A. Jan'21 1
102. B. Jan'21 1
103. C. Jan'21 1
101 B. Feb'21 1
102. B. Feb'21 2
103. C. Feb'21 2
101. A. Mar'21 1
102. B. Mar'21 3
103. A. Mar'21 1
101. A. Apr'21 2
102. B. May'21 3
103. C. May'21 2
I need to calculate last column which is Count in SAS
Example:
If employee made same error in Jan'21, Feb'21, Mar'21 then count is 3.
If employee made same error in Jan'21 Feb'21, May'21 then count is 2 because repetitive months are Jan'21 and Feb'21 so same count has to be marked for May'21
Please drop your comment. THANKS
data have;
input Employee_ID Error_Type $ Month : monyy5.;
format Month monyy5.;
datalines;
101 A Jan21
102 B Jan21
103 C Jan21
101 B Feb21
102 B Feb21
103 C Feb21
101 A Mar21
102 B Mar21
103 A Mar21
101 A Apr21
102 B May21
103 C May21
;
data want(drop = e);
if _N_ = 1 then do;
dcl hash h();
h.definekey("Employee_ID");
h.definedata("e", "m", "Count");
h.definedone();
end;
set have;
if h.find() then Count = 1;
else do;
if e = Error_Type and intck("month", m, Month) = 1 then Count + 1;
if e ne Error_Type and intck("month", m, Month) = 1 then Count = 1;
end;
e = Error_Type;
m = Month;
h.replace();
run;
Related
I have been tasked with taking the following data and creating two permanent data sets from it. One of these permanent data sets is supposed to contain the average of the "value" column for each group (meaning there should only be four rows in the end, with a new column that represents the average of respective values for A, B, C, and D). Averages should exclude missing values, meaning that if category A has a missing value, it should be divided by 3, not 4. The second permanent data set needs to be the one row with the highest overall value in the "value" column (in this case, the row with D 09JUL2021 951 should be the only row exported). I am having a tough time extracting that single row for the second data set. If you know of a way to perform these operations simultaneously, please let me know. Thank you for your time!
Example data:
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
Here is what I tried:
data mylib.data1(keep=type date value value_avg) mylib.data2;
set work.have;
by type;
if value ne . then NotMissing=1; else NotMissing=0;
if first.type then call missing(of value_avg);
value_avg+value;
if first.type then call missing(of num_per_cat);
num_per_cat+NotMissing;
Avg=divide((value_avg+value),(num_per_cat+NotMissing));
if last.type then output mylib.data1;
run;
This was successful for me with calculating averages, but I have no idea how to extract the row with the highest value in the "value" column to a second data set.
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
proc summary data = have nway;
class type;
var value;
output out = want_mean(drop = _:) mean = ;
run;
proc summary data = have nway;
class type;
var value;
output out = want_max(drop = _:) max = ;
run;
Both sets are easelly done by proc sql.
First one:
proc sql;
create table want1 as
select distinct type, max(value) as Max_value, mean(value) as Average_value
from have
group by type
;
quit;
Second one:
proc sql;
create table want2 as
select *
from have
having value = max(value)
;
quit;
I have the following data and used one of the existing answered questions to solve my data problem but could not get what I want. Here is what I have in my data
Amt1 is populated when the Evt_type is Fee
Amt2 is populated when the Evt_type is REF1/REF2
I don't want to display any observations after the last Flag='Y'
If there is no Flag='Y' then I want all the observations for that id (e.g. id=102)
I want to display if the next row for that id is a Fee followed by REF1/REF2 after flag='Y' (e.g. id=101) However I don't want if there is no REF1/REF2 (e.g.id=103)
Have:
id Date Evt_Type Flag Amt1 Amt2
101 2/2/2019 Fee 5
101 2/3/2019 REF1 Y 5
101 2/4/2019 Fee 10
101 2/6/2019 REF2 Y 10
101 2/7/2019 Fee 4
101 2/8/2019 REF1
102 2/2/2019 Fee 25
102 2/2/2019 REF1 N 25
103 2/3/2019 Fee 10
103 2/4/2019 REF1 Y 10
103 2/5/2019 Fee 10
Want:
id Date Evt_Type Flag Amt1 Amt2
101 2/2/2019 Fee 5
101 2/3/2019 REF1 Y 5
101 2/4/2019 Fee 10
101 2/6/2019 REF2 Y 10
101 2/7/2019 Fee 4
101 2/8/2019 REF1
102 2/2/2019 Fee 25
102 2/2/2019 REF1 N 25
103 2/4/2019 REF1 Y 10
103 2/5/2019 Fee 10
I tried the following
data want;
set have;
by id Date;
drop count;
if (first.id or first.date) and FLAG='Y' then
do;
retain count;
count=1;
output;
return;
end;
if count=1 and ((first.id or first.date) and Flag ne 'Y') then
do;
retain count;
delete;
return;
end;
output;
run;
Any help is appreciated.
Thanks
A technique known as DOW loop can perform a computation that measures a group in some way and then, in a second loop, apply that computation to members of the group.
The DOW relies on a SET statement inside the loop. In this case the computation is 'what row in the group is the last one having flag="Y".
data want;
* DOW loop, contains computation;
_max_n_with_Y = 1e12;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if flag='Y' then _max_n_with_Y = _n_;
end;
* Follow up loop, applies computation;
do _n_ = 1 to _n_;
set have;
if _n_ <= _max_n_with_Y then OUTPUT;
end;
drop _:;
run;
Here is one way
data have;
input id $ Date : mmddyy10. Evt_Type $ Flag $ Amt1 Amt2;
format Date mmddyy10.;
infile datalines dsd missover;
datalines;
101,2/2/2019,Fee,,5,
101,2/3/2019,REF1,Y,,5
101,2/4/2019,Fee,,10,
101,2/6/2019,REF2,Y,,10
101,2/7/2019,Fee,,4,
102,2/2/2019,Fee,,25,
102,2/2/2019,REF1,N,25,
;
data want;
do _N_ = 1 by 1 until (last.id);
set have;
by id;
if flag = "Y" then _iorc_ = _N_;
end;
do _N_ = 1 to _N_;
set have;
if _N_ le _iorc_ then output;
end;
_iorc_=1e7;
run;
I am trying to do a count on the number of births. the data looks this way
ID date
101 2016-01-01
101 2016-02-01
101 2016-02-01
102 2015-03-02
102 2016-04-01
103 2016-02-08
So now i want to create a count based on the date
the output expected is this way
ID date count
101 2016-01-01 1
101 2016-02-01 2
101 2016-02-01 2
102 2015-03-02 1
102 2016-04-01 2
103 2016-02-08 1
I am trying to do it by first and last and also the count from proc sql but I am missing something here.
data temp;
set temp;
by ID DATE notsorted;
if first.date then c=1;
else c+1;
if first.ID then m=1;
else m+1;
run;
Another solution with your original approach
data x;
input id : 3. date : ddmmyy10.;
FORMAT DATE ddmmyy10.;
datalines;
101 01-01-2016
101 02-01-2016
101 02-01-2016
102 03-02-2015
102 04-01-2016
103 02-08-2016
;
run;
data x;
set x;
by ID DATE notsorted;
if first.ID then c=0; /*reset count every time id changes*/
if first.date then c+1; /*raise count when date changes*/
run;
produces
Do you absolutely require to use first?
I would use proc freq to achieve this
data have;
infile datalines delimiter='09'x;
input ID $ date $10. ;
datalines;
101 2016-01-01
101 2016-02-01
101 2016-02-01
102 2015-03-02
102 2016-04-01
103 2016-02-08
;run;
proc freq DATA=have NOPRINT;
TABLES ID * date / OUT=want(drop=percent);
run;
creates this:
ID date count
101 2016-01-01 1
101 2016-02-01 2
102 2015-03-02 1
102 2016-04-01 1
103 2016-02-08 1
If you want to reproduce COUNT in the datastep you will have to use the double DOW. The dataset is SET twice. First time to count rows by ID and date. Second time to output all rows.
data out;
do _n_ = 1 by 1 until (last.date);
set test ;
by ID date;
if first.date then count = 1;
else count + 1;
end;
do _n_ = 1 by 1 until (last.date);
set test ;
by ID date;
output;
end;
run;
You forget to add RETAIN statement in your data-step.
data temp;
set temp;
retain c m 0;
by ID DATE notsorted;
if first.date then c=1;
else c+1;
if first.ID then m=1;
else m+1;
run;
Okay, I have edited the previous code. Hopefully this will suit your needs. Just make sure your date variable is in numeric or calendar format so that you can sort your table by ID and date first.
data want;
set have;
by id date;
if first.date then count=0;
count+1;
run;
Here is a very basic question, but I'm unable to find an easy way to do it.
I have a dataset that references different highschools and students :
Highschool Students Sexe
A 1 m
A 2 m
A 3 m
A 4 f
A 5 f
B 1 m
B 2 m
And I'd like to create two new variables that count the number of male and female in each schools :
Highschool Students Sexe Nb_m Nb_f
A 1 m 1 0
A 2 m 2 0
A 3 m 3 0
A 4 f 3 1
A 5 f 3 2
B 1 m 1 0
B 2 m 2 0
And I can finally extract the last line with the total that would look like this :
Highschool Students Sexe Nb_m Nb_f
A 5 f 3 2
B 2 m 2 0
Any ideas ?
You can do this in a single PROC SQL step...
Also, I don't think you really need the value of Sexe from the last row.
proc sql ;
create table want as
select Highschool,
sum(case when Sexe = 'f' then 1 else 0 end) as Nb_f,
sum(case when Sexe = 'm' then 1 else 0 end) as Nb_m,
Nb_f + Nb_m as Students
group by Highschool
order by Highschool ;
quit ;
First you have to sort your dataset by Highschool:
proc sort data = your_dataset;
by Highschool;
run;
then you use
- retain to not reset Nb_m and Nb_f at every record;
- last function and output statement to print only the last observation for every school.
data new_dataset;
set your_dataset;
by Highschool;
retain Nb_m Nb_f;
if Sexe = 'm' then
Nb_m + 1;
else
Nb_f + 1;
if last.Highschool then do;
Students = Nb_m + Nb_f;
output;
Nb_m = 0;
Nb_f = 0;
end;
run;
I am trying to do a recursive lag in sas, the problem that I just learned is that x = lag(x) does not work in SAS.
The data I have is similar in format to this:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
What I want is that given x for the first count, I want each successive x by id to be the lag(x) + some constant.
For example, lets say: if count > 1 then x = lag(x) + 3.
The output that I would want is:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 13
a 1/1/2001 3 16
b 1/1/1997 1 51
b 1/1/1998 2 54
Yes, the lag function in SAS requires some understanding. You should read through the documentation on it (http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#n0l66p5oqex1f2n1quuopdvtcjqb.htm)
When you have conditional statements with a lag inside the "then", I tend to use a retained variable.
data test;
input id $ date count x;
informat date anydtdte.;
format date date9.;
datalines;
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
;
run;
data test(drop=last);
set test;
by id;
retain last;
if ^first.id then do;
if count > 1 then
x = last + 3;
end;
last = x;
run;