Summary table of three variables - sas

suppose to have the following:
ID HS REP YEAR
0001 A a 2015
0001 B a 2015
0001 B c 2015
0001 B d 2015
0002 A f 2015
0002 A g 2015
0002 B a 2015
...... .... ..... .....
I would like to get the count of "rep" per "HS" (for each HS) and also the count of "Ids" per "REP" for each "HS" (no matter if the same ID appears in more than two HS and hence it will be recorded two times). Desired output:
Year HS REP TotIDs
2015 A 3 2
2015 B 4 2
It means: HS "A" has 3 REPs and 2 IDs found (overall without distinguish by REP) corresponding to HS "A". The same for "B".
I need also some summary statistics like the mean, median etc. I there a way to do this with proc means or univariate or freq (maybe?) in one shot?
Thank you in advance

SQL is a little more versatile when you want distinct value counting within a group.
SQL does not have an aggregate function for mode statistic.
If the grouping is what appears to be year, hs the following is an example:
data have; input
ID HS $ REP $ YEAR x;
id2=id;
datalines;
0001 A a 2015 12
0001 B a 2015 2
0001 B c 2015 3
0001 B d 2015 5
0002 A f 2015 13
0002 A g 2015 14
0002 B a 2015 6
;
proc sql;
create table want as
select
year
, hs
, count(*) as rep
, count(distinct id) as id_n_unq
, sum (x) as x_sum
, mean (x) as x_mean
, median (x) as x_median
/* , mode(x) as x_mode -- SQL does not have aggregate function for mode*/
from have
group year, hs
;

Related

Sort variables based on another data set and append data

is there a way in SAS to order columns (variables) of a data set based on the order of another data set? The names are perfectly equal.
And is there also a way to append them (vertically) based on the same column names?
Thank you in advance
ID YEAR DAYS WORK DATASET
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
DATASET DAYS WORK ID YEAR
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
I just need to sort the second data set based on the first and append the second to the first.
Can anyone help me please?
This should work:
data have1;
input ID YEAR DAYS WORK DATASET;
format ID z4.;
datalines;
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
;
run;
data have2;
input DATASET DAYS WORK ID YEAR;
format ID z4.;
datalines;
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
;
run;
First we create a new table by copying our first table. Then we just insert into it variables from the second table. No need to change the column order of the original second table.
proc sql;
create table want as
select *
from have1
;
insert into want(ID, YEAR, DAYS, WORK, DATASET)
select ID, YEAR, DAYS, WORK, DATASET
from have2
;
quit;
I have no idea how you could sort based on something that is not there.
But appending is trivial. You can just set them together.
data want;
set one two;
run;
And if both dataset are already sorted by some key variables (year perhaps in your example?) then you could interleave the observations instead. Just add a BY statement.
data want;
set one two;
by year;
run;
And if you want to make a new version of the second dataset with the variable order modified to match the variable order in the first dataset (something that really has nothing to with sorting the data) you could use the OBS= dataset option. So code like this will order the variables based on the order they have in ONE but not actually use any of the data from that dataset.
data want;
set one(obs=0) two;
run;

Calculate lags and update flags after comparing years

suppose to have the following:
ID yS yE flagS FlagE
0001 2015 2017 1 1
0001 2017 2020 2 2
0002 2017 2018 1 1
0002 2019 2020 2 2
I'm trying to generate the following:
ID yS yE flagS FlagE
0001 2015 2017 1 1
0001 2017 2020 2 1
0002 2017 2018 1 1
0002 2019 2020 2 2
meaning: flagE at the second row becomes equal to the flag at the first row (i.e., 1) because yE at the first row = 2017 is equal to yS at the second row, i.e., 2017.
Conversely, for ID=0002 nothing should be done because there is a gap between end and start in terms of years.
I tried the lag function on yE to then compare the year yS and yE but it performs the lag of the entire variable on the other IDs. Also using first.ID it does not work.
Can anyone help me please?
The following gives you the expected output
Sort by descending id yS
Simulate the leading function with the merging trick
Apply desired flag if the given condition is true
Sort back the output by ascending id yS to get expected format
proc sort data=have out=stage1;
by id descending ys;
run;
data stage2;
merge stage1 stage1(firstobs=2 keep=ye flage id rename=(ye=_ye flage=_flage
id=_id));
if id ne _id then
call missing(_ye, _flage, _id);
if _ye=ys then
flage=_flage;
drop _:;
run;
proc sort data=stage2 out=want;
by id ys;
run;
want
id yS yE flagS flagE
0001 2015 2017 1 1
0001 2017 2020 2 1
0002 2017 2018 1 1
0002 2019 2020 2 2

Proc report- grouping

I have an easy table, and I need to create a complicated report. I tried to do it with proc report using lots of grouping but didn't give me right result. Here is my example table :
campus id year gender
West 35 2013 F
West 35 2014 F
West 35 2015 F
West 38 2014 M
West 38 2015 M
East 48 2014 -
East 48 2015 -
East 55 2013 F
East 55 2014 F
And this is the report I need to create:
west east
2014 2015 2014 2015
total 2 2 2 1
Gender 2 2 2 1
F 1 1 1 -
M 1 1 - -
none - - 1 1
So I have 4 different group: I worked on this code
proc tabulate data=a ;
class gender year ;
table gender, year*n*f=4. ;
by id;
run ;
Do you think I can do total first, then gender. And tehn I can append them?
This doesn't quite match your requested output, but I'm not sure having the total repeated makes sense either. Proc Tabulate works well here:
proc tabulate data=have;
class campus year gender/missing;
table (all='Total' gender='Gender'), campus=''*year=''*n='';
run;

Data tranfromation with if then else

I have a table as below:
id term subj degree
18 2007 ww Yes
32 2015 AA Yes
32 2016 AA No
25 2011 NM No
25 2001 ts No
18 2009 ww Yes
18 2010 ww No
I need another variable term2 if the degree is Yes, and I will write to term2 whatever same id and subj's term. So means:
id term subj degree term2
18 2007 ww Yes 2009
32 2015 AA Yes 2016
32 2016 AA No 0
25 2011 NM No 0
25 2001 ts No 0
18 2009 ww Yes 2010
18 2010 ww No 0
What I did with if then else doesn't work. Any idea? Thank you
this is the one I used
data have;
merge aa aa (rename=(id=id1 subj=subj1
term=term1);
term2=0;
if id=id1 and subj=subj1 and degree="Yes" then
term2=term1
run;
data have;
input id term subj $ degree $;
cards;
32 2015 AA Yes
32 2016 AA No
25 2011 NM No
25 2001 ts No
18 2007 ww Yes
18 2010 ww No
;
data want;
merge have have(firstobs=2 keep=id term rename=(id=_id term=_term));
term2=0;
if id=_id and degree='Yes' then term2=_term;
drop _:;
run;
There is missing some important information, like, when an id has an degree = yes value, is there always a degree = no row with the same id?
What should be done if there are more then one degree=no rows with different terms for an id if it also has an degree=yes value? Why do you want to solve this with an if-else Statement?
Assuming you have always exactly one id-matching degree=no row for a row with degree = yes you can use this:
Proc sql;
Select a.*, case when a.degree = "Yes" then b.term else 0 end from table as a
left outer join table as b on a.id = b.id and b.degree = "No" and a.degree="Yes";
quit;
This is without if-statement and no datastep, but you must provide more information if you want a more specific solution.

Drop all observations by ID where conditions are not met

I have a dataset with ~4 million transactional records, grouped by Customer_No (consisting of 1 or more transactions per Customer_No, denoted by a sequential counter). Each transaction has a Type code and I am only interested in customers where a particular combination of transaction Types were used. Neither joining the table on itself or using EXISTS in Proc Sql is allowing me to efficiently evaluate the transaction Type criteria. I suspect a data step using retain and do-loops would process the dataset faster
The dataset:
Customer_No Tran_Seq Tran_Type
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
The criteria I am trying to apply:
All Customer_No's Tran_Type's must only be in ('04','05','07','84','86'),
drop all transactions for that Customer_No if any other Tran_Type was used
Customer_No's Tran_Type's must include ('84' or '86') AND '04', drop all transactions for the Customer_No if this condition is not met
The output I want:
Customer_No Tran_Seq Tran_Type
0002 1 07
0002 2 86
0002 3 04
The DoW loop solution should be the most efficient if the data is sorted. If it's not sorted, it will either be the most efficient or similar in scale but slightly less efficient depending on the circumstances of the dataset.
I compared to Dom's solution with a 3e7 ID dataset, and got for the DoW a similar (slightly less) total length with less CPU for unsorted dataset, and about 50% faster for sorted. It is guaranteed to run in about the length of time the dataset takes to write out (maybe a bit more, but it shouldn't be much), plus sorting time if needed.
data want;
do _n_=1 by 1 until (last.customer_no);
set have;
by customer_no;
if tran_type in ('84','86')
then has_8486 = 1;
else if tran_type in ('04')
then has_04 = 1;
else if not (tran_type in ('04','05','07','84','86'))
then has_other = 1;
end;
do _n_= 1 by 1 until (last.customer_no);
set have;
by customer_no;
if has_8486 and has_04 and not has_other then output;
end;
run;
I don't think it's that complicated. Join to a subquery, group by Customer_No, and put your conditions in a having clause. A condition in a min function must be true for all rows, whereas a condition in a max function must be true for any one row:
proc sql;
create table want as
select
h.*
from
have h
inner join (
select
Customer_No
from
have
group by
Customer_No
having
min(Tran_Type in('04','05','07','84','86')) and
max(Tran_Type in('84','86')) and
max(Tran_Type eq '04')) h2
on h.Customer_No = h2.Customer_No
;
quit;
I must have made a join error. On re-writing, Proc Sql completed in less than 30 seconds (on the original 4.9 million record dataset). It's not particularly elegant code though, so I'd still appreciate any improvements or alternative methods.
data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
;
run;
Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
where Tran_Type not in ('04','05','07','84','86')
) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('84','86')
) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('04')
) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;
I would offer a slightly less complex SQL solution than #naed555 using the INTERSECT operator.
proc sql noprint;
create table to_keep as
(
select distinct customer_no
from have
where tran_type in ('84','86')
INTERSECT
select distinct customer_no
from have
where tran_type in ('04')
)
EXCEPT
select distinct customer_no
from have
where tran_type not in ('04','05','07','84','86')
;
create table want as
select a.*
from have as a
inner join
to_keep as b
on a.customer_no = b.customer_no;
quit;