SAS Dataset : Counting observation that match an IF condition - if-statement

Here is a very basic question, but I'm unable to find an easy way to do it.
I have a dataset that references different highschools and students :
Highschool Students Sexe
A 1 m
A 2 m
A 3 m
A 4 f
A 5 f
B 1 m
B 2 m
And I'd like to create two new variables that count the number of male and female in each schools :
Highschool Students Sexe Nb_m Nb_f
A 1 m 1 0
A 2 m 2 0
A 3 m 3 0
A 4 f 3 1
A 5 f 3 2
B 1 m 1 0
B 2 m 2 0
And I can finally extract the last line with the total that would look like this :
Highschool Students Sexe Nb_m Nb_f
A 5 f 3 2
B 2 m 2 0
Any ideas ?

You can do this in a single PROC SQL step...
Also, I don't think you really need the value of Sexe from the last row.
proc sql ;
create table want as
select Highschool,
sum(case when Sexe = 'f' then 1 else 0 end) as Nb_f,
sum(case when Sexe = 'm' then 1 else 0 end) as Nb_m,
Nb_f + Nb_m as Students
group by Highschool
order by Highschool ;
quit ;

First you have to sort your dataset by Highschool:
proc sort data = your_dataset;
by Highschool;
run;
then you use
- retain to not reset Nb_m and Nb_f at every record;
- last function and output statement to print only the last observation for every school.
data new_dataset;
set your_dataset;
by Highschool;
retain Nb_m Nb_f;
if Sexe = 'm' then
Nb_m + 1;
else
Nb_f + 1;
if last.Highschool then do;
Students = Nb_m + Nb_f;
output;
Nb_m = 0;
Nb_f = 0;
end;
run;

Related

SAS versioning of two tables using data step

I'm new to SAS and I'm at a dead end.
I need to get the final table. C with a full set of attributes, and the" intersection " of versioning, i.e. as soon as a version change has occurred in one of the Tariffs or Abonents tables, the version in C should also change. If the version was changed simultaneously, in both tables, then in C the version should be changed once.
Tarifs
abon_id tariff_plan type from_date to_date
1 1 1 01OCT2005 01JAN2040
2 1 2 05NOV2005 01DEC2006
2 2 2 02DEC2006 01DEC2007
2 2 1 02DEC2007 01JAN2040
3 0 0 07NOV1917 11JUN1991
3 1 1 12JUN1991 01JAN2040
4 1 1 12JUN1991 01JAN2040
Abonents
abon_id name sex from_date
1 Igor M 01OCT2005 01JAN2040
2 Vasya M 05NOV2005 01AUG2006
2 Lena F 02AUG2006 02SEP2007
2 Yulia F 03SEP2007 01JAN2040
3 USSR Country 07NOV1917 11JUN1991
3 Russia Country 12JUN1991 01JAN2040
4 Petya M 12AUG1991 01JAN2040
Resulting table should be:
C:
abon_id tariff_plan type name sex fd td
1 1 1 Igor М 01oct2005 01jan2040
2 1 2 Vasya М 05nov2005 01aug2006
2 1 2 Lena F 02aug2006 01dec2006
2 2 2 Lena F 02dec2006 02sep2007
2 2 2 Julia F 03sep2007 01dec2007
2 2 1 Julia F 02dec2007 01jan2040
3 0 0 USSR Country 07nov1917 11jun1991
3 1 1 Russia Country 12jun1991 01jan2040
4 1 1 . . 12jun1991 11aug1991
4 1 1 Petya M 12aug1991 01jan2040
So far I have something like:
data out;
retain fd1 fd2 td1 td2;
format fd1 fd2 td1 td2 ddmmyy10.;
merge Tarifs(in=x) Abonents(in=y);
by abon_id fd;
fd1 = 0; fd2 = 0; td1 = 0; td2 = 0;
if x then do;
fd1 = fd;
td1 = td;
end;
if y then do;
fd2 = fd;
td2 = td;
end;
if fd1 <= fd2 then do;
fd = fd1;
if fd2 < td1 and f2 < td2 then td = fd2;
else if td1 < td2 then td = td1;
else td = td2;
end;
else do;
fd = fd2;
if fd1 < td1 and fd1 < td2 then td = fd1;
else if td1 < td2 then td = td1;
else td = td2;
end;
run;
But I think I'm doing something wrong. Please help me!
You can use SQL union to combine the the overlaps with the tarifs pre-abonent
data tarifs;
input
abon_id tariff_plan type from_date: date9. to_date date9.;
format _numeric_ 4. from_date to_date date9.;
datalines;
1 1 1 01OCT2005 01JAN2040
2 1 2 05NOV2005 01DEC2006
2 2 2 02DEC2006 01DEC2007
2 2 1 02DEC2007 01JAN2040
3 0 0 07NOV1917 11JUN1991
3 1 1 12JUN1991 01JAN2040
4 1 1 12JUN1991 01JAN2040
data abonents;
length abon_id 8 name $10 sex $10;
input
abon_id name sex from_date: date9. to_date date9.;
format from_date to_date date9.;
datalines;
1 Igor M 01OCT2005 01JAN2040
2 Vasya M 05NOV2005 01AUG2006
2 Lena F 02AUG2006 02SEP2007
2 Julia F 03SEP2007 01JAN2040
3 USSR Country 07NOV1917 11JUN1991
3 Russia Country 12JUN1991 01JAN2040
4 Petya M 12AUG1991 01JAN2040
;
proc sql;
create table want as
(
select
A.abon_id, A.tariff_plan, A.type
, B.name, B.sex
, case
when A.from_date < B.from_date then B.from_date else A.from_date
end as fd format=date9.
, case
when A.to_date > B.to_date then B.to_date else A.to_date
end as td format=date9.
from tarifs A
left join abonents B
on A.abon_id = B.abon_id
where
B.from_date between A.from_date and A.to_date
or
B.to_date between A.from_date and A.to_date
)
union
(
select
A.abon_id, A.tariff_plan, A.type
, ' ' as name , ' ' as sex
, A.from_date as fd
, min(B.from_date)-1 as td
from tarifs A
left join abonents B
on A.abon_id = B.abon_id
group by
B.abon_id
having
A.from_date < min(B.from_date)
)
;
A simple merge can not accomplish the task because you need to cross join on abon_id.
A cross join can be accomplished in DATA Step by multidata hashing the abonents, linear traversing the tariffs with SET and iterating over find/find_next.
Example
data tarifs;
input
abon_id tariff_plan type from_date: date9. to_date date9.;
format _numeric_ 4. from_date to_date date9.;
datalines;
1 1 1 01OCT2005 01JAN2040
2 1 2 05NOV2005 01DEC2006
2 2 2 02DEC2006 01DEC2007
2 2 1 02DEC2007 01JAN2040
3 0 0 07NOV1917 11JUN1991
3 1 1 12JUN1991 01JAN2040
4 1 1 12JUN1991 01JAN2040
5 1 1 06JAN2021 31DEC2031
data abonents;
length abon_id 8 name $10 sex $10;
input
abon_id name sex from_date: date9. to_date date9.;
format from_date to_date date9.;
datalines;
1 Igor M 01OCT2005 01JAN2040
2 Vasya M 05NOV2005 01AUG2006
2 Lena F 02AUG2006 02SEP2007
2 Julia F 03SEP2007 01JAN2040
3 USSR Country 07NOV1917 11JUN1991
3 Russia Country 12JUN1991 01JAN2040
4 Petya M 12AUG1991 01JAN2040
;
data want(keep=abon_id tariff_plan type name sex tariffed:);
if 0 then set tarifs abonents;
if _n_ = 1 then do;
declare hash abon (dataset:'abonents', multidata:'y');
abon.defineKey('abon_id');
abon.defineData('name', 'sex', 'from_date', 'to_date');
abon.defineDone();
end;
set tarifs (rename=(from_date=fd to_date=td));
min_from = 1e9;
if abon.find() = 0 then do until (abon.find_next() ne 0);
if fd <= from_date <= td then tariffed_fd = from_date;
else
if from_date <= fd <= to_date then tariffed_fd = fd;
if fd <= to_date <= td then tariffed_td = to_date;
else
if from_date <= td <= to_date then tariffed_td = td;
if nmiss(of tariffed:) = 0 then output;
if from_date < min_from then min_from = from_date;
call missing (of tariffed:);
end;
if fd < min_from then do;
tariffed_fd = fd;
tariffed_td = from_date - 1;
call missing (name, sex);
output;
end;
format min_from tariffed: date9.;
run;

Compare rows within group of size two

I've got the below code that works beautifully for comparing rows in a group when the first row doesnt matter.
data want_Find_Change;
set WORK.IA;
by ID;
array var[*] $ RATING;
array lagvar[*] $ zRATING;
array changeflag[*] RATING_UPDATE;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
end;
do i = 1 to dim(var) ;
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.ID);
end;
drop i;
run;
Unfortunately, when I use a dataset that has two rows per group I get incorrect returns, I'm assuming because the first row has to be used in the comparison. How can I compare the only to rows and a return only on the second row. This did not work:
data Change;
set WORK.Two;
by ID;
changeflag = last.RATING NE first.RATING;
run;
Example of the data I have and want
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
want
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
If you want only the first row to not be flagged, you first need to create a variable enumerating the rows within each group. You can do so with:
data temp;
set have;
count + 1;
by Group;
if first.Group then count = 1;
run;
In a second step, you can run a proc sql with a subquery, count distinct by groups, and case when:
proc sql;
create table want as
select
Group, Name, Sport, DogName, Eligibility,
case when count_name > 1 and count > 1 then 1 else 0 end as N_change,
case when count_sport > 1 and count > 1 then 1 else 0 end as S_change,
case when count_dog > 1 and count > 1 then 1 else 0 end as D_change,
case when count_E > 1 and count > 1 then 1 else 0 end as E_change
from (select *,
count(distinct(Name)) as count_name,
count(distinct(Sport)) as count_sport,
count(distinct(DogName)) as count_dog,
count(distinct(Eligibility)) as count_E
from temp
group by Group);
quit;
Best,

SAS-How to count the number of observation over the 10 years prior to certain month

I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;

Generating Unique ID for same group

I have data set,
CustID Rating
1 A
1 A
1 B
2 A
2 B
2 C
2 D
3 X
3 X
3 Z
4 Y
4 Y
5 M
6 N
7 O
8 U
8 T
8 U
And expecting Output
CustID Rating ID
1 A 1
1 A 1
1 B 1
2 A 1
2 B 2
2 C 3
2 D 4
3 X 1
3 X 1
3 Z 2
4 Y 1
4 Y 1
5 M 1
6 N 1
7 O 1
8 U 1
8 T 2
8 U 1
In the solution below, I selected the distinct possible ratings into a macro variable to be used in an array statement. These distinct values are then searched in the ratings tolumn to return the number assigned at each successful find.
You can avoid the macro statement in this case by replacing the %sysfunc by 3 (the number of distinct ratings, if you know it before hand). But the %sysfunc statement helps resolve this in case you don't know.
data have;
input CustomerID Rating $;
cards;
1 A
1 A
1 B
2 A
2 A
3 A
3 A
3 B
3 C
;
run;
proc sql noprint;
select distinct quote(strip(rating)) into :list separated by ' '
from have
order by 1;
%put &list.;
quit;
If you know the number before hand:
data want;
set have;
array num(3) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
Otherwise:
%macro Y;
data want;
set have;
array num(%sysfunc(countw(&list., %str( )))) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
%mend;
%Y;
The output:
Obs CustomerID Rating id
1 1 A 1
2 1 A 1
3 1 B 2
4 2 A 1
5 2 A 1
6 3 A 1
7 3 A 1
8 3 B 2
9 3 C 3
Assuming data is sorted by customerid and rating (as in the original unedited question). Is the following what you want:
data want;
set have;
by customerid rating;
if first.customerid then
id = 0;
if first.rating then
id + 1;
run;

sas recursive lag by id

I am trying to do a recursive lag in sas, the problem that I just learned is that x = lag(x) does not work in SAS.
The data I have is similar in format to this:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
What I want is that given x for the first count, I want each successive x by id to be the lag(x) + some constant.
For example, lets say: if count > 1 then x = lag(x) + 3.
The output that I would want is:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 13
a 1/1/2001 3 16
b 1/1/1997 1 51
b 1/1/1998 2 54
Yes, the lag function in SAS requires some understanding. You should read through the documentation on it (http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#n0l66p5oqex1f2n1quuopdvtcjqb.htm)
When you have conditional statements with a lag inside the "then", I tend to use a retained variable.
data test;
input id $ date count x;
informat date anydtdte.;
format date date9.;
datalines;
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
;
run;
data test(drop=last);
set test;
by id;
retain last;
if ^first.id then do;
if count > 1 then
x = last + 3;
end;
last = x;
run;