I have the following dataset
data input;
input Row$ A B;
datalines;
1 1 2
2 1 2
3 1 1
4 1 1
5 2 3
6 2 3
7 2 3
8 2 2
9 2 2
10 2 1
;
run;
My goal is only to keep records of the first group of data for the variable A. For example I only want records where A=1 and B=2 (lines 1 and 2) and for the next group where A=2 and B=3 and so on...
I tried the following code
data input (rename= (count=rank_b));
set input;
count + 1;
by A descending B;
if first.B then count = 1;
run;
which just gives the number of observations in A (1 to 4) and B (1 to 6). What I would like is
A B rank_b rank_b_desired
1 2 1 1
1 2 2 1
1 1 1 2
1 1 2 2
2 3 1 1
2 3 2 1
2 2 1 2
2 2 2 2
2 1 1 3
So that I can then eliminate all obs where rank_b_desired does not equal 1.
Set a flag to 1 when you encounter a new value of A, then set it to 0 if B changes. retain will preserve the value of the flag when a new line is read from the input.
data want;
set input;
by A descending B;
retain flag;
if first.B then flag = 0;
if first.A then flag = 1;
run;
The desired result can also be achieved via proc sql, with the added benefit that it does not depend on the data being pre sorted.
proc sql;
create table want as
select *
from input
group by A
having B = max(B)
order by Row;
quit;
Or to match user234821's output:
proc sql;
create table want as
select
*,
ifn(B = max(B), 1, 0) as flag
from input
group by A
order by Row;
quit;
Related
I would like to create a page break value that can help me break the page when I use proc report.
Now my data looks like this:
Group Value
a 1
a 2
a 3
...
b 1
b 2
...
c 1
c 2
c 3
And suppose I only want two lines per page, and break if the group changed.
So I need a dataset like this:
Group Value Page
a 1 1
a 2 1
a 3 2
...
b 1 3
b 2 3
...
c 1 4
c 2 4
c 3 5
Can anyone help me with this? Thanks!
Retain holds values across rows. Create a counter value that you can use to track the number of records per group. This allows you to split it into pages of N amount.
Use BY and FIRST to reset counter at the start of each group
Check if the you need to increment page
data have;
input Group $ Value;
cards;
a 1
a 2
a 3
b 1
b 2
c 1
c 2
c 3
;;;;
data want;
set have;
by group;
retain counter page;
if first.group then counter=0;
counter+1;
if mod(counter, 2) =1 or first.group then page+1;
run;
proc print data=want;
run;
Results:
Obs Group Value counter page
1 a 1 1 1
2 a 2 2 1
3 a 3 3 2
4 b 1 1 3
5 b 2 2 3
6 c 1 1 4
7 c 2 2 4
8 c 3 3 5
I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.
I have data set,
CustID Rating
1 A
1 A
1 B
2 A
2 B
2 C
2 D
3 X
3 X
3 Z
4 Y
4 Y
5 M
6 N
7 O
8 U
8 T
8 U
And expecting Output
CustID Rating ID
1 A 1
1 A 1
1 B 1
2 A 1
2 B 2
2 C 3
2 D 4
3 X 1
3 X 1
3 Z 2
4 Y 1
4 Y 1
5 M 1
6 N 1
7 O 1
8 U 1
8 T 2
8 U 1
In the solution below, I selected the distinct possible ratings into a macro variable to be used in an array statement. These distinct values are then searched in the ratings tolumn to return the number assigned at each successful find.
You can avoid the macro statement in this case by replacing the %sysfunc by 3 (the number of distinct ratings, if you know it before hand). But the %sysfunc statement helps resolve this in case you don't know.
data have;
input CustomerID Rating $;
cards;
1 A
1 A
1 B
2 A
2 A
3 A
3 A
3 B
3 C
;
run;
proc sql noprint;
select distinct quote(strip(rating)) into :list separated by ' '
from have
order by 1;
%put &list.;
quit;
If you know the number before hand:
data want;
set have;
array num(3) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
Otherwise:
%macro Y;
data want;
set have;
array num(%sysfunc(countw(&list., %str( )))) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
%mend;
%Y;
The output:
Obs CustomerID Rating id
1 1 A 1
2 1 A 1
3 1 B 2
4 2 A 1
5 2 A 1
6 3 A 1
7 3 A 1
8 3 B 2
9 3 C 3
Assuming data is sorted by customerid and rating (as in the original unedited question). Is the following what you want:
data want;
set have;
by customerid rating;
if first.customerid then
id = 0;
if first.rating then
id + 1;
run;
I have data which is as follows.
data have;
input group replicate $ sex $ count;
datalines;
1 A F 3
1 A M 2
1 B F 4
1 B M 2
1 C F 4
1 C M 5
2 A F 5
2 A M 4
2 B F 6
2 B M 3
2 C F 2
2 C M 2
3 A F 5
3 A M 1
3 B F 3
3 B M 4
3 C F 3
3 C M 1
;
run;
I want to break the count column into two separate columns based on gender.
count_ count_
Obs group replicate female male
1 1 A 3 2
2 1 B 4 2
3 1 C 4 5
4 2 A 5 4
5 2 B 6 3
6 2 C 2 2
7 3 A 5 1
8 3 B 3 4
9 3 C 3 1
This can be done by first creating two separate data sets for each level of sex and then performing a merge.
data just_female;
set have;
where sex = 'F';
rename count = count_female;
run;
data just_male;
set have;
where sex = 'M';
rename count = count_male;
run;
data want;
merge
just_female
just_male
;
by
group
replicate
;
keep
group
replicate
count_female
count_male
;
run;
Is there a less verbose way to do this which doesn't require the need to sort or explicitly drop/keep variables?
You can do this using proc transpose but you will need to sort the data. I believe this is what you're looking for though.
proc sort data=have;
by group replicate;
run;
The data is sorted so now you have your by-group for transposing.
proc transpose data=have out=want(drop=_name_) prefix=count_;
by group replicate;
id sex;
var count;
run;
proc print data=want;
Then you get:
Obs group replicate count_F count_M
1 1 A 3 2
2 1 B 4 2
3 1 C 4 5
4 2 A 5 4
5 2 B 6 3
6 2 C 2 2
7 3 A 5 1
8 3 B 3 4
9 3 C 3 1
Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;