I would like to create a page break value that can help me break the page when I use proc report.
Now my data looks like this:
Group Value
a 1
a 2
a 3
...
b 1
b 2
...
c 1
c 2
c 3
And suppose I only want two lines per page, and break if the group changed.
So I need a dataset like this:
Group Value Page
a 1 1
a 2 1
a 3 2
...
b 1 3
b 2 3
...
c 1 4
c 2 4
c 3 5
Can anyone help me with this? Thanks!
Retain holds values across rows. Create a counter value that you can use to track the number of records per group. This allows you to split it into pages of N amount.
Use BY and FIRST to reset counter at the start of each group
Check if the you need to increment page
data have;
input Group $ Value;
cards;
a 1
a 2
a 3
b 1
b 2
c 1
c 2
c 3
;;;;
data want;
set have;
by group;
retain counter page;
if first.group then counter=0;
counter+1;
if mod(counter, 2) =1 or first.group then page+1;
run;
proc print data=want;
run;
Results:
Obs Group Value counter page
1 a 1 1 1
2 a 2 2 1
3 a 3 3 2
4 b 1 1 3
5 b 2 2 3
6 c 1 1 4
7 c 2 2 4
8 c 3 3 5
Related
i am currently trying to write some code that goes through my dataset and sums each group everytime it appears independently of the whole group. this is what it currently looks like vs what i want it to. I thought it would be simple but sas 9.3 does not support sum over statements/
week ID var2 ... MinUnits group
24jun2019 1 x 5 0
01jul2019 1 x 4 1
08jul2019 1 x 7 1
15jul2019 1 x 2 1
22jul2019 1 x 0 2
29jul2019 1 x 5 2
05aug2019 1 x 2 2
24jun2019 1 x 9 0
01jul2019 2 x 5 1
08jul2019 2 x 6 1
15jul2019 2 x 8 1
22jul2019 2 x 1 2
29jul2019 2 x 5 2
05aug2019 3 x 3 2
what i want it to show
week ID var2 ... MinUnits group SumMinUnits
24jun2019 1 x 5 0 5
01jul2019 1 x 4 1 13
08jul2019 1 x 7 1
15jul2019 1 x 2 1
22jul2019 1 x 0 2 7
29jul2019 1 x 5 2
05aug2019 1 x 2 2
24jun2019 1 x 9 0 9
01jul2019 2 x 5 1 19
08jul2019 2 x 6 1
15jul2019 2 x 8 1
22jul2019 2 x 1 2 9
29jul2019 2 x 5 2
05aug2019 2 x 3 2
as you can see simply summing by group would not work because the group number gets repeated for different ID's (and eventually same ID's but in those cases a location variable is different than the orignal time the ID showed up).
please note i am not asking for you to code it for me as that is too much work. i just want to know if there is a functin i could use to do this. I thought about using a loop and groupby but that would sum up the total groups
You can use the NOTSORTED keyword on the BY statement use the GROUP variable to make BY groups.
data want;
do until (last.group);
set have ;
by group notsorted;
SumMinUnits=sum(SumMinUnits,MinUnits);
end;
do until (last.group);
set have ;
by group notsorted;
output;
end;
run;
Note this will set SUMMINUNITS to the same value for all observations in the group. You could add extra code to set it to missing inside the second DO loop when it is not the first observation for the group.
Wouldn't something like this work? It adds the total to every record of the group but otherwise your data seems order by ID and GROUP.
proc sql;
create table want as
select *, sum(minUnits) as total_units
from have
group by ID, GROUP;
quit;
I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.
I have data which is as follows.
data have;
input group replicate $ sex $ count;
datalines;
1 A F 3
1 A M 2
1 B F 4
1 B M 2
1 C F 4
1 C M 5
2 A F 5
2 A M 4
2 B F 6
2 B M 3
2 C F 2
2 C M 2
3 A F 5
3 A M 1
3 B F 3
3 B M 4
3 C F 3
3 C M 1
;
run;
I want to break the count column into two separate columns based on gender.
count_ count_
Obs group replicate female male
1 1 A 3 2
2 1 B 4 2
3 1 C 4 5
4 2 A 5 4
5 2 B 6 3
6 2 C 2 2
7 3 A 5 1
8 3 B 3 4
9 3 C 3 1
This can be done by first creating two separate data sets for each level of sex and then performing a merge.
data just_female;
set have;
where sex = 'F';
rename count = count_female;
run;
data just_male;
set have;
where sex = 'M';
rename count = count_male;
run;
data want;
merge
just_female
just_male
;
by
group
replicate
;
keep
group
replicate
count_female
count_male
;
run;
Is there a less verbose way to do this which doesn't require the need to sort or explicitly drop/keep variables?
You can do this using proc transpose but you will need to sort the data. I believe this is what you're looking for though.
proc sort data=have;
by group replicate;
run;
The data is sorted so now you have your by-group for transposing.
proc transpose data=have out=want(drop=_name_) prefix=count_;
by group replicate;
id sex;
var count;
run;
proc print data=want;
Then you get:
Obs group replicate count_F count_M
1 1 A 3 2
2 1 B 4 2
3 1 C 4 5
4 2 A 5 4
5 2 B 6 3
6 2 C 2 2
7 3 A 5 1
8 3 B 3 4
9 3 C 3 1
I have the following dataset
data input;
input Row$ A B;
datalines;
1 1 2
2 1 2
3 1 1
4 1 1
5 2 3
6 2 3
7 2 3
8 2 2
9 2 2
10 2 1
;
run;
My goal is only to keep records of the first group of data for the variable A. For example I only want records where A=1 and B=2 (lines 1 and 2) and for the next group where A=2 and B=3 and so on...
I tried the following code
data input (rename= (count=rank_b));
set input;
count + 1;
by A descending B;
if first.B then count = 1;
run;
which just gives the number of observations in A (1 to 4) and B (1 to 6). What I would like is
A B rank_b rank_b_desired
1 2 1 1
1 2 2 1
1 1 1 2
1 1 2 2
2 3 1 1
2 3 2 1
2 2 1 2
2 2 2 2
2 1 1 3
So that I can then eliminate all obs where rank_b_desired does not equal 1.
Set a flag to 1 when you encounter a new value of A, then set it to 0 if B changes. retain will preserve the value of the flag when a new line is read from the input.
data want;
set input;
by A descending B;
retain flag;
if first.B then flag = 0;
if first.A then flag = 1;
run;
The desired result can also be achieved via proc sql, with the added benefit that it does not depend on the data being pre sorted.
proc sql;
create table want as
select *
from input
group by A
having B = max(B)
order by Row;
quit;
Or to match user234821's output:
proc sql;
create table want as
select
*,
ifn(B = max(B), 1, 0) as flag
from input
group by A
order by Row;
quit;
I have the following dataset :
ID CODE
1 A
1 B
2 A
2 A
2 B
3 A
3 B
I would like to add a third column to this table which gives a sequence no. as given below :
ID CODE SEQ
1 A 1
1 B 2
2 A 1
2 A 1
2 B 2
3 A 1
3 B 2
How can I achieve this instead of coding A as 1 and B as 2 rather by a retain statement ?
You should look at by processing and first.. Something like this will work; basically, for each ID initialize seq to zero, and for each new code increment it by one.
data want;
set have;
by id code;
if first.id then seq=0;
if first.code then seq+1;
run;