I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.
Related
I would like to create a page break value that can help me break the page when I use proc report.
Now my data looks like this:
Group Value
a 1
a 2
a 3
...
b 1
b 2
...
c 1
c 2
c 3
And suppose I only want two lines per page, and break if the group changed.
So I need a dataset like this:
Group Value Page
a 1 1
a 2 1
a 3 2
...
b 1 3
b 2 3
...
c 1 4
c 2 4
c 3 5
Can anyone help me with this? Thanks!
Retain holds values across rows. Create a counter value that you can use to track the number of records per group. This allows you to split it into pages of N amount.
Use BY and FIRST to reset counter at the start of each group
Check if the you need to increment page
data have;
input Group $ Value;
cards;
a 1
a 2
a 3
b 1
b 2
c 1
c 2
c 3
;;;;
data want;
set have;
by group;
retain counter page;
if first.group then counter=0;
counter+1;
if mod(counter, 2) =1 or first.group then page+1;
run;
proc print data=want;
run;
Results:
Obs Group Value counter page
1 a 1 1 1
2 a 2 2 1
3 a 3 3 2
4 b 1 1 3
5 b 2 2 3
6 c 1 1 4
7 c 2 2 4
8 c 3 3 5
I am dealing with a repeated measures dataset in a wide format. Each observation represents one measurement for one subject and each subject is measures six times. The data contains mainly dummy variables.
I am looking to do a count of unique dummy variable values across all six observations for each subject.
Have:
MeasurementNum SubjectID Dummy0 Dummy1 Dummy2 Dummy3 Dummy4
-----------------------------------------------------------------------------
1 1 1 1 0 0 0
2 1 0 1 0 1 0
3 1 - - - - -
4 1 0 0 1 1 0
5 1 - - - - -
6 1 0 0 0 1 0
1 2 1 0 0 1 0
2 2 0 0 0 0 0
3 2 0 1 0 0 0
4 2 1 1 0 1 0
5 2 - - - - -
6 2 1 1 1 0 0
Want:
Total for Overall
MeasurementNum SubjectID ... MeasurementNUM Total
--------------------------------...-----------------------------
1 1 ... 2 4
2 1 ... 2 4
3 1 ... - 4
4 1 ... 2 4
5 1 ... - 4
6 1 ... 1 4
1 2 ... 2 4
2 2 ... 0 4
3 2 ... 1 4
4 2 ... 3 4
5 2 ... - 4
6 2 ... 3 4
My current approach is to consolidate all six rows within each subject to one rows retaining value 1 using Proc MEANS with BY and OUTPUT statements, as described in this related question. I then use Proc SUMMARY to get the values listed under variable 'Total` in the have statement.
proc summary
data=have;
By SubjectID
class Dummy1-4;
output out=want sum=sum;
Is there a way to get the distinct/unique counts across observations without consolidating rows first?
I prefer PROC SQL as it will also allow me to do conditional counts according to subject covariates present in my working dataset. I.e. producing the want descriptives on condition of a covariate specific to the subject.
I suspect that using PROC SUMMARY (aka PROC MEANS) will be the easiest way. Sounds like you want to find the MAX for each SUBJECT and then SUM those to get the subject totals.
proc summary data=have nway ;
class SubjectID ;
var Dummy0-Dummy999;
output out=any(drop=_type_ _freq_) n=n_reps max= ;
run;
data want ;
set any ;
total = sum(of Dummy0-Dummy999) ;
run;
Not sure how SQL helps any with conditional counts. But you could generate the counts and total in one step with PROC SQL, but it would require wallpaper code like this:
proc sql ;
create table want as
select SubjectID
, count(*) as n_reps
, max(dummy0) as dummy0
, max(dummy1) as dummy1
...
, max(dummy999) as dumyy999
, sum
( max(dummy0)
, max(dummy1)
...
, max(dummy999)
) as Total
from have
group by 1
;
quit;
You could probably define a macro (or some other tool) to generate that wallpaper code for you from a list of variable names.
I have a time series SAS dataset and I want to transfer it to vertical dataset.
My data looks like..
ID A2009 A2010 A2011 A2012
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
data multcol;
infile datalines;
input ID A2009 A2010 A2011 A2012 A2013;
return;
datalines;
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
;
run;
proc print data=multcol noobs;
run;
I search the web only find someone's solution as following.Not worked.
But my dataset is too large, this method shut down my computer.
data cmbcol(keep=a orig_varname orig_obsnum);
set multcol;
array myvars _numeric_;
do i = 2 to dim(myvars);
orig_varname = vname(myvars(i));
orig_obsnum = _n_;
A = myvars(i);
output;
end;
run;
proc print data=cmbcol ;
title 'cmbcol';
run;
proc sort data=cmbcol;
by orig_varname a;
run;
proc print data=cmbcol noobs;
title 'cmbcol';
run;
And I want them to become like this.
ID t t+1
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
1 3 4
2 3 4
3 3 4
4 3 4
5 3 4
How can we do that?
Thanks in advance.
That is an unusual data structure for sure, but you could achieve this using the following macro (adjust to your needs).
options validvarname = any;
%macro transp;
%let i = 2009;
%do %while (&i <= 2011);
%let j = %eval(&i + 1);
data part_&i(rename = (A&i = t A&j = 't+1'n));
set multcol(keep = ID A&i A&j);
run;
%let i = %eval(&i + 1);
%end;
data combined;
set part_:;
run;
proc datasets nolist nodetails;
delete part_:;
quit;
%mend transp;
%transp
Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;
I have a dataset that consists of a series of readings made by different people/instruments, of a bunch of different dimensions. It looks like this:
SUBJECT DIM1_1 DIM1_2 DIM1_3 DIM1_4 DIM1_5 DIM2_1 DIM2_2 DIM2_3 DIM3_1 DIM3_2
1 1 . 1 1 2 3 3 3 2 .
2 1 1 . 1 1 2 2 3 1 1
3 2 2 2 . . 1 . . 5 5
... ... ... ... ... ... ... ... ... ... ...
My real dataset contains around 190 dimensions, with up to 5 measures in each one
I have to obey a set of rules to create a new variable for each dimension:
If there are 2 different values in the same dimension (missings excluded), the new variable is a missing.
If all values are the same (missings excluded), the new variable assumes the same value.
My new variables should look like this:
SUBJECT ... DIM1_X DIM2_X DIM3_X
1 ... . 3 2
2 ... 1 . 1
3 ... 2 1 5
The problem here is that i don't have the same number of measures for each dimension. Also, i could only come up with a lot of IF's (and I mean a LOT, as more measures in a given dimension increases the number of comparisons), so I wonder if there is some easier way to handle this particular problem.
Any help would be apreciated.
Thanks in advance.
Easiest way is to transpose it to vertical (one row per DIMx_y), summarize, then set the ones you want missing to missing, then retranspose (and if needed merge back on).
data have;
input SUBJECT DIM1_1 DIM1_2 DIM1_3 DIM1_4 DIM1_5 DIM2_1 DIM2_2 DIM2_3 DIM3_1 DIM3_2;
datalines;
1 1 . 1 1 2 3 3 3 2 .
2 1 1 . 1 1 2 2 3 1 1
3 2 2 2 . . 1 . . 5 5
;;;;
run;
data have_pret;
set have;
array dim_data DIM:;
do _t = 1 to dim(dim_Data); *dim function is not related to the name - it gives # of vars in array;
dim_Group = scan(vname(dim_data[_t]),1,'_');
dim_num = input(scan(vname(dim_data[_t]),2,'_'),BEST12.);
dim_val=dim_data[_t];
output;
end;
keep dim_group dim_num subject dim_val;
run;
proc freq data=have_pret noprint;
by subject dim_group;
tables dim_val/out=want_pret(where=(not missing(dim_val)));
run;
data want_pret2;
set want_pret;
by subject dim_Group;
if percent ne 100 then dim_val=.;
idval = cats(dim_Group,'_X');
if last.dim_Group;
run;
proc transpose data=want_pret2 out=want;
by subject;
id idval;
var dim_val;
run;