I have this database:
data temp;
input ID date type ;
datalines;
1 10/11/2006 1
1 10/12/2006 2
1 15/01/2007 2
1 20/01/2007 3
2 10/08/2008 1
2 11/09/2008 1
2 17/10/2008 1
2 12/11/2008 2
2 10/12/2008 3
;
I would like to create a new column where I repeat the last date by ID:
data temp;
input ID date type last_date;
datalines;
1 10/11/2006 1 20/01/2007
1 10/12/2006 2 20/01/2007
1 15/01/2007 2 20/01/2007
1 20/01/2007 3 20/01/2007
2 10/08/2008 1 10/12/2008
2 11/09/2008 1 10/12/2008
2 17/10/2008 1 10/12/2008
2 12/11/2008 2 10/12/2008
2 10/12/2008 3 10/12/2008
;
I have tried this code but it doesn't work:
data temp;
set temp;
IF last.ID then last_date= .;
RETAIN last_date;
if missing(last_date) then last_date= date;
run;
Thank you in advance for your help!
First thing is that FIRST.ID and LAST.ID variables are not created in the data step unless you include the variable ID in the BY statement.
Second is that to attach the last date to each observation you need to process the data twice. Your current code (if the BY statement is added) will only assign a value to LAST_DATE on the last observation of the by group.
One way to do this is to re-sort the data by descending date within each by group then you could use BY ID and FIRST.ID and RETAIN.
proc sort data=have;
by id descending date;
run;
data want;
set have;
by id descending date;
if first.id then last_date=date;
retain last_date;
format last_date ddmmyy10.;
run;
Here is a way to use the original sort order using what is called a double DOW loop. By placing the SET/BY statements inside of a DO loop you can read all of the observations for a group in a single pass of the data step. You then add a second DO loop to re-process that BY group and use the information calculated in the first loop and write out the observations.
data want;
do until (last.id);
set have;
by id;
end;
last_date=date ;
format last_date ddmmyy10.;
do until (last.id);
set have;
by id;
output;
end;
run;
Two other ways are:
Proc SQL joining a subselect, or
Proc MEANS + DATA/MERGE
SQL
proc sql;
create table want as
select have.*, id_group.max_date as last_date format=ddmmyy10.
from
have
join
( select id, max(date) as max_date
from have
group by id
) as id_group
on
have.id = id_group.id
;
MEANS
proc means noprint data=have;
by id;
var date;
output out=maxdates(keep=id last_date) max(date)=last_date;
run;
data want;
merge have maxdates;
by id;
run;
Related
I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;
I want to do some sum calculate for a data set. The challenge is I need to do both row sum AND column Sum by ID. Below is the example.
data have;
input ID var1 var2;
datalines;
1 1 1
1 3 2
1 2 3
2 0 5
2 1 3
3 0 1
;
run;
data want;
input ID var1 var2 sum;
datalines;
1 1 1 12
1 3 2 12
1 2 3 12
2 0 5 9
2 1 3 9
3 0 1 1
;
run;
Using SQL is cool, but SAS has nice data step!
proc sort data=have; by id; run;
data result;
set have;
by id;
retain sum 0;
if first.id then sum=0;
sum=sum+sum(var1,var2);
if last.id then output;
run;
proc sort data=result; by id; run;
data want;
merge have result;
by id;
run;
You will decide what to use...
Use SQL to do all of it in one step. Group only by ID, but keep var1 and var2 in the column selection. This will create the same data in want.
proc sql noprint;
create table want as
select ID
, var1
, var2
, sum(var1) + sum(var2) as sum
from have
group by ID
;
quit;
I have data like below
p_id E_id
---- ----
1 1
1 2
1 3
1 4
2 1
3 1
3 2
3 3
4 1
For each primary_id I have to create a table of the corresponding E_id.
How do I do it in SAS;
I am using:
proc freq data = abc;
where p_id = 1;
tables p_id * E_id;
run;
How do I generalize the where statement for all the primary keys??
The by statement is how you get a separate table for each ID. It requires data to be sorted by the variable.
proc freq data = abc;
by p_id;
tables p_id * E_id;
run;
Here is a solution allowing you to select p_id to generate frequency tables.
data have;
input p_id e_id;
datalines;
1 1
1 2
1 3
1 4
2 1
3 1
3 2
3 3
4 1
;
run;
proc sort data = have;
by p_id;
run;
%let pid_list = (1,2); ** only generate two tables;
data _null_;
set have;
by p_id;
if first.p_id and p_id in &pid_list then do;
call execute('
proc freq data = have(where = (p_id = '||p_id||'));
tables p_id * e_id;
run;
');
end;
run;
In my sas data set there are groups, i.e. id and I want delete groups with missing values in a certain variable.
For example I have this sas data set:
data have;
input v1 v2 v3 id;
datalines;
9 7 210 1
0 6 . 1
9 3 320 2
6 1 . 1
9 4 432 2
;
run;
I tried this:
/*Order by id*/
proc sort data=have;
by id;
run;
/*Select no missing observations by id*/
data=want;
set=have;
if cmiss(of _all_) then delete;
run;
However this code does not exclude id's with missing values. It delete missing values.
Hmmm. You can use proc sql for this:
proc sql;
delete from have
where exists (select 1 from have have2 where have.id = have2.id and (have2.v1 is null or have2.v2 is null or have2.v3 is null);
One idea might be to use a double DOW loop. First to check for any missing values and then a second one to output the records for the ids with no missing values.
data have;
input v1 v2 v3 id;
datalines;
9 7 210 1
0 6 . 1
9 3 320 2
6 1 . 1
9 4 432 2
1 2 333 3
;
You will need to sort as in your example.
data want ;
do until (last.id);
set have;
by id;
anymissing=max(anymissing,cmiss(of v1-v3));
end;
do until (last.id);
set have;
by id;
if not anymissing then output;
end;
run;
You just dont want to have lines with missing Columns in your result dataset. So why delete, just exclude them when writing result-dataset or overwrite source-Dataset.:
data have;/*overwriting my have dataset instead of deleting lines*/
set have;
if not cmiss(of _ALL_);
run;
When you want to remove all lines for a group if only one line has a missing value you can do this, Store an ID if it has no value and then dont write any line with that id, and you just get ID lines you want as result. Important is that the ID with missing value is first in dataset, but that should be that way because of proc sort:
data want;
retain x;
set have;
if cmiss(of _ALL_) then
x= id;
if x ne id;
run;
This question is an expansion of this one : SAS: Create a frequency variable
The code provided in the first response work well, but what if I want to add another categorical variable ? I have a date variable and an ID, categorical variable. I've tried multiple things, but here's what seemed the most logical to me (but doesn't work):
data work.frequencycounts;
do _n_ =1 by 1 until (last.Date);
set work.dataset;
by Date ID;
if first.Date & first.ID then count=0;
count+1;
end;
frequency= count;
do _n_ = 1 by 1 until (last.Date);
set work.dataset;
by Date ID;
output;
end;
run;
Should I add a do loop ?
Thanks for your help.
Edits:
Example of what I have:
Date ID
1 19736 H-3-10
2 19736 H-3-12
3 19737 E-2-10
4 19737 E-2-10
Example of what I want:
Date ID Count
1 19736 H-3-10 1
2 19736 H-3-12 1
3 19737 E-2-10 2
4 19737 E-2-10 2
This produces the desired output.
What is happening here is that you need to use the last variable in the BY statement for everything with first./last. processing. If you need to know why, put a few put _all_; in the datastep to see what is what value at different points. You shouldn't check for first.Date at any point, because if first.Date is true then first.ID is always true (by definition, first propagates rightwards); and you want a different count for [first.ID and not first.date].
Basically, treat the initial example as correct, and the variable in the initial example should be the last variable in your by statement; add as many additional variables as you want to the left of it, and nothing will change. This does require the data be sorted by the by-group variables.
data have;
input date id $;
datalines;
19736 H-3-10
19736 H-3-12
19737 E-2-10
19737 E-2-10
;;;;;
run;
data work.want;
do _n_ =1 by 1 until (last.ID); *last.<last variable in by group>;
set work.have;
by Date ID;
if first.ID then count=0; *first.ID is what you want here.;
count+1;
end;
frequency= count; *this is not really needed - can use just the one variable consistently;
do _n_ = 1 by 1 until (last.ID); *again, last.<last var in by group>;
set work.have;
by Date ID;
output;
end;
run;