This question is an expansion of this one : SAS: Create a frequency variable
The code provided in the first response work well, but what if I want to add another categorical variable ? I have a date variable and an ID, categorical variable. I've tried multiple things, but here's what seemed the most logical to me (but doesn't work):
data work.frequencycounts;
do _n_ =1 by 1 until (last.Date);
set work.dataset;
by Date ID;
if first.Date & first.ID then count=0;
count+1;
end;
frequency= count;
do _n_ = 1 by 1 until (last.Date);
set work.dataset;
by Date ID;
output;
end;
run;
Should I add a do loop ?
Thanks for your help.
Edits:
Example of what I have:
Date ID
1 19736 H-3-10
2 19736 H-3-12
3 19737 E-2-10
4 19737 E-2-10
Example of what I want:
Date ID Count
1 19736 H-3-10 1
2 19736 H-3-12 1
3 19737 E-2-10 2
4 19737 E-2-10 2
This produces the desired output.
What is happening here is that you need to use the last variable in the BY statement for everything with first./last. processing. If you need to know why, put a few put _all_; in the datastep to see what is what value at different points. You shouldn't check for first.Date at any point, because if first.Date is true then first.ID is always true (by definition, first propagates rightwards); and you want a different count for [first.ID and not first.date].
Basically, treat the initial example as correct, and the variable in the initial example should be the last variable in your by statement; add as many additional variables as you want to the left of it, and nothing will change. This does require the data be sorted by the by-group variables.
data have;
input date id $;
datalines;
19736 H-3-10
19736 H-3-12
19737 E-2-10
19737 E-2-10
;;;;;
run;
data work.want;
do _n_ =1 by 1 until (last.ID); *last.<last variable in by group>;
set work.have;
by Date ID;
if first.ID then count=0; *first.ID is what you want here.;
count+1;
end;
frequency= count; *this is not really needed - can use just the one variable consistently;
do _n_ = 1 by 1 until (last.ID); *again, last.<last var in by group>;
set work.have;
by Date ID;
output;
end;
run;
Related
I have this database:
data temp;
input ID date type ;
datalines;
1 10/11/2006 1
1 10/12/2006 2
1 15/01/2007 2
1 20/01/2007 3
2 10/08/2008 1
2 11/09/2008 1
2 17/10/2008 1
2 12/11/2008 2
2 10/12/2008 3
;
I would like to create a new column where I repeat the last date by ID:
data temp;
input ID date type last_date;
datalines;
1 10/11/2006 1 20/01/2007
1 10/12/2006 2 20/01/2007
1 15/01/2007 2 20/01/2007
1 20/01/2007 3 20/01/2007
2 10/08/2008 1 10/12/2008
2 11/09/2008 1 10/12/2008
2 17/10/2008 1 10/12/2008
2 12/11/2008 2 10/12/2008
2 10/12/2008 3 10/12/2008
;
I have tried this code but it doesn't work:
data temp;
set temp;
IF last.ID then last_date= .;
RETAIN last_date;
if missing(last_date) then last_date= date;
run;
Thank you in advance for your help!
First thing is that FIRST.ID and LAST.ID variables are not created in the data step unless you include the variable ID in the BY statement.
Second is that to attach the last date to each observation you need to process the data twice. Your current code (if the BY statement is added) will only assign a value to LAST_DATE on the last observation of the by group.
One way to do this is to re-sort the data by descending date within each by group then you could use BY ID and FIRST.ID and RETAIN.
proc sort data=have;
by id descending date;
run;
data want;
set have;
by id descending date;
if first.id then last_date=date;
retain last_date;
format last_date ddmmyy10.;
run;
Here is a way to use the original sort order using what is called a double DOW loop. By placing the SET/BY statements inside of a DO loop you can read all of the observations for a group in a single pass of the data step. You then add a second DO loop to re-process that BY group and use the information calculated in the first loop and write out the observations.
data want;
do until (last.id);
set have;
by id;
end;
last_date=date ;
format last_date ddmmyy10.;
do until (last.id);
set have;
by id;
output;
end;
run;
Two other ways are:
Proc SQL joining a subselect, or
Proc MEANS + DATA/MERGE
SQL
proc sql;
create table want as
select have.*, id_group.max_date as last_date format=ddmmyy10.
from
have
join
( select id, max(date) as max_date
from have
group by id
) as id_group
on
have.id = id_group.id
;
MEANS
proc means noprint data=have;
by id;
var date;
output out=maxdates(keep=id last_date) max(date)=last_date;
run;
data want;
merge have maxdates;
by id;
run;
I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;
I'm trying to create a new variable that increments by days based on the first date of the ID variable in SAS.
I've been trying to use intck but to no avail. Below is my code:
DATA want;
SET have;
LENGTH NEWVAR 8.;
by IDVAR DATEVAR;
RETAIN NEWVAR ;
if first.IDVAR then newvar =0 ;
if first.DATEVAR then NEWVAR = intck('day',first.DATEVAR,'continuous')+1;
RUN;
This is the dataset I'm looking to create:
IDVAR DATEVAR NEWVAR
1 1-Jan-18 1
1 2-Jan-18 2
1 5-Jan-18 5
1 6-Jan-18 6
1 1-Feb-18 32
1 3-Feb-18 34
2 2-Jan-18 1
2 3-Jan-18 2
You need to store the first value in a separate variable that does not get updated.
FIRST.<var> is a special flag variable that will be either 0 or 1. It does not take on the value of the variable.
DATA want;
SET have;
LENGTH NEWVAR 8.;
by IDVAR DATEVAR;
RETAIN first_date_for_id ;
if first.IDVAR then first_date_for_id = datevar ;
NEWVAR = intck('day',datevar, first_date_for_id,'continuous')+1;
label newvar = 'Days since id started';
drop first_date_for_id;
RUN;
You're pretty close here. When referencing first.datevar the resulting value will be either 1 or 0 (i.e. true or false). Instead, you will need to retain the first datevar.
DATA want;
SET have;
LENGTH NEWVAR 8.;
by IDVAR DATEVAR;
RETAIN firstdate;
if first.IDVAR then do;
firstdate = datevar;
end;
NEWVAR = intck('day',firstdate,datevar)+1;
RUN;
I hope this is not a duplicate question. I've searched the forum and retain function seems to be choice of weapon but it copies down an observation, and I'm trying to do the following; for a given id, copy the second line to the first line for the x value. also first value of x is always 2.
Here's my data;
id x
3 2
3 1
3 1
2 2
2 1
2 1
6 2
6 0
6 0
and i want it to look like this;
id x
3 1
3 1
3 1
2 1
2 1
2 1
6 0
6 0
6 0
and here's the starter code;
data have;
input id x;
cards;
3 2
3 1
3 1
2 2
2 1
2 1
6 2
6 0
6 0
;
run;
Lead is tricky in SAS. You can sort in reverse and use a lag function to get around it though, and you are right: a retain statement will allow us to add an order variable so we can sort it back to its original format.
data have;
set have;
retain order;
lagid = lag(id);
if id ne lagid then order = 0;
order = order + 1;
drop lagid;
run;
proc sort data=have; by id descending order; run;
data have;
set have;
leadx = lag(x);
run;
proc sort data=have; by id order; run;
data have;
set have;
if order = 3 then x_fixed = x;
else x_fixed = leadx;
run;
If your data is exactly as you say, then you can use a lookahead merge. It literally takes the dataset and merges itself to a copy of the dataset that starts on row 2, side-to-side. You just have to check that you're still on the same ID. This does change the value of x for all records to the value one hence, not just the first; you could add additional code to pay attention to that (but can't use FIRST and LAST).
data want;
merge have have(firstobs=2 rename=(id=newid x=newx));
if newid=id then x=newx;
keep x id;
run;
If you don't have any additional variables of interest, then you can do something even more interesting: duplicate the second row in its entirety and delete the first row.
data want;
set have;
by id notsorted;
if first.id then do;
firstrow+1;
delete;
end;
if firstrow=1 then do;
firstrow=0;
output;
end;
output;
run;
However, the "safest" method (in terms of doing most likely what you want precisely) is the following, which is a DoW loop.
data want;
idcounter=0;
do _n_ = 1 by 1 until (last.id);
set have;
by id notsorted;
idcounter+1;
if idcounter=2 then second_x = x;
end;
do _n_=1 by 1 until (last.id);
set have;
by id notsorted;
if first.id then x=second_x;
output;
end;
run;
This identifies the second x in the first loop, for that BY group, then in the second loop sets it to the correct value for row 1 and outputs.
In both of the latter examples I assume your data is organized by ID but not truly sorted (like your initial example is). If it's not organized by ID, you need to perform a sort first (but then can remove the NOTSORTED).
I have a data set with 3 observations, 1 2 3
4 5 6
7 8 9 , now i have to interchange 1 2 3 and 7 8 9.
How can do this in base sas?
If you just want to sort your dataset by a variable in descending order, use proc sort:
data example;
input number;
datalines;
123
456
789
;
run;
proc sort data = example;
by descending number;
run;
If you want to re-order a dataset in a more complex way, create a new variable containing the position that you want each row to be in, and then sort it by that variable.
If you want to swap the contents of the first and last observations while leaving the rest of the dataset in place, you could do something like this.
data class;
set sashelp.class;
run;
data firstobs;
i = 1;
set sashelp.class(obs = 1);
run;
data lastobs;
i = nobs;
set sashelp.class nobs = nobs point = nobs;
output;
stop;
run;
data transaction;
set lastobs firstobs;
/*Swap the values of i for first and last obs*/
retain _i;
if _n_ = 1 then do;
_i = i;
i = 1;
end;
if _n_ = 2 then i = _i;
drop _i;
run;
data class;
set transaction(keep = i);
modify class point = i;
set transaction;
run;
This modifies just the first and last observations, which should be quite a bit faster than sorting or replacing a large dataset. You can do a similar thing with the update statement, but that only works if your dataset is already sorted / indexed by a unique key.
By Sandeep Sharma:sandeep.sharmas091#gmail.com
data testy;
input a;
datalines;
1
2
3
4
5
6
7
8
9
;
run;
data ghj;
drop y;
do i=nobs-2 to nobs;
set testy point=i nobs=nobs;
output;
end;
do n=4 to nobs-3;
set testy point=n;
output;
end;
do y=1 to 3;
set testy;
output;
end;
stop;
run;