Please can anyone help me the follwing probelm.
I have following dummy data:
id num
1 1
1 2
1 1
1 2
1 1
1 2
2 1
2 15
2 1
2 1
2 1
2 15
2 1
2 15
How to count number of times num (column) is changing for each id?
Please find the results and new column.
I need results like this
id number no_of_times
1 1 1
1 2 1
1 1 1
1 2 2
1 1 1
1 2 3
2 1 1
2 15 1
2 1 1
2 1 1
2 1 1
2 15 2
2 1 1
2 15 3
Hope you can understand after seeing the results
The following hash approach works for the test data provided with the question:
data have;
input id number no_of_times_target;
cards;
1 1 1
1 2 1
1 1 1
1 2 2
1 1 1
1 2 3
2 1 1
2 15 1
2 1 1
2 1 1
2 1 1
2 15 2
2 1 1
2 15 3
;
run;
data want;
set have;
by id;
if _n_ = 1 then do;
length prev_number no_of_times 8;
declare hash h();
rc = h.definekey('number','prev_number');
rc = h.definedata('no_of_times');
rc = h.definedone();
end;
prev_number = lag(number);
if number > prev_number and not(first.id) then do;
rc = h.find();
no_of_times = sum(no_of_times,1);
rc = h.replace();
end;
else no_of_times = 1;
if last.id then rc = h.clear();
drop rc prev_number;
run;
Related
I want to find the number of unique ids for every subset combination of the variables. For example
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
run;
I want the result to be
var1 var2 var3 count
. . 0 5
. . 1 5
. 0 . 7
. 0 0 5
. 0 1 3
. 1 . 2
. 1 1 2
0 . . 3
0 . 0 2
0 . 1 1
0 0 . 3
0 0 0 2
0 0 1 1
1 . . 7
1 . 0 4
1 . 1 4
1 0 . 5
1 0 0 4
1 0 1 2
1 1 . 2
1 1 1 2
which is the result of appending all the possible proc sql; group bys (var1 is shown below)
proc sql;
create table sub1 as
select var1, count(distinct id) as count
from have
where not missing(var1)
group by var1
;
quit;
I don't care about the case where all variables are missing or when any of the variables in the group by are missing. Is there a more efficient way of doing this?
You can use Proc SUMMARY to compute the combinations of var1-var3 values for each id by group. From the SUMMARY output a SQL query can count the distinct ids per combination.
Example:
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
proc summary noprint missing data=have;
by id;
class var1-var3;
output out=combos;
run;
proc sql;
create table want as
select var1, var2, var3, count(distinct id) as count
from combos
group by var1, var2, var3
;
I have this database:
data temp;
input ID monitoring_date score ;
datalines;
1 10/11/2006 0
1 10/12/2006 0
1 15/01/2007 1
1 20/01/2007 1
1 20/04/2007 1
2 10/08/2008 0
2 11/09/2008 0
2 17/10/2008 1
2 12/11/2008 0
3 10/12/2008 0
3 10/08/2008 0
3 11/09/2008 0
3 17/10/2009 1
3 12/12/2009 1
3 05/01/2010 0
4 10/12/2006 0
4 10/08/2006 0
4 11/09/2006 0
4 17/10/2007 0
4 12/12/2007 0
4 09/04/2008 1
4 05/08/2008 1
5 10/12/2013 0
5 03/09/2013 0
5 11/09/2013 0
5 19/10/2014 0
5 10/12/2014 1
5 14/01/2015 1
6 10/12/2017 0
6 10/08/2018 0
6 11/09/2018 0
6 17/10/2018 1
6 12/12/2018 1
6 09/04/2019 1
6 25/07/2019 0
6 05/08/2019 1
6 15/09/2019 0
;
I would like to create a new database with a new column where I note, for each ID, the date of the first progression of the score from 0 to 1 and if this progression is stable at least 3 months until at the end of monitoring else date_progresion = . :
data want;
input ID date_progression;
datalines;
1 15/01/2007
2 .
3 .
4 09/04/2008
5 .
6 .
;
I really have no idea to code this and I would like to get the wanted data to generate a cox model where the progression (Yes/No) is my event.
I am really stuck !
Thank you in advance for your help!
A DOW loop can process the ID groups, tracking for a single active run of 1s. A run has a start date and duration.
Example:
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
select;
when (pscore = 0 and score = 1) do; state = 1; start = date; dur = 1; end;
when (pscore = 1 and score = 1) do; state = 2; dur + 1; end;
when (pscore = 1 and score = 0) do; state = 3; start = .; dur = .; end;
when (pscore = 0 and score = 0) do; state = 4; end;
otherwise;
end;
pscore = score;
end;
if state = 2 and dur >= 3 then progression_date = start;
keep ID progression_date;
format progression_date yymmdd10.;
run;
I have a big panel dataset that looks somewhat like this:
data have;
input id t a b ;
datalines;
1 1 0 0
1 2 0 0
1 3 1 0
1 4 0 0
1 5 0 1
1 6 1 0
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 1
2 6 0 1
2 7 0 1
2 8 0 1
2 9 1 0
2 10 0 1
3 1 0 0
3 2 0 0
3 3 0 0
3 4 0 0
3 5 0 0
3 6 0 0
3 7 1 0
3 8 0 0
3 9 0 0
3 10 0 0
;
run;
For every ID I want to record all 'trigger' events, namely when a=1 and then I need to how long it takes to the next occurrence of b=1. The final output should then give me the following:
data want;
input id a_no a_t b_t diff ;
datalines;
1 1 3 5 2
1 2 6 10 4
2 1 2 5 3
2 2 9 10 1
3 1 7 . .
;
run;
It is of course no problem to get all a=1 and b=1 events, but as it is a very big dataset with a lot of both events for every ID I am searching for an elegant and straight-forward solution. Any ideas?
Here's a fairly simple SQL approach that gives more or less the desired output:
proc sql;
create table want
as select
t1.id,
t1.t as a_t,
t2.t as b_t,
t2.t - t1.t as diff
from
have(where = (a=1)) t1
left join
have(where = (b=1)) t2
on
t1.id = t2.id
and t2.t > t1.t
group by t1.id, t1.t
having diff = min(diff)
;
quit;
The only part missing is a_no. This sort of row-incrementing ID is quite a lot of work to generate consistently in SQL, but it's trivial with an extra data step:
data want;
set want;
by id;
if first.id then a_no = 0;
a_no + 1;
run;
An elegant DATA step way can use nested DOW loops. It's straight forward when you understand DOW loops.
data want(keep=id--diff);
length id a_no a_t b_t diff 8;
do until (last.id); * process each group;
do a_no = 1 by 1 until(last.id); * counter for each output;
do until ( output_condition or end); * process each triggering state change;
SET have end=end; * read data;
by id; * setup first. last. variables for group;
if a=1 then a_t = t; * detect and record start of trigger state;
output_condition = (b=1 and t > a_t > 0); * evaluate for proper end of trigger state;
end;
if output_condition then do;
b_t = t; * compute remaining info at output point;
diff = b_t - a_t;
OUTPUT;
a_t = .; * reset trigger state tracking variables;
b_t = .;
end;
else
OUTPUT; * end of data reached without triggered output;
end;
end;
run;
Note: A SQL way (not shown) can use self join within groups.
I have a dataset work.test1 that consists of 4 variables hhid (household id), pid (person id), pidlink (combination of hhid and pid) and bin (positive or negative).
example data looks like this:
obs hhid pid pidlink bin
1 10600 1 1060001 1
2 10600 1 1060001 1
3 10800 1 1080001 1
4 10800 1 1080001 1
5 10800 2 1080002 1
6 10800 2 1080002 2
7 12200 1 1220001 1
8 12200 1 1220001 2
Now I want to create a dataset work.test2 that should only contain unique hhid that are either bin 2 (if there is a bin=2 in the household) or bin 1 (if there are no bin 2 in the household). If there are more than 1 bin=2, i would choose the first one. And if there are no bin 2 but there are more than 1 bin 1 i would chose the first one. The resulting dataset should only have unique hhid (single entry per household).
The resulting output should look like this:
obs hhid pid pidlink bin
1 10600 1 1060001 1
2 10800 2 1080001 2
3 12200 1 1220001 2
Thank you
As far as data and output shown a group by and max function should work and given me desired results.
data have(drop =obs);
input obs hhid pid pidlink bin;
datalines;
1 10600 1 1060001 1
2 10600 1 1060001 1
3 10800 1 1080001 1
4 10800 1 1080001 1
5 10800 2 1080002 1
6 10800 2 1080002 2
7 12200 1 1220001 1
8 12200 1 1220001 2
;
proc sql;
select hhid, max(pid) as pid, max(pidlink) as pidlink, max(bin) as bin
from have
group by 1;
if you have more columns then it gets little tricky but you can do it but again you need more choice, otherwise you will more records. see the query below
data have(drop =obs);
input obs hhid pid pidlink bin anotherval1 abotherval2 $;
datalines;
1 10600 1 1060001 1 7 A
2 10600 1 1060001 1 8 B
3 10800 1 1080001 1 6 C
4 10800 1 1080001 1 8 D
5 10800 2 1080002 1 8 E
6 10800 2 1080002 2 9 F
7 12200 1 1220001 1 10 G
8 12200 1 1220001 2 7 H
;
proc sql;
select * from have
group by 1
having pid= max(pid)
and pidlink = max(pidlink)
and bin = max(bin) ;
if you want to have only distinct records with additional columns then
data have1;
set have;
val =_n_;
run;
proc sql;
create table have2(drop =val) as
select * from
(select * from have1
group by 1
having pid= max(pid)
and pidlink = max(pidlink)
and bin = max(bin))a
group by hhid, pid, pid,bin
having val=min(val);
I have data with
One binary variable, poor
Two socio-demographic variables var1 and var2
I would like to have the poverty rate of each of my var1 * var2 possible value, that would look like that :
But with three variables in a proc freq, I get multiple outputs, one for each value of the first variable I put on my product
proc freq data=test;
table var1*var2*poor;
run;
How can I get something close to what I would like ?
Try this
data test;
input var1 var2 poor;
cards;
1 1 1
2 3 0
3 2 1
4 1 1
1 2 1
2 3 0
4 1 0
4 2 0
3 1 1
1 2 0
3 2 0
1 3 1
3 3 0
3 3 0
3 3 1
1 1 0
2 2 0
2 2 1
2 2 1
2 1 1
2 1 1
2 1 1
;
run;
proc tabulate data=test;
class var1 var2 poor;
tables var1,
var2*poor*pctn<poor>={label="%"};
run;