SAS Creating a count variable that resets - sas

I'd like to create a count variable that resets every 6 months by the manner of crime. Here is my code :
proc sort data=incident_v1;
by incident year;
run;
Data incident_v2;
set incident_v1 end= eof;
by incident;
do i=1 until (eof);
do j = 1 to 6;
retain ID 0;
if first.incident then ID=ID +1;
end;
end;
run;
Year-mo incident # of incidents Count every six months
199901 car-jack 6 1
199902 car-jack 7 2
199903 car-jack 12 3
199904 car-jack 8 4
199905 car-jack 13 5
199906 car-jack 8 6
199907 car-jack 13 1
199908 car-jack 6 2
199909 car-jack 8 6
200001 robbery 5 1
200002 robbery 5 2
200003 robbery 8 3
200004 robbery 4 4
200005 robbery 6 5
200006 robbery 14 6

Try this:
proc sort data=incident_v1;
by incident year;
run;
Data incident_v2;
set incident_v1 end= eof;
by incident;
/*Reset Count if New Incident Type*/
if first.incident then Incident_count_6_mnths = 0;
/*Reset Count if month is 07 (which from your data below I believe to be actually year & month and of the form YYYYMM*/
if substr(year,5,2) = "07" then Incident_count_6_mnths = 0;
/*if year is a not a text field you will have to use: substr(compress(put(year,8.)),5,2) = "07" */
/*Add a counter - please note there is no need for a retain statement as the +1 will automatically retain.*/
Incident_count_6_mnths+1;
run;

Related

SAS how to Dense_rank

I am new to sas, I used to do oracle SQL
I did similar question before
How to tricky rank SAS?
I thought this question could solve the problem.
but
I got stuck.
so my code is this
data stepstep;
input emplid KEY:$3. count;
cards;
11 11Y 1
11 11Y 2
11 11N 3
11 11N 4
11 11Y 5
11 11N 6
12 12Y 1
12 12Y 2
12 12N 3
;
run;
and then I tried
data stepstep2;
set stepstep;
by key emplid NOTSORTED;
if first.key AND first.emplidthen rank=1;
ELSE rank+1;
run;
Output is this
I want to show
emplid key count rank
11 11Y 1 1
11 11Y 2 1
11 11N 3 2
11 11N 4 2
11 11Y 5 3
11 11N 6 4
12 12Y 1 1
12 12Y 2 1
12 12N 3 2
so new emplid comes, I want "Rank" goes back to start count from 1.
so this example, when first emplid "12" comes, rank goes back to 1
How can I do that?
You need to leverage your BY groups properly and I think you have them in the wrong order for starters. Try this instead:
data stepstep2;
set stepstep;
by emplid KEY NOTSORTED;
if first.emplid then rank=1; *start of each emplid group;
ELSE if first.key rank+1; *start of each new key;
run;
You can also use a sum statement:
data stepstep2;
set stepstep;
by emplid key NOTSORTED;
if first.emplid then rank=0;
rank + first.key;
run;

Create values for group - SAS

data test;
input Index Indicator value FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;
run;
I have a data set with the first 3 columns. How do I get the 4th columns based on the indicators? For example, for the index, when the indicator =1, the value is 21, so I put 21 is the final values in all lines for index 1.
Use the SAS Retain Keyword.
You can do this in a data step; by Retaining the Value where indicator = 1.
Steps:
Sort your data by Index and Indicator
Group by the Index & Retain the Value where Indicator=1
Code:
/*Sort Data by Index and Indicator & remove the hardcodeed finalvalue*/
proc sort data=test (keep= Index Indicator value);
by index descending indicator ;
run;
/*Retain the FinalValue*/
data want;
set test;
retain FinalValue;
keep Index Indicator value FinalValue;
if indicator =1 then do;FinalValue=value;end;
/*The If statement below will assign . to records that doesn't have an indicator value of 1*/
if indicator ne 1 and FIRST.Index=1 then FinalValue=.;
by index;
run;
Output:
Index=1 Indicator=1 value=21 FinalValue=21
Index=1 Indicator=0 value=5 FinalValue=21
Index=2 Indicator=1 value=0 FinalValue=0
Index=3 Indicator=1 value=7 FinalValue=7
Index=3 Indicator=0 value=4 FinalValue=7
Index=3 Indicator=0 value=8 FinalValue=7
Index=3 Indicator=0 value=2 FinalValue=7
Index=4 Indicator=1 value=1 FinalValue=1
Index=4 Indicator=0 value=4 FinalValue=1
Use proc sql by left join. Select value which indicator=1 and group by index, then left join with original dataset. It seemed that your first row of index=3 should be 7, not 0.
proc sql;
select a.*,b.finalvalue from test a
left join (select *,value as finalvalue from test group by index having indicator=1) b
on a.index=b.index;
quit;
This is rather old school but should be adequate. I reckon you call it a self merge or something.
data test;
input Index Indicator value;* FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;;;;
run;
data final;
if 0 then set test;
merge test(where=(indicator eq 1) rename=(value=FinalValue)) test;
by index;
run;
proc print;
run;
Final
Obs Index Indicator value Value
1 1 0 5 21
2 1 1 21 21
3 2 1 0 0
4 3 0 4 7
5 3 1 7 7
6 3 0 8 7
7 3 0 2 7
8 4 1 1 1
9 4 0 4 1

SAS, calculate row difference

data test;
input ID month d_month;
datalines;
1 59 0
1 70 11
1 80 21
2 10 0
2 11 1
2 13 3
3 5 0
3 9 4
4 8 0
;
run;
I have two columns of data ID and Month. Column 1 is the ID, the same ID may have multiple rows (1-5). The second column is the enrolled month. I want to create the third column. It calculates the different between the current month and the initial month for each ID.
you can do it like that.
data test;
input ID month d_month;
datalines;
1 59 0
1 70 11
1 80 21
2 10 0
2 11 1
2 13 3
3 5 0
3 9 4
4 8 0
;
run;
data calc;
set test;
by id;
retain current_month;
if first.id then do;
current_month=month;
calc_month=0;
end;
if ^first.id then do;
calc_month = month - current_month ;
end;
run;
Krs

Filling in gaps between sequentially numbered records and updating a status indicator

In a summarized dataset, I have the status of an event at each hour after baseline in which it was recorded. I also have the last hour the event could have been recorded. I want to create a new dataset with one record for each hour from the first through the last hour, with the status for each record being the one from the last recorded status.
Here is an example dataset:
data new;
input hour status last_hour;
cards;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
run;
In this case, the first recorded hour was the second, and the last recorded hour was the 10th. The last possible hour to record data was the 12th.
The final dataset should look like so:
0 . 12
1 . 12
2 1 12
3 1 12
4 1 12
5 1 12
6 1 12
7 0 12
8 0 12
9 1 12
10 0 12
11 0 12
12 0 12
I sort of have it working with this series of data steps, but I'm not sure if there's a cleaner way I'm not seeing.
data step1;
set new (keep=id hour);
by id;
do hour = 0 to last_hour;
output;
end;
run;
proc sort data=step1;
by id hour;
run;
proc sql;
create table step2 as
select distinct a.id, a.hour, b.status
from step1 as a
left join new as b
on a.id = b.id
and a.hour = b.hour
order by a.id, a.hour;
quit;
data step3;
set step2;
by id hour;
retain previous_status;
if first.id then do;
previous_status = .;
if status > . then previous_status = status;
end;
if not first.id then do;
if status = . and previous_status > . then status = previous_status;
if status > . then previous_status = status;
end;
run;
Seeing your code, it seems you left out of your question the fact that you also have id's. So this is a newer solution that deals with different id's. See further below for my first solution ignoring id's.
Since last_hour is always 12, I left it out of the have dataset. It will be added later on.
data have;
input id hour status;
cards;
1 2 1
1 4 1
1 5 1
1 6 1
1 7 0
1 9 1
1 10 0
2 2 1
2 4 1
2 5 1
2 6 1
2 7 0
2 9 1
2 10 0
;
Create a hours dataset, just containing numbers 0 thru 12;
data hours;
do i = 0 to 12;
hour = i;
output;
end;
drop i;
run;
Create a temporary dataset that will have the right number of rows (13 rows for every id, with valid hour values where they exist in the have table).
proc sql;
create table tmp as
select distinct t1.id, t2.hour, 12 as last_hour
from have as t1
cross join
(select hour from hours) as t2;
quit;
Then use merge and retain to fill in the missing hour column where appropriate.
data want;
merge have
tmp;
by id hour;
retain status_previous;
if not first.id then do;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
end;
if last.id then status_previous = .;
drop status_previous;
run;
Previous solution (no id's)
If last_hour is always 12, then this should do it:
data have;
input hour status last_hour;
datalines;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
data hours;
do i = 0 to 12;
hour = i;
last_hour = 12;
output;
end;
drop i;
run;
data want;
merge have
hours;
by hour;
retain status_previous;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
drop status_previous;
run;

Proportional sampling via SAS

I have 600,000+ observed data that I want to sample proportional to its ZIP codes (the number of ZIP codes in the data are proportional to its population density). The key variables in the data are ZIP CODE, ID, and GROUP.
I need to fix my existing SAS code so that when SAS picks a ZIP CODE, it picks all the records in its GROUP. For example, if ID=2 is selected, I need ID=1 and ID=3 as well. Thus, I have all the ZIP codes in GROUP=1.
ID GROUP ZIP
1 1 46227
2 1 46227
3 1 46227
4 2 47620
5 3 47433
6 3 47433
7 3 47433
8 4 46135
9 4 46135
10 5 46202
11 5 46202
12 5 46202
13 5 46202
14 6 46793
15 6 46793
16 7 46202
17 7 46202
18 7 46202
19 8 46409
20 8 46409
21 9 46030
22 9 46030
23 9 46030
24 10 46383
25 10 46383
26 10 46383
I have the following SAS code that will sample 1000 obs from the data however it just randomly picks ZIP codes without considering the GROUP variable.
proc freq data=sample;
tables zip / out=outfreq noprint;
run;
data newfreq error; set outfreq;
sampnum=(percent*1000)/100;
_NSIZE_=round(sampnum, 1);
sampnum=round(sampnum, .01);
if _NSIZE_=0 then output error;
if _NSIZE_=0 then delete;
output newfreq;
run;
data newfreq2; set newfreq error;
by zip;
keep zip _NSIZE_;
run;
proc sort data=newfreq2;
by zip;
run;
proc sort data=sample;
by zip;
run;
/* proportional stratified sampling */
proc surveyselect data=sample seed=2020 out=sampout sampsize=newfreq2;
strata zip;
id id zip;
run;
I hope I am explaining my problem clearly. If not, I'll try to clarify and/or elaborate on things that are unclear.
Thanks in advance.
Here's an attempt that seems to work.
data test;
input id group zip;
cards;
1 1 46227
2 1 46227
3 1 46227
4 2 47620
5 3 47433
6 3 47433
7 3 47433
8 4 46135
9 4 46135
10 5 46202
11 5 46202
12 5 46202
13 5 46202
14 6 46793
15 6 46793
16 7 46202
17 7 46202
18 7 46202
19 8 46409
20 8 46409
21 9 46030
22 9 46030
23 9 46030
24 10 46383
25 10 46383
26 10 46383
;
run;
data test;
set test;
rand = ranuni(1200);
run;
proc sort data=test;
by rand;
run;
/* 10 here is how many cases you want to sample initially */
data test;
set test;
if _n_ <= 10 then sample = 1;
else sample = 0;
run;
proc sort data=test;
by group
descending sample;
run;
data test;
set test;
by group;
retain keep;
if first.group and sample = 1 then keep = 1;
if first.group and sample = 0 then keep = 0;
if not first.group then keep = keep;
drop rand
sample;
run;
proc sort data=test;
by id;
run;
As a bonus, here's an R one-liner that will give the same results:
# 3 here is the number of cases being sampled
test[test$group %in% (test[sample(1:nrow(test),3),]$group),]
Not sure what you mean. Are you trying to sample ZIP codes (and return all obs for each ZIP) or do you want a sample stratified BY ZIP code (meaning N obs from each ZIP)? You might want to see Example 89.4 in the SAS/STAT User's Guide here.
This example of 'proportional allocation' on p. 6 of the article referenced below may help:
proc surveyselect data=frame out=sampsizes_prop sampsize=400;
strata cityside **/ alloc=prop**;
run;
Article:
http://analytics.ncsu.edu/sesug/2013/SD-01.pdf