count and group the column by sequence - sas

I have a dataset that has to be grouped by number as follows.
ID dept count
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
so for every 3rd row I need a new level the output should be as follows.
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
I have tried counting the number of rows based on the dept and count.
data want;
set have;
by dept count;
if first.count then level=1;
else level+1;
run;
this generates a count but not what exactly I am looking for
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2

It isn't quite clear what output you want. I've extended your input data a bit - please
could you clarify what output you'd expect for this input and what the logic is for generating it?
I've made a best guess at roughly what you might be aiming for - incrementing every 3 rows with the same dept and count - perhaps this will be enough for you to get to the answer you want?
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
11 30 4
12 30 4
13 30 4
14 30 4
;
run;
data want;
set have;
by dept count;
if first.count then do;
level = 0;
dummy = 0;
end;
if mod(dummy,3) = 0 then level + 1;
dummy + 1;
drop dummy;
run;
Output:
ID dept count level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 1
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 1
10 30 4 2
11 30 4 2
12 30 4 2
13 30 4 3
14 30 4 3

One way to do this is to nest the SET statement inside a DO loop. Or in this case two DO loops. One to generate the LEVEL (within DEPT) and the second to count by twos. Use the LAST.DEPT flag to handle odd number of observations.
So if I modify the input to include odd number of observations in some groups.
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 20 4
8 30 4
9 30 4
10 30 4
;
Then can use this step to assign the LEVEL variable.
data want ;
do level=1 by 1 until(last.dept);
do sublevel=1 to 2 until(last.dept);
set have;
by dept;
output;
end;
end;
run;
Results:
Obs level sublevel ID dept count
1 1 1 1 10 2
2 1 2 2 10 2
3 1 1 3 20 4
4 1 2 4 20 4
5 2 1 5 20 4
6 2 2 6 20 4
7 3 1 7 20 4
8 1 1 8 30 4
9 1 2 9 30 4
10 2 1 10 30 4

Related

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

Keep first record when event occurrs

I have the following data in Stata:
clear
* Input data
input grade id exit time
1 1 . 10
2 1 . 20
3 1 2 30
4 1 0 40
5 1 . 50
1 2 0 10
2 2 0 20
3 2 0 30
4 2 0 40
5 2 0 50
1 3 1 10
2 3 1 20
3 3 0 30
4 3 . 40
5 3 . 50
1 4 . 10
2 4 . 20
3 4 . 30
4 4 . 40
5 4 . 50
1 5 1 10
2 5 2 20
3 5 1 30
4 5 1 40
5 5 1 50
end
The objective is to take the first row foreach id when a event occurs and if no event occur then take the last report foreach id. Here is a example for the data I hope to attain
* Input data
input grade id exit time
3 1 2 30
5 2 0 50
1 3 1 10
5 4 . 50
1 5 1 10
end
The definition of an event appears to be that exit is not zero or missing. If so, then all you need to do is tweak the code in my previous answer:
bysort id (time): egen when_first_e = min(cond(exit > 0 & exit < ., time, .))
by id: gen tokeep = cond(when_first_e == ., time == time[_N], time == when_first_e)
Previous thread was here.

SAS - Split single column into two based value of non-binary ID column

I have data which is as follows:
data have;
length
group 8
replicate $ 1
day 8
observation 8
;
input (_all_) (:);
datalines;
1 A 1 0
1 A 1 5
1 A 1 3
1 A 1 3
1 A 2 7
1 A 2 2
1 A 2 4
1 A 2 2
1 B 1 1
1 B 1 3
1 B 1 8
1 B 1 0
1 B 2 3
1 B 2 8
1 B 2 1
1 B 2 3
1 C 1 1
1 C 1 5
1 C 1 2
1 C 1 7
1 C 2 2
1 C 2 1
1 C 2 4
1 C 2 1
2 A 1 7
2 A 1 5
2 A 1 3
2 A 1 1
2 A 2 0
2 A 2 5
2 A 2 3
2 A 2 0
2 B 1 0
2 B 1 3
2 B 1 4
2 B 1 8
2 B 2 1
2 B 2 3
2 B 2 4
2 B 2 0
2 C 1 0
2 C 1 4
2 C 1 3
2 C 1 1
2 C 2 2
2 C 2 3
2 C 2 0
2 C 2 1
3 A 1 4
3 A 1 5
3 A 1 6
3 A 1 7
3 A 2 3
3 A 2 1
3 A 2 5
3 A 2 2
3 B 1 2
3 B 1 0
3 B 1 2
3 B 1 3
3 B 2 0
3 B 2 6
3 B 2 3
3 B 2 7
3 C 1 7
3 C 1 5
3 C 1 3
3 C 1 1
3 C 2 0
3 C 2 3
3 C 2 2
3 C 2 1
;
run;
I want to split observation into two columns based on day.
observation_ observation_
Obs group replicate day_1 day_2
1 1 A 0 7
2 1 A 5 2
3 1 A 3 4
4 1 A 3 2
5 1 B 1 3
6 1 B 3 8
7 1 B 8 1
8 1 B 0 3
9 1 C 1 2
10 1 C 5 1
11 1 C 2 4
12 1 C 7 1
13 2 A 7 0
14 2 A 5 5
15 2 A 3 3
16 2 A 1 0
17 2 B 0 1
18 2 B 3 3
19 2 B 4 4
20 2 B 8 0
21 2 C 0 2
22 2 C 4 3
23 2 C 3 0
24 2 C 1 1
25 3 A 4 3
26 3 A 5 1
27 3 A 6 5
28 3 A 7 2
29 3 B 2 0
30 3 B 0 6
31 3 B 2 3
32 3 B 3 7
33 3 C 7 0
34 3 C 5 3
35 3 C 3 2
36 3 C 1 1
The observant SO reader will notice that I have asked essentially the same question previously. However, because of SAS's obsession with "levels" and "by groups", since the variable being used to split the variable of interest isn't binary, that solution doesn't generalize.
Trying it directly, the following occurs:
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test;
by
group
replicate
;
var observation;
id day;
run;
ERROR: The ID value "_1" occurs twice in the same BY group.
I can use a LET statement to repress the errors, but in addition to cluttering up the log, SAS retains only the last observation of each BY group.
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test let;
by
group
replicate
;
var observation;
id day;
run;
Obs group replicate _NAME_ _1 _2
1 1 A observation 3 2
2 1 B observation 0 3
3 1 C observation 7 1
4 2 A observation 1 0
5 2 B observation 8 0
6 2 C observation 1 1
7 3 A observation 7 2
8 3 B observation 3 7
9 3 C observation 1 1
I don't doubt there's some kludgy way it could be done, such as splitting each group into a separate data set and then re-merging them. It seems like it should be doable with PROC TRANSPOSE, although how escapes me. Any ideas?
Not sure what you're talking about with "SAS's obsession...", but the issue here is fairly straightforward; you need to tell SAS about the four rows (or whatever) being separate, distinct rows. by tells SAS what the row-level ID is, but you're lying to it when you say by group replicate, since there are still multiple rows under that. So you need to have a unique key. (This would be true in any database-like language, nothing unique to SAS here. )
I would do this - make a day_row field, then sort by that.
data have_id;
set have;
by group replicate day;
if first.day then day_row = 0;
day_row+1;
run;
proc sort data=have_id;
by group replicate day_row;
run;
proc transpose data=have_id out=want(drop=_name_) prefix=observation_day_;
by group replicate day_row;
var observation;
id day;
run;
Your output looks like you don't want to transpose the data but instead just want split it into DAY1 and DAY2 sets and merge them back together. This will just pair the multiple readings per BY group in the same order that they appear, which is what it looks like you did in your example.
data want ;
merge
have(where=(day=1) rename=(observation=day_1))
have(where=(day=2) rename=(observation=day_2))
;
by group replicate;
drop day ;
run;
You can read the source data as many times as you need for the number of values of DAY.
If you think that you might not have the same number of observations per BY group for each DAY then you should add these statements at the end of the data step.
output;
call missing(of day_:);

How to find maximum distance apart of values within a variable

I create a working example dataset:
input ///
group value
1 3
1 2
1 3
2 4
2 6
2 7
3 4
3 4
3 4
3 4
4 17
4 2
5 3
5 5
5 12
end
My goal is to figure out the maximum distance between incremental values within group. For group 2, this would be 2, because the next highest value after 4 is 6. Note that the only value relevant to 4 is 6, not 7, because 7 is not the next highest value after 4. The result for group 3 is 0 because there is only one value in group 3. There will only be one result per group.
What I want to get:
input ///
group value result
1 3 1
1 2 1
1 3 1
2 4 2
2 6 2
2 7 2
3 4 0
3 4 0
3 4 0
3 4 0
4 17 15
4 2 15
5 3 7
5 5 7
5 12 7
end
The order is not important, so the order just above can change with no problem.
Any tips?
I may have figured it out:
bys group (value): gen d = value[_n+1] - value[_n]
bys group: egen result = max(d)
drop d

if condition is intended to be fulfilled for observation but for value of another variable

At the moment my code reads: gen lateFirms = 1 if firmage0 != .
So at the moment the dataset which I get looks like this:
firm_id lateFirms firmage0
1
1
1
1
1
3
3
3
3
3
4
4
4
4
4
5
5
6 1 110
6
6
6
6
7
7
7
7
7
8 1 90
8
8
8
8
But what I want is this:
firm_id lateFirms firmage0
1
1
1
1
1
3
3
3
3
3
4
4
4
4
4
5
5
6 1 110
6 1
6 1
6 1
6 1
7
7
7
7
7
8 1 90
8 1
8 1
8 1
8 1
NOTE: All blank entries are missing values!
So "lateFirms" should equal 1 if, regarding a "firm_id", there exists one observation for which firmage0 is not a missing value.
bysort firm_id : egen present = count(firmage0)
replace lateFirms = present > 0
The count() function of egen counts non-missings and assigns the count to all values for each firm.
Maybe this helps:
bysort firm_id: gen dum = 1 if sum(firmage0) != 0
To get exactly what you want, you can use replace instead of generate:
bysort firm_id: replace lateFirms = 1 if sum(firmage0) != 0
As #NickCox pointed out, this solution is specific to the example dataset you provided.