Keep first record when event occurrs - stata

I have the following data in Stata:
clear
* Input data
input grade id exit time
1 1 . 10
2 1 . 20
3 1 2 30
4 1 0 40
5 1 . 50
1 2 0 10
2 2 0 20
3 2 0 30
4 2 0 40
5 2 0 50
1 3 1 10
2 3 1 20
3 3 0 30
4 3 . 40
5 3 . 50
1 4 . 10
2 4 . 20
3 4 . 30
4 4 . 40
5 4 . 50
1 5 1 10
2 5 2 20
3 5 1 30
4 5 1 40
5 5 1 50
end
The objective is to take the first row foreach id when a event occurs and if no event occur then take the last report foreach id. Here is a example for the data I hope to attain
* Input data
input grade id exit time
3 1 2 30
5 2 0 50
1 3 1 10
5 4 . 50
1 5 1 10
end

The definition of an event appears to be that exit is not zero or missing. If so, then all you need to do is tweak the code in my previous answer:
bysort id (time): egen when_first_e = min(cond(exit > 0 & exit < ., time, .))
by id: gen tokeep = cond(when_first_e == ., time == time[_N], time == when_first_e)
Previous thread was here.

Related

One variable in kg and grams. another indicates which unit; how can I get new variable in kg?

In Stata quantity has inputs in both kg and grams. while unit =1 indicates kg and unit=2 indicates grams. How can I generate a new variable quantity_kg which converts all gram values into kg?
My existing dataset-
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
My expected dataset
input double(hhid quantity unit unit_price quantity_kg)
1 24 1 . 24
1 4 1 . 4
1 350 2 50 .35
1 550 2 90 .55
1 2 1 65 2
1 3.5 1 85 3.5
1 1 1 20 1
1 4 1 25 4
1 2 1 . 2
2 1 1 30 1
2 2 1 15 2
2 1 1 20 1
2 250 2 10 .25
2 2 1 20 2
2 400 2 10 .40
2 100 2 60 .10
2 1 1 20 1
The code below does what you want.
This looks like household data where one typically has to do a lot of unit conversions. They are also a common source of error so I have included the best practice of defining conversion rates and unit codes in locals. If you define this at one place, then you can reuse these locals in multiple places where you convert units. It is easy to spot typos in the rows with replace as you would notice if one row said kilo_rate but then gram_unit. In this simple example it might be overkill, but if you have many units and rates, then this is a neat way to avoid errors.
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
end
*Define conversion rates and unit codes
local kilo_rate = 1
local kilo_unit = 1
local gram_rate = 0.001
local gram_unit = 2
*Create the standardized variable
gen quantity_kg = .
replace quantity_kg = quantity * `kilo_rate' if unit == `kilo_unit'
replace quantity_kg = quantity * `gram_rate' if unit == `gram_unit'
// unit 1 means kg, unit 2 means g, and 1000 g = 1 kg
generate quantity_kg = cond(unit == 1, quantity, cond(unit == 2, quantity/1000, .))
Your example doesn't have any missing values on unit, but it does no harm to imagine that they might occur.
Providing a comment by way of explanation could be anywhere between redundant and essential for third parties.

count and group the column by sequence

I have a dataset that has to be grouped by number as follows.
ID dept count
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
so for every 3rd row I need a new level the output should be as follows.
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
I have tried counting the number of rows based on the dept and count.
data want;
set have;
by dept count;
if first.count then level=1;
else level+1;
run;
this generates a count but not what exactly I am looking for
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
It isn't quite clear what output you want. I've extended your input data a bit - please
could you clarify what output you'd expect for this input and what the logic is for generating it?
I've made a best guess at roughly what you might be aiming for - incrementing every 3 rows with the same dept and count - perhaps this will be enough for you to get to the answer you want?
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
11 30 4
12 30 4
13 30 4
14 30 4
;
run;
data want;
set have;
by dept count;
if first.count then do;
level = 0;
dummy = 0;
end;
if mod(dummy,3) = 0 then level + 1;
dummy + 1;
drop dummy;
run;
Output:
ID dept count level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 1
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 1
10 30 4 2
11 30 4 2
12 30 4 2
13 30 4 3
14 30 4 3
One way to do this is to nest the SET statement inside a DO loop. Or in this case two DO loops. One to generate the LEVEL (within DEPT) and the second to count by twos. Use the LAST.DEPT flag to handle odd number of observations.
So if I modify the input to include odd number of observations in some groups.
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 20 4
8 30 4
9 30 4
10 30 4
;
Then can use this step to assign the LEVEL variable.
data want ;
do level=1 by 1 until(last.dept);
do sublevel=1 to 2 until(last.dept);
set have;
by dept;
output;
end;
end;
run;
Results:
Obs level sublevel ID dept count
1 1 1 1 10 2
2 1 2 2 10 2
3 1 1 3 20 4
4 1 2 4 20 4
5 2 1 5 20 4
6 2 2 6 20 4
7 3 1 7 20 4
8 1 1 8 30 4
9 1 2 9 30 4
10 2 1 10 30 4

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

Stata: Capture p-value from ranksum test

When I run return list, all after running a ranksum test, the count and z-score are available, but not the p-value. Is there any way of picking it up?
clear
input eventtime prefflag winner stakechange
1 1 1 10
1 2 1 5
2 1 0 50
2 2 0 31
2 1 1 51
2 2 1 20
1 1 0 10
2 2 1 10
2 1 0 5
3 2 0 8
4 2 0 8
5 2 0 8
5 2 1 8
3 1 1 8
4 1 1 8
5 1 1 8
5 1 1 8
end
bysort eventtime winner: tabstat stakechange, stat(mean median n) columns(statistics)
ranksum stakechange if inlist(eventtime, 1, 2) & inlist(winner, 0, .), by (eventtime)
return list, all
Try computing it after ranksum:
scalar pval = 2 * normprob(-abs(r(z)))
display pval
The answer is by #NickCox:
http://www.stata.com/statalist/archive/2004-12/msg00622.html
The Statalist archive is a valuable resource.