Merging two stat(sum) codes - stata

I have obtained a list of projects that in total generate zero revenue (total revenue over a period of time)
tabstat revenue, by(project) stat(sum)
I have identified 261 projects (out of 1000s) that generate zero revenue for the whole period of time.
Now, want to look at the total value of a specific variable that can be tracked over multiple periods for each project in these zero-revenue-generating projects. I know that I can go after each campaign by typing
tabstat variable_of_interest if project==127, stat(sum)
Again, here project 127 generated zero revenue.
Is there a way to merge these two codes so that I can generate a table with the following logic
generate total sum of the variable_of_interest if project's stat(sum) was equal to zero?
here is a data sample
project revenue var_of_intr
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
Since projects 1 and 4 generated revenue>0, the code should ignore then when summing up the variable of interest by campaign, thus, the table I am interested in should look like this
project var_of_intr
2 97
3 108
5 131

You can use collapse:
clear
set more off
*----- example data -----
input ///
project revenue somevar
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
end
list
*----- what you want -----
collapse (sum) revenue somevar, by(project)
keep if revenue == 0
That will destroy the database, of course, but it might be useful anyway. You don't really specify if this approach is acceptable or not.
For a table, you can flag projects with revenue equal to zero and condition on that:
bysort project (revenue): gen revzero = revenue[_N] == 0
tabstat somevar if revzero, by(project) stat(sum)
If you have missing or negative revenues, modifications are required.

Related

Merge all .dta files in one folder?

I have a folder with 36 .dta files which are all structured the same. Each one has 2 fields: RowID and value. Each file also has the same number of rows (2,500). The name of the "value" variable is unique to each file. I would like to construct a loop that loads in the first .dta file and then merges the "value" variable from each of the other 35 files. Any help will be greatly appreciated.
Here are sample data from 3 of the .dta files:
Example 1:
input int rowid_ float value_ex_1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
9 1
10 1
Example 2:
input int rowid_ float value_ex_2
1 1
2 0
3 0
4 1
5 1
6 0
7 0
8 0
9 0
10 0
Example 3:
input int rowid_ float value_ex_3
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 0
9 0
10 1
In order to loop over all your .dta files, first make sure they are named after a logical order (i.e example_1.dta, example_2.dta, example_3.dta etc).
Then, you can load the first dataset and loop over the other ones with a forvalues loop:
cd "path/to/your/datasets"
use example_1.dta, clear
forvalues i = 2(1)35 {
merge 1:1 rowid_ using example_`i'.dta
drop _merge
}

Conditional summation in time-to-event data

I have the following data that has been prepared with stset. The resulting variables signify cohort entry and exit times along with event status. In addition, a numerical variable - prob has been calculated based on the riskset size.
For those subjects that are not cases (where _d == 0), I need to sum all values of the prob variable where _t falls within that subject's follow-up time.
For example, subject 8 enters the cohort at _t0 == 0 and exits at _t == 8. Between these times, there are three prob values 0.9, 0.875 and 0.875 - giving the desired answer for subject 8 as 2.65.
* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte(_t0 _t _d) float prob
1 0 1 0 .
2 0 2 0 .
3 1 3 1 .9
4 0 4 0 .
5 0 5 1 .875
6 0 6 1 .875
7 5 7 0 .
8 0 8 0 .
9 0 9 1 .8333333
10 0 10 1 .8
11 0 11 0 .
12 8 12 1 .6666667
13 0 13 0 .
14 0 14 0 .
15 0 15 0 .
end
The desired output would return all of the data with an additional variable signifying the summed values of prob.
Thanks so much in advance.

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

Keep first record when event occurrs

I have the following data in Stata:
clear
* Input data
input grade id exit time
1 1 . 10
2 1 . 20
3 1 2 30
4 1 0 40
5 1 . 50
1 2 0 10
2 2 0 20
3 2 0 30
4 2 0 40
5 2 0 50
1 3 1 10
2 3 1 20
3 3 0 30
4 3 . 40
5 3 . 50
1 4 . 10
2 4 . 20
3 4 . 30
4 4 . 40
5 4 . 50
1 5 1 10
2 5 2 20
3 5 1 30
4 5 1 40
5 5 1 50
end
The objective is to take the first row foreach id when a event occurs and if no event occur then take the last report foreach id. Here is a example for the data I hope to attain
* Input data
input grade id exit time
3 1 2 30
5 2 0 50
1 3 1 10
5 4 . 50
1 5 1 10
end
The definition of an event appears to be that exit is not zero or missing. If so, then all you need to do is tweak the code in my previous answer:
bysort id (time): egen when_first_e = min(cond(exit > 0 & exit < ., time, .))
by id: gen tokeep = cond(when_first_e == ., time == time[_N], time == when_first_e)
Previous thread was here.

How to modify a variable conditioned on max value of other variable

I have a long format dataset: ID, time varying variable, time and outcome (y).
Subjects have differing numbers of rows due to different times and different outcome values, 0,1 or 2. But I need to only keep the outcome value corresponding to the last time point, and replace all other outcome rows to 0.
I can't figure out how to gen a new variable = outcome only for max(time) by ID
id sbp y time
1 120 1 0
1 126 1 1
1 126 1 2
1 126 1 3
1 126 1 4
1 132 1 5
1 132 1 6
1 132 1 7
1 150 1 8
1 150 1 9
1 150 1 10
1 160 1 11
1 160 1 12
1 160 1 13
1 160 1 14
You seem to be asking quite different things:
Replacing outcome values before the last for each panel with 0.
Keeping only the last.
Here they are in turn:
bysort id (time) : replace y = 0 if _n < _N
by id: keep if _n == _N
If you just want the second, you need bysort id (time) rather than by id.