Get previous rowindex in Google Sheets where certain columnvalue is zero - if-statement

Consider a sheet like:
rowNr | Another Col | Filled | Cumul. Size
0 2 -1000 -1000
1 3 1000 0
2 1 -5000 -5000
3 4 5000 0
4 5 -10000 -10000
5 2 -10000 -20000
6 1 -20000 -40000
6 4 40000 0
The 'Cumul. Size'-column displays the cumulative sum of the 'filled' column.
each time Cummulutive Size = 0, I need to calculate the sum of 'Another Column' for all previous rows until 'Cummulutive Size' != 0 again. For rows where 'Cummulutive Size' = 0, display '' (blank)
So something like this:
rowNr | Another Col | Filled | Cumul. Size | calculated
0 2 -1000 -1000
1 3 1000 0 5
2 1 -5000 -5000
3 4 5000 0 5
4 5 -10000 -10000
5 2 -10000 -20000
6 1 -20000 -40000
6 4 40000 0 12
I'm sure I can create something working as long as I can find a function with a signature similar to: findPreviousRowIndex(curRowIndex, whereCondition)
Any pointers much appreciated
EDIT
Link To example Google Sheet

paste in D2 cell and drag down:
=ARRAYFORMULA(IF(LEN(A2), IF(C2=0, SUM(INDIRECT(ADDRESS(IFERROR(MAX(IF(
INDIRECT("C1:C"&ROW()-1)=0, ROW(A:A), ))+1, 2), 1, 4)&":A"&ROW())), ), ))

Related

Conditional summation in time-to-event data

I have the following data that has been prepared with stset. The resulting variables signify cohort entry and exit times along with event status. In addition, a numerical variable - prob has been calculated based on the riskset size.
For those subjects that are not cases (where _d == 0), I need to sum all values of the prob variable where _t falls within that subject's follow-up time.
For example, subject 8 enters the cohort at _t0 == 0 and exits at _t == 8. Between these times, there are three prob values 0.9, 0.875 and 0.875 - giving the desired answer for subject 8 as 2.65.
* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte(_t0 _t _d) float prob
1 0 1 0 .
2 0 2 0 .
3 1 3 1 .9
4 0 4 0 .
5 0 5 1 .875
6 0 6 1 .875
7 5 7 0 .
8 0 8 0 .
9 0 9 1 .8333333
10 0 10 1 .8
11 0 11 0 .
12 8 12 1 .6666667
13 0 13 0 .
14 0 14 0 .
15 0 15 0 .
end
The desired output would return all of the data with an additional variable signifying the summed values of prob.
Thanks so much in advance.

How to add a number to a portion of dataframe column in pandas?

I have a dataframe with two columns A and B.
A B
1 0
2 0
3 1
4 2
5 0
6 3
What I want to do is to add column A with with column B. But only with the corresponding non zero values of column B. And put the result on column B.
A B
1 0
2 0
3 4
4 6
5 0
6 9
Thank you for your help and sugestion in advance.
use .loc with a boolean mask:
In [49]:
df.loc[df['B'] != 0, 'B'] = df['A'] + df['B']
df
Out[49]:
A B
0 1 0
1 2 0
2 3 4
3 4 6
4 5 0
5 6 9

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

Merging two stat(sum) codes

I have obtained a list of projects that in total generate zero revenue (total revenue over a period of time)
tabstat revenue, by(project) stat(sum)
I have identified 261 projects (out of 1000s) that generate zero revenue for the whole period of time.
Now, want to look at the total value of a specific variable that can be tracked over multiple periods for each project in these zero-revenue-generating projects. I know that I can go after each campaign by typing
tabstat variable_of_interest if project==127, stat(sum)
Again, here project 127 generated zero revenue.
Is there a way to merge these two codes so that I can generate a table with the following logic
generate total sum of the variable_of_interest if project's stat(sum) was equal to zero?
here is a data sample
project revenue var_of_intr
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
Since projects 1 and 4 generated revenue>0, the code should ignore then when summing up the variable of interest by campaign, thus, the table I am interested in should look like this
project var_of_intr
2 97
3 108
5 131
You can use collapse:
clear
set more off
*----- example data -----
input ///
project revenue somevar
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
end
list
*----- what you want -----
collapse (sum) revenue somevar, by(project)
keep if revenue == 0
That will destroy the database, of course, but it might be useful anyway. You don't really specify if this approach is acceptable or not.
For a table, you can flag projects with revenue equal to zero and condition on that:
bysort project (revenue): gen revzero = revenue[_N] == 0
tabstat somevar if revzero, by(project) stat(sum)
If you have missing or negative revenues, modifications are required.

Stata: Capture p-value from ranksum test

When I run return list, all after running a ranksum test, the count and z-score are available, but not the p-value. Is there any way of picking it up?
clear
input eventtime prefflag winner stakechange
1 1 1 10
1 2 1 5
2 1 0 50
2 2 0 31
2 1 1 51
2 2 1 20
1 1 0 10
2 2 1 10
2 1 0 5
3 2 0 8
4 2 0 8
5 2 0 8
5 2 1 8
3 1 1 8
4 1 1 8
5 1 1 8
5 1 1 8
end
bysort eventtime winner: tabstat stakechange, stat(mean median n) columns(statistics)
ranksum stakechange if inlist(eventtime, 1, 2) & inlist(winner, 0, .), by (eventtime)
return list, all
Try computing it after ranksum:
scalar pval = 2 * normprob(-abs(r(z)))
display pval
The answer is by #NickCox:
http://www.stata.com/statalist/archive/2004-12/msg00622.html
The Statalist archive is a valuable resource.