Pandas grouped differences with variable lags - python-2.7

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.

You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

Related

One variable in kg and grams. another indicates which unit; how can I get new variable in kg?

In Stata quantity has inputs in both kg and grams. while unit =1 indicates kg and unit=2 indicates grams. How can I generate a new variable quantity_kg which converts all gram values into kg?
My existing dataset-
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
My expected dataset
input double(hhid quantity unit unit_price quantity_kg)
1 24 1 . 24
1 4 1 . 4
1 350 2 50 .35
1 550 2 90 .55
1 2 1 65 2
1 3.5 1 85 3.5
1 1 1 20 1
1 4 1 25 4
1 2 1 . 2
2 1 1 30 1
2 2 1 15 2
2 1 1 20 1
2 250 2 10 .25
2 2 1 20 2
2 400 2 10 .40
2 100 2 60 .10
2 1 1 20 1
The code below does what you want.
This looks like household data where one typically has to do a lot of unit conversions. They are also a common source of error so I have included the best practice of defining conversion rates and unit codes in locals. If you define this at one place, then you can reuse these locals in multiple places where you convert units. It is easy to spot typos in the rows with replace as you would notice if one row said kilo_rate but then gram_unit. In this simple example it might be overkill, but if you have many units and rates, then this is a neat way to avoid errors.
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
end
*Define conversion rates and unit codes
local kilo_rate = 1
local kilo_unit = 1
local gram_rate = 0.001
local gram_unit = 2
*Create the standardized variable
gen quantity_kg = .
replace quantity_kg = quantity * `kilo_rate' if unit == `kilo_unit'
replace quantity_kg = quantity * `gram_rate' if unit == `gram_unit'
// unit 1 means kg, unit 2 means g, and 1000 g = 1 kg
generate quantity_kg = cond(unit == 1, quantity, cond(unit == 2, quantity/1000, .))
Your example doesn't have any missing values on unit, but it does no harm to imagine that they might occur.
Providing a comment by way of explanation could be anywhere between redundant and essential for third parties.

Keep first record when event occurrs

I have the following data in Stata:
clear
* Input data
input grade id exit time
1 1 . 10
2 1 . 20
3 1 2 30
4 1 0 40
5 1 . 50
1 2 0 10
2 2 0 20
3 2 0 30
4 2 0 40
5 2 0 50
1 3 1 10
2 3 1 20
3 3 0 30
4 3 . 40
5 3 . 50
1 4 . 10
2 4 . 20
3 4 . 30
4 4 . 40
5 4 . 50
1 5 1 10
2 5 2 20
3 5 1 30
4 5 1 40
5 5 1 50
end
The objective is to take the first row foreach id when a event occurs and if no event occur then take the last report foreach id. Here is a example for the data I hope to attain
* Input data
input grade id exit time
3 1 2 30
5 2 0 50
1 3 1 10
5 4 . 50
1 5 1 10
end
The definition of an event appears to be that exit is not zero or missing. If so, then all you need to do is tweak the code in my previous answer:
bysort id (time): egen when_first_e = min(cond(exit > 0 & exit < ., time, .))
by id: gen tokeep = cond(when_first_e == ., time == time[_N], time == when_first_e)
Previous thread was here.

Pandas groupping values to column

I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index
You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.

Merging two stat(sum) codes

I have obtained a list of projects that in total generate zero revenue (total revenue over a period of time)
tabstat revenue, by(project) stat(sum)
I have identified 261 projects (out of 1000s) that generate zero revenue for the whole period of time.
Now, want to look at the total value of a specific variable that can be tracked over multiple periods for each project in these zero-revenue-generating projects. I know that I can go after each campaign by typing
tabstat variable_of_interest if project==127, stat(sum)
Again, here project 127 generated zero revenue.
Is there a way to merge these two codes so that I can generate a table with the following logic
generate total sum of the variable_of_interest if project's stat(sum) was equal to zero?
here is a data sample
project revenue var_of_intr
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
Since projects 1 and 4 generated revenue>0, the code should ignore then when summing up the variable of interest by campaign, thus, the table I am interested in should look like this
project var_of_intr
2 97
3 108
5 131
You can use collapse:
clear
set more off
*----- example data -----
input ///
project revenue somevar
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
end
list
*----- what you want -----
collapse (sum) revenue somevar, by(project)
keep if revenue == 0
That will destroy the database, of course, but it might be useful anyway. You don't really specify if this approach is acceptable or not.
For a table, you can flag projects with revenue equal to zero and condition on that:
bysort project (revenue): gen revzero = revenue[_N] == 0
tabstat somevar if revzero, by(project) stat(sum)
If you have missing or negative revenues, modifications are required.

J (Tacit) Sieve Of Eratosthenes

I'm looking for a J code to do the following.
Suppose I have a list of random integers (sorted),
2 3 4 5 7 21 45 49 61
I want to start with the first element and remove any multiples of the element in the list then move on to the next element cancel out its multiples, so on and so forth.
Thus the output
I'm looking at is 2 3 5 7 61. Basically a Sieve Of Eratosthenes. Would appreciate if someone could explain the code as well, since I'm learning J and find it difficult to get most codes :(
Regards,
babsdoc
It's not exactly what you ask but here is a more idiomatic (and much faster) version of the Sieve.
Basically, what you need is to check which number is a multiple of which. You can get this from the table of modulos: |/~
l =: 2 3 4 5 7 21 45 49 61
|/~ l
0 1 0 1 1 1 1 1 1
2 0 1 2 1 0 0 1 1
2 3 0 1 3 1 1 1 1
2 3 4 0 2 1 0 4 1
2 3 4 5 0 0 3 0 5
2 3 4 5 7 0 3 7 19
2 3 4 5 7 21 0 4 16
2 3 4 5 7 21 45 0 12
2 3 4 5 7 21 45 49 0
Every pair of multiples gives a 0 on the table. Now, we are not interested in the 0s that correspond to self-modulos (2 mod 2, 3 mod 3, etc; the 0s on the diagonal) so we have to remove them. One way to do this is to add 1s on their place, like so:
=/~ l
1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1
(=/~l) + (|/~l)
1 1 0 1 1 1 1 1 1
2 1 1 2 1 0 0 1 1
2 3 1 1 3 1 1 1 1
2 3 4 1 2 1 0 4 1
2 3 4 5 1 0 3 0 5
2 3 4 5 7 1 3 7 19
2 3 4 5 7 21 1 4 16
2 3 4 5 7 21 45 1 12
2 3 4 5 7 21 45 49 1
This can be also written as (=/~ + |/~) l.
From this table we get the final list of numbers: every number whose column contains a 0, is excluded.
We build this list of exclusions simply by multiplying by column. If a column contains a 0, its product is 0 otherwise it's a positive number:
*/ (=/~ + |/~) l
256 2187 0 6250 14406 0 0 0 18240
Before doing the last step, we'll have to improve this a little. There is no reason to perform long multiplications since we are only interested in 0s and not-0s. So, when building the table, we'll keep only 0s and 1s by taking the "sign" of each number (this is the signum:*):
* (=/~ + |/~) l
1 1 0 1 1 1 1 1 1
1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1
1 1 1 1 1 0 1 0 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
so,
*/ * (=/~ + |/~) l
1 1 0 1 1 0 0 0 1
From the list of exclusion, you just copy:# the numbers to your final list:
l #~ */ * (=/~ + |/~) l
2 3 5 7 61
or,
(]#~[:*/[:*=/~+|/~) l
2 3 5 7 61
Tacit iteration is usually done with the conjunction Power. When the test for completion needs to be something other than hitting a fixpoint, the Do While construction works well.
In this solution filterMultiplesOfHead is applied repeatedly until there are no more numbers not either applied or filtered. Numbers already applied are accumulated in a partial answer. When the list to be processed is empty the partial answer is the result, after stripping off the boxing used to segregate processed from unprocessed data.
filterMultiplesOfHead=: {. (((~: >.)# %~) # ]) }.
appendHead=: (>#[ , {.#>#])/
pass=: appendHead ; filterMultiplesOfHead#>#{:
prep=: a: , <
unfinished=: [: -. a: -: {:
sieve=: [: ; [: pass^:unfinished^:_ prep
sieve 2 3 4 5 7 21 45 49 61
2 3 5 7 61
prep 2 3 4 7 9 10
┌┬────────────┐
││2 3 4 7 9 10│
└┴────────────┘
appendHead prep 2 3 4 7 9 10
2
filterMultiplesOfHead 2 3 4 7 9 10
3 7 9
pass^:2 prep 2 3 4 7 9 10
┌───┬─┐
│2 3│7│
└───┴─┘
sieve 1-.~/:~~.>:?.$~100
2 3 7 11 29 31 41 53 67 73 83 95 97