sum data in multiple CSV's based on row and column python - python-2.7

I have multiple csv's and I want to add data present in these two csv's based on row and column keys
Example:
input1.csv
account,param1,param2,param3
D1,2,-1,0
D2,3,2,-2
D4,12,-1,-2
D3,1,1,0
input2.csv
account,param1,param2,param3
D4,22,-1,0
D6,3,2,-2
D1,-2,-1,-2
D3,1,1,0
output.csv
account,param1,param2,param3
D1,0,-2,0
D2,3,2,-2
D3,2,2,0
D4,34,-2,-2
D6,3,2,-2
So, In output.csv I need to have all accounts present in both csv's and for common accounts the param values needs to be added.
Note:The accounts are not in serial order

Here's one way using pd.concat
In [824]: df = pd.concat((pd.read_csv(f) for f in ['input1.csv', 'input2.csv']), ignore_index=True)
In [825]: df
Out[825]:
account param1 param2 param3
0 D1 2 -1 0
1 D2 3 2 -2
2 D4 12 -1 -2
3 D3 1 1 0
4 D4 22 -1 0
5 D6 3 2 -2
6 D1 -2 -1 -2
7 D3 1 1 0
In [826]: df.groupby('account', as_index=False).sum()
Out[826]:
account param1 param2 param3
0 D1 0 -2 -2
1 D2 3 2 -2
2 D3 2 2 0
3 D4 34 -2 -2
4 D6 3 2 -2
In [827]: df.groupby('account', as_index=False).sum().to_csv('output.csv', index=False)

Related

Conditional summation in time-to-event data

I have the following data that has been prepared with stset. The resulting variables signify cohort entry and exit times along with event status. In addition, a numerical variable - prob has been calculated based on the riskset size.
For those subjects that are not cases (where _d == 0), I need to sum all values of the prob variable where _t falls within that subject's follow-up time.
For example, subject 8 enters the cohort at _t0 == 0 and exits at _t == 8. Between these times, there are three prob values 0.9, 0.875 and 0.875 - giving the desired answer for subject 8 as 2.65.
* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte(_t0 _t _d) float prob
1 0 1 0 .
2 0 2 0 .
3 1 3 1 .9
4 0 4 0 .
5 0 5 1 .875
6 0 6 1 .875
7 5 7 0 .
8 0 8 0 .
9 0 9 1 .8333333
10 0 10 1 .8
11 0 11 0 .
12 8 12 1 .6666667
13 0 13 0 .
14 0 14 0 .
15 0 15 0 .
end
The desired output would return all of the data with an additional variable signifying the summed values of prob.
Thanks so much in advance.

Expand table by merging additional variables as columns

I have a dataset that looks like this but with several more binary outcome variables: 
* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID byte(Region Type Tier Secure Offshore Highland)
120034 12 1 2 1 0 1
120035 12 1 2 1 0 1
120036 12 1 2 1 0 1
120037 12 1 2 1 0 1
120038 41 1 2 1 0 0
120039 41 2 2 1 1 0
120040 41 2 1 0 1 0
120041 41 2 1 0 1 0
120042 41 2 1 0 1 0
120043 41 2 1 0 0 .
120044 65 2 1 0 0 .
120045 65 3 1 0 0 0
120046 65 3 1 1 0 0
120047 65 3 2 1 1 0
120048 65 3 2 1 0 0
120049 65 3 2 . 1 1
120050 25 3 2 . 1 1
120051 25 5 2 . 1 1
120052 25 5 1 . 0 1
120053 25 5 2 . 0 .
120054 25 5 2 0 0 .
120055 25 5 1 0 . 0
120056 25 5 1 0 . 0
120057 95 7 1 0 1 0
120058 95 7 1 0 1 0
120059 95 7 1 1 1 0
120060 95 7 2 1 0 1
120061 95 7 2 1 0 1
120062 59 7 2 1 0 1
120063 95 8 2 0 . 1
120064 59 8 1 0 . 1
120065 59 8 1 0 . 0
120066 59 8 1 1 . 0
120067 59 8 1 1 1 0
120068 59 8 2 1 1 0
120069 40 9 2 1 1 1
120070 40 9 2 1 0 1
120071 40 9 2 1 0 1
120072 40 9 1 0 0 1
end
I am creating a table with the community-contributed command tabout:
foreach v of var Secure Offshore Highland{
    tabout Region Type Tier `v' using `v'.docx ///
    , replace ///
    style(docx) font(bold) c(freq row) ///
    f(1) layout(cb) landscape nlab(Obs) paper(A4)
    }
It has both row frequencies, percentages and the totals.
However, I did not need all this information so i modified my code as follows:
foreach v of var Secure Offshore Highland{
    tabout Region Type Tier `v' using `v'.docx ///
    , replace ///
    style(docx) font(bold) c(freq row) ///
    f(1) layout(cb) h3(nil) h2(nil) dropc(2 3 4 5 7) landscape nlab(Obs) paper(A4)
    }
This produces what I need but both versions of my code create three individual tables for each outcome variables. I have to manually make one table combining the three tables keeping the left-most column, the % of "1" column and the right-most column showing the row-total. 
Can anyone help me out here regarding:
Merging all the tables in one go, keeping the exploratory variable labels on the left-most and the rowtotal on the right-most column.
Instead of deleting the columns except % of "1"s, I only want to have the desired column. Deleting columns seem so crude and dangerous.
Can i get this same output in Excel through "putexcel"? I tried following the wonderfully written blog by Chuck Huber. But I cannot figure out the "merging" part.
I came this far due to lots and lots of studying, especially Ian Watson's "User Guide for tabout Version 3" and Nicholas Cox's "How to face lists with fortitude". 
Cross-posted on Statalist.
You cannot do this readily with tabout -- custom tables require custom programming.
My advice is to create a matrix with whatever values you need and then use the (also) community-contributed command esttab to tabulate and export everything.
That said, what you want requires a lot of work but here is a simplified example based on your data:
matrix N = J(1, 2, .)
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Secure, matcell(A`i')
matrix arowsum = J(1, rowsof(A`i'), 1) * A`i'
matrix A`i' = A`i' \ arowsum
if `i' > 1 local N \ N
matrix m1a = (nullmat(m1a) `N' \ A`i')
}
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Offshore, matcell(B`i')
matrix browsum = J(1, rowsof(B`i'), 1) * B`i'
matrix B`i' = B`i' \ browsum
if `i' > 1 local N \ N
matrix m2a = (nullmat(m2a) `N' \ B`i')
}
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Highland, matcell(C`i')
matrix crowsum = J(1, rowsof(C`i'), 1) * C`i'
matrix C`i' = C`i' \ crowsum
if `i' > 1 local N \ N
matrix m3a = (nullmat(m3a) `N' \ C`i')
}
matrix m1b = m1a * J(colsof(m1a), 1, 1)
matrix m2b = m2a * J(colsof(m2a), 1, 1)
matrix m3b = m3a * J(colsof(m3a), 1, 1)
matrix M1 = m1a, m1b
matrix M2 = m2a, m2b
matrix M3 = m3a, m3b
matrix K = J(1, 3, .)
matrix M = M1 \ K \ M2 \ K \ M3
You can then use esttab to export the results in Excel or Word:
esttab matrix(M)
---------------------------------------------------
M
c1 c2 c1
---------------------------------------------------
r1 0 4 4
r2 3 0 3
r3 1 3 4
r4 4 2 6
r5 2 4 6
r6 2 3 5
r7 3 3 6
r1 15 19 34
r1 . . .
r1 0 5 5
r2 5 1 6
r3 1 3 4
r4 3 0 3
r5 2 4 6
r6 3 3 6
r7 1 3 4
r1 15 19 34
r1 . . .
r1 13 4 17
r2 2 15 17
r1 15 19 34
r1 . . .
r1 . . .
r1 4 0 4
r2 3 2 5
r3 3 1 4
r4 2 4 6
r5 1 2 3
r6 4 2 6
r7 2 3 5
r1 19 14 33
r1 . . .
r1 5 0 5
r2 2 4 6
r3 3 3 6
r4 3 1 4
r5 3 3 6
r6 0 2 2
r7 3 1 4
r1 19 14 33
r1 . . .
r1 6 7 13
r2 13 7 20
r1 19 14 33
r1 . . .
r1 . . .
r1 0 4 4
r2 2 3 5
r3 0 4 4
r4 5 0 5
r5 4 2 6
r6 4 1 5
r7 3 3 6
r1 18 17 35
r1 . . .
r1 1 4 5
r2 4 0 4
r3 4 2 6
r4 2 2 4
r5 3 3 6
r6 4 2 6
r7 0 4 4
r1 18 17 35
r1 . . .
r1 13 3 16
r2 5 14 19
r1 18 17 35
---------------------------------------------------
You will have to generate the rest of the elements you want separately (including column and row names etc.) but the idea is the same. You will also have to play with the options in esttab to fine tune the desired final outcome.
Note that the above can be written more efficiently but I have kept everything separate in this answer so you can understand it.
EDIT:
If you are working with matrices as above you can also use putexcel easily:
putexcel A1 = matrix(M)

Build a static outputTable (similar to a pivot table) with shiny

I have a table that has the following data (shortened for this example):
C1 C2 C3
1 0 1 1
2 1 1 0
3 1 0 1
4 1 1 1
5 0 0 1
6 0 0 0
I want to create a create a query that gives me the following result:
C1
C2 sum(C3)
It's similar to a pivot table but it's static.
Could you help me please, I'll be grateful.

How to add a number to a portion of dataframe column in pandas?

I have a dataframe with two columns A and B.
A B
1 0
2 0
3 1
4 2
5 0
6 3
What I want to do is to add column A with with column B. But only with the corresponding non zero values of column B. And put the result on column B.
A B
1 0
2 0
3 4
4 6
5 0
6 9
Thank you for your help and sugestion in advance.
use .loc with a boolean mask:
In [49]:
df.loc[df['B'] != 0, 'B'] = df['A'] + df['B']
df
Out[49]:
A B
0 1 0
1 2 0
2 3 4
3 4 6
4 5 0
5 6 9

Pandas Multi-indexing from a Flatten DataFrame

I want to resample flatten dataframe to multi-indexed columns.
Dataframe looks like :
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
And I would like to set him like this :
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
I try somethings with groupby or stack but I don't find a good way...Does anyone know how to do this ?
With unstack (to use this you first have to set the multi-index):
In [48]: df.set_index(['goods', 'category', 'month']).unstack([0,1])
Out[48]:
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 0 10 0 40
3 0 0 10 0
Alternative with pivot_table (but be aware, if you have multiple values with the same combination of goods/category/month, they will be averaged by default (another function can be specified)):
In [54]: df.pivot_table(columns=['goods', 'category'], index='month', values='stock')
Out[54]:
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 0 10 0 40
3 0 0 10 0