Pandas Dataframe: Pairwise division of columns without replacement [duplicate] - python-2.7

This question already has answers here:
Fastest way to calculate difference in all columns
(4 answers)
Closed 5 years ago.
I am trying to divide all columns by each column but only once (A/B but not B/A)
From Dividing each column by every other column and creating a new dataframe from the results
and thanks to #COLDSPEED, the following code performs the division of all columns by every column (and adds the corresponding new columns).
I cannot figure out how to avoid the pair duplication.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,9,size=(5, 3)), columns=list('ABC'))
ratio_df = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0) \
for col in df.columns], axis=1)
print ratio_df
Which outputs:
Original dataframe
A B C
0 6 3 7
1 4 6 2
2 6 7 4
3 3 7 7
4 2 5 4
Resulting dataframe
B C A C A B
0 0.500000 1.166667 2.000000 2.333333 0.857143 0.428571
1 1.500000 0.500000 0.666667 0.333333 2.000000 3.000000
2 1.166667 0.666667 0.857143 0.571429 1.500000 1.750000
3 2.333333 2.333333 0.428571 1.000000 0.428571 1.000000
4 2.500000 2.000000 0.400000 0.800000 0.500000 1.250000
In row 0, the value for the first column B is B/A or 3/6 = 0.5 and the first column A is A/B or 6/3 = 2
I would like to keep only one result for the pair operation (eg only left column / right column).
A/B A/C B/C
0 2.000000 0.857143 0.428571
1 0.666667 2.000000 3.000000
2 0.857143 1.500000 1.750000
3 0.428571 0.428571 1.000000
4 0.400000 0.500000 1.250000
I was not able to find clues on this matter.
How could I resolve it?
Thanks!

Here's one approach -
idx0,idx1 = np.triu_indices(df.shape[1],1)
df_out = pd.DataFrame(df.iloc[:,idx0].values/df.iloc[:,idx1])
c = df.columns.values
df_out.columns = c[idx0]+'/'+c[idx1]
Sample run -
In [58]: df
Out[58]:
A B C
0 6 3 7
1 4 6 2
2 6 7 4
3 3 7 7
4 2 5 4
In [59]: df_out
Out[59]:
A/B A/C B/C
0 2.000000 0.857143 0.428571
1 0.666667 2.000000 3.000000
2 0.857143 1.500000 1.750000
3 0.428571 0.428571 1.000000
4 0.400000 0.500000 1.250000
Alternative way to get idx0 and idx1 -
from itertools import combinations
idx0,idx1 = np.array(list(combinations(range(df.shape[1]),2))).T

Related

How to add dummy row based on one column in pandas dataframe?

I'm working with pandas,So basically i've two dataframes and the number of rows are different in both the cases:
df
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
So there are some rows in df1 that are not in df. So i want to add the row to the dataframe and reset the index accordingly. Previously i was just removing the extra rows from the dataframe to keep them equal but now i just want to add an empty row of the index of column isn't there.
The desired result should look like this,
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0 0 0 0 0
6 4508.28 0 0 0 0 0
7 4512.99 0 0 0 0 0
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
How can i get this?
IIUC, you can use DataFrame.loc to update the values of df1 where wave doesnt exist in df:
df1.loc[~df1.wave.isin(df.wave), 'num':] = 0
Then use DataFrame.combine_first to make sure that the values in df take precedence:
df_out = df.set_index('wave').combine_first(df1.set_index('wave')).reset_index()
[out]
print(df_out)
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5.0 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9.0 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9.0 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14.0 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0.0 0.00000 0.00000 0.00000 0.000000
6 4508.28 0.0 0.00000 0.00000 0.00000 0.000000
7 4512.99 0.0 0.00000 0.00000 0.00000 0.000000
8 5520.50 1.0 0.06148 0.12556 8.21685 5520.484742

How can I calculate correlation between different values of a python dictionary?

I am learning python.I want to calculate correlation between values.Below is my data which is a dictionary.
My_data = {1: [1450.0, -80.0, 840.0, -220.0, 630.0, 780.0, -1140.0], 2: [1450.0, -80.0, 840.0, -220.0, 630.0, 780.0, -1140.0],3:[ 720.0, -230.0, 460.0, 220.0, 710.0, -460.0, 90.0] }
This is what I expect to have in return.
1 2 3
1 1 0.69 0.77
2 1 0.54
3 1
This is the code I tried. I get TypeError:unsupported operand type(s) for /: 'list' and 'long'
I am not sure what went wrong. I would appreciate if somebody explains me and help me get the desired solution.
my_array=np.array(My_data .values())
Correlation = np.corrcoef(my_array,my_array)
Case 1: if you are open to use pandas
Using pandas (which is a wrapper of numpy), you can porceed as follows:
In [55]: import pandas as pd
In [56]: df = pd.DataFrame.from_dict(My_data, orient='index').T
In [57]: df.corr(method='pearson')
Out[57]:
1 2 3
1 1.000000 1.000000 0.384781
2 1.000000 1.000000 0.121978
3 0.384781 0.121978 1.000000
In [58]: df.corr(method='kendall')
Out[58]:
1 2 3
1 1.000000 1.000000 0.333333
2 1.000000 1.000000 0.240385
3 0.333333 0.240385 1.000000
In [59]: df.corr(method='spearman')
Out[59]:
1 2 3
1 1.000000 1.00000 0.464286
2 1.000000 1.00000 0.327370
3 0.464286 0.32737 1.000000
In [60]:
Explanation:
The following line creates a pandas.DataFrame from the dictionary My_data
df = pd.DataFrame.from_dict(My_data, orient='index').T
Which looks like this:
In [60]: df
Out[60]:
1 2 3
0 1450.0 1450.0 720.0
1 -80.0 -80.0 -230.0
2 840.0 840.0 460.0
3 -220.0 -220.0 220.0
4 630.0 630.0 710.0
5 780.0 780.0 -460.0
6 -1140.0 -1140.0 90.0
7 NaN 450.0 -640.0
8 NaN 730.0 870.0
9 NaN -810.0 -290.0
10 NaN 390.0 -2180.0
11 NaN -220.0 -790.0
12 NaN -1640.0 65.0
13 NaN -590.0 70.0
14 NaN -145.0 460.0
15 NaN -420.0 NaN
16 NaN 620.0 NaN
17 NaN 450.0 NaN
18 NaN -90.0 NaN
19 NaN 990.0 NaN
20 NaN -705.0 NaN
then df.corr() will compute the pairwise correlation between columns.
Case 2: if you want a pure numpy solution
You need to convert your data into numpy.ndarray first, then you can compute the correlation like this,
In [91]: np.corrcoef(np.asarray(new_data.values()))
Out[91]:
array([[ 1. , 1. , 0.38478131],
[ 1. , 1. , 0.38478131],
[ 0.38478131, 0.38478131, 1. ]])
In [92]:

pandas.DataFrame: How to div row by row [python]

I want to div row[i] by row[i+1] in pandas.DataFrame
row[i] = row[i+1] / row[i]
for example:
1 2 3 4
4 2 6 2
8 5 3 1
the result is
0.25 1 0.5 2
0.5 0.4 2 2
You can divide by div shifted DataFrame, last remove NaN row by dropna:
print (df)
a b c d
0 1 2 3 4
1 4 2 6 2
2 8 5 3 1
print (df.div(df.shift(-1), axis=1))
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
2 NaN NaN NaN NaN
df = df.div(df.shift(-1), axis=1).dropna(how='all')
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
Another solution for remove last row is select by iloc:
df = df.div(df.shift(-1), axis=1).iloc[:-1]
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0

something wrong in pandas.Series.mean() or .apply()

I'm practicing the function of .apply() in pandas, but something goes wrong when I use .Series.mean() in the function.
here is my code:
In[1]: column = ['UserInfo_2', 'UserInfo_4','info_1', 'info_2', 'info_3','target']
value = [['a', 'b', 'a', 'c', 'b', 'a'],
['a', 'c', 'b', 'c', 'b', 'b'],
range(0, 11, 2),
range(1, 12, 2),
range(15, 21),
[0, 0, 1, 0, 1, 0]
]
master_train = pd.DataFrame(dict(zip(column, value)))
In[2]: def f(group):
return pd.DataFrame({'original': group,'demand':group-group.mean()})
In[3]: master_train.groupby('UserInfo_2')['info_1'].apply(f)
Out[3]:
demand original
0 -4.666667 0
1 -3.000000 2
2 -0.666667 4
3 0.000000 6
4 3.000000 8
5 5.333333 10
I am confused because the mean of info_1 is actually 5, but from the result abrove, the mean changes from 4.666667 to 7.
What's wrong??
I think now it is clear - you count mean of column info_1 (or original) by groups from column UserInfo_2:
def f(group):
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
I think you want mean of column info_1:
def f(group):
return pd.DataFrame({'original': group,
'demand':group - master_train['info_1'].mean(),
'mean':master_train['info_1'].mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand mean original
0 -5.0 5.0 0
1 -3.0 5.0 2
2 -1.0 5.0 4
3 1.0 5.0 6
4 3.0 5.0 8
5 5.0 5.0 10
EDIT:
For testing is possible add print(group) to function f - it returns Series from column info_1 by groups from column UserInfo_2:
def f(group):
print (group)
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
0 0
2 4
5 10
Name: a, dtype: int32
1 2
4 8
Name: b, dtype: int32
3 6
Name: c, dtype: int32
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
And if you need mean of all column info_1:
print (master_train['info_1'])
0 0
1 2
2 4
3 6
4 8
5 10
Name: info_1, dtype: int32
print (master_train['info_1'].mean())
5.0

how to convert planar mesh to arrangement in CGAL

I need to convert 2d planar polygonal meshes to 2D Arrangements in CGAL. for example if I have the following mesh in Wavefront obj format:
v -5.687006 -4.782805 0.000000
v 4.878987 -4.782805 0.000000
v -5.687006 4.782805 0.000000
v 4.878987 4.782805 0.000000
v -0.404010 -4.782805 0.000000
v -5.687006 0.000000 0.000000
v 4.878987 0.000000 0.000000
v -0.404010 4.782805 0.000000
v -0.404010 0.000000 0.000000
f 5 2 9
f 9 2 7
f 7 4 9
f 9 4 8
f 8 3 9
f 9 3 6
f 6 1 9
f 9 1 5
what is the simplest way I could convert it to a 2d Arrangement using the CGAL library?
Using the following example, you'll find out.
insert_in_face_interior for the first segment
insert_from_left_vertex or insert_from_right_vertex for the middle one, depending on the orientation of your polygon.
insert_at_vertices for the last one