This question already has answers here:
Fastest way to calculate difference in all columns
(4 answers)
Closed 5 years ago.
I am trying to divide all columns by each column but only once (A/B but not B/A)
From Dividing each column by every other column and creating a new dataframe from the results
and thanks to #COLDSPEED, the following code performs the division of all columns by every column (and adds the corresponding new columns).
I cannot figure out how to avoid the pair duplication.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,9,size=(5, 3)), columns=list('ABC'))
ratio_df = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0) \
for col in df.columns], axis=1)
print ratio_df
Which outputs:
Original dataframe
A B C
0 6 3 7
1 4 6 2
2 6 7 4
3 3 7 7
4 2 5 4
Resulting dataframe
B C A C A B
0 0.500000 1.166667 2.000000 2.333333 0.857143 0.428571
1 1.500000 0.500000 0.666667 0.333333 2.000000 3.000000
2 1.166667 0.666667 0.857143 0.571429 1.500000 1.750000
3 2.333333 2.333333 0.428571 1.000000 0.428571 1.000000
4 2.500000 2.000000 0.400000 0.800000 0.500000 1.250000
In row 0, the value for the first column B is B/A or 3/6 = 0.5 and the first column A is A/B or 6/3 = 2
I would like to keep only one result for the pair operation (eg only left column / right column).
A/B A/C B/C
0 2.000000 0.857143 0.428571
1 0.666667 2.000000 3.000000
2 0.857143 1.500000 1.750000
3 0.428571 0.428571 1.000000
4 0.400000 0.500000 1.250000
I was not able to find clues on this matter.
How could I resolve it?
Thanks!
Here's one approach -
idx0,idx1 = np.triu_indices(df.shape[1],1)
df_out = pd.DataFrame(df.iloc[:,idx0].values/df.iloc[:,idx1])
c = df.columns.values
df_out.columns = c[idx0]+'/'+c[idx1]
Sample run -
In [58]: df
Out[58]:
A B C
0 6 3 7
1 4 6 2
2 6 7 4
3 3 7 7
4 2 5 4
In [59]: df_out
Out[59]:
A/B A/C B/C
0 2.000000 0.857143 0.428571
1 0.666667 2.000000 3.000000
2 0.857143 1.500000 1.750000
3 0.428571 0.428571 1.000000
4 0.400000 0.500000 1.250000
Alternative way to get idx0 and idx1 -
from itertools import combinations
idx0,idx1 = np.array(list(combinations(range(df.shape[1]),2))).T
Related
I'm working with pandas,So basically i've two dataframes and the number of rows are different in both the cases:
df
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
So there are some rows in df1 that are not in df. So i want to add the row to the dataframe and reset the index accordingly. Previously i was just removing the extra rows from the dataframe to keep them equal but now i just want to add an empty row of the index of column isn't there.
The desired result should look like this,
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0 0 0 0 0
6 4508.28 0 0 0 0 0
7 4512.99 0 0 0 0 0
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
How can i get this?
IIUC, you can use DataFrame.loc to update the values of df1 where wave doesnt exist in df:
df1.loc[~df1.wave.isin(df.wave), 'num':] = 0
Then use DataFrame.combine_first to make sure that the values in df take precedence:
df_out = df.set_index('wave').combine_first(df1.set_index('wave')).reset_index()
[out]
print(df_out)
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5.0 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9.0 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9.0 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14.0 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0.0 0.00000 0.00000 0.00000 0.000000
6 4508.28 0.0 0.00000 0.00000 0.00000 0.000000
7 4512.99 0.0 0.00000 0.00000 0.00000 0.000000
8 5520.50 1.0 0.06148 0.12556 8.21685 5520.484742
I am learning python.I want to calculate correlation between values.Below is my data which is a dictionary.
My_data = {1: [1450.0, -80.0, 840.0, -220.0, 630.0, 780.0, -1140.0], 2: [1450.0, -80.0, 840.0, -220.0, 630.0, 780.0, -1140.0],3:[ 720.0, -230.0, 460.0, 220.0, 710.0, -460.0, 90.0] }
This is what I expect to have in return.
1 2 3
1 1 0.69 0.77
2 1 0.54
3 1
This is the code I tried. I get TypeError:unsupported operand type(s) for /: 'list' and 'long'
I am not sure what went wrong. I would appreciate if somebody explains me and help me get the desired solution.
my_array=np.array(My_data .values())
Correlation = np.corrcoef(my_array,my_array)
Case 1: if you are open to use pandas
Using pandas (which is a wrapper of numpy), you can porceed as follows:
In [55]: import pandas as pd
In [56]: df = pd.DataFrame.from_dict(My_data, orient='index').T
In [57]: df.corr(method='pearson')
Out[57]:
1 2 3
1 1.000000 1.000000 0.384781
2 1.000000 1.000000 0.121978
3 0.384781 0.121978 1.000000
In [58]: df.corr(method='kendall')
Out[58]:
1 2 3
1 1.000000 1.000000 0.333333
2 1.000000 1.000000 0.240385
3 0.333333 0.240385 1.000000
In [59]: df.corr(method='spearman')
Out[59]:
1 2 3
1 1.000000 1.00000 0.464286
2 1.000000 1.00000 0.327370
3 0.464286 0.32737 1.000000
In [60]:
Explanation:
The following line creates a pandas.DataFrame from the dictionary My_data
df = pd.DataFrame.from_dict(My_data, orient='index').T
Which looks like this:
In [60]: df
Out[60]:
1 2 3
0 1450.0 1450.0 720.0
1 -80.0 -80.0 -230.0
2 840.0 840.0 460.0
3 -220.0 -220.0 220.0
4 630.0 630.0 710.0
5 780.0 780.0 -460.0
6 -1140.0 -1140.0 90.0
7 NaN 450.0 -640.0
8 NaN 730.0 870.0
9 NaN -810.0 -290.0
10 NaN 390.0 -2180.0
11 NaN -220.0 -790.0
12 NaN -1640.0 65.0
13 NaN -590.0 70.0
14 NaN -145.0 460.0
15 NaN -420.0 NaN
16 NaN 620.0 NaN
17 NaN 450.0 NaN
18 NaN -90.0 NaN
19 NaN 990.0 NaN
20 NaN -705.0 NaN
then df.corr() will compute the pairwise correlation between columns.
Case 2: if you want a pure numpy solution
You need to convert your data into numpy.ndarray first, then you can compute the correlation like this,
In [91]: np.corrcoef(np.asarray(new_data.values()))
Out[91]:
array([[ 1. , 1. , 0.38478131],
[ 1. , 1. , 0.38478131],
[ 0.38478131, 0.38478131, 1. ]])
In [92]:
I want to div row[i] by row[i+1] in pandas.DataFrame
row[i] = row[i+1] / row[i]
for example:
1 2 3 4
4 2 6 2
8 5 3 1
the result is
0.25 1 0.5 2
0.5 0.4 2 2
You can divide by div shifted DataFrame, last remove NaN row by dropna:
print (df)
a b c d
0 1 2 3 4
1 4 2 6 2
2 8 5 3 1
print (df.div(df.shift(-1), axis=1))
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
2 NaN NaN NaN NaN
df = df.div(df.shift(-1), axis=1).dropna(how='all')
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
Another solution for remove last row is select by iloc:
df = df.div(df.shift(-1), axis=1).iloc[:-1]
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
I'm practicing the function of .apply() in pandas, but something goes wrong when I use .Series.mean() in the function.
here is my code:
In[1]: column = ['UserInfo_2', 'UserInfo_4','info_1', 'info_2', 'info_3','target']
value = [['a', 'b', 'a', 'c', 'b', 'a'],
['a', 'c', 'b', 'c', 'b', 'b'],
range(0, 11, 2),
range(1, 12, 2),
range(15, 21),
[0, 0, 1, 0, 1, 0]
]
master_train = pd.DataFrame(dict(zip(column, value)))
In[2]: def f(group):
return pd.DataFrame({'original': group,'demand':group-group.mean()})
In[3]: master_train.groupby('UserInfo_2')['info_1'].apply(f)
Out[3]:
demand original
0 -4.666667 0
1 -3.000000 2
2 -0.666667 4
3 0.000000 6
4 3.000000 8
5 5.333333 10
I am confused because the mean of info_1 is actually 5, but from the result abrove, the mean changes from 4.666667 to 7.
What's wrong??
I think now it is clear - you count mean of column info_1 (or original) by groups from column UserInfo_2:
def f(group):
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
I think you want mean of column info_1:
def f(group):
return pd.DataFrame({'original': group,
'demand':group - master_train['info_1'].mean(),
'mean':master_train['info_1'].mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand mean original
0 -5.0 5.0 0
1 -3.0 5.0 2
2 -1.0 5.0 4
3 1.0 5.0 6
4 3.0 5.0 8
5 5.0 5.0 10
EDIT:
For testing is possible add print(group) to function f - it returns Series from column info_1 by groups from column UserInfo_2:
def f(group):
print (group)
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
0 0
2 4
5 10
Name: a, dtype: int32
1 2
4 8
Name: b, dtype: int32
3 6
Name: c, dtype: int32
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
And if you need mean of all column info_1:
print (master_train['info_1'])
0 0
1 2
2 4
3 6
4 8
5 10
Name: info_1, dtype: int32
print (master_train['info_1'].mean())
5.0
I need to convert 2d planar polygonal meshes to 2D Arrangements in CGAL. for example if I have the following mesh in Wavefront obj format:
v -5.687006 -4.782805 0.000000
v 4.878987 -4.782805 0.000000
v -5.687006 4.782805 0.000000
v 4.878987 4.782805 0.000000
v -0.404010 -4.782805 0.000000
v -5.687006 0.000000 0.000000
v 4.878987 0.000000 0.000000
v -0.404010 4.782805 0.000000
v -0.404010 0.000000 0.000000
f 5 2 9
f 9 2 7
f 7 4 9
f 9 4 8
f 8 3 9
f 9 3 6
f 6 1 9
f 9 1 5
what is the simplest way I could convert it to a 2d Arrangement using the CGAL library?
Using the following example, you'll find out.
insert_in_face_interior for the first segment
insert_from_left_vertex or insert_from_right_vertex for the middle one, depending on the orientation of your polygon.
insert_at_vertices for the last one