Null independent column wise mean calculation in Python - python-2.7

I am trying to calculate the mean of 3 three columns in Python. Here is the catch-
If all 3 row values of my 3 columns are not null then my mean will be (x+y+z)/3.
If one of my row value is null (suppose z), then my mean should be (x+y)/2.
I'm storing there mean values in a seperate column which is part of the pandas dataframe.
I'm looking for the best approach as my dataset has over 2 million rows.
My data is below.
Thanks in advance.
A B C
0 1 2 3 # = (1+2+3)/3 = 2
1 4 NaN 6 # = (4+6)/2 = 5
2 NaN 8 9 # = (8+9)/2 = 8.5

Just apply the numpy.nanmean function along axis 0 (columns). This is the default axis so you will get the same result with omitting axis = 0. If you want the means row-wise use axis = 1:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [2.3, 4.5, 2.1, np.nan, 6.7],
'b': [2.4, 5.6, np.nan, np.nan, 7.1],
'c': [np.nan, np.nan, np.nan, np.nan, 0.9]
})
colmeans = df.apply(np.nanmean, axis = 0)
# colmeans
# a 3.900000
# b 5.033333
# c 0.900000
# dtype: float64
rowmeans = df.apply(np.nanmean, axis = 1)
# 0 2.35
# 1 5.05
# 2 2.10
# 3 NaN
# 4 4.90
# dtype: float64

Related

How to plot parallel coordinae plot ftrom Hyperparameter Tuning with the HParams Dashboard?

I am trying to replicate the parallel coordinate plot form Hyperparameter Tuning tutorial in this Tensorflow tutorial and I have writen my own csv file where I store my results.
My output reading the csv file is like this:
conv_layers filters dropout accuracy
0 4 16 0.5 0.447917
1 4 16 0.6 0.458333
2 4 32 0.5 0.635417
3 4 32 0.6 0.447917
4 4 64 0.5 0.604167
5 4 64 0.6 0.645833
6 8 16 0.5 0.437500
7 8 16 0.6 0.437500
8 8 32 0.5 0.437500
9 8 32 0.6 0.562500
10 8 64 0.5 0.562500
11 8 64 0.6 0.437500
How can I create the same plot like in the tutorial in python?
so I found the answer using plotly
import os
import sys
import pandas as pd
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
init_notebook_mode(connected=True)
df = pd.read_csv('path/to/csv')
fig = go.Figure(data=
go.Parcoords(
line = dict(color = df['accuracy'],
colorbar = [],
colorscale = [[0, '#6C9E12'], ##
[0.25,'#0D5F67'], ##
[0.5,'#AA1B13'], ##
[0.75, '#69178C'], ##
[1, '#DE9733']]),
dimensions = list([
dict(range = [0,12],
label = 'Conv_layers', values = df['conv_layers']),
dict(range = [8,64],
label = 'filter_number', values = df['filters']),
dict(range = [0.2,0.8],
label = 'dropout_rate', values = df['dropout']),
dict(range = [0.2,0.8],
label = 'dense_num', values = df['dense']),
dict(range = [0.1,1.0],
label = 'accuracy', values = df['accuracy'])
])
)
)
fig.update_layout(
plot_bgcolor = '#E5E5E5',
paper_bgcolor = '#E5E5E5',
title="Parallel Coordinates Plot"
)
# print the plot
fig.show()

Python Side-by-side box plots on same figure

I am trying to generate a box plot in Python 2.7 for each categorical value in column E from the Pandas dataframe below
A B C D E
0 0.647366 0.317832 0.875353 0.993592 1
1 0.504790 0.041806 0.113889 0.445370 2
2 0.769335 0.120647 0.749565 0.935732 3
3 0.215003 0.497402 0.795033 0.246890 1
4 0.841577 0.211128 0.248779 0.250432 1
5 0.045797 0.710889 0.257784 0.207661 4
6 0.229536 0.094308 0.464018 0.402725 3
7 0.067887 0.591637 0.949509 0.858394 2
8 0.827660 0.348025 0.507488 0.343006 3
9 0.559795 0.820231 0.461300 0.921024 1
I would be willing to do this with Matplotlib or any other plotting library. So far the above code can plot all the categories combined on one plot. Here is the code to generate the above data and produce the plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# Data
df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df['E'] = [1,2,3,1,1,4,3,2,3,1]
# Boxplot
bp = ax.boxplot(df.iloc[:,:-1].values, widths=0.2)
plt.show()
In this example, the categories are 1,2,3,4. I would like to plot separate boxplots side-by-side on the same figure, for only categories 1 and 2 and show the category names in the legend.
Is there a way to do this?
Additional Information:
The output should look similar to the 3rd figure from here - replace "Yes","No" by "1","2".
Starting with this:
import numpy
import pandas
from matplotlib import pyplot
import seaborn
seaborn.set(style="ticks")
# Data
df = pandas.DataFrame(numpy.random.rand(10,4), columns=list('ABCD'))
df['E'] = [1, 2, 3, 1, 1, 4, 3, 2, 3, 1]
You've got a couple of options. If separate axes are ok,
fig, axes = pyplot.subplots(ncols=4, figsize=(12, 5), sharey=True)
df.query("E in [1, 2]").boxplot(by='E', return_type='axes', ax=axes)
If you want 1 axes, I think seaborn will be easier. You just need to clean up your data.
ax = (
df.set_index('E', append=True) # set E as part of the index
.stack() # pull A - D into rows
.to_frame() # convert to a dataframe
.reset_index() # make the index into reg. columns
.rename(columns={'level_2': 'quantity', 0: 'value'}) # rename columns
.drop('level_0', axis='columns') # drop junk columns
.pipe((seaborn.boxplot, 'data'), x='E', y='value', hue='quantity', order=[1, 2])
)
seaborn.despine(trim=True)
The cool thing about seaborn is that tweaking the parameters slightly can achieve a lot in terms of the plot's layout. If we switch our hue and x variables, we get:
ax = (
df.set_index('E', append=True) # set E as part of the index
.stack() # pull A - D into rows
.to_frame() # convert to a dataframe
.reset_index() # make the index into reg. columns
.rename(columns={'level_2': 'quantity', 0: 'value'}) # rename columns
.drop('level_0', axis='columns') # drop junk columns
.pipe((seaborn.boxplot, 'data'), x='quantity', y='value', hue='E', hue_order=[1, 2])
)
seaborn.despine(trim=True)
If you're curious, the resulting dataframe looks something like this:
E quantity value
0 1 A 0.935433
1 1 B 0.862290
2 1 C 0.197243
3 1 D 0.977969
4 2 A 0.675037
5 2 B 0.494440
6 2 C 0.492762
7 2 D 0.531296
8 3 A 0.119273
9 3 B 0.303639
10 3 C 0.911700
11 3 D 0.807861
An addition to #Paul_H answer.
Side-by-side boxplots on the single matplotlib.axes.Axes, no seaborn:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10,4), columns=list('ABCD'))
df['E'] = [1, 2, 1, 1, 1, 2, 1, 2, 2, 1]
mask_e = df['E'] == 1
# prepare data
data_to_plot = [df[mask_e]['A'], df[~mask_e]['A'],
df[mask_e]['B'], df[~mask_e]['B'],
df[mask_e]['C'], df[~mask_e]['C'],
df[mask_e]['D'], df[~mask_e]['D']]
# Positions defaults to range(1, N+1) where N is the number of boxplot to be drawn.
# we will move them a little, to visually group them
plt.figure(figsize=(10, 6))
box = plt.boxplot(data_to_plot,
positions=[1, 1.6, 2.5, 3.1, 4, 4.6, 5.5, 6.1],
labels=['A1','A0','B1','B0','C1','C0','D1','D0'])

pandas checking for nan not working using .isin()

I have the following pandas Dataframe with a NaN in it.
import pandas as pd
df = pd.DataFrame([1,2,3,float('nan')], columns=['A'])
df
A
0 1
1 2
2 3
3 NaN
I also have the list filter_list using which I want to filter my Dataframe. But if i use .isin() function, it is not detecting the NaN. Instead of getting True I am getting False in the last row
filter_list = [1, float('nan')]
df['A'].isin(filter_list)
0 True
1 False
2 False
3 False
Name: A, dtype: bool
Expected output:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
I know that I can use .isnull() to check for NaNs. But here I have other values to check as well. I am using pandas 0.16.0 version
Edit: The list filter_list comes from the user. So it might or might not have NaN. Thats why i am using .isin()
The float NaN has the interesting property that it is not equal to itself:
In [194]: float('nan') == float('nan')
Out[194]: False
isin checks for equality. So you can't use isin to check if a value equals NaN.
To check for NaNs it is best to use np.isnull.
In [200]: df['A'].isin([1]) | df['A'].isnull()
Out[200]:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
You could replace nan with a unique non-NaN value that will not occur in your list, say 'NA' or ''. For example:
In [23]: import pandas as pd
In [24]: df = pd.DataFrame([1, 2, 3, pd.np.nan], columns=['A'])
In [25]: filter_list = pd.Series([1, pd.np.nan])
In [26]: na_equiv = 'NA'
In [27]: df['A'].replace(pd.np.nan, na_equiv).isin(filter_list.replace(pd.np.nan, na_equiv))
Out[27]:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
I think that the simplest way is to use numpy.nan:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, np.nan], columns=['A'])
filter_list = [1, np.nan]
df['A'].isin(filter_list)
If you really what to use isin() to match NaN. You can create a class that has the same hash as nan and return True when compare to nan:
import numpy as np
import pandas as pd
class NAN(object):
def __eq__(self, v):
return np.isnan(v)
def __hash__(self):
return hash(np.nan)
nan = NAN()
df = pd.DataFrame([1,2,3,float('nan')], columns=['A'])
df.A.isin([1, nan])

Pandas list comprehension in a dataframe

I would like to pull out the price at the next day's open currently stored in (row + 1) and store it in a new column, if some condition is met.
df['b']=''
df['shift']=''
df['shift']=df['open'].shift(-1)
df['b']=df[x for x in df['shift'] if df["MA10"]>df["MA100"]]
There are a few approaches. Using apply:
>>> df = pd.read_csv("bondstack.csv")
>>> df["shift"] = df["open"].shift(-1)
>>> df["b"] = df.apply(lambda row: row["shift"] if row["MA10"] > row["MA100"] else np.nan, axis=1)
which produces
>>> df[["MA10", "MA100", "shift", "b"]][:10]
MA10 MA100 shift b
0 16.915625 17.405625 16.734375 NaN
1 16.871875 17.358750 17.171875 NaN
2 16.893750 17.317187 17.359375 NaN
3 16.950000 17.279062 17.359375 NaN
4 17.137500 17.254062 18.640625 NaN
5 17.365625 17.229063 18.921875 18.921875
6 17.550000 17.200312 18.296875 18.296875
7 17.681250 17.177500 18.640625 18.640625
8 17.812500 17.159375 18.609375 18.609375
9 17.943750 17.142813 18.234375 18.234375
For a more vectorized approach, you could use
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = np.nan
>>> df["b"][df["MA10"] > df["MA100"]] = df["open"].shift(-1)
or my preferred approach:
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = df["open"].shift(-1).where(df["MA10"] > df["MA100"])
Modifying DSM's approach 3, stating True/False values in np.where explicitly:
#numpy.where(condition, x, y)
df["b"] = np.where(df["MA10"] > df["MA100"], df["open"].shift(-1), np.nan)
Using list comprehension explicitly:
#[xv if c else yv for (c,xv,yv) in zip(condition,x,y)] #np.where documentation
df['b'] = [ xv if c else np.nan for (c,xv) in zip(df["MA10"]> df["MA100"], df["open"].shift(-1))]

How to identify the first occurence of duplicate rows in Python pandas Dataframe

I have a pandas DataFrame with duplicate values for a set of columns. For example:
df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})
In [2]: df
Out[2]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).
I am looking for the following output:
Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]
Dup_Index the original index of the duplicate row.
In [3]: df
Out[3]:
Column1 Column2 Column3 Column4 Is_Duplicate Dup_Index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
There is a DataFrame method duplicated for the first column:
In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0 False
1 False
2 True
In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])
To do the second you could try something like this:
In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])
In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])
In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])
In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
In [17]: df
Out[17]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Let's say your dataframe is stored in df.
You can use groupby to get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:
df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()
you can then merge this new dataframe with the original one by using the merge function:
df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))
You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:
df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']