pandas checking for nan not working using .isin() - python-2.7

I have the following pandas Dataframe with a NaN in it.
import pandas as pd
df = pd.DataFrame([1,2,3,float('nan')], columns=['A'])
df
A
0 1
1 2
2 3
3 NaN
I also have the list filter_list using which I want to filter my Dataframe. But if i use .isin() function, it is not detecting the NaN. Instead of getting True I am getting False in the last row
filter_list = [1, float('nan')]
df['A'].isin(filter_list)
0 True
1 False
2 False
3 False
Name: A, dtype: bool
Expected output:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
I know that I can use .isnull() to check for NaNs. But here I have other values to check as well. I am using pandas 0.16.0 version
Edit: The list filter_list comes from the user. So it might or might not have NaN. Thats why i am using .isin()

The float NaN has the interesting property that it is not equal to itself:
In [194]: float('nan') == float('nan')
Out[194]: False
isin checks for equality. So you can't use isin to check if a value equals NaN.
To check for NaNs it is best to use np.isnull.
In [200]: df['A'].isin([1]) | df['A'].isnull()
Out[200]:
0 True
1 False
2 False
3 True
Name: A, dtype: bool

You could replace nan with a unique non-NaN value that will not occur in your list, say 'NA' or ''. For example:
In [23]: import pandas as pd
In [24]: df = pd.DataFrame([1, 2, 3, pd.np.nan], columns=['A'])
In [25]: filter_list = pd.Series([1, pd.np.nan])
In [26]: na_equiv = 'NA'
In [27]: df['A'].replace(pd.np.nan, na_equiv).isin(filter_list.replace(pd.np.nan, na_equiv))
Out[27]:
0 True
1 False
2 False
3 True
Name: A, dtype: bool

I think that the simplest way is to use numpy.nan:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, np.nan], columns=['A'])
filter_list = [1, np.nan]
df['A'].isin(filter_list)

If you really what to use isin() to match NaN. You can create a class that has the same hash as nan and return True when compare to nan:
import numpy as np
import pandas as pd
class NAN(object):
def __eq__(self, v):
return np.isnan(v)
def __hash__(self):
return hash(np.nan)
nan = NAN()
df = pd.DataFrame([1,2,3,float('nan')], columns=['A'])
df.A.isin([1, nan])

Related

Null independent column wise mean calculation in Python

I am trying to calculate the mean of 3 three columns in Python. Here is the catch-
If all 3 row values of my 3 columns are not null then my mean will be (x+y+z)/3.
If one of my row value is null (suppose z), then my mean should be (x+y)/2.
I'm storing there mean values in a seperate column which is part of the pandas dataframe.
I'm looking for the best approach as my dataset has over 2 million rows.
My data is below.
Thanks in advance.
A B C
0 1 2 3 # = (1+2+3)/3 = 2
1 4 NaN 6 # = (4+6)/2 = 5
2 NaN 8 9 # = (8+9)/2 = 8.5
Just apply the numpy.nanmean function along axis 0 (columns). This is the default axis so you will get the same result with omitting axis = 0. If you want the means row-wise use axis = 1:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [2.3, 4.5, 2.1, np.nan, 6.7],
'b': [2.4, 5.6, np.nan, np.nan, 7.1],
'c': [np.nan, np.nan, np.nan, np.nan, 0.9]
})
colmeans = df.apply(np.nanmean, axis = 0)
# colmeans
# a 3.900000
# b 5.033333
# c 0.900000
# dtype: float64
rowmeans = df.apply(np.nanmean, axis = 1)
# 0 2.35
# 1 5.05
# 2 2.10
# 3 NaN
# 4 4.90
# dtype: float64

python - check if nan float in dictionary

import numpy as np
import pandas as pd
import tia.bbg.datamgr as dm
mgr = dm.BbgDataManager()
bb_yearb4 = "2016-12-30"
bb_today = "2017-09-22"
indices = [list of indices]
sids_index = mgr[indices]
df_idx = sids_index.get_historical('PX_LAST', bb_yearb4, bb_today)
nan = np.nan
price_test = {}
for index in indices:
price_test["{0}".format(index)] = df_idx.loc[bb_today][index]
The output shows multiple nan float values:
In [1]: price_test.values()
Out[1]: [nan, nan, nan, 47913.199999999997, nan, 1210.3299999999999, nan]
However, testing for nan shows false:
In [2]: nan in price_test.values()
Out[2]: False
What is the correct way to test this?
NaN is weird, because NaN != NaN. There's a good reason for that, but it still breaks in checks and everything else that assumes normal == behavior.
Check for NaN with NaN-specific checks, like numpy.isnan:
any(np.isnan(val) for val in d.values())
or in a non-NumPy context,
any(math.isnan(val) for val in d.values())

How to create a column in pandas dataframe using conditions defined in dict

Here's my code:
import pandas as pd
import numpy as np
input = {'name': ['Andy', 'Alex', 'Amy', "Olivia" ],
'rating': ['A', 'A', 'B', "B" ],
'score': [100, 60, 70, 95]}
df = pd.DataFrame(input)
df['valid1']=np.where((df['score']==100) & (df['rating']=='A'),'true','false')
The code above works fine to set a new column 'valid1' data as 'true' for score is 100 and 'rating' is A.
If the condition comes from a dict variable as
c = {'score':'100', 'rating':'A'}
How can I use the condition defined in c to get the same result 'valid' column value? I tried the following code
for key,value in c.iteritems():
df['valid2']=np.where((df[key]==value),'true','false')
got an error:
TypeError: Invalid type comparison
I'd define c as a pd.Series so that when you compare it to a dataframe, it automatically compares agains each row while matching columns with series indices. Note that I made sure 100 was an integer and not a string.
c = pd.Series({'score':100, 'rating':'A'})
i = df.columns.intersection(c.index)
df.assign(valid1=df[i].eq(c).all(1))
name rating score valid1
0 Andy A 100 True
1 Alex A 60 False
2 Amy B 70 False
3 Olivia B 95 False
You can use the same series and still use numpy to speed things up
c = pd.Series({'score':100, 'rating':'A'})
i = df.columns.intersection(c.index)
v = np.column_stack(df[c].values for c in i)
df.assign(valid1=(v == c.loc[i].values).all(1))
name rating score valid1
0 Andy A 100 True
1 Alex A 60 False
2 Amy B 70 False
3 Olivia B 95 False

how to index np.nan in a DataFrame

I create a DataFrame df with some nan in the column label, how can i get the index of the nan??
I have trydf['label'] == np.nan, but it seems doesn't work, While I use sum(df['Adj. Volume'] == 5090527.0), I can get the right answer, what happened? why ==np.nan doesn't work?
The DataFrame is like this
use isnull to test for NaN values:
df[df['label'].isnull()]
This will return all rows in your df where the label is NaN
the equality operator doesn't work with NaN which is why == np.NaN doesn't work
NaN has the property that np.NaN != np.NaN which is counter-intuitive
Example:
In [5]:
s = pd.Series([0,np.NaN, 3])
s
Out[5]:
0 0.0
1 NaN
2 3.0
dtype: float64
In [6]:
s == np.NaN
Out[6]:
0 False
1 False
2 False
dtype: bool
In [7]:
s != s
Out[7]:
0 False
1 True
2 False
dtype: bool
You can see in the last example if we test whether s != s it returns True for the NaN entry
Using isnull also gives the same result:
In [8]:
s.isnull()
Out[8]:
0 False
1 True
2 False
dtype: bool
You can then access the index attribute of the above to get just the index values:
In [10]:
s[s.isnull()].index
Out[10]:
Int64Index([1], dtype='int64')
I think you need boolean indexing with isnull and then return index by .index:
print (df[df.label.isnull()].index)
Sample:
df = pd.DataFrame({'A':[1,2,3],
'label':[4,np.nan,np.nan],
'C':[7,8,9]})
print (df)
A C label
0 1 7 4.0
1 2 8 NaN
2 3 9 NaN
print (df[df.label.isnull()].index)
Int64Index([1, 2], dtype='int64')

Pandas list comprehension in a dataframe

I would like to pull out the price at the next day's open currently stored in (row + 1) and store it in a new column, if some condition is met.
df['b']=''
df['shift']=''
df['shift']=df['open'].shift(-1)
df['b']=df[x for x in df['shift'] if df["MA10"]>df["MA100"]]
There are a few approaches. Using apply:
>>> df = pd.read_csv("bondstack.csv")
>>> df["shift"] = df["open"].shift(-1)
>>> df["b"] = df.apply(lambda row: row["shift"] if row["MA10"] > row["MA100"] else np.nan, axis=1)
which produces
>>> df[["MA10", "MA100", "shift", "b"]][:10]
MA10 MA100 shift b
0 16.915625 17.405625 16.734375 NaN
1 16.871875 17.358750 17.171875 NaN
2 16.893750 17.317187 17.359375 NaN
3 16.950000 17.279062 17.359375 NaN
4 17.137500 17.254062 18.640625 NaN
5 17.365625 17.229063 18.921875 18.921875
6 17.550000 17.200312 18.296875 18.296875
7 17.681250 17.177500 18.640625 18.640625
8 17.812500 17.159375 18.609375 18.609375
9 17.943750 17.142813 18.234375 18.234375
For a more vectorized approach, you could use
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = np.nan
>>> df["b"][df["MA10"] > df["MA100"]] = df["open"].shift(-1)
or my preferred approach:
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = df["open"].shift(-1).where(df["MA10"] > df["MA100"])
Modifying DSM's approach 3, stating True/False values in np.where explicitly:
#numpy.where(condition, x, y)
df["b"] = np.where(df["MA10"] > df["MA100"], df["open"].shift(-1), np.nan)
Using list comprehension explicitly:
#[xv if c else yv for (c,xv,yv) in zip(condition,x,y)] #np.where documentation
df['b'] = [ xv if c else np.nan for (c,xv) in zip(df["MA10"]> df["MA100"], df["open"].shift(-1))]