python - check if nan float in dictionary - python-2.7

import numpy as np
import pandas as pd
import tia.bbg.datamgr as dm
mgr = dm.BbgDataManager()
bb_yearb4 = "2016-12-30"
bb_today = "2017-09-22"
indices = [list of indices]
sids_index = mgr[indices]
df_idx = sids_index.get_historical('PX_LAST', bb_yearb4, bb_today)
nan = np.nan
price_test = {}
for index in indices:
price_test["{0}".format(index)] = df_idx.loc[bb_today][index]
The output shows multiple nan float values:
In [1]: price_test.values()
Out[1]: [nan, nan, nan, 47913.199999999997, nan, 1210.3299999999999, nan]
However, testing for nan shows false:
In [2]: nan in price_test.values()
Out[2]: False
What is the correct way to test this?

NaN is weird, because NaN != NaN. There's a good reason for that, but it still breaks in checks and everything else that assumes normal == behavior.
Check for NaN with NaN-specific checks, like numpy.isnan:
any(np.isnan(val) for val in d.values())
or in a non-NumPy context,
any(math.isnan(val) for val in d.values())

Related

Getting same value for Precision and Recall (K-NN) using sklearn

Updated question:
I did this, but I am getting the same result for both precision and recall is it because I am using average ='binary'?
But when I use average='macro' I get this error message:
Test a custom review
messageC:\Python27\lib\site-packages\sklearn\metrics\classification.py:976:
DeprecationWarning: From version 0.18, binary input will not be
handled specially when using averaged precision/recall/F-score. Please
use average='binary' to report only the positive class performance.
'positive class performance.', DeprecationWarning)
Here is my updated code:
path = 'opinions.tsv'
data = pd.read_table(path,header=None,skiprows=1,names=['Sentiment','Review'])
X = data.Review
y = data.Sentiment
#Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(stop_words='english', ngram_range = (1,1), max_df = .80, min_df = 4)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1, test_size= 0.2)
#Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
#Accuracy using KNN Model
KNN = KNeighborsClassifier(n_neighbors = 3)
KNN.fit(X_train_dtm, y_train)
y_pred = KNN.predict(X_test_dtm)
print('\nK Nearest Neighbors (NN = 3)')
#Naive Bayes Analysis
tokens_words = vect.get_feature_names()
print '\nAnalysis'
print'Accuracy Score: %f %%'% (metrics.accuracy_score(y_test,y_pred)*100)
print "Precision Score: %f%%" % precision_score(y_test,y_pred, average='binary')
print "Recall Score: %f%%" % recall_score(y_test,y_pred, average='binary')
By using the code above I get same value for precision and recall.
Thank you for answering my question, much appreciated.
To calculate precision and recall metrics, you should import the according methods from sklearn.metrics.
As stated in the documentation, their parameters are 1-d arrays of true and predicted labels:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print('Calculating the metrics...')
recision_score(y_true, y_pred, average='macro')
>>> 0.22
recall_score(y_true, y_pred, average='macro')
>>> 0.33

how to index np.nan in a DataFrame

I create a DataFrame df with some nan in the column label, how can i get the index of the nan??
I have trydf['label'] == np.nan, but it seems doesn't work, While I use sum(df['Adj. Volume'] == 5090527.0), I can get the right answer, what happened? why ==np.nan doesn't work?
The DataFrame is like this
use isnull to test for NaN values:
df[df['label'].isnull()]
This will return all rows in your df where the label is NaN
the equality operator doesn't work with NaN which is why == np.NaN doesn't work
NaN has the property that np.NaN != np.NaN which is counter-intuitive
Example:
In [5]:
s = pd.Series([0,np.NaN, 3])
s
Out[5]:
0 0.0
1 NaN
2 3.0
dtype: float64
In [6]:
s == np.NaN
Out[6]:
0 False
1 False
2 False
dtype: bool
In [7]:
s != s
Out[7]:
0 False
1 True
2 False
dtype: bool
You can see in the last example if we test whether s != s it returns True for the NaN entry
Using isnull also gives the same result:
In [8]:
s.isnull()
Out[8]:
0 False
1 True
2 False
dtype: bool
You can then access the index attribute of the above to get just the index values:
In [10]:
s[s.isnull()].index
Out[10]:
Int64Index([1], dtype='int64')
I think you need boolean indexing with isnull and then return index by .index:
print (df[df.label.isnull()].index)
Sample:
df = pd.DataFrame({'A':[1,2,3],
'label':[4,np.nan,np.nan],
'C':[7,8,9]})
print (df)
A C label
0 1 7 4.0
1 2 8 NaN
2 3 9 NaN
print (df[df.label.isnull()].index)
Int64Index([1, 2], dtype='int64')

SciPy curve_fit not working when one of the parameters to fit is a power

I'm trying to fit my data to a user defined function using SciPy curve_fit, which works when fitting to a function with a fixed power (func1). But curve_fit does not work when the function contains a power as a parameter to fit to (func2).
Curve_fit still does not work if I provide an initial guess for the parameters usins the keyword p0. I can not use the bounds keyword as the version of SciPy which I have does not have it.
This script illustrates the point:
import scipy
from scipy.optimize import curve_fit
import sys
print 'scipy version: ', scipy.__version__
print 'np.version: ', np.__version__
print sys.version_info
def func1(x,a):
return (x-a)**3.0
def func2(x,a,b):
return (x-a)**b
x_train = np.linspace(0, 12, 50)
y = func2(x_train, 0.5, 3.0)
y_train = y + np.random.normal(size=len(x_train))
print 'dtype of x_train: ', x_train.dtype
print 'dtype of y_train: ', y_train.dtype
popt1, pcov1 = curve_fit( func1, x_train, y_train, p0=[0.6] )
popt2, pcov2 = curve_fit( func2, x_train, y_train, p0=[0.6, 4.0] )
print 'Function 1: ', popt1, pcov1
print 'Function 2: ', popt2, pcov2
Which outputs the following:
scipy version: 0.14.0
np.version: 1.8.2
sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
dtype of x_train: float64
dtype of y_train: float64
stack_overflow.py:14: RuntimeWarning: invalid value encountered in power
return (x-a)**b
Function 1: [ 0.50138759] [[ 3.90044196e-07]]
Function 2: [ nan nan] [[ inf inf]
[ inf inf]]
(As #xnx first commented,) the problem with the second formulation (where the exponent b is unknown and considered to be real-valued) is that, in the process of testing potential values for a and b, quantities of the form z**p need to be evaluated, where z is a negative real number and p is a non-integer. This quantity is complex in general, hence the procedure fails. For example, for x=0 and test variables a=0.5, b=4.1, it holds (x-a)**b = (-0.5)**4.1 = 0.0555+0.018j.

pandas checking for nan not working using .isin()

I have the following pandas Dataframe with a NaN in it.
import pandas as pd
df = pd.DataFrame([1,2,3,float('nan')], columns=['A'])
df
A
0 1
1 2
2 3
3 NaN
I also have the list filter_list using which I want to filter my Dataframe. But if i use .isin() function, it is not detecting the NaN. Instead of getting True I am getting False in the last row
filter_list = [1, float('nan')]
df['A'].isin(filter_list)
0 True
1 False
2 False
3 False
Name: A, dtype: bool
Expected output:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
I know that I can use .isnull() to check for NaNs. But here I have other values to check as well. I am using pandas 0.16.0 version
Edit: The list filter_list comes from the user. So it might or might not have NaN. Thats why i am using .isin()
The float NaN has the interesting property that it is not equal to itself:
In [194]: float('nan') == float('nan')
Out[194]: False
isin checks for equality. So you can't use isin to check if a value equals NaN.
To check for NaNs it is best to use np.isnull.
In [200]: df['A'].isin([1]) | df['A'].isnull()
Out[200]:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
You could replace nan with a unique non-NaN value that will not occur in your list, say 'NA' or ''. For example:
In [23]: import pandas as pd
In [24]: df = pd.DataFrame([1, 2, 3, pd.np.nan], columns=['A'])
In [25]: filter_list = pd.Series([1, pd.np.nan])
In [26]: na_equiv = 'NA'
In [27]: df['A'].replace(pd.np.nan, na_equiv).isin(filter_list.replace(pd.np.nan, na_equiv))
Out[27]:
0 True
1 False
2 False
3 True
Name: A, dtype: bool
I think that the simplest way is to use numpy.nan:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, np.nan], columns=['A'])
filter_list = [1, np.nan]
df['A'].isin(filter_list)
If you really what to use isin() to match NaN. You can create a class that has the same hash as nan and return True when compare to nan:
import numpy as np
import pandas as pd
class NAN(object):
def __eq__(self, v):
return np.isnan(v)
def __hash__(self):
return hash(np.nan)
nan = NAN()
df = pd.DataFrame([1,2,3,float('nan')], columns=['A'])
df.A.isin([1, nan])

Pandas list comprehension in a dataframe

I would like to pull out the price at the next day's open currently stored in (row + 1) and store it in a new column, if some condition is met.
df['b']=''
df['shift']=''
df['shift']=df['open'].shift(-1)
df['b']=df[x for x in df['shift'] if df["MA10"]>df["MA100"]]
There are a few approaches. Using apply:
>>> df = pd.read_csv("bondstack.csv")
>>> df["shift"] = df["open"].shift(-1)
>>> df["b"] = df.apply(lambda row: row["shift"] if row["MA10"] > row["MA100"] else np.nan, axis=1)
which produces
>>> df[["MA10", "MA100", "shift", "b"]][:10]
MA10 MA100 shift b
0 16.915625 17.405625 16.734375 NaN
1 16.871875 17.358750 17.171875 NaN
2 16.893750 17.317187 17.359375 NaN
3 16.950000 17.279062 17.359375 NaN
4 17.137500 17.254062 18.640625 NaN
5 17.365625 17.229063 18.921875 18.921875
6 17.550000 17.200312 18.296875 18.296875
7 17.681250 17.177500 18.640625 18.640625
8 17.812500 17.159375 18.609375 18.609375
9 17.943750 17.142813 18.234375 18.234375
For a more vectorized approach, you could use
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = np.nan
>>> df["b"][df["MA10"] > df["MA100"]] = df["open"].shift(-1)
or my preferred approach:
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = df["open"].shift(-1).where(df["MA10"] > df["MA100"])
Modifying DSM's approach 3, stating True/False values in np.where explicitly:
#numpy.where(condition, x, y)
df["b"] = np.where(df["MA10"] > df["MA100"], df["open"].shift(-1), np.nan)
Using list comprehension explicitly:
#[xv if c else yv for (c,xv,yv) in zip(condition,x,y)] #np.where documentation
df['b'] = [ xv if c else np.nan for (c,xv) in zip(df["MA10"]> df["MA100"], df["open"].shift(-1))]