ValueError: invalid literal for float(): when adding annotation in pandas - python-2.7

I get this error when I try to add an annotation to my plot - ValueError: invalid literal for float(): 10_May.
my dataframe:
my code (I use to_datetime and strftime before ploting as I needed to sort dates which were stored as strings):
# dealing with dates as strings
grouped.index = pd.to_datetime(grouped.index, format='%d_%b')
grouped = grouped.sort_index()
grouped.index = grouped.index.strftime('%d_%b')
plt.annotate('Peak',
(grouped.index[9], grouped['L'][9]),
xytext=(15, 15),
textcoords='offset points',
arrowprops=dict(arrowstyle='-|>'))
grouped.plot()
grouped.index[9] returns u'10_May' while grouped['L'][9] returns 10.0.
I know that pandas expect index to be float but I thought I can access it by df.index[]. Will appreciate your suggestions.

For me works first plot and then get index position by Index.get_loc:
ax = df.plot()
ax.annotate('Peak',
(df.index.get_loc(df.index[9]), df['L'][9]),
xytext=(15, 15),
textcoords='offset points',
arrowprops=dict(arrowstyle='-|>'))
Sample:
np.random.seed(10)
df = pd.DataFrame({'L':[3,5,0,1]}, index=['4_May','3_May','1_May', '2_May'])
#print (df)
df.index = pd.to_datetime(df.index, format='%d_%b')
df = df.sort_index()
df.index = df.index.strftime('%d_%b')
df.plot()
plt.annotate('Peak',
(df.index.get_loc(df.index[2]), df['L'][2]),
xytext=(15, 15),
textcoords='offset points',
arrowprops=dict(arrowstyle='-|>'))
EDIT:
More general solution with get_loc + idxmax + max:
ax = df.plot()
ax.annotate('Peak',
(df.index.get_loc(df['L'].idxmax()), df['L'].max()),
xytext=(15, 15),
textcoords='offset points',
arrowprops=dict(arrowstyle='-|>'))

Related

Replace Null values with median in pyspark

How can I replace null values with median in the columns Age and Height below data set df.
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William',None, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor'),
(10,'Xavier',1.64,43,None,'Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
In the post Replace missing values with mean - Spark Dataframe I used the function given
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns])
imputer.fit(df).transform(df)
It throws me an error.
IllegalArgumentException: 'requirement failed: Column Id must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type LongType.'
So please help.
Thank you
It's likely an initial casting error (I had some strings I needed to be floats). To convert all cols to floats do:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Then you should be fine to impute. Note: I set my strategy to median rather than mean.
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
).setStrategy("median")
# Add imputation cols to df
df = imputer.fit(df).transform(df)
I'd be interested in a more elegant solution but I separately imputed the categoricals from the numerics. To impute the categoricals I got the most common value and filled the blanks with it using the when and otherwise functions:
import pyspark.sql.functions as F
for col_name in ['Name', 'Gender', 'Profession']:
common = df.dropna().groupBy(col_name).agg(F.count("*")).orderBy('count(1)', ascending=False).first()[col_name]
df = df.withColumn(col_name, F.when(F.isnull(col_name), common).otherwise(df[col_name]))
To impute the numerics before running the imputer lines I simply casting the Age and Id columns as doubles circumvents the issue for the numeric fields and restrict the imputer to the numerical columns.
from pyspark.ml.feature import Imputer
df = df.withColumn("Age", df['Age'].cast('double')).withColumn('Id', df['Id'].cast('double'))
imputer = Imputer(
inputCols=['Id', 'Height', 'Age'],
outputCols=['Id', 'Height', 'Age'])
imputer.fit(df).transform(df)

Exporting Dictionary to CSV

One of the stack overflow buddies was kind enough to give me a below code for creating a dictionary. This works well. But now I want to export the data frames in the dictionary into a single CSV file. Can someone please help me with this?
import pandas as pd
DF1 = pd.DataFrame({"A": [3], "B": [2], "C": [100]})
DF_list = {}
for i in ["A", "B"]:
DF = pd.DataFrame({})
DF[i] = DF1[[i]]
DF["C"] = DF1[["C"]]
DF["value"] = DF[i] * DF["C"]
DF_list["DF_" + i] = DF
print(DF_list)
{'DF_A': A C value
0 3 100 300, 'DF_B': B C value
0 2 100 200}

Getting same value for Precision and Recall (K-NN) using sklearn

Updated question:
I did this, but I am getting the same result for both precision and recall is it because I am using average ='binary'?
But when I use average='macro' I get this error message:
Test a custom review
messageC:\Python27\lib\site-packages\sklearn\metrics\classification.py:976:
DeprecationWarning: From version 0.18, binary input will not be
handled specially when using averaged precision/recall/F-score. Please
use average='binary' to report only the positive class performance.
'positive class performance.', DeprecationWarning)
Here is my updated code:
path = 'opinions.tsv'
data = pd.read_table(path,header=None,skiprows=1,names=['Sentiment','Review'])
X = data.Review
y = data.Sentiment
#Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(stop_words='english', ngram_range = (1,1), max_df = .80, min_df = 4)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1, test_size= 0.2)
#Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
#Accuracy using KNN Model
KNN = KNeighborsClassifier(n_neighbors = 3)
KNN.fit(X_train_dtm, y_train)
y_pred = KNN.predict(X_test_dtm)
print('\nK Nearest Neighbors (NN = 3)')
#Naive Bayes Analysis
tokens_words = vect.get_feature_names()
print '\nAnalysis'
print'Accuracy Score: %f %%'% (metrics.accuracy_score(y_test,y_pred)*100)
print "Precision Score: %f%%" % precision_score(y_test,y_pred, average='binary')
print "Recall Score: %f%%" % recall_score(y_test,y_pred, average='binary')
By using the code above I get same value for precision and recall.
Thank you for answering my question, much appreciated.
To calculate precision and recall metrics, you should import the according methods from sklearn.metrics.
As stated in the documentation, their parameters are 1-d arrays of true and predicted labels:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print('Calculating the metrics...')
recision_score(y_true, y_pred, average='macro')
>>> 0.22
recall_score(y_true, y_pred, average='macro')
>>> 0.33

Reading Data from CSV and fill Empty Values Python

I am reading in a CSV file with the general schema of
,abv,ibu,id,name,style,brewery_id,ounces
14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
I am running into problems where fields are not existing such as in object 0 where it is lacking an IBU. I would like to be able to insert a value such as 0.0 that would work as a float for values that require floats and an empty string for ones that require strings.
My code is along the lines of
import csv
import numpy as np
def dataset(path, filter_field, filter_value):
with open(path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
if filter_field:
for row in filter(lambda row: row[filter_field]==filter_value, reader):
yield row
def main(path):
data = [(row["ibu"], float(row["ibu"])) for row in dataset(path, "style", "American Pale Lager")]
As of right now my code would throw an error sine there are empty values in the "ibu" column for object 0.
How should one go about solving this problem?
You can do the following:
add a default dictionary input that you can use for missing values
and also to update upon certain conditions such as when ibu is empty
this is your implementation changed to correct for what you need. If I were you I would use pandas ...
import csv, copy
def dataset(path, filter_field, filter_value, default={'brewery_id':-1, 'style': 'unkown style', ' ': -1, 'name': 'unkown name', 'abi':0.0, 'id': -1, 'ounces':-1, 'ibu':0.0}):
with open(path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row is None:
break
if row[filter_field].strip() != filter_value:
continue
default_row = copy.copy(default)
default_row.update(row)
# you might want to add conditions
if default_row["ibu"] == "":
default_row["ibu"] = default["ibu"]
yield default_row
data = [(row["ibu"], float(row["ibu"])) for row in dataset('test.csv', "style", "American Pale Lager")]
print data
>> [(0.0, 0.0)]
Why don't you use
import pandas as pd
df = pd.read_csv(data_file)
The following is the result:
In [13]: df
Out[13]:
Unnamed: 0 abv ibu id name style \
0 14 0.061 60.0 1979 Bitter Bitch American Pale Ale (APA)
1 0 0.050 NaN 1436 Pub Beer American Pale Lager
brewery_id ounces
0 177 12.0
1 408 12.0
Simulating your file with a text string:
In [48]: txt=b""" ,abv,ibu,id,name,style,brewery_id,ounces
...: 14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
...: 0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
...: """
I can load it with numpy genfromtxt.
In [49]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None,skip_heade
...: r=1,filling_values=0)
In [50]: data
Out[50]:
array([ (14, 0.061, 60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177, 12.),
( 0, 0.05 , 0., 1436, b' Pub Beer', b' American Pale Lager', 408, 12.)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'), ('f4', 'S12'), ('f5', 'S23'), ('f6', '<i4'), ('f7', '<f8')])
In [51]:
I had to skip the header line because it is incomplete (a blank for the 1st field). The result is a structured array - a mix of ints, floats and strings (bytestrings in Py3).
After correcting the header line, and using names=True, I get
array([ (14, 0.061, 60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177, 12.),
( 0, 0.05 , 0., 1436, b' Pub Beer', b' American Pale Lager', 408, 12.)],
dtype=[('f0', '<i4'), ('abv', '<f8'), ('ibu', '<f8'), ('id', '<i4'), ('name', 'S12'), ('style', 'S23'), ('brewery_id', '<i4'), ('ounces', '<f8')])
genfromtxt is the most powerful csv reader in numpy. See it's docs for more parameters. The pandas reader is faster and more flexible - but of course produces a data frame, not array.

only apply date format to date columns with xlwt

I have results from a database search where some columns have dates, but not others. I need help to write
data = [['556644', 'Mr', 'So', 'And', 'So', Decimal('0.0000'), datetime.datetime(2012, 2, 25, 0, 0), '', False, datetime.datetime(2013, 6, 30, 0, 0)],...]
into an Excel spreadsheet so that
easyxf(num_format_str='DD/MM/YYYY')
is only applied to the datetime columns. I'm quite new to Python and I've been banging my head on this for quite a few days now. Thanks!
After simplifying your self-answer:
date_xf = easyxf(num_format_str='DD/MM/YYYY') # sets date format in Excel
data = [list(n) for n in cursor.fetchall()]
for row_index, row_contents in enumerate(data):
for column_index, cell_value in enumerate(row_contents):
if isinstance(cell_value, datetime.date):
sheet1.write(row_index+1, column_index, cell_value, date_xf)
else:
sheet1.write(row_index+1, column_index, cell_value)
This is being imported from the SQL server with pyodbc, so that's from where cursor.fetchall() is coming:
data = [list(n) for n in cursor.fetchall()]
for row_index, row_contents in enumerate(data):
for column_index, cell_value in enumerate(row_contents):
xf=None
if isinstance(data[row_index][column_index], datetime.date):
xf = easyxf(num_format_str='DD/MM/YYYY') # sets date format in Excel
if xf:
sheet1.write(row_index+1, column_index, cell_value, xf)
else:
sheet1.write(row_index+1, column_index, cell_value)
Done! :)
Want to write a date to a cell in Excel:
Add an extra attribute to the write call - the style for the cell:
import datetime
import xlwt
workbook = xlwt.Workbook()
worksheet = workbook.add_sheet('test')
date_style = xlwt.easyxf(num_format_str='YYYY/MM/DD')
worksheet.write(0, 0, 'Today')
worksheet.write(1, 0, datetime.date.today(), date_style)
workbook.save('C:\\data.xls')