Replace Null values with median in pyspark - replace

How can I replace null values with median in the columns Age and Height below data set df.
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William',None, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor'),
(10,'Xavier',1.64,43,None,'Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
In the post Replace missing values with mean - Spark Dataframe I used the function given
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns])
imputer.fit(df).transform(df)
It throws me an error.
IllegalArgumentException: 'requirement failed: Column Id must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type LongType.'
So please help.
Thank you

It's likely an initial casting error (I had some strings I needed to be floats). To convert all cols to floats do:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Then you should be fine to impute. Note: I set my strategy to median rather than mean.
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
).setStrategy("median")
# Add imputation cols to df
df = imputer.fit(df).transform(df)

I'd be interested in a more elegant solution but I separately imputed the categoricals from the numerics. To impute the categoricals I got the most common value and filled the blanks with it using the when and otherwise functions:
import pyspark.sql.functions as F
for col_name in ['Name', 'Gender', 'Profession']:
common = df.dropna().groupBy(col_name).agg(F.count("*")).orderBy('count(1)', ascending=False).first()[col_name]
df = df.withColumn(col_name, F.when(F.isnull(col_name), common).otherwise(df[col_name]))
To impute the numerics before running the imputer lines I simply casting the Age and Id columns as doubles circumvents the issue for the numeric fields and restrict the imputer to the numerical columns.
from pyspark.ml.feature import Imputer
df = df.withColumn("Age", df['Age'].cast('double')).withColumn('Id', df['Id'].cast('double'))
imputer = Imputer(
inputCols=['Id', 'Height', 'Age'],
outputCols=['Id', 'Height', 'Age'])
imputer.fit(df).transform(df)

Related

Create a dictionary in a loop

I have 2 lists that I want to convert them into a dict with key and values. I managed to do so but there are too many steps so I would like to know if there's a simpler way of achieving this. Basically I would like to create the dict directly in the loop without having the extra steps bellow. I just started working with python and I don't quite understand all the datatypes that it provides.
The jName form can be modified if needed.
jName=["Nose", "Neck", "RShoulder", "RElbow", "RWrist", "LShoulder", "LElbow", "LWrist", "RHip",
"RKnee","RAnkle","LHip", "LKnee", "LAnkle", "REye", "LEye", "REar", "LEar"]
def get_joints(subset, candidate):
joints_per_skeleton = [[] for i in range(len(subset))]
# for each detected skeleton
for n in range(len(subset)):
# for each joint
for i in range(18):
cidx = subset[n][i]
if cidx != -1:
y = candidate[cidx.astype(int), 0]
x = candidate[cidx.astype(int), 1]
joints_per_skeleton[n].append((y, x))
else:
joints_per_skeleton[n].append(None)
return joints_per_skeleton
joints = get_joints(subset,candidate)
print joints
Here is the output of the joints list of list
[[None, (48.0, 52.0), (72.0, 50.0), None, None, (24.0, 55.0), (5.0, 105.0), None, (63.0, 159.0), (57.0, 221.0), (55.0, 281.0), (28.0, 154.0), (23.0, 219.0), (23.0, 285.0), None, (25.0, 17.0), (55.0, 18.0), (30.0, 21.0)]]
Here I defined a function to create the dictionary from the 2 lists
def create_dict(keys, values):
return dict(zip(keys, values))
my_dict = create_dict(jointsName, joints[0])
Here is the result:
{'LAnkle': (23.0, 285.0),
'LEar': (30.0, 21.0),
'LElbow': (5.0, 105.0),
'LEye': (25.0, 17.0),
'LHip': (28.0, 154.0),
'LKnee': (23.0, 219.0),
'LShoulder': (24.0, 55.0),
'LWrist': None,
'Neck': (48.0, 52.0),
'Nose': None,
'RAnkle': (55.0, 281.0),
'REar': (55.0, 18.0),
'RElbow': None,
'REye': None,
'RHip': (63.0, 159.0),
'RKnee': (57.0, 221.0),
'RShoulder': (72.0, 50.0),
'RWrist': None}
I think defaultdict could help you. I made my own example to show that you could predefine the keys and then go through a double for loop and have the values of the dict be lists of potentially different sizes. Please let me know if this answers your question:
from collections import defaultdict
import random
joint_names = ['hip','knee','wrist']
num_skeletons = 10
d = defaultdict(list)
for skeleton in range(num_skeletons):
for joint_name in joint_names:
r1 = random.randint(0,10)
r2 = random.randint(0,10)
if r1 > 4:
d[joint_name].append(r1*r2)
print d
Output:
defaultdict(<type 'list'>, {'hip': [0, 5, 30, 36, 56], 'knee': [35, 50, 10], 'wrist': [27, 5, 15, 64, 30]})
As a note I found it very difficult to read through your code since there were some variables that were defined before the snippet you posted.

Groupby on one column of pandas dataframe, and train feature and target (X, y) of each group with a common sklearn pipeline using GridsearchCv

I have a pandas dataframe with the following structure:
pd.DataFrame({"user_id": ['user_id1', 'user_id1', 'user_id1', 'user_id2', 'user_id2'],
'meeting': ['text1', 'text2', 'text3', 'text4', 'text5'], 'label': ['a,b', 'a', 'a,c', 'x', 'x,y' ]})
There a total of 12 user_id's. I have a pipeline as follows:
knn_tfidf = Pipeline([('tf_idf', TfidfVectorizer(stop_words='english')),
('model', OneVsRestClassifier(KNeighborsClassifier())])
a parameter grid as follows:
param_grid_1 = {'tf_idf__max_df': (0.25, 0.5, 0.75),
'tf_idf__ngram_range': [(1, 1), (1, 2), (2,2) (1, 3)],
'model__estimator_n_neighbors' : [np.range(1,30)]
}
And finally GridSearchCV:
Grid_Search_tune = GridSearchCV(knn_tfidf, param_grid_1, cv=2)
I need to create a model for each user with the corresponding X and y values. For one user, I can do the following:
t = df[df.user_id == 'user_id1']
Extract X and y from t. Pass y to a Multi labelBinarizer(), then after instantiating the pipeline, param_grid and GridsearchCV, I can do:
Grid_Search_tune.fit(X, y)
Doing this 12 times for each user is repetitive. So I looped through the grouped pandas Dataframe. Here is what I have done:
g = df.groupby('user_id')
for names, groups in g:
X = groups.meeting_subject.as_matrix()
labels = [x.split(', ') for x in groups.priority_label.tolist()]
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labels)
knn_tfidf = Pipeline([('tf_idf', TfidfVectorizer(stop_words='english')),
('model', OneVsRestClassifier(KNeighborsClassifier()))])
param_grid_1 = {'tf_idf__max_df': (0.25, 0.5, 0.75),
'tf_idf__ngram_range': [(1, 2), (2,2), (1, 3)], 'model__estimator__n_neighbors': np.arange(1,4)}
Grid_Search_tune = GridSearchCV(knn_tfidf, param_grid_1, cv=2)
all_estimators = Grid_Search_tune.fit(X, y)
best_of_all_estimators = Grid_Search_tune.best_estimator_
print(best_of_all_estimators)
This gives me an output like:
user_id1
Pipeline(memory=None,
steps=[('tf_idf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=0.25, max_features=None, min_df=1,
ngram_range=(2, 2), norm=u'l2', preprocessor=None, smooth_idf=T...tric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform'),
n_jobs=1))])
user_id2
Pipeline(memory=None,
steps=[('tf_idf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=0.25, max_features=None, min_df=1,
ngram_range=(1, 2), norm=u'l2', preprocessor=None, smooth_idf=T...tric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform'),
n_jobs=1))])
And so on till user_id12 and the corresponding pipeline. I don't know if this is the correct way of doing it, and from here on I am lost. If I do:
best_of_all_estimators.predict(['some_text_string'])
I get a prediction for all the 12 models. How do I key or index my models with the for loop variable 'names' so that when I do:
str(raw_input('Choose user_id from above list:'))
Say I choose user_id3 , and then
str(raw_input('Enter text string:'))
I enter 'some random string'. The model trained for the X and y belonging to user_id3 is pulled up and a prediction is done on that model, and not for all the models. A very similar question is linked here. training an ML model on selected parts of a data frame. I am beginner and I'm really struggling! Please, please help! Thanks a ton in advance.
Apparently Pipeline doesn't support changing the number of samples, such as in groupby or other aggregation.
Here is a similar question and possible workaround.
sklearn: Have an estimator that filters samples

pandas shape issues when applying function returning multiple new columns

I need to return multiple calculated columns for each row of a pandas dataframe.
This error: ValueError: Shape of passed values is (4, 2), indices imply (4, 3) is raised when the apply function is executed in the following code snippet:
import pandas as pd
my_df = pd.DataFrame({
'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],
'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],
'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],
})
my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff'])
my_df.sort_values(['datetime_stuff'], inplace=True)
print(my_df.head())
def calculate_stuff(row):
if row['url'].startswith('http'):
categories = row['categories'] if type(row['categories']) == list else []
calculated_column_x = row['url'] + '_other_stuff_'
else:
calculated_column_x = None
another_column = 'deduction_from_fields'
return calculated_column_x, another_column
print(my_df.shape)
my_df['calculated_column_x'], my_df['another_column'] = zip(*my_df.apply(calculate_stuff, axis=1))
Each row of the dataframe I am working on is more complicated than the example above, and the function calculate_stuff I am applying is using many different columns for each row, then returning multiple new columns.
However, the previous example still raises this ValueError related to the shape of the dataframe that I am not able to understand how to fix.
How to create multiple new columns (for each row) that can be calculated starting from the existing columns?
When you return a list or tuple from a function that is being applied, pandas attempts to shoehorn it back into the dataframe you ran apply over. Instead, return a series.
Reconfigured Code
my_df = pd.DataFrame({
'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],
'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],
'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],
})
my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff'])
my_df.sort_values(['datetime_stuff'], inplace=True)
def calculate_stuff(row):
if row['url'].startswith('http'):
categories = row['categories'] if type(row['categories']) == list else []
calculated_column_x = row['url'] + '_other_stuff_'
else:
calculated_column_x = None
another_column = 'deduction_from_fields'
# I changed this VVVV
return pd.Series((calculated_column_x, another_column), ['calculated_column_x', 'another_column'])
my_df.join(my_df.apply(calculate_stuff, axis=1))
categories datetime_stuff url calculated_column_x another_column
0 [foo, bar] 2012-01-20 http://www.something http://www.something_other_stuff_ deduction_from_fields
1 [x, y, z] 2012-02-16 http://www.somethingelse http://www.somethingelse_other_stuff_ deduction_from_fields
2 [xxx] 2012-06-19 http://www.foo http://www.foo_other_stuff_ deduction_from_fields
3 [a123, a456] 2012-12-15 http://www.bar http://www.bar_other_stuff_ deduction_from_fields

ValueError: Shape of passed values is (6, 251), indices imply (6, 1)

I am getting an error and I'm not sure how to fix it.
Here is my code:
from matplotlib.finance import quotes_historical_yahoo_ochl
from datetime import date
from datetime import datetime
import pandas as pd
today = date.today()
start = (today.year-1, today.month, today.day)
quotes = quotes_historical_yahoo_ochl('AXP', start, today)
fields = ['date', 'open', 'close', 'high', 'low', 'volume']
list1 = []
for i in range(len(quotes)):
x = date.fromordinal(int(quotes[i][0]))
y = datetime.strftime(x, '%Y-%m-%d')
list1.append(y)
quotesdf = pd.DataFrame(quotes, index = list1, columns = fields)
quotesdf = quotesdf.drop(['date'], axis = 1)
print quotesdf
How can I change my code to achieve my goal, change the dateform and delete the original one?
In principle your code should work, you just need to indent it correctly, that is, you need to append the value of y to list1 inside the for loop.
for i in range(len(quotes)):
x = date.fromordinal(int(quotes[i][0]))
y = datetime.strftime(x, '%Y-%m-%d')
list1.append(y)
Thereby list1 will have as many entries as quotes instead of only one (the last one). And the final dataframe will not complain about misshaped data.

How do I create a dictionary mapping strings to sets given a list and a tuple of tuples?

I am trying to create a dictionary from a list and tuple of tuples as illustrated below. I have to reverse map the tuples to the list and create a set of non-None column names.
Any suggestions on a pythonic way to achieve the solution (desired dictionary) is much appreciated.
MySQL table 'StateLog':
Name NY TX NJ
Amy 1 None 1
Kat None 1 1
Leo None None 1
Python code :
## Fetching data from MySQL table
#cursor.execute("select * from statelog")
#mydataset = cursor.fetchall()
## Fetching column names for mapping
#state_cols = [fieldname[0] for fieldname in cursor.description]
state_cols = ['Name', 'NY', 'TX', 'NJ']
mydataset = (('Amy', '1', None, '1'), ('Kat', None, '1', '1'), ('Leo', None, None, '1'))
temp = [zip(state_cols, each) for each in mydataset]
# Looks like I can't do a tuple comprehension for the following snippet : finallist = ((eachone[1], eachone[0]) for each in temp for eachone in each if eachone[1] if eachone[0] == 'Name')
for each in temp:
for eachone in each:
if eachone[1]:
if eachone[0] == 'Name':
k = eachone[1]
print k, eachone[0]
print '''How do I get a dictionary in this format'''
print '''name_state = {"Amy": set(["NY", "NJ"]),
"Kat": set(["TX", "NJ"]),
"Leo": set(["NJ"])}'''
Output so far :
Amy Name
Amy NY
Amy NJ
Kat Name
Kat TX
Kat NJ
Leo Name
Leo NJ
Desired dictionary :
name_state = {"Amy": set(["NY", "NJ"]),
"Kat": set(["TX", "NJ"]),
"Leo": set(["NJ"])}
To be really honest, I would say your problem is that your code is becoming too cumbersome. Resist the temptation of "one-lining" it and create a function. Everything will become way easier!
mydataset = (
('Amy', '1', None, '1'),
('Kat', None, '1', '1'),
('Leo', None, None, '1')
)
def states(cols, data):
"""
This function receives one of the tuples with data and returns a pair
where the first element is the name from the tuple, and the second
element is a set with all matched states. Well, at least *I* think
it is more readable :)
"""
name = data[0]
states = set(state for state, value in zip(cols, data) if value == '1')
return name, states
pairs = (states(state_cols, data) for data in mydataset)
# Since dicts can receive an iterator which yields pairs where the first one
# will become a key and the second one will become the value, I just pass
# a list with all pairs to the dict constructor.
print dict(pairs)
The result is:
{'Amy': set(['NY', 'NJ']), 'Leo': set(['NJ']), 'Kat': set(['NJ', 'TX'])}
Looks like another job for defaultdict!
So lets create our default dict
name_state = collections.defaultdict(set)
We now have a dictionary that has sets as all default values, we can now do something like this
name_state['Amy'].add('NY')
Moving on you just need to iterate over your object and add to each name the right states. Enjoy
You can do this as a dictionary comprehension (Python 2.7+):
from itertools import compress
name_state = {data[0]: set(compress(state_cols[1:], data[1:])) for data in mydataset}
or as a generator expression:
name_state = dict((data[0], set(compress(state_cols[1:], data[1:]))) for data in mydataset)