Deleting rows if missing in some variable in Python Pandas - python-2.7

I am trying to use Pandas to remove rows that contain missing ethnicity information, though I didn't get very far as I am new to Pandas.
Using 'print name[ethnic.isnull() == True]' I can visualize which are the people with missing ethnicity information. But ultimately I want to 1) record the index by appending the missing-ethnicity cases' indexes into the 'missing array', 2) then create a second frame by deleting all the row with index matched with those in the 'missing' array.
I am currently stuck in the 'for case in frame' loop, where I try to print names of those with missing ethnicity. But my program ends without error but without printing out anything.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
### Remove cases with missing name or missing ethnicity information
def RemoveMissing():
data = pd.read_csv("C:\...\sample.csv")
frame = DataFrame(data)
frame.columns = ["Name", "Ethnicity", "Event_Place", "Birth_Place", "URL"]
missing = []
name = frame.Name
ethnic = frame.Ethnicity
# Filter based on some variable criteria
#print name[ethnic == "English"]
#print name[ethnic.isnull() == True] # identify those who don't have ethnicity entry
# This works
for case in frame:
print frame.Name
# Doesn't work
for case in frame:
if frame.Ethnicity.isnull() is True:
print frame.Name
RemoveMissing()

This seems to work:
# Create a var to check if Ethnicity is missing
index_missEthnic = frame.Ethnicity.isnull()
frame2 = frame[index_missEthnic != True]

Related

XGboost Google-AI-Model expecting float values instead of using Categorical values and converting them

I'm trying to run a simple XGBoost Prediction based on Google Cloud using this simple example https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions
The model is building fine, but when I try to run a prediction with a sample input JSON it fails with error "Could not initialize DMatrix from inputs: could not convert string to float:" as shown in the screen below. I understand this is happening because the test-input has strings, I was hoping the Google machine learning model should have information to convert the categorical values to floats. I cannot expect my user to submit-online-prediction-request with float values.
Based on the tutorial it should work without converting the categorical values to floats. Please advise, I have attached the GIF with more details. Thanks
import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
# these are the column labels from the census data files
COLUMNS = (
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income-level'
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
)
# load training set
with open('./census_data/adult.data', 'r') as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')
# load test set
with open('./census_data/adult.test', 'r') as test_data:
raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')
# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
test_features[col] = encoders[col].fit_transform(test_features[col])
# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)
# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')
Here is a fix. Put the input shown in the Google documentation in a file input.json, then run this. The output is input_numerical.json and prediction will succeed if you use that in place of input.json.
This code is just preprocessing categorical columns to numerical forms using the same procedure as was done with training and test data.
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
COLUMNS = (
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income-level",
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
)
with open("./input.json", "r") as json_lines:
rows = [json.loads(line) for line in json_lines]
prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
prediction_features[col] = encoders[col].fit_transform(prediction_features[col])
with open("input_numerical.json", "w") as input_numerical:
for index, row in prediction_features.iterrows():
input_numerical.write(row.to_json(orient="values") + "\n")
I created this Google Issues Tracker ticket as the Google documentation is missing this important step.
You can use pandas to convert categorical strings into codes for model inputs. For prediction input you can define a dictionary for each category with corresponding category values and codes. For example, for workclass:
df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))
If a prediction input is 'somestring' you can access its code as follows:
category_input = workclass_dict['somestring']
XGBoost models take floats as input. In your training script you converted the categorical variables into numbers. The same transformation needs to be done when submitting a prediction.

Select row with regex instead of unique value

Hello everyone I'm making a really simple lookup in a pandas dataframe, what I need to do is to lookup for the input I'm typing as a regex instead of == myvar
So far this is what I got which is very inneficient because there's a lot of Names in my DataFrame that instead of matching a list of them which could be
Name LastName
NAME 1 Some Awesome
Name 2 Last Names
Nam e 3 I can keep going
Bane Writing this is awesome
BANE 114 Lets continue
However this is what I got
import pandas as pd
contacts = pd.read_csv("contacts.csv")
print("regex contacts")
nameLookUp = input("Type the name you are looking for: ")
print(nameLookUp)
desiredRegexVar = contacts.loc[contacts['Name'] == nameLookUp]
print(desiredRegexVar)
I have to type 'NAME 1' or 'Nam e 3' in order results or I wont get any at all, I tried using this but it didnt work
#regexVar = "^" + contacts.filter(regex = nameLookUp)
Thanks for the answer #Code Different
The code looks like this
import pandas as pd
import re
namelookup = input("Type the name you are looking for: ")
pattern = '^' + re.escape(namelookup)
match = contactos['Cliente'].str.contains(pattern, flags=re.IGNORECASE, na=False)
print(contactos[match])
Use Series.str.contains. Tweak the pattern as appropriate:
import re
pattern = '^' + re.escape(namelookup)
match = contacts['Name'].str.contains(pattern, flags=re.IGNORECASE)
contacts[match]

Compare value in a column in one row of csv file to value in next row using python

I have been reading up on the csv.reader next but did not see a way to compare the values in a column from one row to the next. For instance, if my data looked like this in Maps.csv file:
County1 C:/maps/map1.pdf
County1 C:/maps/map2.pdf
County2 C:/maps/map1.pdf
County2 C:/maps/map3.pdf
County3 C:/maps/map3.pdf
County4 C:/maps/map2.pdf
County4 C:/maps/map4.pdf
If line two's county equals line one's county do something
The following code compares rows, I want to compare the county values between current and previous rows.
import csv.
f = open("Maps.csv", "r+")
ff = csv.reader(f)
pre_line = ff.next()
while(True):
try:
cur_line = ff.next()
if pre_line == cur_line:
print "Matches"
pre_line = cur_line
except:
break
I know I can grab the current value (see below) but do not know how to grab previous value. Is this possible? If so, could someone please tell me how. On day three of trying to solve writing my script to append pdf files from a csv file and am about to toss my coffee cup at my monitor. I am breaking these down into smaller parts and using simpler data as pilot. My file is much larger. I was advised to focus on just one issue at a time when posting to this forum. This is my latest issue. It seems no matter what tack I take, I can't seem to read the data the way I want. Arrrggghhhhh.
CurColor = row[color]
Using python 2.7
You already know how to look up the previous row. Why not get the column you need from that row?
import csv.
f = open("Maps.csv", "r+")
ff = csv.reader(f)
pre_line = ff.next()
while(True):
try:
cur_line = ff.next()
if pre_line[0] == cur_line[0]: # <-- compare first column
print "Matches"
pre_line = cur_line
except:
break
or more simply:
pre_line = ff.next()
for cur_line in ff:
if pre_line[0] == cur_line[0]: # <-- compare first column
print "Matches"
pre_line = cur_line
import csv
f = open("Maps.csv", "r+")
# Use delimiters to split each line into different elements
# In my example i used a comma. Your csv may have a different delimiter
# make sure the delimiter is a single character string though
# so no multiple spaces between "County1 C:/maps/map1.pdf"
# it should be something like "County1,C:/maps/map1.pdf"
ff = csv.reader(f, delimiter=',')
COUNTY_INDEX = 0
# each time ff.next() is called, it makes an array variable ['County1', 'C:/maps/map1.pdf ']
# since you want to compare the value in the first index, then you need to reference it like so
# the line below will set pre_line = 'County1'
pre_line = ff.next()[COUNTY_INDEX]
while(True):
try:
# the current line will be 'County1' or 'County2' etc...Depending on which line is read
cur_line = ff.next()[COUNTY_INDEX]
if pre_line == cur_line:
print "Matches"
pre_line = cur_line
except:
break

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []