Python pandas data frame warning, suggest to use .loc instead? - python-2.7

Hi I would like to manipulate the data by removing missing information and make all letters lower case. But for the lowercase conversion, I get this warning:
E:\Program Files Extra\Python27\lib\site-packages\pandas\core\frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
C:\Users\KubiK\Desktop\FamSeach_NameHandling.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["name"] = frame3["name"].str.lower()
C:\Users\KubiK\Desktop\FamSeach_NameHandling.py:19: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["ethnicity"] = frame3["ethnicity"].str.lower()
import pandas as pd
from pandas import DataFrame
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame[index_missEthnic != True]
frame3 = frame2[index_missName != True]
# Make all letters into lowercase
frame3["name"] = frame3["name"].str.lower()
frame3["ethnicity"] = frame3["ethnicity"].str.lower()
# Test outputs
print frame3
This warning doesn't seem to be fatal (at least for my small sample data), but how should I deal with this?
Sample data
Name Ethnicity
Thos C. Martin Russian
Charlotte Wing English
Frederick A T Byrne Canadian
J George Christe French
Mary R O'brien English
Marie A Savoie-dit Dugas English
J-b'te Letourneau Scotish
Jane Mc-earthar French
Amabil?? Bonneau English
Emma Lef??c French
C., Akeefe African
D, James Matheson English
Marie An: Thomas English
Susan Rrumb;u English
English
Kaio Chan

Not sure why do you need so many booleans...
Also note that .isnull() does not catch empty strings.
And filtering empty string before applying .lower() doesn't seems neccessary either.
But it there is a need... This works for me:
frame = pd.DataFrame({'name':['Abc Def', 'EFG GH', ''], 'ethnicity':['Ethnicity1','', 'Ethnicity2']})
print frame
ethnicity name
0 Ethnicity1 Abc Def
1 EFG GH
2 Ethnicity2
name_null = frame.name.str.len() == 0
frame.loc[~name_null, 'name'] = frame.loc[~name_null, 'name'].str.lower()
print frame
ethnicity name
0 Ethnicity1 abc def
1 efg gh
2 Ethnicity2

When you set frame2/3, trying using .loc as follows:
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
I think this would fix the error you're seeing:
frame3.loc[:, "name"] = frame3.loc[:, "name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3.loc[:, "ethnicity"].str.lower()
You can also try the following, although it doesn't answer your question:
frame3.loc[:, "name"] = [t.lower() if isinstance(t, str) else t for t in frame3.name]
frame3.loc[:, "ethnicity"] = [t.lower() if isinstance(t, str) else t for t in frame3. ethnicity]
This converts any string in the column into lowercase, otherwise it leaves the value untouched.

Related

Pandas data frame replace international currency sign

Working with an excel file which has international currency signs in multiple columns. In addition to that file some international language.
Example: Paying £40.50 doesn't make any sense for a one-hour parking.
Example: Produkty są zbyt drogie (Polish)
Example: 15% de la population féminine n'obtient pas de bons emplois (French)
As a cleanup process following actions have been taken
df = df.apply(lambda x: x.str.replace('\\r',' '))
df = df.apply(lambda x: x.str.replace('\\n',' '))
df = df.apply(lambda x: x.str.replace('\.+', ''))
df = df.apply(lambda x: x.str.replace('-', ''))
df = df.apply(lambda x: x.str.replace('&', ''))
df = df.apply(lambda x: x.str.replace(r"[\"\',]", ''))
df = df.apply(lambda x: x.str.replace('[%*]', ''), axis=1)
(If there is more efficient way - more than welcome)
In addition to this: Method has been created to remove stopwords
def cleanup(row):
stops = set(stopwords.words('english'))
removedStopWords = " ".join([str(i) for i in row.lower().split()
return removedStopWords
to apply this method to all columns in the data frame that contains above examples:
df = df.applymap(self._row_cleaner)['ComplainColumns']
but UnicodeEncodeError has been the biggest problem. One of the first place it throws this error on British Pound Sign.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 646: ordinal not in range(128)
Tried following:
df = df.apply(lambda x: x.unicode.replace(u'\xa3', '')) gut didn't work.
Goal is to replace all none alphabetical characters to '' or ' '
If you want to replace all the characters other than [A-z0-9] then you can use replace with regex i.e
df = df.replace('[^\w\s]','',regex=True)
There might be missing data in the dataframe so you might need to use astype(str), since you are using list comprehension with .lower(), Nan will be considered as float.
df.astype(str).apply(cleanup)

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []

Confused about how to store extracted sub-string features before running machine learning

I have some data with name and ethnicity
j-bte letourneau scotish
jane mc-earthar french
amabil bonneau english
I then normalize the name as such by replacing the space with "#" and add trailing "?" to standardize the total length of the name entries. I would like to use sequential three-letter substring as my feature to predict ethnicity.
name_filled substr1 substr2 substr3 \
0 j-bte#letourneau??????????????????????????? j-b -bt bte
1 jane#mc-earthar???????????????????????????? jan ane ne#
2 amabil#bonneau????????????????????????????? ama mab abi
Here is my code for data manipulation to this point:
import pandas as pd
from pandas import DataFrame
import re
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
substr = "substr" + str(i)
frame3[substr] = frame3["name_filled"].str[i-1:i+2]
My question is, would it be a problem to store my 3-character substring this way to run the machine learning algorithm? This could be a problem as the example below.
Imagine two Chinese people both with the last name Chan, but one is called "Li Chan" and the other is called "Joseph Chan".
The Chan will be split into "cha" and "han", but for the first case, the "cha" will be stored in the str4 while the other will be stored in the str8 because the first name pushes it to be stored much later. I wonder if I could and should store the 3-character substrings into just one single variable as a list (for example: ["j-b", "-bt", "bte"] for substr variable for case 0), and if the substrings are stored into one single variable, can it be run with machine learning algorithms to predict ethnicity?

AttributeError: 'DataFrame' object has no attribute 'Height'

I am able to convert a csv file to pandas DataFormat and able to print out the table, as seen below. However, when I try to print out the Height column I get an error. How can I fix this?
import pandas as pd
df = pd.read_csv('/path../NavieBayes.csv')
print df #this prints out as seen below
print df.Height #this gives me the "AttributeError: 'DataFrame' object has no attribute 'Height'
Height Weight Classifer
0 70.0 180 Adult
1 58.0 109 Adult
2 59.0 111 Adult
3 60.0 113 Adult
4 61.0 115 Adult
I have run into a similar issue before when reading from csv. Assuming it is the same:
col_name =df.columns[0]
df=df.rename(columns = {col_name:'new_name'})
The error in my case was caused by (I think) by a byte order marker in the csv or some other non-printing character being added to the first column label. df.columns returns an array of the column names. df.columns[0] gets the first one. Try printing it and seeing if something is odd with the results.
PS On above answer by JAB - if there is clearly spaces in your column names use skipinitialspace=True in read_csv e.g.
df = pd.read_csv('/path../NavieBayes.csv',skipinitialspace=True)
df = pd.read_csv(r'path_of_file\csv_file_name.csv')
OR
df = pd.read_csv('path_of_file/csv_file_name.csv')
Example:
data = pd.read_csv(r'F:\Desktop\datasets\hackathon+data+set.csv')
Try it, it will work.

Deleting rows if missing in some variable in Python Pandas

I am trying to use Pandas to remove rows that contain missing ethnicity information, though I didn't get very far as I am new to Pandas.
Using 'print name[ethnic.isnull() == True]' I can visualize which are the people with missing ethnicity information. But ultimately I want to 1) record the index by appending the missing-ethnicity cases' indexes into the 'missing array', 2) then create a second frame by deleting all the row with index matched with those in the 'missing' array.
I am currently stuck in the 'for case in frame' loop, where I try to print names of those with missing ethnicity. But my program ends without error but without printing out anything.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
### Remove cases with missing name or missing ethnicity information
def RemoveMissing():
data = pd.read_csv("C:\...\sample.csv")
frame = DataFrame(data)
frame.columns = ["Name", "Ethnicity", "Event_Place", "Birth_Place", "URL"]
missing = []
name = frame.Name
ethnic = frame.Ethnicity
# Filter based on some variable criteria
#print name[ethnic == "English"]
#print name[ethnic.isnull() == True] # identify those who don't have ethnicity entry
# This works
for case in frame:
print frame.Name
# Doesn't work
for case in frame:
if frame.Ethnicity.isnull() is True:
print frame.Name
RemoveMissing()
This seems to work:
# Create a var to check if Ethnicity is missing
index_missEthnic = frame.Ethnicity.isnull()
frame2 = frame[index_missEthnic != True]