Working with an excel file which has international currency signs in multiple columns. In addition to that file some international language.
Example: Paying £40.50 doesn't make any sense for a one-hour parking.
Example: Produkty są zbyt drogie (Polish)
Example: 15% de la population féminine n'obtient pas de bons emplois (French)
As a cleanup process following actions have been taken
df = df.apply(lambda x: x.str.replace('\\r',' '))
df = df.apply(lambda x: x.str.replace('\\n',' '))
df = df.apply(lambda x: x.str.replace('\.+', ''))
df = df.apply(lambda x: x.str.replace('-', ''))
df = df.apply(lambda x: x.str.replace('&', ''))
df = df.apply(lambda x: x.str.replace(r"[\"\',]", ''))
df = df.apply(lambda x: x.str.replace('[%*]', ''), axis=1)
(If there is more efficient way - more than welcome)
In addition to this: Method has been created to remove stopwords
def cleanup(row):
stops = set(stopwords.words('english'))
removedStopWords = " ".join([str(i) for i in row.lower().split()
return removedStopWords
to apply this method to all columns in the data frame that contains above examples:
df = df.applymap(self._row_cleaner)['ComplainColumns']
but UnicodeEncodeError has been the biggest problem. One of the first place it throws this error on British Pound Sign.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 646: ordinal not in range(128)
Tried following:
df = df.apply(lambda x: x.unicode.replace(u'\xa3', '')) gut didn't work.
Goal is to replace all none alphabetical characters to '' or ' '
If you want to replace all the characters other than [A-z0-9] then you can use replace with regex i.e
df = df.replace('[^\w\s]','',regex=True)
There might be missing data in the dataframe so you might need to use astype(str), since you are using list comprehension with .lower(), Nan will be considered as float.
df.astype(str).apply(cleanup)
Related
I have a Pandas Dataframe with data as below
id, name, date
[101],[test_name],[2019-06-13T13:45:00.000Z]
[103],[test_name3],[2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z]
[104],[],[]
I am trying to convert it to a format as below with no square brackets
Expected output:
id, name, date
101,test_name,2019-06-13T13:45:00.000Z
103,test_name3,2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z
104,,
I tried using regex as below but it gave me an error TypeError: expected string or bytes-like object
re.search(r"\[([A-Za-z0-9_]+)\]", df['id'])
Figured I am able to extract the data using the below:
df['id'].str.get(0)
Loop through the data frame to access each string then use:
newstring = oldstring[1:len(oldstring)-1]
to replace the cell in the dataframe.
Try looping through columns:
for col in df.columns:
df[col] = df[col].str[1:-1]
Or use apply if your duplication of your data is not a problem:
df = df.apply(lambda x: x.str[1:-1])
Output:
id name date
0 101 test_name 2019-06-13T13:45:00.000Z
1 103 test_name3 2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00....
2 104
Or if you want to use regex, you need str accessor, and extract:
df.apply(lambda x: x.str.extract('\[([A-Za-z0-9_]+)\]'))
I need to add double quotes to specific columns in a csv file that my script generates.
Below is the goofy way I thought of doing this. For these two fixed-width fields, it works:
df['DATE'] = df['DATE'].str.ljust(9,'"')
df['DATE'] = df['DATE'].str.rjust(10,'"')
df['DEPT CODE'] = df['DEPT CODE'].str.ljust(15,'"')
df[DEPT CODE'] = df['DEPT CODE'].str.rjust(16,'"')
For the following field, it doesn't. It has a variable length. So, if the value is shorter than the standard 6-digits, I get extra double-quotes: "5673"""
df['ID'] = df['ID'].str.ljust(7,'"')
df['ID'] = df['ID'].str.rjust(8,'"')
I have tried zfill, but the data in the column is a series-- I get "pandas.core.series.Series" when i run
print type(df['ID'])
and I have not been able to convert it to string using astype. I'm not sure why. I have not imported numpy.
I tried using len() to get the length of the ID number and pass it to str.ljust and str.rjust as its first argument, but I think it got hung up on the data not being a string.
Is there a simpler way to apply double-quotes as I need, or is the zfill going to be the way to go?
You can add a speech mark before / after:
In [11]: df = pd.DataFrame([["a"]], columns=["A"])
In [12]: df
Out[12]:
A
0 a
In [13]: '"' + df['A'] + '"'
Out[13]:
0 "a"
Name: A, dtype: object
Assigning this back:
In [14]: df['A'] = '"' + df.A + '"'
In [15]: df
Out[15]:
A
0 "a"
If it's for exporting to csv you can use the quoting kwarg:
In [21]: df = pd.DataFrame([["a"]], columns=["A"])
In [22]: df.to_csv()
Out[22]: ',A\n0,a\n'
In [23]: df.to_csv(quoting=1)
Out[23]: '"","A"\n"0","a"\n'
With numpy, not pandas, you can specify the formatting method when saving to a csv file. As very simple example:
In [209]: np.savetxt('test.txt',['string'],fmt='%r')
In [210]: cat test.txt
'string'
In [211]: np.savetxt('test.txt',['string'],fmt='"%s"')
In [212]: cat test.txt
"string"
I would expect the pandas csv writer to have a similar degree of control, if not more.
When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []
I have some data with name and ethnicity
j-bte letourneau scotish
jane mc-earthar french
amabil bonneau english
I then normalize the name as such by replacing the space with "#" and add trailing "?" to standardize the total length of the name entries. I would like to use sequential three-letter substring as my feature to predict ethnicity.
name_filled substr1 substr2 substr3 \
0 j-bte#letourneau??????????????????????????? j-b -bt bte
1 jane#mc-earthar???????????????????????????? jan ane ne#
2 amabil#bonneau????????????????????????????? ama mab abi
Here is my code for data manipulation to this point:
import pandas as pd
from pandas import DataFrame
import re
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
substr = "substr" + str(i)
frame3[substr] = frame3["name_filled"].str[i-1:i+2]
My question is, would it be a problem to store my 3-character substring this way to run the machine learning algorithm? This could be a problem as the example below.
Imagine two Chinese people both with the last name Chan, but one is called "Li Chan" and the other is called "Joseph Chan".
The Chan will be split into "cha" and "han", but for the first case, the "cha" will be stored in the str4 while the other will be stored in the str8 because the first name pushes it to be stored much later. I wonder if I could and should store the 3-character substrings into just one single variable as a list (for example: ["j-b", "-bt", "bte"] for substr variable for case 0), and if the substrings are stored into one single variable, can it be run with machine learning algorithms to predict ethnicity?
Hi I would like to manipulate the data by removing missing information and make all letters lower case. But for the lowercase conversion, I get this warning:
E:\Program Files Extra\Python27\lib\site-packages\pandas\core\frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
C:\Users\KubiK\Desktop\FamSeach_NameHandling.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["name"] = frame3["name"].str.lower()
C:\Users\KubiK\Desktop\FamSeach_NameHandling.py:19: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["ethnicity"] = frame3["ethnicity"].str.lower()
import pandas as pd
from pandas import DataFrame
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame[index_missEthnic != True]
frame3 = frame2[index_missName != True]
# Make all letters into lowercase
frame3["name"] = frame3["name"].str.lower()
frame3["ethnicity"] = frame3["ethnicity"].str.lower()
# Test outputs
print frame3
This warning doesn't seem to be fatal (at least for my small sample data), but how should I deal with this?
Sample data
Name Ethnicity
Thos C. Martin Russian
Charlotte Wing English
Frederick A T Byrne Canadian
J George Christe French
Mary R O'brien English
Marie A Savoie-dit Dugas English
J-b'te Letourneau Scotish
Jane Mc-earthar French
Amabil?? Bonneau English
Emma Lef??c French
C., Akeefe African
D, James Matheson English
Marie An: Thomas English
Susan Rrumb;u English
English
Kaio Chan
Not sure why do you need so many booleans...
Also note that .isnull() does not catch empty strings.
And filtering empty string before applying .lower() doesn't seems neccessary either.
But it there is a need... This works for me:
frame = pd.DataFrame({'name':['Abc Def', 'EFG GH', ''], 'ethnicity':['Ethnicity1','', 'Ethnicity2']})
print frame
ethnicity name
0 Ethnicity1 Abc Def
1 EFG GH
2 Ethnicity2
name_null = frame.name.str.len() == 0
frame.loc[~name_null, 'name'] = frame.loc[~name_null, 'name'].str.lower()
print frame
ethnicity name
0 Ethnicity1 abc def
1 efg gh
2 Ethnicity2
When you set frame2/3, trying using .loc as follows:
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
I think this would fix the error you're seeing:
frame3.loc[:, "name"] = frame3.loc[:, "name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3.loc[:, "ethnicity"].str.lower()
You can also try the following, although it doesn't answer your question:
frame3.loc[:, "name"] = [t.lower() if isinstance(t, str) else t for t in frame3.name]
frame3.loc[:, "ethnicity"] = [t.lower() if isinstance(t, str) else t for t in frame3. ethnicity]
This converts any string in the column into lowercase, otherwise it leaves the value untouched.