Replacing strings using regex using Pandas - regex

In Pandas, why does the following not replace any strings containing an exclamation mark with whatever follows it?
In [1]: import pandas as pd
In [2]: ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zi
...: mbabwe'])
In [3]: ser
Out[3]:
0 Aland Islands !Åland Islands
1 Reunion !Réunion
2 Zimbabwe
dtype: object
In [4]: patt = r'.*!(.*)'
In [5]: repl = lambda m: m.group(1)
In [6]: ser.replace(patt, repl)
Out[6]:
0 Aland Islands !Åland Islands
1 Reunion !Réunion
2 Zimbabwe
dtype: object
Whereas the direct reference to the matched substring does work:
In [7]: ser.replace({patt: r'\1'}, regex=True)
Out[7]:
0 Åland Islands
1 Réunion
2 Zimbabwe
dtype: object
What am I doing wrong in the first case?

It appears that replace does not support a method as a replacement argument. Thus, all you can do is to import re library implicitly and use apply:
>>> import re
>>> #... your code ...
>>> ser.apply(lambda row: re.sub(patt, repl, row))
0 Åland Islands
1 Réunion
2 Zimbabwe"
dtype: object

There are two replace methods in Pandas.
The one that acts directly on a Series can take a regex pattern string or a compiled regex and can act in-place, but doesn't allow the replacement argument to be a callable. You must set regex=True and use raw strings.
With:
import re
import pandas as pd
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
Yes:
ser.replace(r'.*!(.*)', r'\1', regex=True, inplace=True)
ser.replace(r'.*!', '', regex=True, inplace=True)
regex = re.compile(r'.*!(.*)', inplace=True)
ser.replace(regex, r'\1', regex=True, inplace=True)
No:
repl = lambda m: m.group(1)
ser.replace(regex, repl, regex=True, inplace=True)
There's another, used as Series.str.replace. This one accepts a callable replacement but won't substitute in-place and doesn't take a regex argument (though regular expression pattern strings can be used):
Yes:
ser.str.replace(r'.*!', '')
ser.str.replace(r'.*!(.*)', r'\1')
ser.str.replace(regex, repl)
No:
ser.str.replace(regex, r'\1')
ser.str.replace(r'.*!', '', inplace=True)
I hope this is helpful to someone out there.

Try this snippet:
pattern = r'(.*)!'
ser.replace(pattern, '', regex=True)
In your case, you didn't set regex=True, as it is false by default.

Related

RegEx from PCRE engine does not matches with RegEx with Pandas

So I have a pandas series like below
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
To capture only the domains, I have used the below regex expression from PCRE:
(?<=\/\/)(\w+[-.]?\w+[.]?){2}
Now when I use this in Pandas, below unexpected result is inferenced:
test_urls_clean = test_urls.str.extract(r"(?<=\/\/)(\w+[-.]?\w+[.]?){2}", expand=False)
0
0 com
1 com
2 com
3 om
4 om
5 rg
6 cc
7 com
8 com
9 ly
10 techstream.org
But using the below regex, correct results are fetched
https?://([\w\-\.]+)
Any reason why this issue happens with Pandas?
If I'm understanding correctly...
The first regex (?<=\/\/)(\w+[-.]?\w+[.]?){2} is matching the url, but only capturing the TLD.
So .com etc. is the only part returned when using str.extract (extract the captured group only).
See https://regex101.com/r/aMKgFv/1 and notice the blue (match) vs the green (capture) in the highlighting of the urls.
Additional:
For getting the domain out of the string I would recommend avoiding regex and let urllib do the work.
Code:
from urllib.parse import urlparse
import pandas as pd
def getDomain(s):
return urlparse(s).netloc
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
test_urls = test_urls.apply(getDomain)
print(test_urls)
0 www.amazon.com
1 www.interactivedynamicvideo.com
2 www.nytimes.com
3 evonomics.com
4 github.com
5 phys.org
6 iot.seeed.cc
7 www.bfilipek.com
8 beta.crowdfireapp.com
9 www.valid.ly
10 css-cursor.techstream.org
dtype: object

RegEx for matching the month, day and year

I'm trying to find a regular expression to extract the month, day and year from a datetime stamp in this format:
01/20/2019 12:34:54
It should return a list:
['01', '20', '2019']
I know this can be solved using:
dt.split(' ')[0].split('/')
But, I'm trying to find a regex to do it:
[^\/\s]+
But, I need it to exclude everything after the space.
As you are expecting the date month and year to be returned as a list, you can use this Python code,
import re
s = '01/20/2019 12:34:54'
print(re.findall(r'\d+(?=[ /])', s))
Prints,
['01', '20', '2019']
Otherwise, you can better write your regex as,
(\d{2})/(\d{2})/(\d{4})
And get date, month and year from group1, group2 and group3
Regex Demo
Python code in this way should be,
import re
s = '01/20/2019 12:34:54'
m = re.search(r'(\d{2})/(\d{2})/(\d{4})', s)
if m:
print([m.group(1), m.group(2), m.group(3)])
Prints,
['01', '20', '2019']
You should absolutely be taking advantage of Python's date/time API here. Use strptime to parse your input datetime string to a bona fide Python datetime. Then, just build a list, accessing the various components you need.
dt = "01/20/2019 12:34:54"
dto = datetime.strptime(dt, '%m/%d/%Y %H:%M:%S')
list = [dto.month, dto.day, dto.year]
print(list)
[1, 20, 2019]
If you really want/need to work with the original datetime string, then split provides an option, without even formally using a regex:
dt = "01/20/2019 12:34:54"
dt = dt.split()[0].split('/')
print(dt)
['01', '20', '2019']
This RegEx might help you to do so.
([0-9]+)\/([0-9]+)\/([0-9]+)\s[0-9]+:[0-9]+:[0-9]+
Code:
import re
string = '01/20/2019 12:34:54'
matches = re.search(r'([0-9]+)/([0-9]+)/([0-9]+)', string)
if matches:
print([matches.group(1), matches.group(2), matches.group(3)])
else:
print('Sorry! No matches! Something is not right! Call 911')
Output
['01', '20', '2019']

rdflib SPARQL queries (involving subclasses) not behaving as expected

I have just started to self-learn RDF and the Python rdflib library. But I have hit an issue with the following code. I expected the queries to return mark and nat as Persons and only natalie as a Professor. I can't see where I've gone wrong. (BTW, I know Professor should be a title rather than a kind of Person, but I'm just tinkering ATM.) Any help appreciated. Thanks.
rdflib 4, Python 2.7
>>> from rdflib import Graph, BNode, URIRef, Literal
INFO:rdflib:RDFLib Version: 4.2.
>>> from rdflib.namespace import RDF, RDFS, FOAF
>>> G = Graph()
>>> mark = BNode()
>>> nat = BNode()
>>> G.add((mark, RDF.type, FOAF.Person))
>>> G.add((mark, FOAF.firstName, Literal('mark')))
>>> G.add((URIRef('Professor'), RDF.type, RDFS.Class))
>>> G.add((URIRef('Professor'), RDFS.subClassOf, FOAF.Person))
>>> G.add((nat, RDF.type, URIRef('Professor')))
>>> G.add((nat, FOAF.firstName, Literal('natalie')))
>>> qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type foaf:Person .
?a foaf:firstName ?aname .
}""", initNs = {"rdf": RDF,"foaf": FOAF})
>>> for row in qres:
print "%s is a person" % row
mark is a person
>>> qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type ?prof .
?a foaf:firstName ?aname .
}""", initNs = {"rdf": RDF,"foaf": FOAF, "prof": URIRef('Professor')})
>>> for row in qres:
print "%s is a Prof" % row
natalie is a Prof
mark is a Prof
>>>
Your query is bringing back all types because the ?prof variable isn't bound to a value.
I think you want to use is the initBindings kwarg to pass in the URI for 'Professor' to your query. So changing your query to below retrieves just natalie.
qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type ?prof .
?a foaf:firstName ?aname .
}""",
initNs ={"rdf": RDF,"foaf": FOAF},
initBindings={"prof": URIRef('Professor')})

python/pandas:need help adding double quotes to columns

I need to add double quotes to specific columns in a csv file that my script generates.
Below is the goofy way I thought of doing this. For these two fixed-width fields, it works:
df['DATE'] = df['DATE'].str.ljust(9,'"')
df['DATE'] = df['DATE'].str.rjust(10,'"')
df['DEPT CODE'] = df['DEPT CODE'].str.ljust(15,'"')
df[DEPT CODE'] = df['DEPT CODE'].str.rjust(16,'"')
For the following field, it doesn't. It has a variable length. So, if the value is shorter than the standard 6-digits, I get extra double-quotes: "5673"""
df['ID'] = df['ID'].str.ljust(7,'"')
df['ID'] = df['ID'].str.rjust(8,'"')
I have tried zfill, but the data in the column is a series-- I get "pandas.core.series.Series" when i run
print type(df['ID'])
and I have not been able to convert it to string using astype. I'm not sure why. I have not imported numpy.
I tried using len() to get the length of the ID number and pass it to str.ljust and str.rjust as its first argument, but I think it got hung up on the data not being a string.
Is there a simpler way to apply double-quotes as I need, or is the zfill going to be the way to go?
You can add a speech mark before / after:
In [11]: df = pd.DataFrame([["a"]], columns=["A"])
In [12]: df
Out[12]:
A
0 a
In [13]: '"' + df['A'] + '"'
Out[13]:
0 "a"
Name: A, dtype: object
Assigning this back:
In [14]: df['A'] = '"' + df.A + '"'
In [15]: df
Out[15]:
A
0 "a"
If it's for exporting to csv you can use the quoting kwarg:
In [21]: df = pd.DataFrame([["a"]], columns=["A"])
In [22]: df.to_csv()
Out[22]: ',A\n0,a\n'
In [23]: df.to_csv(quoting=1)
Out[23]: '"","A"\n"0","a"\n'
With numpy, not pandas, you can specify the formatting method when saving to a csv file. As very simple example:
In [209]: np.savetxt('test.txt',['string'],fmt='%r')
In [210]: cat test.txt
'string'
In [211]: np.savetxt('test.txt',['string'],fmt='"%s"')
In [212]: cat test.txt
"string"
I would expect the pandas csv writer to have a similar degree of control, if not more.

Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly simplified.
I would like to fit some features into a logistic regression model to predict 'chinese' or 'non-chinese'. I have a raw_name which I will extract to get two features 1) is just the last name, and 2) is a list of substring of the last name, for example, 'Chan' will give ['ch', 'ha', 'an']. But it seems Dictvectorizer doesn't take list type as part of the dictionary. From the link above, I try to create a function list_to_dict, and successfully, return some dict elements,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
but I have no idea how to incorporate that in the my_dict = ... before applying the dictvectorizer.
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
Output:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
Sample data:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese
If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:
my_dict_list = []
for i in X:
# create a new feature dictionary
feat_dict = {}
# add the features that are straight forward
feat_dict['last-name'] = feature_full_last_name(i)
feat_dict['dummy'] = 1
# for the features that have a list of values iterate over the values and
# create a custom feature for each value
for two_letters in feature_twoLetters(feature_full_last_name(i)):
# make sure the naming is unique enough so that no other feature
# unrelated to this will have the same name/ key
feat_dict['two-letter-substrings-' + two_letters] = True
# save it to the feature dictionary list that will be used in Dict vectorizer
my_dict_list.append(feat_dict)
print my_dict_list
from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x
Output:
[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:
# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.
Output:
[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 1.]]