Replacing strings using regex using Pandas

Replacing strings using regex using Pandas - regex

In Pandas, why does the following not replace any strings containing an exclamation mark with whatever follows it?
In [1]: import pandas as pd
In [2]: ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zi
...: mbabwe'])
In [3]: ser
Out[3]:
0 Aland Islands !Åland Islands
1 Reunion !Réunion
2 Zimbabwe
dtype: object
In [4]: patt = r'.*!(.*)'
In [5]: repl = lambda m: m.group(1)
In [6]: ser.replace(patt, repl)
Out[6]:
0 Aland Islands !Åland Islands
1 Reunion !Réunion
2 Zimbabwe
dtype: object
Whereas the direct reference to the matched substring does work:
In [7]: ser.replace({patt: r'\1'}, regex=True)
Out[7]:
0 Åland Islands
1 Réunion
2 Zimbabwe
dtype: object
What am I doing wrong in the first case?

It appears that replace does not support a method as a replacement argument. Thus, all you can do is to import re library implicitly and use apply:
>>> import re
>>> #... your code ...
>>> ser.apply(lambda row: re.sub(patt, repl, row))
0 Åland Islands
1 Réunion
2 Zimbabwe"
dtype: object

There are two replace methods in Pandas.
The one that acts directly on a Series can take a regex pattern string or a compiled regex and can act in-place, but doesn't allow the replacement argument to be a callable. You must set regex=True and use raw strings.
With:
import re
import pandas as pd
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
Yes:
ser.replace(r'.*!(.*)', r'\1', regex=True, inplace=True)
ser.replace(r'.*!', '', regex=True, inplace=True)
regex = re.compile(r'.*!(.*)', inplace=True)
ser.replace(regex, r'\1', regex=True, inplace=True)
No:
repl = lambda m: m.group(1)
ser.replace(regex, repl, regex=True, inplace=True)
There's another, used as Series.str.replace. This one accepts a callable replacement but won't substitute in-place and doesn't take a regex argument (though regular expression pattern strings can be used):
Yes:
ser.str.replace(r'.*!', '')
ser.str.replace(r'.*!(.*)', r'\1')
ser.str.replace(regex, repl)
No:
ser.str.replace(regex, r'\1')
ser.str.replace(r'.*!', '', inplace=True)
I hope this is helpful to someone out there.

Try this snippet:
pattern = r'(.*)!'
ser.replace(pattern, '', regex=True)
In your case, you didn't set regex=True, as it is false by default.

Related

RegEx from PCRE engine does not matches with RegEx with Pandas

So I have a pandas series like below
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
To capture only the domains, I have used the below regex expression from PCRE:
(?<=\/\/)(\w+[-.]?\w+[.]?){2}
Now when I use this in Pandas, below unexpected result is inferenced:
test_urls_clean = test_urls.str.extract(r"(?<=\/\/)(\w+[-.]?\w+[.]?){2}", expand=False)
0
0 com
1 com
2 com
3 om
4 om
5 rg
6 cc
7 com
8 com
9 ly
10 techstream.org
But using the below regex, correct results are fetched
https?://([\w\-\.]+)
Any reason why this issue happens with Pandas?

If I'm understanding correctly...
The first regex (?<=\/\/)(\w+[-.]?\w+[.]?){2} is matching the url, but only capturing the TLD.
So .com etc. is the only part returned when using str.extract (extract the captured group only).
See https://regex101.com/r/aMKgFv/1 and notice the blue (match) vs the green (capture) in the highlighting of the urls.
Additional:
For getting the domain out of the string I would recommend avoiding regex and let urllib do the work.
Code:
from urllib.parse import urlparse
import pandas as pd
def getDomain(s):
return urlparse(s).netloc
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
test_urls = test_urls.apply(getDomain)
print(test_urls)
0 www.amazon.com
1 www.interactivedynamicvideo.com
2 www.nytimes.com
3 evonomics.com
4 github.com
5 phys.org
6 iot.seeed.cc
7 www.bfilipek.com
8 beta.crowdfireapp.com
9 www.valid.ly
10 css-cursor.techstream.org
dtype: object

RegEx for matching the month, day and year

I'm trying to find a regular expression to extract the month, day and year from a datetime stamp in this format:
01/20/2019 12:34:54
It should return a list:
['01', '20', '2019']
I know this can be solved using:
dt.split(' ')[0].split('/')
But, I'm trying to find a regex to do it:
[^\/\s]+
But, I need it to exclude everything after the space.

As you are expecting the date month and year to be returned as a list, you can use this Python code,
import re
s = '01/20/2019 12:34:54'
print(re.findall(r'\d+(?=[ /])', s))
Prints,
['01', '20', '2019']
Otherwise, you can better write your regex as,
(\d{2})/(\d{2})/(\d{4})
And get date, month and year from group1, group2 and group3
Regex Demo
Python code in this way should be,
import re
s = '01/20/2019 12:34:54'
m = re.search(r'(\d{2})/(\d{2})/(\d{4})', s)
if m:
print([m.group(1), m.group(2), m.group(3)])
Prints,
['01', '20', '2019']

You should absolutely be taking advantage of Python's date/time API here. Use strptime to parse your input datetime string to a bona fide Python datetime. Then, just build a list, accessing the various components you need.
dt = "01/20/2019 12:34:54"
dto = datetime.strptime(dt, '%m/%d/%Y %H:%M:%S')
list = [dto.month, dto.day, dto.year]
print(list)
[1, 20, 2019]
If you really want/need to work with the original datetime string, then split provides an option, without even formally using a regex:
dt = "01/20/2019 12:34:54"
dt = dt.split()[0].split('/')
print(dt)
['01', '20', '2019']

This RegEx might help you to do so.
([0-9]+)\/([0-9]+)\/([0-9]+)\s[0-9]+:[0-9]+:[0-9]+
Code:
import re
string = '01/20/2019 12:34:54'
matches = re.search(r'([0-9]+)/([0-9]+)/([0-9]+)', string)
if matches:
print([matches.group(1), matches.group(2), matches.group(3)])
else:
print('Sorry! No matches! Something is not right! Call 911')
Output
['01', '20', '2019']

rdflib SPARQL queries (involving subclasses) not behaving as expected

I have just started to self-learn RDF and the Python rdflib library. But I have hit an issue with the following code. I expected the queries to return mark and nat as Persons and only natalie as a Professor. I can't see where I've gone wrong. (BTW, I know Professor should be a title rather than a kind of Person, but I'm just tinkering ATM.) Any help appreciated. Thanks.
rdflib 4, Python 2.7
>>> from rdflib import Graph, BNode, URIRef, Literal
INFO:rdflib:RDFLib Version: 4.2.
>>> from rdflib.namespace import RDF, RDFS, FOAF
>>> G = Graph()
>>> mark = BNode()
>>> nat = BNode()
>>> G.add((mark, RDF.type, FOAF.Person))
>>> G.add((mark, FOAF.firstName, Literal('mark')))
>>> G.add((URIRef('Professor'), RDF.type, RDFS.Class))
>>> G.add((URIRef('Professor'), RDFS.subClassOf, FOAF.Person))
>>> G.add((nat, RDF.type, URIRef('Professor')))
>>> G.add((nat, FOAF.firstName, Literal('natalie')))
>>> qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type foaf:Person .
?a foaf:firstName ?aname .
}""", initNs = {"rdf": RDF,"foaf": FOAF})
>>> for row in qres:
print "%s is a person" % row
mark is a person
>>> qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type ?prof .
?a foaf:firstName ?aname .
}""", initNs = {"rdf": RDF,"foaf": FOAF, "prof": URIRef('Professor')})
>>> for row in qres:
print "%s is a Prof" % row
natalie is a Prof
mark is a Prof
>>>

Your query is bringing back all types because the ?prof variable isn't bound to a value.
I think you want to use is the initBindings kwarg to pass in the URI for 'Professor' to your query. So changing your query to below retrieves just natalie.
qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type ?prof .
?a foaf:firstName ?aname .
}""",
initNs ={"rdf": RDF,"foaf": FOAF},
initBindings={"prof": URIRef('Professor')})

python/pandas:need help adding double quotes to columns

I need to add double quotes to specific columns in a csv file that my script generates.
Below is the goofy way I thought of doing this. For these two fixed-width fields, it works:
df['DATE'] = df['DATE'].str.ljust(9,'"')
df['DATE'] = df['DATE'].str.rjust(10,'"')
df['DEPT CODE'] = df['DEPT CODE'].str.ljust(15,'"')
df[DEPT CODE'] = df['DEPT CODE'].str.rjust(16,'"')
For the following field, it doesn't. It has a variable length. So, if the value is shorter than the standard 6-digits, I get extra double-quotes: "5673"""
df['ID'] = df['ID'].str.ljust(7,'"')
df['ID'] = df['ID'].str.rjust(8,'"')
I have tried zfill, but the data in the column is a series-- I get "pandas.core.series.Series" when i run
print type(df['ID'])
and I have not been able to convert it to string using astype. I'm not sure why. I have not imported numpy.
I tried using len() to get the length of the ID number and pass it to str.ljust and str.rjust as its first argument, but I think it got hung up on the data not being a string.
Is there a simpler way to apply double-quotes as I need, or is the zfill going to be the way to go?

You can add a speech mark before / after:
In [11]: df = pd.DataFrame([["a"]], columns=["A"])
In [12]: df
Out[12]:
A
0 a
In [13]: '"' + df['A'] + '"'
Out[13]:
0 "a"
Name: A, dtype: object
Assigning this back:
In [14]: df['A'] = '"' + df.A + '"'
In [15]: df
Out[15]:
A
0 "a"
If it's for exporting to csv you can use the quoting kwarg:
In [21]: df = pd.DataFrame([["a"]], columns=["A"])
In [22]: df.to_csv()
Out[22]: ',A\n0,a\n'
In [23]: df.to_csv(quoting=1)
Out[23]: '"","A"\n"0","a"\n'

With numpy, not pandas, you can specify the formatting method when saving to a csv file. As very simple example:
In [209]: np.savetxt('test.txt',['string'],fmt='%r')
In [210]: cat test.txt
'string'
In [211]: np.savetxt('test.txt',['string'],fmt='"%s"')
In [212]: cat test.txt
"string"
I would expect the pandas csv writer to have a similar degree of control, if not more.

Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly simplified.
I would like to fit some features into a logistic regression model to predict 'chinese' or 'non-chinese'. I have a raw_name which I will extract to get two features 1) is just the last name, and 2) is a list of substring of the last name, for example, 'Chan' will give ['ch', 'ha', 'an']. But it seems Dictvectorizer doesn't take list type as part of the dictionary. From the link above, I try to create a function list_to_dict, and successfully, return some dict elements,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
but I have no idea how to incorporate that in the my_dict = ... before applying the dictvectorizer.
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
Output:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
Sample data:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese

If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:
my_dict_list = []
for i in X:
# create a new feature dictionary
feat_dict = {}
# add the features that are straight forward
feat_dict['last-name'] = feature_full_last_name(i)
feat_dict['dummy'] = 1
# for the features that have a list of values iterate over the values and
# create a custom feature for each value
for two_letters in feature_twoLetters(feature_full_last_name(i)):
# make sure the naming is unique enough so that no other feature
# unrelated to this will have the same name/ key
feat_dict['two-letter-substrings-' + two_letters] = True
# save it to the feature dictionary list that will be used in Dict vectorizer
my_dict_list.append(feat_dict)
print my_dict_list
from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x
Output:
[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:
# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.
Output:
[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 1.]]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Replacing strings using regex using Pandas - regex

It appears that replace does not support a method as a replacement argument. Thus, all you can do is to import re library implicitly and use apply: >>> import re >>> #... your code ... >>> ser.apply(lambda row: re.sub(patt, repl, row)) 0 Åland Islands 1 Réunion 2 Zimbabwe" dtype: object

Try this snippet: pattern = r'(.*)!' ser.replace(pattern, '', regex=True) In your case, you didn't set regex=True, as it is false by default.

Related

RegEx from PCRE engine does not matches with RegEx with Pandas

RegEx for matching the month, day and year

rdflib SPARQL queries (involving subclasses) not behaving as expected

python/pandas:need help adding double quotes to columns

Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

Categories

Resources