RegEx from PCRE engine does not matches with RegEx with Pandas - regex

So I have a pandas series like below
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
To capture only the domains, I have used the below regex expression from PCRE:
(?<=\/\/)(\w+[-.]?\w+[.]?){2}
Now when I use this in Pandas, below unexpected result is inferenced:
test_urls_clean = test_urls.str.extract(r"(?<=\/\/)(\w+[-.]?\w+[.]?){2}", expand=False)
0
0 com
1 com
2 com
3 om
4 om
5 rg
6 cc
7 com
8 com
9 ly
10 techstream.org
But using the below regex, correct results are fetched
https?://([\w\-\.]+)
Any reason why this issue happens with Pandas?

If I'm understanding correctly...
The first regex (?<=\/\/)(\w+[-.]?\w+[.]?){2} is matching the url, but only capturing the TLD.
So .com etc. is the only part returned when using str.extract (extract the captured group only).
See https://regex101.com/r/aMKgFv/1 and notice the blue (match) vs the green (capture) in the highlighting of the urls.
Additional:
For getting the domain out of the string I would recommend avoiding regex and let urllib do the work.
Code:
from urllib.parse import urlparse
import pandas as pd
def getDomain(s):
return urlparse(s).netloc
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
test_urls = test_urls.apply(getDomain)
print(test_urls)
0 www.amazon.com
1 www.interactivedynamicvideo.com
2 www.nytimes.com
3 evonomics.com
4 github.com
5 phys.org
6 iot.seeed.cc
7 www.bfilipek.com
8 beta.crowdfireapp.com
9 www.valid.ly
10 css-cursor.techstream.org
dtype: object

Related

Merging two Pandas Dataframes using Regular Expressions

I'm new to Python and Pandas but I try to use Pandas Dataframes to merge two dataframes based on regular expression.
I have one dataframe with some 2 million rows. This table contains data about cars but the model name is often specified in - lets say - a creative way, e.g. 'Audi A100', 'Audi 100', 'Audit 100 Quadro', or just 'A 100'. And the same for other brands. This is stored in a column called "Model". In a second model I have the manufacturer.
Index
Model
Manufacturer
0
A 100
Audi
1
A100 Quadro
Audi
2
Audi A 100
Audi
...
...
...
To clean up the data I created about 1000 regular expressions to search for some key words and stored it in a dataframe called 'regex'. In a second column of this table I save the manufacture. This value is used in a second step to validate the result.
Index
RegEx
Manufacturer
0
.* A100 .*
Audi
1
.* A 100 .*
Audi
2
.* C240 .*
Mercedes
3
.* ID3 .*
Volkswagen
I hope you get the idea.
As far as I understood, the Pandas function "merge()" does not work with regular expressions. Therefore I use a loop to process the list of regular expressions, then use the "match" function to locate matching rows in the car DataFrame and assign the successfully used RegEx and the suggested manufacturer.
I added two additional columns to the cars table 'RegEx' and 'Manufacturer'.
for index, row in regex.iterrows():
cars.loc[cars['Model'].str.match(row['RegEx']),'RegEx'] = row['RegEx']
cars.loc[cars['Model'].str.match(row['RegEx']),'Manufacturer'] = row['Manfacturer']
I learnd 'iterrows' should not be used for performance reasons. It takes 8 minutes to finish the loop, what isn't too bad. However, is there a better way to get it done?
Kind regards
Jiriki
I have no idea if it would be faster (I'll be glad, if you would test it), but it doesn't use iterrows():
regex.groupby(["RegEx", "Manufacturer"])["RegEx"]\
.apply(lambda x: cars.loc[cars['Model'].str.match(x.iloc[0])])
EDIT: Code for reproduction:
cars = pd.DataFrame({"Model": ["A 100", "A100 Quatro", "Audi A 100", "Passat V", "Passat Gruz"],
"Manufacturer": ["Audi", "Audi", "Audi", "VW", "VW"]})
regex = pd.DataFrame({"RegEx": [".*A100.*", ".*A 100.*", ".*Passat.*"],
"Manufacturer": ["Audi", "Audi", "VW"]})
#Output:
# Model Manufacturer
#RegEx Manufacturer
#.*A 100.* Audi 0 A 100 Audi
# 2 Audi A 100 Audi
#.*A100.* Audi 1 A100 Quatro Audi
#.*Passat.* VW 3 Passat V VW
# 4 Passat Gruz VW

Extract the second instance of the website pattern in a string using pandas str.contains

I am trying to extract 2nd instance of www website from the below string. This is in a pandas dataframe.
https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-
banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
So I want to extract the following string and store it in a separate column.
https://www.accenture.com/in-en/insights/software-
platforms/core- banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
Final Dataframe:
sr.no link_orig link_extracted
1 <the above string> <the extracted string that starts from
https://www.accenture.com>
Below is the code snippet:
df['link_extracted'] = `df['link_orig'].str.contains('www.accenture.com',regex=False,na-np.NaN)
I am getting the following error:
ValueError: Cannot mask with non-boolean array containing NA / NaN values
What I am missing here? If I have to use regex then what should be the approach?
The error message means you probably have NaNs in the link_orig column. That can be fixed by adding a fillna('') to your code.
Something like
df['link_extracted'] = df['link_orig'].fillna('').str.contains ...
That said, I'm not sure the rest of your code will do what you want. That will just return True is www.accenture.com is anywhere in the link_orig string.
If the link you are trying to extract always contains www.accenture.com then you can do this
df['link_extracted'] = df['link_orig'].fillna('').str.extract('(www\.accenture\.com.*)')
Personally, I'd use Series.str.extract() for this. E.g:
df['link_extracted'] = df['link_orig'].str.extract('http.*(http.*)')
This matches http, followed by anything, then captures http followed by anything.
An alternate approach would be to use urlparse.
You can use urllib.parse module:
import pandas as pd
from urllib.parse import urlparse, parse_qs
url = 'https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-banking-on-cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD'
df = pd.DataFrame({'sr.no':[1], 'link_orig':[url]})
extract_q = lambda url: parse_qs(urlparse(url).query)['q'][0]
df['link_extracted'] = df['link_orig'].apply(extract_q)
Output:
>>> df
sr.no link_orig link_extracted
0 1 https://google.com/url?q=https://www.accenture... https://www.accenture.com/in-en/insights/softw...

how to remove everything but letters, numbers and ! ? . ; , # ' using regex in python pandas df?

I am trying to remove everythin but letters, numbers and ! ? . ; , # ' from my python pandas column text.
I have already read some other questions on the topic, but still can not make mine work.
Here is an example of what I am doing:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'text':['hey+ guys! wuzup',
'hello p3ople!What\'s up?',
'hey, how- thing == do##n',
'my name is bond, james b0nd']}
)
Then we have the following table:
id text
1 hey+ guys! wuzup
2 hello p3ople!What\'s up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Now, tryng to remove everything but letters, numbers and ! ? . ; , # '
First try:
df.loc[:,'text'] = df['text'].str.replace(r"^(?!(([a-zA-z]|[\!\?\.\;\,\#\'\"]|\d))+)$",' ',regex=True)
output
id text
1 hey+ guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Second try
df.loc[:,'text'] = df['text'].str.replace(r"(?i)\b(?:(([a-zA-Z\!\?\.\;\,\#\'\"\:\d])))",' ',regex=True)
output
id text
1 ey+ uys uzup
2 ello 3ople hat p
3 ey ow- hing == o##
4 y ame s ond ames 0nd
Third try
df.loc[:,'text'] = df['text'].str.replace(r'(?i)(?<!\w)(?:[a-zA-Z\!\?\.\;\,\#\'\"\:\d])',' ',regex=True)
output
id text
1 ey+ uys! uzup
2 ello 3ople! hat' p?
3 ey, ow- hing == o##
4 y ame s ond, ames 0nd
Afterwars, I also tried using re.sub() function using the same regex patterns, but still did not manage to have the expected the result. Being this expected result as follows:
id text
1 hey guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing don
4 my name is bond, james b0nd
Can anyone help me with that?
Links that I have seen over the topic:
Is there a way to remove everything except characters, numbers and '-' from a string
How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?
removing newlines from messy strings in pandas dataframe cells?
https://stackabuse.com/using-regex-for-text-manipulation-in-python/
Is this what you are looking for?
df.text.str.replace("(?i)[^0-9a-z!?.;,#' -]",'')
Out:
0 hey guys! wuzup
1 hello p3ople!What's up?
2 hey, how- thing don
3 my name is bond, james b0nd
Name: text, dtype: object

Getting proper length of emojis

I noticed that while you are inputting emojis in your phone message some of them take 1 character and some of them are taking 2. For example, "♊" takes 1 char but "😁" takes 2. In python, I'm trying to get length of emojis and I'm getting:
len("♊") # 3
len("😁") # 4
len(unicode("♊", "utf-8")) # 1 OH IT WORKS!
len(unicode("😁", "utf-8")) # 1 Oh wait, no it doesn't.
Any ideas?
This site has emojis length in Character.charCount() row: http://www.fileformat.info/info/unicode/char/1F601/index.htm
Read sys.maxunicode:
An integer giving the value of the largest Unicode code point, i.e.
1114111 (0x10FFFF in hexadecimal).
Changed in version 3.3: Before PEP 393, sys.maxunicode used to
be either 0xFFFF or 0x10FFFF, depending on the configuration
option that specified whether Unicode characters were stored as
UCS-2 or UCS-4.
The following script should work in both Python versions 2 an 3:
# coding=utf-8
from __future__ import print_function
import sys, platform, unicodedata
print( platform.python_version(), 'maxunicode', hex(sys.maxunicode))
tab = '\t'
unistr = u'\u264a \U0001f601' ### unistr = u'♊ 😁'
print ( len(unistr), tab, unistr, tab, repr( unistr))
for char in unistr:
print (len(char), tab, char, tab, repr(char), tab,
unicodedata.category(char), tab, unicodedata.name(char,'private use'))
Output shows consequence of different sys.maxunicode property value. For instance, the 😁 character (unicode codepoint 0x1f601 above the Basic Multilingual Plane) is converted to corresponding surrogate pair (codepoints u'\ud83d' and u'\ude01') if sys.maxunicode results to 0xFFFF:
PS D:\PShell> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
PS D:\PShell> . py -3 D:\test\Python\Py\42783173.py
3.5.1 maxunicode 0x10ffff
3 ♊ 😁 '♊ 😁'
1 ♊ '♊' So GEMINI
1 ' ' Zs SPACE
1 😁 '😁' So GRINNING FACE WITH SMILING EYES
PS D:\PShell> . py -2 D:\test\Python\Py\42783173.py
2.7.12 maxunicode 0xffff
4 ♊ 😁 u'\u264a \U0001f601'
1 ♊ u'\u264a' So GEMINI
1 u' ' Zs SPACE
1 �� u'\ud83d' Cs private use
1 �� u'\ude01' Cs private use
Note: above output examples were taken from Unicode-aware Powershell-ISE console pane.

In R, use regular expression to match multiple patterns and add new column to list

I've found numerous examples of how to match and update an entire list with one pattern and one replacement, but what I am looking for now is a way to do this for multiple patterns and multiple replacements in a single statement or loop.
Example:
> print(recs)
phonenumber amount
1 5345091 200
2 5386052 200
3 5413949 600
4 7420155 700
5 7992284 600
I would like to insert a new column called 'service_provider' with /^5/ as Company1 and /^7/ as Company2.
I can do this with the following two lines of R:
recs$service_provider[grepl("^5", recs$phonenumber)]<-"Company1"
recs$service_provider[grepl("^7", recs$phonenumber)]<-"Company2"
Then I get:
phonenumber amount service_provider
1 5345091 200 Company1
2 5386052 200 Company1
3 5413949 600 Company1
4 7420155 700 Company2
5 7992284 600 Company2
I'd like to provide a list, rather than discrete set of grepl's so it is easier to keep country specific information in one place, and all the programming logic in another.
thisPhoneCompanies<-list(c('^5','Company1'),c('^7','Company2'))
In other languages I would use a for loop on on the Phone Company list
For every row in thisPhoneCompanies
Add service provider to matched entries in recs (such as the grepl statement)
end loop
But I understand that isn't the way to do it in R.
Using stringi :
library(stringi)
recs$service_provider <- stri_replace_all_regex(str = recs$phonenumber,
pattern = c('^5.*','^7.*'),
replacement = c('Company1', 'Company2'),
vectorize_all = FALSE)
recs
# phonenumber amount service_provider
# 1 5345091 200 Company1
# 2 5386052 200 Company1
# 3 5413949 600 Company1
# 4 7420155 700 Company2
# 5 7992284 600 Company2
Thanks to #thelatemail
Looks like if I use a dataframe instead of a list for the phone companies:
phcomp <- data.frame(ph=c(5,7),comp=c("Company1","Company2"))
I can match and add a new column to my list of phone numbers in a single command (using the match function).
recs$service_provider <- phcomp$comp[match(substr(recs$phonenumber,1,1), phcomp$ph)]
Looks like I lose the ability to use regular expressions, but the matching here is very simple, just the first digit of the phone number.