Splitting a string in Python based on a regex pattern - regex

I have a bytes object that contains urls:
> body.decode("utf-8")
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
I need to split it into a list with each url as a separate element:
import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'
urls = re.compile(pattern).split(body.decode("utf-8"))
What I get is a list of one element with all urls pasted together:
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']
How do I split each url into a separate element?

Try splitting it with \s+
Try this sample python code,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)
This outputs,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']
Does this result looks ok? Or we can work on it and make as you desire.
In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)
This gives following output,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']

Related

Is there a regular expression for finding all question sentences from a webpage?

I am trying to extract some questions from a web site using BeautifulSoup, and want to use regular expression to get these questions from the web. Is my regular expression incorrect? And how can I combine soup.find_all with re.compile?
I have tried the following:
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
import urllib
import re
url = "https://www.sanfoundry.com/python-questions-answers-variable-names/"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
a = soup.find_all("p")
for m in a:
print(m.get_text())
Now I have some text containing the questions like "1. Is Python case sensitive when dealing with identifiers?". I want to use r"[^.!?]+\?" to filter out the unwanted text, but I have the following error:
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
^
SyntaxError: invalid syntax
I checked my regular expression on https://regex101.com, it seems right. Is there a way to combine the regular expression and soup.find_all together?
One of methods to find p elements containig a ? it to
define a criterion function:
def criterion(tag):
return tag.name == 'p' and re.search('\?', tag.text)
and use it in find_all:
pars = soup.find_all(criterion)
But you want to print only questions, not the whole paragraphs
from pars.
To match these questions, define a pattern:
pat = re.compile(r'\d+\.\s[^?]+\?')
(a sequence of digits, a dot, a space, then a sequence of chars other
than ? and finally a ?).
Note that in general case one paragraph may contain multiple
questions. So the loop processing the paragraphs found should:
use findall to find all questions in the current paragraph
(the result is a list of found strings),
print also all of them, in separate lines, so you should
use join with a \n as a separator.
So the whole loop should be:
for m in pars:
questions = pat.findall(m.get_text())
print('\n'.join(questions))
Not a big regex fan, so tried this:
for q in a:
for i in q:
if '?' in i:
print(i)
Output:
1. Is Python case sensitive when dealing with identifiers?
2. What is the maximum possible length of an identifier?
3. Which of the following is invalid?
4. Which of the following is an invalid variable?
5. Why are local variable names beginning with an underscore discouraged?
6. Which of the following is not a keyword?
8. Which of the following is true for variable names in Python?
9. Which of the following is an invalid statement?
10. Which of the following cannot be a variable?

how do you extract a certain part of a URL in python?

I am looking to extract only a portion of a patterned URL:
https://<some_server>/server/rest/services/<that part I need>/<other data I do not need>/...
I just need to extract the portion after 'rest/services/' the first part can change, the last part can change, but the URL will always have 'rest/services/' and what I need follows, followed by '/...
What you can do is turn this string into a list by splitting it, and then access the nth part.
m = 'https://<some_server>/server/rest/services/<that part I need>/<other data I do not need>'
a = m.split('/')
a[6] //This is the part that you want
You could try this:
(?<=rest/services/) # look behind for this
[^/]+ # anything except a '/'
import re
rgxp = re.compile(r'(?<=rest/services/)[^/]+')
print re.findall(rgxp, text)

Regular expression: Matching multiple occurrences in the same line

I have a string that I need to match using regex. It works perfectly fine when I have a single occurrence in a single line, however, when there are multiple occurrences of the same string in a single line I'm not getting any matches. Can you please help?
Sample strings:
MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T
Regex that I tried:
(([A-Z]{2}[0-9]{8,9}[A-Z]{1})|([A-Z]{2}[0-9]{8,9}))
This seems to work fine:
a = '''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
import re
patterns = ['[A-Z]{2}[0-9]{8,9}[A-Z]{1}','[A-Z]{2}[0-9]{8,9}']
pattern = '({})'.format(')|('.join(patterns))
matches = re.findall(pattern, a)
print([match for sub in matches for match in sub if match])
#['MS17010314', 'MS00030208', 'IL00171198', 'IH09850115', 'IH99400409',
# 'IH99410409', 'IL01771010', 'IL01791002', 'IL01930907', 'IL02360907',
# 'CM00010904', 'IH09520115', 'MS00201285', 'MS19050708', 'MS00370489',
# 'MS19011285T']
I've added a way to combine all patterns.
i tried using python and the following code worked
import re
s='''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
lst_of_regex = [a,b]
pattern = '|'.join(lst_of_regex)
print(re.findall(pattern,s))

Python Regex List into Another List

I currently have a piece of code that runs mainly as I would expect only it prints out both the original list and the one that has been filtered. Essentially what I am trying to do is read URL's from a webpage and store them into a list ( called match, this part works fine) and then filter that list into a new list (called fltrmtch) because the original contains all of the extra href tags ect.
For example at the moment it would print out A and B but Im only after B:
A Core Development',
B'http://docs.python.org/devguide/'),
Heres the code:
url = "URL WOULD BE IN HERE BUT NOT ALLOWED TO POST MULTIPLE LINKS" #Name of the url being searched
webpage = urllib.urlopen(url)
content = webpage.read() #places the read url contents into variable content
import re # Imports the re module which allows seaching for matches.
import pprint # This import allows all listitems to be printed on seperate lines.
match = re.findall(r'\<a.*href\=.*http\:.+', content)#matches any content that begins with a href and ands in >
def filterPick(list, filter):
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
regex=re.compile(r'\"(.+?)\"').search
fltrmtch = filterPick(match, regex)
try:
if match: # defines that if there is a match the below is ran.
print "The number of URL's found is:" , len(match)
match.sort()
print "\nAnd here are the URL's found: "
pprint.pprint(fltrmtch)
except:
print "No URL matches have been found, please try again!"
Any help would be much appreciated.
Thank you in advance.
UPDATE: Thank you for the answer issued however I managed to find the flaw
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
I simply had to remove the 1, from [(1, m.group(1)) ). Thanks again.
It appears that the bottom portion of your code is mostly catching errors from the top portion, and that the regex you provided has no capturing groups. Here is a revised example:
import re
url = "www.site.com" # place real web address here
# read web page into string
page = urllib.urlopen(url).read()
# use regex to extract URLs from <a href=""> patterns
matches = re.findall(r'''\<a\s[^\>]*?\bhref\=(['"])(.+?)\1[^\>]*?\>''', page, re.IGNORECASE)
# keep only the second group of positive matches
matches = sorted([match.group(2) for match in matches if match])
# print matches if they exist
if matches:
print("The number of URL's found is:" + str(len(matches)))
print("\nAnd here are the URL's found:")
# print each match
print('\n'.join(matches))
else:
print 'No URL matches have been found, please try again!'

Regular expression syntax in python

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);