Extracting URL from a string - regex

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
The code I have is:
import re
url = re.findall('<tag>(.*)</tag>', str)
print(url)
returns:
[http://example-1.com</tag><tag>http://example-2.com]
If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!
Thanks everyone!

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
You can use BeautifulSoup to parse HTML.
For example:
from bs4 import BeautifulSoup
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
print tag.text

Using only re package:
import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)
returns:
['http://example-1.com', 'http://example-2.com']
Hope it helps!

Related

Python 2 regex search only for https and export

I have a List with many Links inside (http and https). Now I just want all URLs with https.
Is there a regex for that? I found only one for both.
The URLs are in "". Maybe this makes It easier?
Does someone have any idea?
Yes.
regular expressions are very capable at matching all kinds of strings.
The following example program works as you suggest:
import re
links = ["http://www.x.com", "https://www.y.com", "http://www.a.com", "https://www.b.com",]
r = re.compile("^https")
httpslinks = list(filter(r.match, links))
print(httpslinks)
This prints out only the https links.
What the regular expression is doing is looking for string that start with https. The hat ^ operator looks for string starting with the following characters, in this case "https"
If you are facing a space-delimited string, as you somewhat suggested in the comments, then you can just convert the links to a list using split like so:
links = "http://www.x.com https://www.y.com http://www.a.com https://www.b.com"
r = re.compile("^https")
httpslinks = list(filter(r.match, links.split(" ")))
You can read more on regular expressions here.
The part about list(filter( is only necessary for python2.x, python3.x seems to do this automatically.
Now it works:
Thanks to everyone.
import re
from bs4 import BeautifulSoup
with open('copyfromfile.txt', 'r') as file:
text = file.read()
text = text.replace('"Url":', '[<a href=')
text = text.replace(',"At"', '</a>] ')
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
link2 = link.get('href')
if link2.find("video") == -1:
link3 = 0
else:
f = open("C:/users/%Username%/desktop/copy.txt", "a+")
f.write(str(link2))
f.write("\n")
f.close()

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

How to extract URLs matching a pattern

I'm trying to extract URLs from a webpage with the following pattern :
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
My current code extracts all the links. How could I change my code to only extract URLs that match the pattern? Thank you!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')
You can provide a regular expression pattern as an href argument value for the .find_all():
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

Web Scraping between tags

I am trying to get all of the content between tags from a webpage. The code I have is outputting empty arrays. When I print the htmltext it shows the complete contents of the page, but will not show the contents of the tags.
import urllib
import re
urlToOpen = "webAddress"
htmlfile = urllib.urlopen(urlToOpen)
htmltext = htmlfile.read()
regex = '<h5> (.*) </h5>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print "The h5 tag contains: ", names
You did a mistake while calling the string urlToOpen. Write str(urlToOpen) instead of urlToOpen.
import urllib2
import re
urlToOpen = "http://stackoverflow.com/questions/25107611/web-scraping-between-tags"
htmlfile = urllib2.urlopen(str(urlToOpen))
htmltext = htmlfile.read()
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
names = re.findall(pattern,htmltext)
print names
Dont give spaces between tags and regex expression. Write like this:
regex = '<h5>(.+?)</h5>'

Cannot find suitable regex

What I'm trying to do is to pull the HTML content and find a particular string that I know exists
import urllib.request
import re
response = urllib.request.urlopen('http://ipchicken.com/')
data = response.read()
portregex = re.compile('Remote[\s]+Port: [\d]+')
port = portregex.findall(str(data))
print(data)
print(port)
Now in my case the website contains Remote Port: 50880, but I simply cannot come up with suitable regex! Can anyone find my mistake?
I'm using python 3.4 on Windows
You mistakenly used square brackets instead of round parentheses:
portregex = re.compile(r'Remote\s+Port: (\d+)')
This ensures that the results of re.findall() will contain only the matched number(s) (because re.findall() returns only the capturing groups' matches when those are present):
>>> s = "Foo Remote Port: 12345 Bar Remote Port: 54321"
>>> portregex.findall(s)
['12345', '54321']
You need to use a raw string:
portregex = re.compile(r'Remote[\s]+Port: [\d]+')
or double backslashes:
portregex = re.compile('Remote[\\s]+Port: [\\d]+')
Note that square brackets are not needed.
I'd use an HTML parser in this case. Example using BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen('http://ipchicken.com/')
soup = BeautifulSoup(response)
print(soup.find(text=lambda x: x.startswith('Remote')).text)