Regex to extract URLs from href attribute in HTML with Python [duplicate]

Regex to extract URLs from href attribute in HTML with Python [duplicate] - regex

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed last month.
Considering a string as follows:
string = "<p>Hello World</p>More ExamplesEven More Examples"
How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:
>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']

import re
url = '<p>Hello World</p>More ExamplesEven More Examples'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']

The best answer is...
Don't use a regex
The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.
Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?
Parse the HTML instead
For many tasks, using Beautiful Soup will be far faster and easier to use:
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']
If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def __init__(self, output_list=None):
HTMLParser.__init__(self)
if output_list is None:
self.output_list = []
else:
self.output_list = output_list
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.output_list.append(dict(attrs).get('href'))
Test:
>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']
You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.

Related

Python 2 regex search only for https and export

I have a List with many Links inside (http and https). Now I just want all URLs with https.
Is there a regex for that? I found only one for both.
The URLs are in "". Maybe this makes It easier?
Does someone have any idea?

Yes.
regular expressions are very capable at matching all kinds of strings.
The following example program works as you suggest:
import re
links = ["http://www.x.com", "https://www.y.com", "http://www.a.com", "https://www.b.com",]
r = re.compile("^https")
httpslinks = list(filter(r.match, links))
print(httpslinks)
This prints out only the https links.
What the regular expression is doing is looking for string that start with https. The hat ^ operator looks for string starting with the following characters, in this case "https"
If you are facing a space-delimited string, as you somewhat suggested in the comments, then you can just convert the links to a list using split like so:
links = "http://www.x.com https://www.y.com http://www.a.com https://www.b.com"
r = re.compile("^https")
httpslinks = list(filter(r.match, links.split(" ")))
You can read more on regular expressions here.
The part about list(filter( is only necessary for python2.x, python3.x seems to do this automatically.

Now it works:
Thanks to everyone.
import re
from bs4 import BeautifulSoup
with open('copyfromfile.txt', 'r') as file:
text = file.read()
text = text.replace('"Url":', '[<a href=')
text = text.replace(',"At"', '</a>] ')
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
link2 = link.get('href')
if link2.find("video") == -1:
link3 = 0
else:
f = open("C:/users/%Username%/desktop/copy.txt", "a+")
f.write(str(link2))
f.write("\n")
f.close()

Trying to write a custom template tag in Django that finds a phone number in text and converts it to a link

I want to convert this string tel:123-456-7890.1234 to a a link in html. The final output would be 123-456-7890 ext 1234
I'm not great with Regex and I'm REALLY close, but I need some help. I know that I'm not all the way there with the regex and output. How do I change what I have to make it work?
import re
#register.filter(name='phonify')
#stringfilter
def phonify(val):
"""
Pass the string 'tel:1234' to the filter and the right tel link is returned.
"""
# find every instance of 'tel' and then get the number after it
for tel in re.findall(r'tel:(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})\D*(\d*)', val):
# format the tag for each instance of tel
tag = '{}'.format(tel, tel)
# replace the tel instance with the new formatted html
val = val.replace('tel:{}'.format(tel), tag)
# return the new output to the template context
return val
I added the wagtail tag as I've seen other solutions for this in Wagtail and is something needed for Wagtail, so this might be helpful to others.

You can use re.sub to perform a find and replace:
import re
#register.filter(name='phonify')
#stringfilter
def phonify(val):
tel_rgx = r'tel:(\d{3}\D{0,3}\d{3}\D{0,3}\d{4}\D*\d*)'
return re.sub(tel_rgx, r'\1', val)
Note however that in your template, you will need to mark the result as "safe", otherwise it will replace < with < etc and thus render <a href= as text.
You can mark the string as safe in your template fiter as well:
import re
from django.utils.safestring import mark_safe
#register.filter(name='phonify')
#stringfilter
def phonify(val):
tel_rgx = r'tel:(\d{3}\D{0,3}\d{3}\D{0,3}\d{4}\D*\d*)'
return mark_safe(re.sub(tel_rgx, r'\1', val))
Regardless how you do this, it will however mark all items as safe, even tags, etc. that are part of the original string, and thus should be escaped. Therefore, I'm not sure that this is a good idea.

parsing URL in newspaper website

I have many urls from the same newspaper, each url has a depository for each writer.
For example:
http://alhayat.com/Opinion/Zinab-Ghasab.aspx
http://alhayat.com/Opinion/Abeer-AlFozan.aspx
http://www.alhayat.com/Opinion/Suzan-Mash-hadi.aspx
http://www.alhayat.com/Opinion/Thuraya-Al-Shahri.aspx
http://www.alhayat.com/Opinion/Badria-Al-Besher.aspx
Could someone help me please with writing a regular expression something that would generate all writers urls?
Thanks!

In order to get Zinab-Ghasab.aspx, you need no regex.
Just iterate through all of these URLs and use
print s[s.rfind("/")+1:]
See sample demo.
A regex would look like
print re.findall(r"/([^/]+)\.aspx", input)
It will get all your values from input without .aspx extension.

You can use findall() method in "re" module.
Assuming that you are reading the content from a file
import re
fp = open("file_name", "r")
contents = fp.read()
writer_urls = re.findall("https?://.+.com/.+/(.*).aspx", contents)
fp.close()
Now, writer_urls list is holding all the required urls.

Cannot find suitable regex

What I'm trying to do is to pull the HTML content and find a particular string that I know exists
import urllib.request
import re
response = urllib.request.urlopen('http://ipchicken.com/')
data = response.read()
portregex = re.compile('Remote[\s]+Port: [\d]+')
port = portregex.findall(str(data))
print(data)
print(port)
Now in my case the website contains Remote Port: 50880, but I simply cannot come up with suitable regex! Can anyone find my mistake?
I'm using python 3.4 on Windows

You mistakenly used square brackets instead of round parentheses:
portregex = re.compile(r'Remote\s+Port: (\d+)')
This ensures that the results of re.findall() will contain only the matched number(s) (because re.findall() returns only the capturing groups' matches when those are present):
>>> s = "Foo Remote Port: 12345 Bar Remote Port: 54321"
>>> portregex.findall(s)
['12345', '54321']

You need to use a raw string:
portregex = re.compile(r'Remote[\s]+Port: [\d]+')
or double backslashes:
portregex = re.compile('Remote[\\s]+Port: [\\d]+')
Note that square brackets are not needed.

I'd use an HTML parser in this case. Example using BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen('http://ipchicken.com/')
soup = BeautifulSoup(response)
print(soup.find(text=lambda x: x.startswith('Remote')).text)

TypeError : 'NoneType' object not callable Python with BeautifulSoup XML

I have the following XML file :
<user-login-permission>true</user-login-permission>
<total-matched-record-number>15000</total-matched-record-number>
<total-returned-record-number>15000</total-returned-record-number>
<active-user-records>
<active-user-record>
<active-user-name>username</active-user-name>
<authentication-realm>realm</authentication-realm>
<user-roles>Role</user-roles>
<user-sign-in-time>date</user-sign-in-time>
<events>0</events>
<agent-type>text</agent-type>
<login-node>node</login-node>
</active-user-record>
There are many records
I'm trying to get values from tags and save them in a different text file using the following code :
soup = BeautifulSoup(open("path/to/xmlfile"), features="xml")
with open('path/to/outputfile', 'a') as f:
for i in range(len(soup.findall('active-user-name'))):
f.write ('%s\t%s\t%s\t%s\n' % (soup.findall('active-user-name')[i].text, soup.findall('authentication-realm')[i].text, soup.findall('user-roles')[i].text, soup.findall('login-node')[i].text))
I get the error TypeError : 'NoneType' object not callable Python with BeautifulSoup XML for line : for i in range(len(soup.findall('active-user-name'))):
Any idea what could be causing this?
Thanks!

There are a number of issues that need to be addressed with this, the first is that the XML file you provided is not valid XML - a root element is required.
Try something like this as the XML:
<root>
<user-login-permission>true</user-login-permission>
<total-matched-record-number>15000</total-matched-record-number>
<total-returned-record-number>15000</total-returned-record-number>
<active-user-records>
<active-user-record>
<active-user-name>username</active-user-name>
<authentication-realm>realm</authentication-realm>
<user-roles>Role</user-roles>
<user-sign-in-time>date</user-sign-in-time>
<events>0</events>
<agent-type>text</agent-type>
<login-node>node</login-node>
</active-user-record>
</active-user-records>
</root>
Now onto the python. First off there is not a findall method, it's either findAll or find_all. findAll and find_all are equivalent, as documented here
Next up I would suggest altering your code so you aren't making use of the find_all method quite so often - using find instead will improve the efficiency, especially for large XML files. Additionally the code below is easier to read and debug:
from bs4 import BeautifulSoup
xml_file = open('./path_to_file.xml', 'r')
soup = BeautifulSoup(xml_file, "xml")
with open('./path_to_output_f.txt', 'a') as f:
for s in soup.findAll('active-user-record'):
username = s.find('active-user-name').text
auth = s.find('authentication-realm').text
role = s.find('user-roles').text
node = s.find('login-node').text
f.write("{}\t{}\t{}\t{}\n".format(username, auth, role, node))
Hope this helps. Let me know if you require any further assistance!

The solution is simple: don't use findall method - use find_all.
Why? Because there is no findall method at all, there are findAll and find_all, which are equivalent. See docs for more information.
Though, I agree, error message is confusing.
Hope that helps.

The fix for my version of this problem is to coerce the BeautifulSoup instance into a type string. You do this following:
https://groups.google.com/forum/#!topic/comp.lang.python/ymrea29fMFI
you use the following pythonic:
From python manual
str( [object])
Return a string containing a nicely printable representation of an
object. For strings, this returns the string itself. The difference
with repr(object) is that str(object) does not always attempt to
return a string that is acceptable to eval(); its goal is to return a
printable string. If no argument is given, returns the empty string,

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to extract URLs from href attribute in HTML with Python [duplicate] - regex

import re url = '<p>Hello World</p>More ExamplesEven More Examples' urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url) >>> print urls ['http://example.com', 'http://2.example']

Related

Python 2 regex search only for https and export

Trying to write a custom template tag in Django that finds a phone number in text and converts it to a link

parsing URL in newspaper website

Cannot find suitable regex

TypeError : 'NoneType' object not callable Python with BeautifulSoup XML

Categories

Resources