Python 2 regex search only for https and export - python-2.7

I have a List with many Links inside (http and https). Now I just want all URLs with https.
Is there a regex for that? I found only one for both.
The URLs are in "". Maybe this makes It easier?
Does someone have any idea?

Yes.
regular expressions are very capable at matching all kinds of strings.
The following example program works as you suggest:
import re
links = ["http://www.x.com", "https://www.y.com", "http://www.a.com", "https://www.b.com",]
r = re.compile("^https")
httpslinks = list(filter(r.match, links))
print(httpslinks)
This prints out only the https links.
What the regular expression is doing is looking for string that start with https. The hat ^ operator looks for string starting with the following characters, in this case "https"
If you are facing a space-delimited string, as you somewhat suggested in the comments, then you can just convert the links to a list using split like so:
links = "http://www.x.com https://www.y.com http://www.a.com https://www.b.com"
r = re.compile("^https")
httpslinks = list(filter(r.match, links.split(" ")))
You can read more on regular expressions here.
The part about list(filter( is only necessary for python2.x, python3.x seems to do this automatically.

Now it works:
Thanks to everyone.
import re
from bs4 import BeautifulSoup
with open('copyfromfile.txt', 'r') as file:
text = file.read()
text = text.replace('"Url":', '[<a href=')
text = text.replace(',"At"', '</a>] ')
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
link2 = link.get('href')
if link2.find("video") == -1:
link3 = 0
else:
f = open("C:/users/%Username%/desktop/copy.txt", "a+")
f.write(str(link2))
f.write("\n")
f.close()

Related

Extracting URL from a string

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
The code I have is:
import re
url = re.findall('<tag>(.*)</tag>', str)
print(url)
returns:
[http://example-1.com</tag><tag>http://example-2.com]
If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!
Thanks everyone!
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
You can use BeautifulSoup to parse HTML.
For example:
from bs4 import BeautifulSoup
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
print tag.text
Using only re package:
import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)
returns:
['http://example-1.com', 'http://example-2.com']
Hope it helps!

BeautifulSoup and regexp: Attribute error

I try to extract information with beautifulsoup4 methods by means of reg. exp.
But I get the following answer:
AttributeError: 'NoneType' object has no attribute 'group'
I do not understand what is wrong.. I am trying to:
get the Typologie name: 'herenhuizen'
get the weblink
Here is my code:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://inventaris.onroerenderfgoed.be/erfgoedobjecten/4778'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
text = soup.prettify()
##block
p = re.compile('(?s)(?<=(Typologie))(.*?)(?=(</a>))', re.VERBOSE)
block = p.search(text).group(2)
##typo_url
p = re.compile('(?s)(?<=(href=\"))(.*?)(?=(\">))', re.VERBOSE)
typo_url = p.search(block).group(2)
## typo_name
p = re.compile('\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Does someone have an idea where is the mistake?
I would change this:
## typo_name
block_reverse = block[::-1]
p = re.compile('(\w+)', re.VERBOSE)
typo_name_reverse = p.search(block_reverse).group(1)
typo_name = typo_name_reverse[::-1]
print(typo_name)
Sometimes it's easier to just reverse the string if you are looking for stuff at the end. This just finds the name at the end of your block. There are a number of ways to find what you are looking for, and we could come up with all kinds of clever regexes, but if this works that's probably enough :)
update
However I just noticed the reason the original regex was not working is to use \b it needs to be escaped like \\b or be raw like this:
## typo_name
p = re.compile(r'\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)
Some good followed Q and A here: Does Python re module support word boundaries (\b)?

How to extract URLs matching a pattern

I'm trying to extract URLs from a webpage with the following pattern :
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
My current code extracts all the links. How could I change my code to only extract URLs that match the pattern? Thank you!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')
You can provide a regular expression pattern as an href argument value for the .find_all():
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

Cannot find suitable regex

What I'm trying to do is to pull the HTML content and find a particular string that I know exists
import urllib.request
import re
response = urllib.request.urlopen('http://ipchicken.com/')
data = response.read()
portregex = re.compile('Remote[\s]+Port: [\d]+')
port = portregex.findall(str(data))
print(data)
print(port)
Now in my case the website contains Remote Port: 50880, but I simply cannot come up with suitable regex! Can anyone find my mistake?
I'm using python 3.4 on Windows
You mistakenly used square brackets instead of round parentheses:
portregex = re.compile(r'Remote\s+Port: (\d+)')
This ensures that the results of re.findall() will contain only the matched number(s) (because re.findall() returns only the capturing groups' matches when those are present):
>>> s = "Foo Remote Port: 12345 Bar Remote Port: 54321"
>>> portregex.findall(s)
['12345', '54321']
You need to use a raw string:
portregex = re.compile(r'Remote[\s]+Port: [\d]+')
or double backslashes:
portregex = re.compile('Remote[\\s]+Port: [\\d]+')
Note that square brackets are not needed.
I'd use an HTML parser in this case. Example using BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen('http://ipchicken.com/')
soup = BeautifulSoup(response)
print(soup.find(text=lambda x: x.startswith('Remote')).text)

Regex to extract URLs from href attribute in HTML with Python [duplicate]

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed last month.
Considering a string as follows:
string = "<p>Hello World</p>More ExamplesEven More Examples"
How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:
>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']
import re
url = '<p>Hello World</p>More ExamplesEven More Examples'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']
The best answer is...
Don't use a regex
The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.
Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?
Parse the HTML instead
For many tasks, using Beautiful Soup will be far faster and easier to use:
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']
If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def __init__(self, output_list=None):
HTMLParser.__init__(self)
if output_list is None:
self.output_list = []
else:
self.output_list = output_list
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.output_list.append(dict(attrs).get('href'))
Test:
>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']
You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.