Cannot find suitable regex - regex

What I'm trying to do is to pull the HTML content and find a particular string that I know exists
import urllib.request
import re
response = urllib.request.urlopen('http://ipchicken.com/')
data = response.read()
portregex = re.compile('Remote[\s]+Port: [\d]+')
port = portregex.findall(str(data))
print(data)
print(port)
Now in my case the website contains Remote Port: 50880, but I simply cannot come up with suitable regex! Can anyone find my mistake?
I'm using python 3.4 on Windows

You mistakenly used square brackets instead of round parentheses:
portregex = re.compile(r'Remote\s+Port: (\d+)')
This ensures that the results of re.findall() will contain only the matched number(s) (because re.findall() returns only the capturing groups' matches when those are present):
>>> s = "Foo Remote Port: 12345 Bar Remote Port: 54321"
>>> portregex.findall(s)
['12345', '54321']

You need to use a raw string:
portregex = re.compile(r'Remote[\s]+Port: [\d]+')
or double backslashes:
portregex = re.compile('Remote[\\s]+Port: [\\d]+')
Note that square brackets are not needed.

I'd use an HTML parser in this case. Example using BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen('http://ipchicken.com/')
soup = BeautifulSoup(response)
print(soup.find(text=lambda x: x.startswith('Remote')).text)

Related

Python 2 regex search only for https and export

I have a List with many Links inside (http and https). Now I just want all URLs with https.
Is there a regex for that? I found only one for both.
The URLs are in "". Maybe this makes It easier?
Does someone have any idea?
Yes.
regular expressions are very capable at matching all kinds of strings.
The following example program works as you suggest:
import re
links = ["http://www.x.com", "https://www.y.com", "http://www.a.com", "https://www.b.com",]
r = re.compile("^https")
httpslinks = list(filter(r.match, links))
print(httpslinks)
This prints out only the https links.
What the regular expression is doing is looking for string that start with https. The hat ^ operator looks for string starting with the following characters, in this case "https"
If you are facing a space-delimited string, as you somewhat suggested in the comments, then you can just convert the links to a list using split like so:
links = "http://www.x.com https://www.y.com http://www.a.com https://www.b.com"
r = re.compile("^https")
httpslinks = list(filter(r.match, links.split(" ")))
You can read more on regular expressions here.
The part about list(filter( is only necessary for python2.x, python3.x seems to do this automatically.
Now it works:
Thanks to everyone.
import re
from bs4 import BeautifulSoup
with open('copyfromfile.txt', 'r') as file:
text = file.read()
text = text.replace('"Url":', '[<a href=')
text = text.replace(',"At"', '</a>] ')
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
link2 = link.get('href')
if link2.find("video") == -1:
link3 = 0
else:
f = open("C:/users/%Username%/desktop/copy.txt", "a+")
f.write(str(link2))
f.write("\n")
f.close()

Extracting URL from a string

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
The code I have is:
import re
url = re.findall('<tag>(.*)</tag>', str)
print(url)
returns:
[http://example-1.com</tag><tag>http://example-2.com]
If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!
Thanks everyone!
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
You can use BeautifulSoup to parse HTML.
For example:
from bs4 import BeautifulSoup
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
print tag.text
Using only re package:
import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)
returns:
['http://example-1.com', 'http://example-2.com']
Hope it helps!

Finding string with findall regex python 3 problem

Below is a list of web addresses. However, I would like to print only hostname of each address.
http://www.askoxford.com
http://www.hydrogencarsnow.com
http://www.bnsf.com
http://web.archive.org
Expected result:
askoxford.com
hydrogencarsnow.com
bnsf.com
web.archive.org
My code:
import re
import codecs
raw = codecs.open("D:\Python\gg.txt",'r',encoding='utf-8')
string = raw.read()
link = re.findall(r'www\.(\w+\.com|\w+\.org)',string)
print(link)
Current Output:
['askoxford.com', 'askoxford.com', 'hydrogencarsnow.com', 'bnsf.com']
As of current output, it does not include hostname.org. I'm unsure of the way to the make OR condition for reg in front of the string.
My Tries:
link = re.findall(r'(http://www\.|http://)(\w+\.com|\w+\.org)',string), but it does not work as it would collect http...with the hostname.

Extract a specific string inbetween strings by regex

I have a url like this
http://localhost:4970/eagle
&Account=001
&FruitSlad=Apple, Rambutab
&Fruits=Canada
and I want to extract the text between FruitSlad and the next & and I tried to write a regex for it
I tried ((GroupByMultiples)(.$&?)) but it didnt work . Um seeking for a help to extract the text between &FruitSlad and the next &
/FruitSlad=([^&]+)/
See here.
In JavaScript you could do the following.
var url = "\
http://localhost:4970/eagle \
&Account=001 \
&FruitSlad=Apple, Rambutab \
&Fruits=Canada";
var result = url.split('&')[2].split('=')[1];
console.log(result.trim()); //=> "Apple, Rambutab"
Or if you prefer using regex ...
var result = url.match(/&FruitSlad=([^&]+)/);
if (result)
console.log(result[1].trim()); //=> "Apple, Rambutab"
The convenient way to get values of query variables in Python 2.x is to use urlparse module:
import urlparse
url = 'http://localhost:4970/eagle?Account=001&FruitSlad=Apple, Rambutab&Fruits=Canada'
vars = urlparse.parse_qs(urlparse.urlparse(url).query)
print vars['FruitSlad'][0]
In Python 3.x use urllib.parse module:
import urllib.parse
url = 'http://localhost:4970/eagle?Account=001&FruitSlad=Apple, Rambutab&Fruits=Canada'
vars = urllib.parse.parse_qs(urllib.parse.urlparse(url).query)
print(vars['FruitSlad'][0])
This module decodes also various encoded characters (%nn), so is more suitable than regex in this case.

Regex to extract URLs from href attribute in HTML with Python [duplicate]

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed last month.
Considering a string as follows:
string = "<p>Hello World</p>More ExamplesEven More Examples"
How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:
>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']
import re
url = '<p>Hello World</p>More ExamplesEven More Examples'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']
The best answer is...
Don't use a regex
The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.
Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?
Parse the HTML instead
For many tasks, using Beautiful Soup will be far faster and easier to use:
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']
If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def __init__(self, output_list=None):
HTMLParser.__init__(self)
if output_list is None:
self.output_list = []
else:
self.output_list = output_list
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.output_list.append(dict(attrs).get('href'))
Test:
>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']
You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.