Below is a list of web addresses. However, I would like to print only hostname of each address.
http://www.askoxford.com
http://www.hydrogencarsnow.com
http://www.bnsf.com
http://web.archive.org
Expected result:
askoxford.com
hydrogencarsnow.com
bnsf.com
web.archive.org
My code:
import re
import codecs
raw = codecs.open("D:\Python\gg.txt",'r',encoding='utf-8')
string = raw.read()
link = re.findall(r'www\.(\w+\.com|\w+\.org)',string)
print(link)
Current Output:
['askoxford.com', 'askoxford.com', 'hydrogencarsnow.com', 'bnsf.com']
As of current output, it does not include hostname.org. I'm unsure of the way to the make OR condition for reg in front of the string.
My Tries:
link = re.findall(r'(http://www\.|http://)(\w+\.com|\w+\.org)',string), but it does not work as it would collect http...with the hostname.
Related
I have a List with many Links inside (http and https). Now I just want all URLs with https.
Is there a regex for that? I found only one for both.
The URLs are in "". Maybe this makes It easier?
Does someone have any idea?
Yes.
regular expressions are very capable at matching all kinds of strings.
The following example program works as you suggest:
import re
links = ["http://www.x.com", "https://www.y.com", "http://www.a.com", "https://www.b.com",]
r = re.compile("^https")
httpslinks = list(filter(r.match, links))
print(httpslinks)
This prints out only the https links.
What the regular expression is doing is looking for string that start with https. The hat ^ operator looks for string starting with the following characters, in this case "https"
If you are facing a space-delimited string, as you somewhat suggested in the comments, then you can just convert the links to a list using split like so:
links = "http://www.x.com https://www.y.com http://www.a.com https://www.b.com"
r = re.compile("^https")
httpslinks = list(filter(r.match, links.split(" ")))
You can read more on regular expressions here.
The part about list(filter( is only necessary for python2.x, python3.x seems to do this automatically.
Now it works:
Thanks to everyone.
import re
from bs4 import BeautifulSoup
with open('copyfromfile.txt', 'r') as file:
text = file.read()
text = text.replace('"Url":', '[<a href=')
text = text.replace(',"At"', '</a>] ')
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
link2 = link.get('href')
if link2.find("video") == -1:
link3 = 0
else:
f = open("C:/users/%Username%/desktop/copy.txt", "a+")
f.write(str(link2))
f.write("\n")
f.close()
I frequently need a list of CVEs listed on a vendor's security bulletin page. Sometimes that's simple to copy off, but often they're mixed in with a bunch of text.
I haven't touched Python in a good while, so I thought this would be a great exercise to figure out how to extract that info – especially since I keep finding myself doing it manually.
Here's my current code:
#!/usr/bin/env python3
# REQUIREMENTS
# python3
# BeautifulSoup (pip3 install beautifulsoup)
# python 3 certificates (Applications/Python 3.x/ Install Certificates.command) <-- this one took me forever to figure out!
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
#specify/get the url to scrape
#url ='https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
#url = 'https://source.android.com/security/bulletin/2020-02-01.html'
url = input("What is the URL? ") or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print("Checking URL: " + url)
# CVE regular expression
cve_pattern = 'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = urlopen(url).read()
# parse the html returned using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
count = 0
############################################################
# ANDROID === search for CVE references within <td> tags ===
# find all <td> tags
all_tds = soup.find_all("td")
#print all_tds
for td in all_tds:
if "cve" in td.text.lower():
print(td.text)
############################################################
# CHROME === search for CVE reference within <span> tags ===
# find all <span> tags
all_spans = soup.find_all("span")
for span in all_spans:
# this code returns results in triplicate
for i in re.finditer(cve_pattern, span.text):
count += 1
print(count, i.group())
# this code works, but only returns the first match
# match = re.search(cve_pattern,span.text)
# if match:
# print(match.group(0))
What I have working for the Android URL works fine; the problem I'm having is for the Chrome URL. They have the CVE info inside <span> tags, and I'm trying to leverage regular expressions to pull that out.
Using the re.finditer approach, I end up with results in triplicate.
Using the re.search approach it misses CVE-2019-19925 – they listed two CVEs on that same line.
Can you offer any advice on the best way to get this working?
I finally worked it out myself. No need for BeautifulSoup; everything is RegEx now. To work around the duplicate/triplicate results I was seeing before, I convert the re.findall list result to a dictionary (retaining order of unique values) and back to a list.
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
import requests
import re
# Specify/get the url to scrape (included a default for easier testing)
### there is no input validation taking place here ###
url = input("What is the URL? ") #or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print()
# CVE regular expression
cve_pattern = r'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = requests.get(url)
# initialize count to 0
count = 0
#search for CVE references using RegEx
cves = re.findall(cve_pattern, page.text)
# after several days of fiddling, I was still getting double and sometimes triple results on certain pages. This next line
# converts the list of objects returned from re.findall to a dictionary (which retains order) to get unique values, then back to a list.
# (thanks to https://stackoverflow.com/a/48028065/9205677)
# I found order to be important sometimes, as the most severely rated CVEs are often listed first on the page
cves = list(dict.fromkeys(cves))
# print the results to the screen
for cve in cves:
print(cve)
count += 1
print()
print(str(count) + " CVEs found at " + url)
print()
I am trying to use Beautifulsoup and regular expressions to get the IP address from the website (http://www.gatherproxy.com/).
By inspecting the website, I saw that the IP address existing in the following format:
<tr class="proxy 149-56-34-94-225F" prx="149.56.34.94:8799" time="2017-03-29T15:42:33Z" type="Transparent" country="United States" port="8799" tmres="797"><td>2m 54s ago</td><td>149.56.34.94</td><td><a>
<tr class="proxy 138-68-180-44-1FB6" prx="138.68.180.44:8118" time="2017-03-29T15:42:32Z" type="Elite" country="United States" port="8118" tmres="47"><td>3m 25s ago</td><td>138.68.180.44</td><td><a>
So I am using the following code to get each tag
soup.find_all(name='tr',attrs={'class':re.compile(r"proxy [0-9a-zA-Z]+-[0-9a-zA-Z]+-[0-9a-zA-Z]+-[0-9a-zA-Z]+-[0-9a-zA-Z]+")})
But the output is nothing.
If you print the contents of your request from that website, you'll notice that the rows are being generated via Javascript.
Here's an example of that:
gp.insertPrx({"PROXY_CITY":"","PROXY_COUNTRY":"France","PROXY_IP":"149.202.191.205","PROXY_LAST_UPDATE":"3 1","PROXY_PORT":"C38","PROXY_REFS":null,"PROXY_STATE":"","PROXY_STATUS":"OK","PROXY_TIME":"524","PROXY_TYPE":"Transparent","PROXY_UID":null,"PROXY_UPTIMELD":"4152/393"});
For this step you don't need BeautifulSoup, you can regex the contents directly.
Like this:
import re
import requests
import json
result = requests.get("http://www.gatherproxy.com").content
matches = re.findall(r'gp\.insertPrx\(([^(]*)\);', str(result))
for match in matches:
_object = json.loads(match)
print (_object["PROXY_IP"])
Which outputs:
104.156.226.80
52.32.220.134
138.68.184.128
...
I am new to coding with python, learning it step by step. I have an assignment to parse text file and update database. The file would have some status like
ticket summary:
Frequency :
Action taken: < something like "restarted server">
Status:
I want to parse this file, fetch the values for the fields like "ticket summary","frequency" etc. and put them in the database, where columns for them are defined. I am reading through python regex and sub string parsing, but not finding how to start. Need help
since you haven't included an example, I will provide one to use here. I am assuming you have already installed mongodb, pymongo and have an instance of it running locally. as an example I am choosing my file to have the following formatting:
Frequency: restarted server, Action taken- none, Status: active
the code will use regex to extract the fields. for demos have a look here: regex demo
import pymongo
import re
client = pymongo.MongoClient()
db = client['some-db']
freqgroup = re.compile(r"(?P<Frequency>Frequency[\:\-]\s?).*?(?=,)", flags=re.I)
actgroup = re.compile(r"(?P<ActionTaken>Action\sTaken[\:\-]\s?).*?(?=,)", flags=re.I)
statgroup = re.compile(r"(?P<Status>Status[\:\-]\s?).*", flags=re.I)
with open("some-file.txt", "rb") as f:
for line in f:
k = re.search(freqgroup, line)
db.posts.insert_one({"Frequency": line[k.end("Frequency"):k.end()]})
k = re.search(actgroup, line)
db.posts.insert_one({"ActionTaken": line[k.end("ActionTaken"):k.end()]})
k = re.search(statgroup, line)
db.posts.insert_one({"Status": line[k.end("Status"):k.end()]})
What I'm trying to do is to pull the HTML content and find a particular string that I know exists
import urllib.request
import re
response = urllib.request.urlopen('http://ipchicken.com/')
data = response.read()
portregex = re.compile('Remote[\s]+Port: [\d]+')
port = portregex.findall(str(data))
print(data)
print(port)
Now in my case the website contains Remote Port: 50880, but I simply cannot come up with suitable regex! Can anyone find my mistake?
I'm using python 3.4 on Windows
You mistakenly used square brackets instead of round parentheses:
portregex = re.compile(r'Remote\s+Port: (\d+)')
This ensures that the results of re.findall() will contain only the matched number(s) (because re.findall() returns only the capturing groups' matches when those are present):
>>> s = "Foo Remote Port: 12345 Bar Remote Port: 54321"
>>> portregex.findall(s)
['12345', '54321']
You need to use a raw string:
portregex = re.compile(r'Remote[\s]+Port: [\d]+')
or double backslashes:
portregex = re.compile('Remote[\\s]+Port: [\\d]+')
Note that square brackets are not needed.
I'd use an HTML parser in this case. Example using BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen('http://ipchicken.com/')
soup = BeautifulSoup(response)
print(soup.find(text=lambda x: x.startswith('Remote')).text)