Urllib html not showing - regex

When I use the Urllib module, I can call/print/search the html of a website the first time, but when I try again it is gone. How can I keep the html throughout the program.
For example, when I try:
html = urllib.request.urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=')
search = re.findall(r'Mike',str(html.read()))
search
I get:
['Mike','Mike','Mike','Mike']
But then when I try to do this a second time like so:
results = re.findall(r'Mike',str(html.read()))
I get:
[]
when calling 'result'.
Why is this and how can I stop it from happening/fix it?

Without being very well versed in python, I'm guessing html.read() reads the http stream, so when you call it the second time there is nothing to read.
Try:
html = urllib.request.urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=')
data = str(html.read())
search = re.findall(r'Mike',data)
search
And then use
results = re.findall(r'Mike',data)

In addition to the correct guess of #rvalik that you can only read a stream once, data = str(html.read()) is incorrect. urlopen returns a bytes object and str returns the display representation of that object. An example:
>>> data = b'Mike'
>>> str(data)
"b'Mike'"
What you should do is either decode the bytes object using the encoding of the HTML page (UTF-8 in this case):
from urllib.request import urlopen
import re
with urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=') as html:
data = html.read().decode('utf8')
print(re.findall(r'Mike',data))
or search with a bytes object:
from urllib.request import urlopen
import re
with urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=') as html:
data = html.read()
print(re.findall(rb'Mike',data))

Related

There is a text file want to return content of file in json format on the matching column

there is a text file containing data in the form:
[sec1]
"ab": "s"
"sd" : "d"
[sec2]
"rt" : "ty"
"gh" : "rr"
"kk":"op"
we are supposed to return dara of matching sections in json format like if user wants sec1 so we are supposed to send sec1 contents
The format you specified is very similar to the TOML format. However, this one uses equals signs for assignments of key-value pairs.
If your format actually uses colons for the assignment, the following example may help you.
It uses regular expressions in conjunction with a defaultdict to read the data from the file. The section to be queried is extracted from the URL using a variable rule.
If there is no hit within the loaded data, the server responds with a 404 error (NOT FOUND).
import re
from collections import defaultdict
from flask import (
Flask,
abort,
jsonify
)
def parse(f):
data = defaultdict(dict)
section = None
for line in f:
if re.match(r'^\[[^\]]+\]$', line.strip()):
section = line[1:-2]
data[section] = dict()
continue
m = re.match(r'^"(?P<key>[^"]+)"\s*:\s*"(?P<val>[^"]+)"$', line.strip())
if m:
key,val = m.groups()
if not section:
raise OSError('illegal format')
data[section][key] = val
continue
return dict(data)
app = Flask(__name__)
#app.route('/<string:section>')
def data(section):
path = 'path/to/file'
with open(path) as f:
data = parse(f)
if section in data:
return jsonify(data[section])
abort(404)

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

I want to scrape all the text data from a page of website if that page contains some specific word.. But this code s showing errors

I want to scrape all the text data from a website's page if that page contains some specific words.I have written this code to collect the data from a page if that page contains searchphrase.. but it id giving error after running..
import urllib2
from bs4 import BeautifulSoup
import urlparse
import html2text
import re
yoururl=raw_input('Enter your url:')
page=urllib2.urlopen('http://'+ yoururl)
soup=BeautifulSoup(page)
for tag in soup.findAll('a',href=True):
raw=tag['href']
b1=urlparse.urlparse(tag['href']).hostname
b2=urlparse.urlparse(tag['href']).path
fulllink=str(b1)+str(b2)
html=urllib2.urlopen(fulllink)
h = html2text.HTML2Text()
h.ignore_links= True
if "searchphrase" in h.html:
print h.handle(html)
As discussed above this might not be exactly what the OP intended
This program takes
a start url. It then finds hrefs in the page. Does some pointless manipulation on the hrefs that ICBA to refactor :) and retrieves the objects that the hrefs point to.
Finally, if the data in the object retrieved contains a magic key word (in my case "winter") then the object is displayed
The depth of searching is only one deep, unlike a proper crawler
import urllib2
from bs4 import BeautifulSoup
import urlparse
yoururl=raw_input('Enter your url:')
page=urllib2.urlopen('http://'+ yoururl)
soup=BeautifulSoup(page)
for tag in soup.findAll('a',href=True):
raw=tag['href']
b1=urlparse.urlparse(tag['href']).hostname
b2=urlparse.urlparse(tag['href']).path
fulllink=str(b1)+str(b2)
html=urllib2.urlopen('http://' + fulllink).readlines()
if "winter" in repr(html):
print html

Scraping urls from html, save in csv using BeautifulSoup

I'm trying to save all hyperlinked urls in an online forum in a CSV file, for a research project.
When I 'print' the html scraping result it seems to be working fine, in the sense that it prints all the urls I want, but I'm unable to write these to separate rows in the CSV.
I'm clearly doing something wrong, but I don't know what! So any help will be greatly appreciated.
Here's the code I've written:
import urllib2
from bs4 import BeautifulSoup
import csv
import re
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php? fid=28&page=5').read())
urls = []
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
csvfile = open('Ss141.csv', 'wb')
writer = csv.writer(csvfile)
for url in zip(urls):
writer.writerow([url])
csvfile.close()
You do need to add your matches to the urls list:
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
urls.append(url)
and you don't need to use zip() here.
Best just write your urls as you find them, instead of collecting them in a list first:
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php?fid=28&page=5').read())
with open('Ss141.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for url in soup.find_all('a', href=re.compile('viewthread.php')):
writer.writerow([url['href']])
The with statement will close the file object for you when the block is done.

Read multilanguage strings from html via Python 2.7

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.
import urllib2
import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
print data[0]['content'].encode("utf-8")
the result I am taking is
BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text
The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?
PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.
Thank you in advance!
You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.
Example:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
with open("test.txt", "w") as myfile:
myfile.write(data[0]['content'].encode("utf-8"))
test.txt:
BBC中文网,主页,bbcchinese.com, email news, newsletter, subscription, full text
Which OS and terminal you are using?