Recursively searching a website for links with httplib2 and BeautifulSoup - python-2.7

I'm using the following to get all external Javascript references from a web page. How can I modify the code to search not only the url, but all pages of the website?
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('https://stackoverflow.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print link['src']
First attempt to make it scrape two pages deep below. Any advice on how to make it return only unique urls? As is, most are duplicates. (note that all internal links contain the word "index" on the sites I need to run this on.)
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
site = 'http://www.stackoverflow.com/'
http = httplib2.Http()
status, response = http.request(site)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if 'index' in link['href']:
page = site + link['href']
status, response = http.request(page)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print "script" + " " + link['src']
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print "iframe" + " " + iframe['src']
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if 'index' in link['href']:
page = site + link['href']
status, response = http.request(page)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print "script" + " " + link['src']
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print "iframe" + " " + iframe['src']

Crawling websites is a vast subject. Deciding how to index content and crawl further deep into the website. It includes, content parsing like your rudimentary crawler or spider is doing. It is definitely non-trivial to write a bot similar in excellence to Google Bot. Professional crawling bots do a lot of work which may include
Monitor domain related changes to initiate crawl
Schedule sitemap lookup
Fetching web content (which is scope of this question)
Fetch set of links for further crawl
Adding weights or priorities to each URL
Monitoring when services from website go down
For just doing a crawl on specific website like Stackoverflow, I have modified your code for recursive crawling. It will be trivial to convert this code further to multi-threaded form. It uses bloomfilter to make sure it does not need to crawl same page again. Let me warn upfront, there will still be unexpected pitfalls while doing a crawl. Mature crawling software like Scrapy, Nutch or Heritrix do a way better job.
import requests
from bs4 import BeautifulSoup as Soup, SoupStrainer
from bs4.element import Tag
from bloom_filter import BloomFilter
from Queue import Queue
from urlparse import urljoin, urlparse
visited = BloomFilter(max_elements=100000, error_rate=0.1)
visitlist = Queue()
def isurlabsolute(url):
return bool(urlparse(url).netloc)
def visit(url):
print "Visiting %s" % url
visited.add(url)
return requests.get(url)
def parsehref(response):
if response.status_code == 200:
for link in Soup(response.content, 'lxml', parse_only=SoupStrainer('a')):
if type(link) == Tag and link.has_attr('href'):
href = link['href']
if isurlabsolute(href) == False:
href = urljoin(response.url, href)
href = str(href)
if href not in visited:
visitlist.put_nowait(href)
else:
print "Already visited %s" % href
else:
print "Got issues mate"
if __name__ == '__main__':
visitlist.put_nowait('http://www.stackoverflow.com/')
while visitlist.empty() != True:
url = visitlist.get()
resp = visit(url)
parsehref(resp)
visitlist.task_done()
visitlist.join()

Related

Storing the data in the console in a database and accessing it

I'm building a web app which accesses the location of the user when a particular button is pressed for this I'm using the HTML geolocation api.
Below is the location.js file:
`var x = document.getElementById("demo");
function getLocation() {
if (navigator.geolocation) {
navigator.geolocation.getCurrentPosition(showPosition);
} else {
x.innerHTML = "Geolocation is not supported by this browser.";
}
}
function showPosition(position) {
x.innerHTML = "Latitude: " + position.coords.latitude +
"<br>Longitude: " + position.coords.longitude;
console.log(position.coords.latitude)
console.log(position.coords.longitude)
}
Below is the snippet of the HTML file:
<button onclick="getLocation()">HELP</button>
<p id="demo"></p>
<script src="../static/location.js"></script>
What I want to do is send this information ( i.e. longitude/latitude of the user ) to list of e-mails associated with that user but I don't know how to store this data and access this after the button is pressed.
It would be of great use if someone could get me started on how to save this data corresponding to the user and accessing it from the database.
If you want to store this info to a django DB, then if might be easier to do this in a django view. This could be a RedirectView that just redirects to the same view after the button is clicked.
I have previously used a downloaded DB of the GeoLite2-City.mmdb, which might not always be up to date, but is ok.
You can get the ip address of a request in django with the ipware library. Then convert it into a python IP object in IPy. You can then use the geoip library to get the information out of the DB.
Import the following libraries;
from ipware.ip import get_ip
from IPy import IP
import geoip2.database
Then your method to get the IPs would be something like
class MyRedirectView(RedirectView)
def get_redirect_url(self, request, *args, **kwargs):
## Write some code to handle the redirect url first ##
ip_address = get_ip(self.request)
"""Ensure that the IP address is a valid IP first"""
try:
IP(ip_address)
except Exception:
logger.exception("GEOIP2 error: ")
"""Then get the IP location"""
geo_path = settings.GEOIP_PATH
reader = geoip2.database.Reader(geo_path + '/GeoLite2-City.mmdb')
try:
response = reader.city(ip_address)
city = response.city.name
country = response.country.name
### Some code here to save to your DB
return super(MyRedirectView, self).get_redirect_url(*args, **kwargs)
If you need a much more accurate IP location service you could involce an API call to something like http://ip-api.com/. But then you would have to wait for this response before serving the next view.

How to solve the URL decoding problem in Django and nginx

My web app is deployed using nginx. I have view like below for the url /incoming/`.
def incoming_view(request):
incoming = request.GET["incoming"]
user = request.GET["user"]
...
When I hit my url /incoming/?incoming=hello&user=nkishore I am getting the response I need. but when I call this url using requests module with below code I am getting an error.
r = requests.get('http://localhost/incoming/?incoming=%s&user=%s'%("hello", "nkishore"))
print r.json()
I have checked the nginx logs and the request I got was /incoming/?incoming=hiu0026user=nkishore so in my view request.GET["user"] is failing to get user.
I am not getting what I am missing here, is it a problem with nginx or any other way to call in requests.
See Requests Docs for how to pass parameters, e.g.
>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get('https://httpbin.org/get', params=payload)
>>> print(r.url)
https://httpbin.org/get?key2=value2&key1=value1
Internally, Requests will likely escape the & ampersand with &. If you really want to do the URL manually, try as your URL string:
'http://localhost/incoming/?incoming=%s&user=%s'

Test scrapy spider still working - find page changes

How can I test a scrapy spider against online data.
I now from this post that it is possible to test a spider against offline data.
My target is to check if my spider still extracts the right data from a page, or if the page changed. I extract the data via XPath and sometimes the page receives and update and my scraper is no longer working. I would love to have the test as close to my code as possible, eg. using the spider and scrapy setup and just hook into the parse method.
Referring to the link you provided, you could try this method for online testing which I used for my problem which was similar to yours. All you have to do is instead of reading the requests from a file you can use the Requests library to fetch the live webpage for you and compose a scrapy response from the response you get from Requests like below
import os
import requests
from scrapy.http import Response, Request
def online_response_from_url (url=None):
if not url:
url = 'http://www.example.com'
request = Request(url=url)
oresp = requests.get(url)
response = TextResponse(url=url, request=request,
body=oresp.text, encoding = 'utf-8')
return response

Scrapy can't follow url with commas without encoding it

Can I force scrapy to request an URL including commas without encoding it into %2C? The site (phorum) I want to crawl does not accept encoded URLs and redirecting me into root.
So, for example, I have site to parse: example.phorum.com/read.php?12,8
The url is being encoded into: example.phorum.com/read.php?12%2C8=
But when try to request this url, every time, I'm redirected into page with list of topics:
example.phorum.com/list.php?12
In those example URLs 12 is category number, 8 is topic number.
I tried to disable redirecting by disabling RedirectMiddleware:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
}
and in spider:
handle_httpstatus_list = [302, 403]
Moreover I tried to rewrite this URL and request it by sub parser:
Rules = [Rule(RegexLinkExtractor(allow=[r'(.*%2C.*)']), follow=True, callback='prepare_url')]
def prepare_url(self, response):
url = response.url
url = re.sub(r'%2C', ',', url)
if "=" in url[-1]:
url = url[:-1]
yield Request(urllib.unquote(url), callback = self.parse_site)
Where parse_site is target parser, which still calls using encoded URL.
Thanks in advance for any feedback
You can try canonicalize=False. Example iPython session:
In [1]: import scrapy
In [2]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor
In [3]: hr = scrapy.http.HtmlResponse(url="http://example.phorum.com", body="""link""")
In [4]: lx = RegexLinkExtractor(canonicalize=False)
In [5]: lx.extract_links(hr)
Out[5]: [Link(url='http://example.phorum.com/list.php?1,2', text=u'link', fragment='', nofollow=False)]

Python request module - Getting response cookies

I am using python 3.3 and the request module. And I am trying understand how to retrieve cookies from a response. The request documentation says:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']
That doesn't make sense, how do you get data from a cookie if you don't already know the name of the cookie? Maybe I don't understand how cookies work? If I try and print the response cookies I get:
<<class 'requests.cookies.RequestsCookieJar'>[]>
Thanks
You can retrieve them iteratively:
import requests
r = requests.get('http://example.com/some/cookie/setting/url')
for c in r.cookies:
print(c.name, c.value)
I got the following code from HERE:
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
#Create a CookieJar object to hold the cookies
cj = cookielib.CookieJar()
#Create an opener to open pages using the http protocol and to process cookies.
opener = build_opener(HTTPCookieProcessor(cj), HTTPHandler())
#create a request object to be used to get the page.
req = Request("http://www.about.com")
f = opener.open(req)
#see the first few lines of the page
html = f.read()
print html[:50]
#Check out the cookies
print "the cookies are: "
for cookie in cj:
print cookie
See if this works for you.
Cookies are stored in headers as well. If this isn't working for you, check your headers for:
"Set-Cookie: Name=Value; [Expires=Date; Max-Age=Value; Path=Value]"