Why does scrapy miss some links? - python-2.7

I am scraping the web-site "www.accell-group.com" using the "scrapy" library for Python. The site is scraped completely, in total 131 pages (text/html) and 2 documents (application/pdf) are identified. Scrapy did not throw any warnings or errors. My algorithm is supposed to scrape every single link. I use CrawlSpider.
However, when I look into the page "http://www.accell-group.com/nl/investor-relations/jaarverslagen/jaarverslagen-van-accell-group.htm", which is reported by "scrapy" as scraped/processed, I see that there are more pdf-documents, for example "http://www.accell-group.com/files/4/5/0/1/Jaarverslag2014.pdf". I cannot find any reasons for it not to be scraped. There is no dynamic/JavaScript content on this page. It is not forbidden in "http://www.airproducts.com/robots.txt".
Do you maybe have any idea why it can happen?
It is maybe because the "files" folder is not in "http://www.accell-group.com/sitemap.xml"?
Thanks in advance!
My code:
class PyscrappSpider(CrawlSpider):
"""This is the Pyscrapp spider"""
name = "PyscrappSpider"
def__init__(self, *a, **kw):
# Get the passed URL
originalURL = kw.get('originalURL')
logger.debug('Original url = {}'.format(originalURL))
# Add a protocol, if needed
startURL = 'http://{}/'.format(originalURL)
self.start_urls = [startURL]
self.in_redirect = {}
self.allowed_domains = [urlparse(i).hostname.strip() for i in self.start_urls]
self.pattern = r""
self.rules = (Rule(LinkExtractor(deny=[r"accessdenied"]), callback="parse_data", follow=True), )
# Get WARC writer
self.warcHandler = kw.get('warcHandler')
# Initialise the base constructor
super(PyscrappSpider, self).__init__(*a, **kw)
def parse_start_url(self, response):
if (response.request.meta.has_key("redirect_urls")):
original_url = response.request.meta["redirect_urls"][0]
if ((not self.in_redirect.has_key(original_url)) or (not self.in_redirect[original_url])):
self.in_redirect[original_url] = True
self.allowed_domains.append(original_url)
return self.parse_data(response)
def parse_data(self, response):
"""This function extracts data from the page."""
self.warcHandler.write_response(response)
pattern = self.pattern
# Check if we are interested in the current page
if (not response.request.headers.get('Referer')
or re.search(pattern, self.ensure_not_null(response.meta.get('link_text')), re.IGNORECASE)
or re.search(r"/(" + pattern + r")", self.ensure_not_null(response.url), re.IGNORECASE)):
logging.debug("This page gets processed = %(url)s", {'url': response.url})
sel = Selector(response)
item = PyscrappItem()
item['url'] = response.url
return item
else:
logging.warning("This page does NOT get processed = %(url)s", {'url': response.url})
return response.request

Remove or expand appropriately your "allowed_domains" variable and you should be fine. All the URLs the spider follows, by default, are restricted by allowed_domains.
EDIT: This case mentions particularly pdfs. PDFs are explicitly excluded as extensions as per the default value of deny_extensions (see here) which is IGNORED_EXTENSIONS (see here).
To allow your application to crawl PDFs all you have to do is to exclude them from IGNORED_EXTENSIONS by setting explicitly the value for deny_extensions:
from scrapy.linkextractors import IGNORED_EXTENSIONS
self.rules = (Rule(...
LinkExtractor(deny=[r"accessdenied"], deny_extensions=set(IGNORED_EXTENSIONS)-set(['pdf']))
..., callback="parse_data"...
So, I'm afraid, this is the answer to the question "Why does Scrapy miss some links?". As you will likely see it just opens the doors to further questions, like "how do I handle those PDFs" but I guess this is the subject of another question.

Related

web Crawling and Extracting data using scrapy

I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.

Retrieve all URLs to web pages on the same domain, but no media or anchors

I'm new to Python, and extremely impressed by the amount of libraries at my disposal. I have a function already that uses Beautiful Soup to extract URLs from a site, but not all of them are relevant. I only want webpages (no media) on the same website (domain or subdomain, but no other domains). I'm trying to manually program around examples I run into, but I feel like I'm reinventing the wheel - surely this is a common problem in internet applications.
Here's an example list of URLs that I might retrieve from a website, say http://example.com, with markings for whether or not I want them and why. Hopefully this illustrates the issue.
Good:
example.com/page - it links to another page on the same domain
example.com/page.html - has an filetype ending but it's an HTML page
subdomain.example.com/page.html - it's on the same site, though on a subdomain
/about/us - it's a relative link, so it doesn't have the domain it it, but it's implied
Bad:
otherexample.com/page - bad, the domain doesn't match
example.com/image.jpg - bad, not an image and not a page
/ - bad - sometimes there's just a slash in the "a" tag, but that's a reference to the page I'm already on
#anchor - this is also a relative link, but it's on the same page, so there's no need for it
I've been writing cases in if statements for each of these...but there has to be a better way!
Edit: Here's my current code, which returns nothing:
ignore_values = {"", "/"}
def desired_links(href):
# ignore if href is not set
if not href:
return False
# ignore if it is just a link to the same page
if href.startswith("#"):
return False
# skip ignored values
if href in ignore_values:
return False
def explorePage(pageURL):
#Get web page
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(pageURL)
html = response.read()
#Parse web page for links
soup = BeautifulSoup(html, 'html.parser')
links = [a["href"] for a in soup.find_all("a", href=desired_links)]
for link in links:
print(link)
return
def main():
explorePage("http://xkcd.com")
BeautifulSoup is quite flexible in helping you to create and apply the rules to attribute values. You can create a filtering function and use it as a value for the href argument to find_all().
For example, something for you to start with:
ignore_values = {"", "/"}
def desired_links(href):
# ignore if href is not set
if not href:
return False
# ignore if it is just a link to the same page
if href.startswith("#"):
return False
# skip ignored values
if href in ignore_values:
return False
# TODO: more rules
# you would probably need "urlparse" package for a proper url analysis
return True
Usage:
links = [a["href"] for a in soup.find_all("a", href=desired_links)]
You should take a look at Scrapy and its Link Extractors.

Scrapy webcrawler gets caught in infinite loop, despite initially working.

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

TDD Django tests seem to skip certain parts of the view code

I'm writing some tests for a site using django TDD.
The problem is that when I manually go to the testserver. Fill in the form and submit it then it seems to works fine. But when I run the test using manage.py test wiki it seems to skip parts of the code within the view. The page parts all seem to work fine. But the pagemod-parts within the code and even a write() I created just to see what was going on seems to be ignored.
I have no idea what could be causing this and can't seem to find a solution. Any ideas?
This is the code:
test.py
#imports
class WikiSiteTest(LiveServerTestCase):
....
def test_wiki_links(self):
'''Go to the site, and check a few links'''
#creating a few objects which will be used later
.....
#some code to get to where I want:
.....
#testing the link to see if the tester can add pages
link = self.browser.find_element_by_link_text('Add page (for testing only. delete this later)')
link.click()
#filling in the form
template_field = self.browser.find_element_by_name('template')
template_field.send_keys('homepage')
slug_field = self.browser.find_element_by_name('slug')
slug_field.send_keys('this-is-a-slug')
title_field = self.browser.find_element_by_name('title')
title_field.send_keys('this is a title')
meta_field = self.browser.find_element_by_name('meta_description')
meta_field.send_keys('this is a meta')
content_field = self.browser.find_element_by_name('content')
content_field.send_keys('this is content')
#submitting the filled form so that it can be processed
s_button = self.browser.find_element_by_css_selector("input[value='Submit']")
s_button.click()
# now the view is called
and a view:
views.py
def page_add(request):
'''This function does one of these 3 things:
- Prepares an empty form
- Checks the formdata it got. If its ok then it will save it and create and save
a copy in the form of a Pagemodification.
- Checks the formdata it got. If its not ok then it will redirect the user back'''
.....
if request.method == 'POST':
form = PageForm(request.POST)
if form.is_valid():
user = request.user.get_profile()
page = form.save(commit=False)
page.partner = user.partner
page.save() #works
#Gets ignored
pagemod = PageModification()
pagemod.template = page.template
pagemod.parent = page.parent
pagemod.page = Page.objects.get(slug=page.slug)
pagemod.title = page.title
pagemod.meta_description = page.meta_description
pagemod.content = page.content
pagemod.author = request.user.get_profile()
pagemod.save()
f = open("/location/log.txt", "w", True)
f.write('are you reaching this line?')
f.close()
#/gets ignored
#a render to response
Then later I do:
test.py
print '###############Data check##################'
print Page.objects.all()
print PageModification.objects.all()
print '###############End data check##############'
And get:
terminal:
###############Data check##################
[<Page: this is a title 2012-10-01 14:39:21.739966+00:00>]
[]
###############End data check##############
All the imports are fine. Putting the page.save() after the ignored code makes no difference.
This only happens when running it through the TDD test.
Thanks in advance.
How very strange. Could it be that the view is somehow erroring at the Pagemodification stage? Have you got any checks later on in your test that assert that the response from the view is coming through correctly, ie that a 500 error is not being returned instead?
Now this was a long time ago.
It was solved but the solution was a little embarrassing. Basically, it was me being stupid. I can't remember the exact details but I believe a different view was called instead of the one that I showed here. That view had the same code except the "skipped" part.
My apologies to anyone who took their time looking into this.

Django: creating/modifying the request object

I'm trying to build an URL-alias app which allows the user create aliases for existing url in his website.
I'm trying to do this via middleware, where the request.META['PATH_INFO'] is checked against database records of aliases:
try:
src: request.META['PATH_INFO']
alias = Alias.objects.get(src=src)
view = get_view_for_this_path(request)
return view(request)
except Alias.DoesNotExist:
pass
return None
However, for this to work correctly it is of life-importance that (at least) the PATH_INFO is changed to the destination path.
Now there are some snippets which allow the developer to create testing request objects (http://djangosnippets.org/snippets/963/, http://djangosnippets.org/snippets/2231/), but these state that they are intended for testing purposes.
Of course, it could be that these snippets are fit for usage in a live enviroment, but my knowledge about Django request processing is too undeveloped to assess this.
Instead of the approach you're taking, have you considered the Redirects app?
It won't invisibly alias the path /foo/ to return the view bar(), but it will redirect /foo/ to /bar/
(posted as answer because comments do not seem to support linebreaks or other markup)
Thank for the advice, I have the same feeling regarding modifying request attributes. There must be a reason that the Django manual states that they should be considered read only.
I came up with this middleware:
def process_request(self, request):
try:
obj = A.objects.get(src=request.path_info.rstrip('/')) #The alias record.
view, args, kwargs = resolve_to_func(obj.dst + '/') #Modified http://djangosnippets.org/snippets/2262/
request.path = request.path.replace(request.path_info, obj.dst)
request.path_info = obj.dst
request.META['PATH_INFO'] = obj.dst
request.META['ROUTED_FROM'] = obj.src
request.is_routed = True
return view(request, *args, **kwargs)
except A.DoesNotExist: #No alias for this path
request.is_routed = False
except TypeError: #View does not exist.
pass
return None
But, considering the objections against modifying the requests' attributes, wouldn't it be a better solution to just skip that part, and only add the is_routed and ROUTED_TO (instead of routed from) parts?
Code that relies on the original path could then use that key from META.
Doing this using URLConfs is not possible, because this aliasing is aimed at enabling the end-user to configure his own URLs, with the assumption that the end-user has no access to the codebase or does not know how to write his own URLConf.
Though it would be possible to write a function that converts a user-readable-editable file (XML for example) to valid Django urls, it feels that using database records allows a more dynamic generation of aliases (other objects defining their own aliases).
Sorry to necro-post, but I just found this thread while searching for answers. My solution seems simpler. Maybe a) I'm depending on newer django features or b) I'm missing a pitfall.
I encountered this because there is a bot named "Mediapartners-Google" which is asking for pages with url parameters still encoded as from a naive scrape (or double-encoded depending on how you look at it.) i.e. I have 404s in my log from it that look like:
1.2.3.4 - - [12/Nov/2012:21:23:11 -0800] "GET /article/my-slug-name%3Fpage%3D2 HTTP/1.1" 1209 404 "-" "Mediapartners-Google
Normally I'd just ignore a broken bot, but this one I want to appease because it ought to better target our ads (It's google adsense's bot) resulting in better revenue - if it can see our content. Rumor is it doesn't follow redirects so I wanted to find a solution similar to the original Q. I do not want regular clients accessing pages by these broken urls, so I detect the user-agent. Other applications probably won't do that.
I agree a redirect would normally be the right answer.
My (complete?) solution:
from django.http import QueryDict
from django.core.urlresolvers import NoReverseMatch, resolve
class MediapartnersPatch(object):
def process_request(self, request):
# short-circuit asap
if request.META['HTTP_USER_AGENT'] != 'Mediapartners-Google':
return None
idx = request.path.find('?')
if idx == -1:
return None
oldpath = request.path
newpath = oldpath[0:idx]
try:
url = resolve(newpath)
except NoReverseMatch:
return None
request.path = newpath
request.GET = QueryDict(oldpath[idx+1:])
response = url.func(request, *url.args, **url.kwargs)
response['Link'] = '<%s>; rel="canonical"' % (oldpath,)
return response