Scrapy webcrawler gets caught in infinite loop, despite initially working. - python-2.7

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)

Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

Related

django - view for redirecting to another page with query recieved

I have two search inputs, first on home page another on search results itself. what i m try to do is receive query form home and redirect it to search results page according to this :
Like I search - html-5
redirect page should be - 127.0.0.1/html-5/find/?q=html-5
I have tried but unfortunately not getting the right way to it, please suggest me the correct way to do it.
I use these url patterns
url(r'^(?P<key>.*)/find/', FacetedSearchView.as_view(), name='haystack_search'),
url(r'^search/',category_query_view,name='category_query'),
then in category_query
def category_query_view(request):
category = request.GET.get('q')
print('hihi',category)
return HttpResponseRedirect(reverse('haystack_search', kwargs={'key':category},))
It is redirecting me to
127.0.0.1/html-5/find/
but i don't know how to add
/?q=html-5
in after this?
Oh, I get the right way, its pretty simple
def category_query_view(request):
category = request.GET.get('q')
print('hihi',category)
url = '{category}/find/?q={category}'.format(category=category)
return HttpResponseRedirect('/'+url)

web Crawling and Extracting data using scrapy

I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.

Why does scrapy miss some links?

I am scraping the web-site "www.accell-group.com" using the "scrapy" library for Python. The site is scraped completely, in total 131 pages (text/html) and 2 documents (application/pdf) are identified. Scrapy did not throw any warnings or errors. My algorithm is supposed to scrape every single link. I use CrawlSpider.
However, when I look into the page "http://www.accell-group.com/nl/investor-relations/jaarverslagen/jaarverslagen-van-accell-group.htm", which is reported by "scrapy" as scraped/processed, I see that there are more pdf-documents, for example "http://www.accell-group.com/files/4/5/0/1/Jaarverslag2014.pdf". I cannot find any reasons for it not to be scraped. There is no dynamic/JavaScript content on this page. It is not forbidden in "http://www.airproducts.com/robots.txt".
Do you maybe have any idea why it can happen?
It is maybe because the "files" folder is not in "http://www.accell-group.com/sitemap.xml"?
Thanks in advance!
My code:
class PyscrappSpider(CrawlSpider):
"""This is the Pyscrapp spider"""
name = "PyscrappSpider"
def__init__(self, *a, **kw):
# Get the passed URL
originalURL = kw.get('originalURL')
logger.debug('Original url = {}'.format(originalURL))
# Add a protocol, if needed
startURL = 'http://{}/'.format(originalURL)
self.start_urls = [startURL]
self.in_redirect = {}
self.allowed_domains = [urlparse(i).hostname.strip() for i in self.start_urls]
self.pattern = r""
self.rules = (Rule(LinkExtractor(deny=[r"accessdenied"]), callback="parse_data", follow=True), )
# Get WARC writer
self.warcHandler = kw.get('warcHandler')
# Initialise the base constructor
super(PyscrappSpider, self).__init__(*a, **kw)
def parse_start_url(self, response):
if (response.request.meta.has_key("redirect_urls")):
original_url = response.request.meta["redirect_urls"][0]
if ((not self.in_redirect.has_key(original_url)) or (not self.in_redirect[original_url])):
self.in_redirect[original_url] = True
self.allowed_domains.append(original_url)
return self.parse_data(response)
def parse_data(self, response):
"""This function extracts data from the page."""
self.warcHandler.write_response(response)
pattern = self.pattern
# Check if we are interested in the current page
if (not response.request.headers.get('Referer')
or re.search(pattern, self.ensure_not_null(response.meta.get('link_text')), re.IGNORECASE)
or re.search(r"/(" + pattern + r")", self.ensure_not_null(response.url), re.IGNORECASE)):
logging.debug("This page gets processed = %(url)s", {'url': response.url})
sel = Selector(response)
item = PyscrappItem()
item['url'] = response.url
return item
else:
logging.warning("This page does NOT get processed = %(url)s", {'url': response.url})
return response.request
Remove or expand appropriately your "allowed_domains" variable and you should be fine. All the URLs the spider follows, by default, are restricted by allowed_domains.
EDIT: This case mentions particularly pdfs. PDFs are explicitly excluded as extensions as per the default value of deny_extensions (see here) which is IGNORED_EXTENSIONS (see here).
To allow your application to crawl PDFs all you have to do is to exclude them from IGNORED_EXTENSIONS by setting explicitly the value for deny_extensions:
from scrapy.linkextractors import IGNORED_EXTENSIONS
self.rules = (Rule(...
LinkExtractor(deny=[r"accessdenied"], deny_extensions=set(IGNORED_EXTENSIONS)-set(['pdf']))
..., callback="parse_data"...
So, I'm afraid, this is the answer to the question "Why does Scrapy miss some links?". As you will likely see it just opens the doors to further questions, like "how do I handle those PDFs" but I guess this is the subject of another question.

TDD Django tests seem to skip certain parts of the view code

I'm writing some tests for a site using django TDD.
The problem is that when I manually go to the testserver. Fill in the form and submit it then it seems to works fine. But when I run the test using manage.py test wiki it seems to skip parts of the code within the view. The page parts all seem to work fine. But the pagemod-parts within the code and even a write() I created just to see what was going on seems to be ignored.
I have no idea what could be causing this and can't seem to find a solution. Any ideas?
This is the code:
test.py
#imports
class WikiSiteTest(LiveServerTestCase):
....
def test_wiki_links(self):
'''Go to the site, and check a few links'''
#creating a few objects which will be used later
.....
#some code to get to where I want:
.....
#testing the link to see if the tester can add pages
link = self.browser.find_element_by_link_text('Add page (for testing only. delete this later)')
link.click()
#filling in the form
template_field = self.browser.find_element_by_name('template')
template_field.send_keys('homepage')
slug_field = self.browser.find_element_by_name('slug')
slug_field.send_keys('this-is-a-slug')
title_field = self.browser.find_element_by_name('title')
title_field.send_keys('this is a title')
meta_field = self.browser.find_element_by_name('meta_description')
meta_field.send_keys('this is a meta')
content_field = self.browser.find_element_by_name('content')
content_field.send_keys('this is content')
#submitting the filled form so that it can be processed
s_button = self.browser.find_element_by_css_selector("input[value='Submit']")
s_button.click()
# now the view is called
and a view:
views.py
def page_add(request):
'''This function does one of these 3 things:
- Prepares an empty form
- Checks the formdata it got. If its ok then it will save it and create and save
a copy in the form of a Pagemodification.
- Checks the formdata it got. If its not ok then it will redirect the user back'''
.....
if request.method == 'POST':
form = PageForm(request.POST)
if form.is_valid():
user = request.user.get_profile()
page = form.save(commit=False)
page.partner = user.partner
page.save() #works
#Gets ignored
pagemod = PageModification()
pagemod.template = page.template
pagemod.parent = page.parent
pagemod.page = Page.objects.get(slug=page.slug)
pagemod.title = page.title
pagemod.meta_description = page.meta_description
pagemod.content = page.content
pagemod.author = request.user.get_profile()
pagemod.save()
f = open("/location/log.txt", "w", True)
f.write('are you reaching this line?')
f.close()
#/gets ignored
#a render to response
Then later I do:
test.py
print '###############Data check##################'
print Page.objects.all()
print PageModification.objects.all()
print '###############End data check##############'
And get:
terminal:
###############Data check##################
[<Page: this is a title 2012-10-01 14:39:21.739966+00:00>]
[]
###############End data check##############
All the imports are fine. Putting the page.save() after the ignored code makes no difference.
This only happens when running it through the TDD test.
Thanks in advance.
How very strange. Could it be that the view is somehow erroring at the Pagemodification stage? Have you got any checks later on in your test that assert that the response from the view is coming through correctly, ie that a 500 error is not being returned instead?
Now this was a long time ago.
It was solved but the solution was a little embarrassing. Basically, it was me being stupid. I can't remember the exact details but I believe a different view was called instead of the one that I showed here. That view had the same code except the "skipped" part.
My apologies to anyone who took their time looking into this.

Problems Scraping a Page With Beautiful Soup

I am using Beautiful Soup to try and scrape a page.
I am trying to follow this tutorial.
I am trying to get the contents of the following page after submitting a Stock Ticker Symbol:
http://www.cboe.com/delayedquote/quotetable.aspx
The tutorial is for a page with a "GET" method, my page is a "POST". I wonder if that is part of the problem?
I want use the first text box – under where it says:
“Enter a Stock or Index symbol below for delayed quotes.”
Relevant code:
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
values = {'ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol' : 'IBM' }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.cboe.com/delayedquote/quotetable.aspx", data, headers)
response = urllib2.urlopen(request)
The call does not fail, I do not get a set of options and prices returned to me like when I run the page interactively. I a bunch of garbled HTML.
Thanks in advance!
Ok - I think I figured out the problem (and found another). I decided to switch to 'mechanize' from 'urllib2'. Unfortunately, I kept having problems getting the data. Finally, I realized that there are two 'submit' buttons, so I tried passing the name parameter when submitting the form. That did the trick as far as getting the correct response.
However, the next problem was that I could not get BeautifulSoup to parse the HTML and find the necessary tags. A brief Google search revealed others having similar problems. So, I gave up on BeautifulSoup and just did a basic regex on the HTML. Not as elegant as BeautifulSoup, but effective.
Ok - enough speechifying. Here's what I came up with:
import mechanize
import re
br = mechanize.Browser()
url = 'http://www.cboe.com/delayedquote/quotetable.aspx'
br.open(url)
br.select_form(name='aspnetForm')
br['ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol'] = 'IBM'
# here's the key step that was causing the trouble - pass the name parameter
# for the button when calling submit
response = br.submit(name="ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$btnSubmit")
data = response.read()
match = re.search( r'Bid</font><span> \s*([0-9]{1,4}\.[0-9]{2})', data, re.MULTILINE|re.M|re.I)
if match:
print match.group(1)
else:
print "There was a problem retrieving the quote"