Scrapy Request callback method never called - python-2.7

I am building a CrawlSpider using Scrapy 0.22.2 for Python 2.7.3 and am having problems with Requests, where the callback method that I specify is never called. Here is a snippet from my parsing method that initiates a Request within a elif block:
elif current_status == "Superseded":
#Need to do more work here. Have to check whether there is a replacement unit available. If there isn't, download whatever outline is there
# We need to look for a <td> element which contains "Is superseded by " and follow that link
updated_unit = hxs.xpath('/html/body/div[#id="page"]/div[#id="layoutWrapper"]/div[#id="twoColLayoutWrapper"]/div[#id="twoColLayoutLeft"]/div[#class="layoutContentWrapper"]/div[#class="outer"]/div[#class="fieldset"]/div[#class="display-row"]/div[#class="display-row"]/div[#class="display-field-info"]/div[#class="t-widget t-grid"]/table/tbody/tr[1]/td[contains(., "Is superseded by ")]/a')
# need child element a
updated_unit_link = updated_unit.xpath('#href').extract()[0]
updated_url = "http://training.gov.au" + updated_unit_link
print "\033[0;31mSuperceded by "+updated_url+"\033[0m" # prints in Red for superseded, need to follow this link to current
yield Request(url=updated_url, callback='sortSuperseded', dont_filter=True)
def sortSuperseded(self, response):
print "\033[0;35mtest callback called\033[0m"
There are no errors when I execute this and the url is OK, but sortSuperseded is never called as I never see the 'test callback called' printed in the console.
The url I am extracting is also within the domain that I specify for my CrawlSpider.
allowed_domains = ["training.gov.au"]
Where am I going wrong?

Quotes are not required around the callback method name. Change the line:
yield Request(url=updated_url, callback='sortSuperseded', dont_filter=True)
to
yield Request(updated_url, callback=self.sortSuperseded, dont_filter=True)

Related

Save a cookie from callback function in Dash by Plotly

Looked at the following post explaining how to store cookies:
How to access a cookie from callback function in Dash by Plotly?
I'm trying to replicate this and I'm not able to store/retrieve cookies.What is wrong in the simple example below ? There are no error messages, but when debugging, the all_cookies dict is empty, while I'd expect it to have at least one member 'dash cookie'.
#app.callback(
Output(ThemeSwitchAIO.ids.switch("theme"), "value"),
Input("url-login", "pathname"),
)
def save_load_cookie(value):
dash.callback_context.response.set_cookie('dash cookie', '1 - cookie')
all_cookies = dict(flask.request.cookies)
return dash.no_update
Please note the app is running on my local machine via the standard flask server:
app.run_server(host='127.0.0.1', port=80, debug=True,
use_debugger=False, use_reloader=False, passthrough_errors=True)
Thank you #coralvanda, the callback needs to return a value instead of dash.no_update. Code should simply be:
#app.callback(
Output(ThemeSwitchAIO.ids.switch("theme"), "value"),
Input("url-login", "pathname"),
)
def save_load_cookie(value):
dash.callback_context.response.set_cookie('dash cookie', '1 - cookie')
all_cookies = dict(flask.request.cookies)
return value

Additional arguments in Flask grequests hook

I am having issue in passing additional parameter to grequests using a hook, Its working in a standalone app (non flask) but its not with flask (flask integrated server) Here is my code snippet.
self.async_list = []
for url in self.urls:
self.action_item = grequests.get(url, hooks = {'response' : [self.hook_factory(test='new_folder')]}, proxies={ 'http': 'proxy url'},timeout=20)
self.async_list.append(self.action_item)
grequests.map(self.async_list)
def hook_factory(self, test, *factory_args, **factory_kwargs):
print (test + "In start of hook factory") #this worked and I see test value is printing as new_folder
def do_something(response, *args, **kwargs):
print (test + "In do something") #This is not working hence I was not able to save this response to a newly created folder.
self.file_name = "str(test)+"/"
print ("file name is " + self.file_name)
with open(REL_PATH + self.file_name, 'wb') as f:
f.write(response.content)
return None
return do_something
Am I missing anything here?.
Trying to answer my own question, After further analysis there was nothing wrong with the above code, for some reason I was not getting my session data which is in the request_ctx_stack.top. But the same session data was available in my h_request_ctx_stack._local, Don't know the reason. But I was able to get my data from h_request_ctx_stack._local instead _request_ctx_stack.top for this hook alone. After I made that change was able execute the same hook without any issues.

Scrapy webcrawler gets caught in infinite loop, despite initially working.

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

How to start a new request after the item_scraped scrapy signal is called?

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com.
There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.
I don't wan't a for loop for handling each request. So i followed the below mentioned steps.
started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal
After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
itemIDs = [11111,22222,33333]
current_item_num = 0
def __init__(self, itemids=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.item_scraped, signals.item_scraped)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.quit()
def start_requests(self):
request = self.make_requests_from_url('http://example.com/itemview')
yield request
def parse(self,response):
self.driver = webdriver.PhantomJS()
self.driver.get(response.url)
first_data = self.driver.find_element_by_xpath('//div[#id="itemview"]').text.strip()
yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)
def processDetails(self,response):
itemID = self.itemIDs[self.current_item_num]
..form submission with the current itemID goes here...
...the content of the page is updated with the given itemID...
yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)
def processData(self,response):
...some more scraping goes here...
item = ExamplecrawlerItem()
item['first_data'] = response.meta['first_data']
yield item
def item_scraped(self,item,response,spider):
self.current_item_num += 1
#i need to call the processDetails function here for the next itemID
#and the process needs to contine till the itemID finishes
self.parse(response)
My piepline:
class ExampleDBPipeline(object):
def process_item(self, item, spider):
MYCOLLECTION.insert(dict(item))
return
I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.
self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))
However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:
yield item
self.counter += 1
yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}
Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.
Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:
import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()
And then there is no need to increment the counter and your program is thread safe.

Unable to reverse view in middleware

I'm trying to do catch a missing variable in a Django rotue with middleware - however I am unable to reverse the URL as Django cannot find the view (even though it exists in urlconf). For example:
With this route:
# matches /test and /game/test
url(r'^((?P<game>[A-Za-z0-9]+)/)?test', 'hyp.views.test'),
I am trying to detect if the game part is not given, and redirect in that case with middleware:
class GameMiddleware:
def process_view(self, request, view_func, view_args, view_kwargs):
if 'game' in view_kwargs:
game = view_kwargs['game']
if game is None:
# As a test, attempt to resolve the url
# Correctly finds ResolverMatch for hyp.views.test, game=TestGame
print resolve('/TestGame/test', urlconf=request.urlconf)
# Fails with "Reverse for 'hyp.views.test' with arguments '()'
# and keyword arguments '{'game': 'TestGame'}' not found."
return HttpResponseRedirect(reverse(
request.resolver_match.url_name, # 'hyp.views.test'
urlconf=request.urlconf,
kwargs={'game': 'TestGame'}
))
return None
request.urlconf does contain the test url:
{ '__name__': 'urlconf', '__doc__': None, 'urlpatterns': [
<RegexURLPattern None ^$>,
<RegexURLPattern None ^((?P<game>[A-Za-z0-9]+)/)?test>
], '__package__': None }
The only thing I can think of is that the URL-rewriter might not be able to deal with the regex containing optional parts - would a better solution be to create separate views for these cases (I'm going to have a lot of views with optional game params) or can I fix it?
Update
I managed to get it to work by removing the wrapping brackets in the route (so it reads r'^(?P<game>[A-Za-z0-9]+/)?test' and by passing 'TestGame/' as the game - however this isn't ideal as I have to call game.rstrip('/') each time (although only in the middleware). It's also difficult to use {% url %} tags as a name ending with / is expected.
Leaving this open in case someone has a better solution.
Thanks to Todd's answer on another question, I found a clean method of doing this: define two routes (one with the game and one without), specifying game as None in the route without the pattern:
url(r'^test$', 'hyp.views.test', {'game': None}),
url(r'^(?P<game>[A-Za-z0-9]+)/test', 'hyp.views.test'),
This triggers the middleware's if game is None part correctly and also allows games to be specified without trailing slashes.