Scrapy spider closes prematurely after yield request inside start_requests function - python-2.7

I have no idea at all how to do this. Basically I want to run a database insert before running the scraping program. I have tried to do this by placing a yield request at the top of the start_requests() function, like so, except what happens is the pipline executes and the scrape terminates instead of actually going to the next line in start_requests.
def start_requests(self):
db = MyDatabase()
link = "http://alexanderwhite.se/"
item = PopulateListingAvItem()
self.start_urls.append(link)
yield Request(link, callback=self.listing_av_populate, meta={'item': item}, priority=300)
#program terminates successfully completing the above request, but I need it to continue to the next line
query1 = "SELECT listing_id FROM listing_availability WHERE availability=1"
listing_ids = db.query(query1)
for lid in listing_ids:
query2 = "SELECT url from listings where listing_id="+str(lid['listing_id'])
self.start_urls.append( db.query(query2)[0]['url'] )
for url in self.start_urls:
yield Request(url, self.parse, priority=1)

The solution is not obvious and could be helpful to others. In this case, when you need to perform an insert query before doing any scraping, you can use yield item, followed by the rest of your start_url generator code inside the first callback function as suggested by ice13berg. The whole key is to use "yield item".
I don't quite understand why it works but it does, correctly executing the insert with the item, and using the result of the insert for the next select queries.
def listing_av_populate(self,response):
db = MyDatabase()
item = response.meta['item']
item['update_bid_av'] = 3
yield item
query1 = "SELECT listing_id FROM listing_availability WHERE availability=1"
listing_ids = db.query(query1)
for lid in listing_ids:
query2 = "SELECT url from listings where listing_id="+str(lid['listing_id'])
self.start_urls.append( db.query(query2)[0]['url'] )
for url in self.start_urls:
yield Request(url, self.parse, priority=1)
def start_requests(self):
link = "http://alexanderwhite.se/"
item = PopulateListingAvItem()
yield Request(link, callback=self.listing_av_populate, meta={'item': item}, priority=300)

Related

List populated with Scrapy is returned before actually filled

This involves pretty much the same code I just asked a different question about this morning, so if it looks familiar, that's because it is.
class LbcSubtopicSpider(scrapy.Spider):
...irrelevant/sensitive code...
rawTranscripts = []
rawTranslations = []
def parse(self, response):
rawTitles = []
rawVideos = []
for sel in response.xpath('//ul[1]'): #only scrape the first list
...irrelevant code...
index = 0
for sub in sel.xpath('li/ul/li/a'): #scrape the sublist items
index += 1
if index%2!=0: #odd numbered entries are the transcripts
transcriptLink = sub.xpath('#href').extract()
#url = response.urljoin(transcriptLink[0])
#yield scrapy.Request(url, callback=self.parse_transcript)
else: #even numbered entries are the translations
translationLink = sub.xpath('#href').extract()
url = response.urljoin(translationLink[0])
yield scrapy.Request(url, callback=self.parse_translation)
print rawTitles
print rawVideos
print "translations:"
print self.rawTranslations
def parse_translation(self, response):
for sel in response.xpath('//p[not(#class)]'):
rawTranslation = sel.xpath('text()').extract()
rawTranslation = ''.join(rawTranslation)
#print rawTranslation
self.rawTranslations.append(rawTranslation)
#print self.rawTranslations
My problem is that "print self.rawTranslations" in the parse(...) method prints nothing more than "[]". This could mean one of two things: it could be resetting the list right before printing, or it could be printing before the calls to parse_translation(...) that populate the list from links parse(...) follows are finished. I'm inclined to suspect it's the latter, as I can't see any code that would reset the list, unless "rawTranslations = []" in the class body is run multiple times.
Worth noting is that if I uncomment the same line in parse_translation(...), it will print the desired output, meaning it's extracting the text correctly and the problem seems to be unique to the main parse(...) method.
My attempts to resolve what I believe is a synchronization problem were pretty aimless--I just tried using an RLock object based on as many Google tutorials I could find and I'm 99% sure I misused it anyway, as the result was identical.
The problem here is that you are not understanding how scrapy really works.
Scrapy is a crawling framework, used for creating website spiders, not just for doing requests, that's the requests module for.
Scrapy's requests work asynchronously, when you call yield Request(...) you are adding requests to a stack of requests that will be executed at some point (you don't have control over it). Which means that you can't expect that some part of your code after the yield Request(...) will be executed at that moment. In fact, your method should always end yielding a Request or an Item.
Now from what I can see and most cases of confusion with scrapy, you want to keep populating an item you created on some method, but the information you need is on a different request.
On that case, communication is usually done with the meta parameter of the Request, something like this:
...
yield Request(url, callback=self.second_method, meta={'item': myitem, 'moreinfo': 'moreinfo', 'foo': 'bar'})
def second_method(self, response):
previous_meta_info = response.meta
# I can access the previous item with `response.meta['item']`
...
So this seems like somewhat of a hacky solution, especially since I just learned of Scrapy's request priority functionality, but here's my new code that gives the desired result:
class LbcVideosSpider(scrapy.Spider):
...code omitted...
done = 0 #variable to keep track of subtopic iterations
rawTranscripts = []
rawTranslations = []
def parse(self, response):
#initialize containers for each field
rawTitles = []
rawVideos = []
...code omitted...
index = 0
query = sel.xpath('li/ul/li/a')
for sub in query: #scrape the sublist items
index += 1
if index%2!=0: #odd numbered entries are the transcripts
transcriptLink = sub.xpath('#href').extract()
#url = response.urljoin(transcriptLink[0])
#yield scrapy.Request(url, callback=self.parse_transcript)
else: #even numbered entries are the translations
translationLink = sub.xpath('#href').extract()
url = response.urljoin(translationLink[0])
yield scrapy.Request(url, callback=self.parse_translation, \
meta={'index': index/2, 'maxIndex': len(query)/2})
print rawTitles
print rawVideos
def parse_translation(self, response):
#grab meta variables
i = response.meta['index']
maxIndex = response.meta['maxIndex']
#interested in p nodes without class
query = response.xpath('//p[not(#class)]')
for sel in query:
rawTranslation = sel.xpath('text()').extract()
rawTranslation = ''.join(rawTranslation) #collapse each line
self.rawTranslations.append(rawTranslation)
#increment number of translations done, check if finished
self.done += 1
print self.done
if self.done==maxIndex:
print self.rawTranslations
Basically, I just kept track of how many requests were completed and making some code conditional on a request being final. This prints the fully populated list.

How to call a function after finishing recursive asynchronous jobs in Python?

I use scrapy for scraping this site.
I want to save all the sub-categories in an array, then get the corresponding pages (pagination)
first step i have
def start_requests(self):
yield Request(start_urls[i], callback=self.get_sous_cat)
get_sous_cat is a function which gets all the sub-categories of a site, then starts asynchronously jobs to explore the sub-sub-categories recursively.
def get_sous_cat(self,response):
#Put all the categgories in a array
catList = response.css('div.categoryRefinementsSection')
if (catList):
for category in catList.css('a::attr(href)').extract():
category = 'https://www.amazon.fr' + category
print category
self.arrayCategories.append(category)
yield Request(category, callback=self.get_sous_cat)
When all the respective request have been sent, I need to call this termination function :
def pagination(self,response):
for i in range(0, len(self.arrayCategories[i])):
#DO something with each sub-category
I tried this
def start_requests(self):
yield Request(start_urls[i], callback=self.get_sous_cat)
for subCat in range(0,len(self.arrayCategories)):
yield Request(self.arrayCategories[subCat], callback=self.pagination)
Well done, this is a good question! Two small things:
a) use a set instead of an array. This way you won't have duplicates
b) site structure will change once a month/year. You will likely crawl more frequently. Break the spider into two; 1. The one that creates the list of category urls and runs monthly and 2. The one that gets as start_urls the file generated by the first
Now, if you really want to do it the way you do it now, hook the spider_idle signal (see here: Scrapy: How to manually insert a request from a spider_idle event callback? ). This gets called when there are no further urls to do and allows you to inject more. Set a flag or reset your list at that point so that the second time the spider is idle (after it crawled everything), it doesn't re-inject the same category urls for ever.
If, as it seems in your case, you don't want to do some fancy processing on the urls but just crawl categories before other URLs, this is what Request priority property is for (http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-subclasses). Just set it to e.g. 1 for your category URLs and then it will follow those links before it processes any non-category links. This is more efficient since it won't load those category pages twice as your current implementation would do.
This is not "recursion", it's asynchronous jobs. What you need is a global counter (protected by a Lock) and if 0, do your completion :
from threading import Lock
class JobCounter(object):
def __init__(self, completion_callback, *args, **kwargs):
self.c = 0
self.l = Lock()
self.completion = (completion_callback, args, kwargs)
def __iadd__(self, n):
b = false
with self.l:
self.c += n
if self.c <= 0:
b = true
if b:
f, args, kwargs = self.completion
f(*args, **kwargs)
def __isub__(self, n):
self.__iadd__(-n)
each time you launch a job, do counter += 1
each time a job finishes, do counter -= 1
NOTE : this does the completion in the thread of the last calling job. If you want to do it in a particular thread, use a Condition instead of a Lock, and do notify() instead of the call.

Scrapy webcrawler gets caught in infinite loop, despite initially working.

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

How to start a new request after the item_scraped scrapy signal is called?

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com.
There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.
I don't wan't a for loop for handling each request. So i followed the below mentioned steps.
started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal
After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
itemIDs = [11111,22222,33333]
current_item_num = 0
def __init__(self, itemids=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.item_scraped, signals.item_scraped)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.quit()
def start_requests(self):
request = self.make_requests_from_url('http://example.com/itemview')
yield request
def parse(self,response):
self.driver = webdriver.PhantomJS()
self.driver.get(response.url)
first_data = self.driver.find_element_by_xpath('//div[#id="itemview"]').text.strip()
yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)
def processDetails(self,response):
itemID = self.itemIDs[self.current_item_num]
..form submission with the current itemID goes here...
...the content of the page is updated with the given itemID...
yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)
def processData(self,response):
...some more scraping goes here...
item = ExamplecrawlerItem()
item['first_data'] = response.meta['first_data']
yield item
def item_scraped(self,item,response,spider):
self.current_item_num += 1
#i need to call the processDetails function here for the next itemID
#and the process needs to contine till the itemID finishes
self.parse(response)
My piepline:
class ExampleDBPipeline(object):
def process_item(self, item, spider):
MYCOLLECTION.insert(dict(item))
return
I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.
self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))
However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:
yield item
self.counter += 1
yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}
Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.
Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:
import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()
And then there is no need to increment the counter and your program is thread safe.

django subprocess p.wait() doesn't return web

With a django button, I need to launch multiples music (with random selection).
In my models.py, I have two functions 'playmusic' and 'playmusicrandom' :
def playmusic(self, music):
if self.isStarted():
self.stop()
command = ("sudo /usr/bin/mplayer "+music.path)
p = subprocess.Popen(command+str(music.path), shell=True)
p.wait()
def playmusicrandom(request):
conn = sqlite3.connect(settings.DATABASES['default']['NAME'])
cur = conn.cursor()
cur.execute("SELECT id FROM webgui_music")
list_id = [row[0] for row in cur.fetchall()]
### Get three IDs randomly from the list ###
selected_ids = random.sample(list_id, 3)
for i in (selected_ids):
music = Music.objects.get(id=i)
player.playmusic(music)
With this code, three musics are played (one after the other), but the web page is just "Loading..." during execution...
Is there a way to display the refreshed web page to the user, during the loop ??
Thanks.
Your view is blocked from returning anything to the web server while it is waiting for playmusicrandom() to finish.
You need to arrange for playmusicrandom() to do its task after you're returned the HTTP status from the view.
This means that you likely need a thread (or similar solution).
Your view will have something like this:
import threading
t = threading.Thread(target=player_model.playmusicrandom,
args=request)
t.setDaemon(True)
t.start()
return HttpResponse()
This code snippet came from here, where you will find more detailed information about the issues you face and possible solutions.