How to call a function after finishing recursive asynchronous jobs in Python? - python-2.7

I use scrapy for scraping this site.
I want to save all the sub-categories in an array, then get the corresponding pages (pagination)
first step i have
def start_requests(self):
yield Request(start_urls[i], callback=self.get_sous_cat)
get_sous_cat is a function which gets all the sub-categories of a site, then starts asynchronously jobs to explore the sub-sub-categories recursively.
def get_sous_cat(self,response):
#Put all the categgories in a array
catList = response.css('div.categoryRefinementsSection')
if (catList):
for category in catList.css('a::attr(href)').extract():
category = 'https://www.amazon.fr' + category
print category
self.arrayCategories.append(category)
yield Request(category, callback=self.get_sous_cat)
When all the respective request have been sent, I need to call this termination function :
def pagination(self,response):
for i in range(0, len(self.arrayCategories[i])):
#DO something with each sub-category
I tried this
def start_requests(self):
yield Request(start_urls[i], callback=self.get_sous_cat)
for subCat in range(0,len(self.arrayCategories)):
yield Request(self.arrayCategories[subCat], callback=self.pagination)

Well done, this is a good question! Two small things:
a) use a set instead of an array. This way you won't have duplicates
b) site structure will change once a month/year. You will likely crawl more frequently. Break the spider into two; 1. The one that creates the list of category urls and runs monthly and 2. The one that gets as start_urls the file generated by the first
Now, if you really want to do it the way you do it now, hook the spider_idle signal (see here: Scrapy: How to manually insert a request from a spider_idle event callback? ). This gets called when there are no further urls to do and allows you to inject more. Set a flag or reset your list at that point so that the second time the spider is idle (after it crawled everything), it doesn't re-inject the same category urls for ever.
If, as it seems in your case, you don't want to do some fancy processing on the urls but just crawl categories before other URLs, this is what Request priority property is for (http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-subclasses). Just set it to e.g. 1 for your category URLs and then it will follow those links before it processes any non-category links. This is more efficient since it won't load those category pages twice as your current implementation would do.

This is not "recursion", it's asynchronous jobs. What you need is a global counter (protected by a Lock) and if 0, do your completion :
from threading import Lock
class JobCounter(object):
def __init__(self, completion_callback, *args, **kwargs):
self.c = 0
self.l = Lock()
self.completion = (completion_callback, args, kwargs)
def __iadd__(self, n):
b = false
with self.l:
self.c += n
if self.c <= 0:
b = true
if b:
f, args, kwargs = self.completion
f(*args, **kwargs)
def __isub__(self, n):
self.__iadd__(-n)
each time you launch a job, do counter += 1
each time a job finishes, do counter -= 1
NOTE : this does the completion in the thread of the last calling job. If you want to do it in a particular thread, use a Condition instead of a Lock, and do notify() instead of the call.

Related

running flow through code Django-Viewflow

I want to run my process entirely by code. I've managed to start a process but a I can't run the next part of the process. I tried using the "flow.Function" and calling my desired function but i't says
'Function {} should be called with task instance', 'execute'"
and the documentation on this subject is'nt very clear.
flows.py
#flow.flow_start_func
def create_flow(activation, campos_proceso, **kwargs):
activation.process.asignador = campos_proceso['asignador']
activation.process.ejecutor = campos_proceso['ejecutor']
activation.process.tipo_de_flujo = campos_proceso['tipo_de_flujo']
activation.process.estado_del_entregable = campos_proceso[
'estado_del_entregable']
activation.process.save()
activation.prepare()
activation.done()
return activation
#flow.flow_func
def exec_flow(activation, process_fields, **kwargs):
activation.process.revisor = process_fields['revisor']
activation.process.save()
activation.prepare()
activation.done()
return activation
#frontend.register
class Delivery_flow(Flow):
process_class = DeliveryProcess
start = flow.StartFunction(create_flow).Next(this.execute)
execute = flow.Function(exec_flow).Next(this.end)
end = flow.End()
views.py
def Execute(request): #campos_ejecucion, request):
campos_ejecucion = {
'ejecutor':request.user,
'revisor':request.user,
'observaciones_ejecutor':'Este es un puente magico',
'url_ejecucion':'https://www.youtube.com/watch?v=G-yNGb0Q91Y',
}
campos_proceso = {
'revisor':campos_ejecucion['revisor']
}
flows.Delivery_flow.execute.run()
Entregable.objects.crear_entregable()
return render(request, "Flujo/landing.html")
Generally, running "entirely by code" is the antipattern and should be avoided. Flow class is the set of views bound with URLs, so it's like class-based URL config, you don't need to have one more addition separate view and URL entry.
For custom views, you can take a look at the cookbook sample - https://github.com/viewflow/cookbook/blob/master/custom_views/demo/bloodtest/views.py
As for the actual question, you have missed task_loader. Function node should figure out what task is actually executed. You can do it on the flow layer (with task_loader) or directly get the Task model instance and pass it as the function parameter - http://docs.viewflow.io/viewflow_flow_nodes.html#viewflow.flow.Function

Twisted Python for responding multiple clients at a time

I have a problem. I'm having a echoserver which will accept clients and process his requirement and it returns the result to client.
Suppose I have two clients and 1 client requirement processing time would be 10 sec and 2 client requirement processing time would be 1 sec.
So when both clients connected to server at a time. how to run both the clients tasks at a time parallely and return the response to specific client which ever finishes first.
I have read that we can achieve this problem using python twisted. I have tried my luck, but Im unable to do it.
Please help me out of this Issue
Your code (https://trinket.io/python/87fd18ca9e) has many mistakes in terms of async design patterns, but I will only address the most blatant mistake. There are a few calls to time.sleep(), this is blocking code and is causing your code to stop until the sleep function is done running. The number 1 rule it async programming is do not use blocking functions! Don't worry, this is a very common mistake and the Twisted and Python async communities are there to help you :) I'll give you a naive solution for your server:
from twisted.internet.protocol import Factory
from twisted.internet import reactor, protocol, defer, task
def sleep(n):
return task.deferLater(reactor, n, lambda: None)
class QuoteProtocol(protocol.Protocol):
def __init__(self, factory):
self.factory = factory
def connectionMade(self):
self.factory.numConnections += 1
#defer.inlineCallbacks
def recur_factorial(self,n):
fact=1
print(n)
for i in range(1,int(n)+1):
fact=fact*i
yield sleep(5) # async sleep
defer.returnValue(str(fact))
def dataReceived(self, data):
try:
number = int(data) # validate data is an int
except ValueError:
self.transport.write('Invalid input!')
return # "exit" otherwise
# use Deferreds to write to client after calculation is finished
deferred_factorial = self.recur_factorial(number)
deferred_factorial.addCallback(self.transport.write)
def connectionLost(self, reason):
self.factory.numConnections -= 1
class QuoteFactory(Factory):
numConnections = 0
def buildProtocol(self, addr):
return QuoteProtocol(self)
reactor.listenTCP(8000, QuoteFactory())
reactor.run()
The main differences are in recur_factorial() and dataReceived(). The recur_factorial() is now utilizing Deferred (search how inlineCallbacks or coroutine's works) which allows for functions to execute after the result is available. So when the data in received, the factorial is calculated, then written to the end user. Finally there's the new sleep() function which allows for an async sleep function. I hope this helps. Keep reading the Krondo blog.

List populated with Scrapy is returned before actually filled

This involves pretty much the same code I just asked a different question about this morning, so if it looks familiar, that's because it is.
class LbcSubtopicSpider(scrapy.Spider):
...irrelevant/sensitive code...
rawTranscripts = []
rawTranslations = []
def parse(self, response):
rawTitles = []
rawVideos = []
for sel in response.xpath('//ul[1]'): #only scrape the first list
...irrelevant code...
index = 0
for sub in sel.xpath('li/ul/li/a'): #scrape the sublist items
index += 1
if index%2!=0: #odd numbered entries are the transcripts
transcriptLink = sub.xpath('#href').extract()
#url = response.urljoin(transcriptLink[0])
#yield scrapy.Request(url, callback=self.parse_transcript)
else: #even numbered entries are the translations
translationLink = sub.xpath('#href').extract()
url = response.urljoin(translationLink[0])
yield scrapy.Request(url, callback=self.parse_translation)
print rawTitles
print rawVideos
print "translations:"
print self.rawTranslations
def parse_translation(self, response):
for sel in response.xpath('//p[not(#class)]'):
rawTranslation = sel.xpath('text()').extract()
rawTranslation = ''.join(rawTranslation)
#print rawTranslation
self.rawTranslations.append(rawTranslation)
#print self.rawTranslations
My problem is that "print self.rawTranslations" in the parse(...) method prints nothing more than "[]". This could mean one of two things: it could be resetting the list right before printing, or it could be printing before the calls to parse_translation(...) that populate the list from links parse(...) follows are finished. I'm inclined to suspect it's the latter, as I can't see any code that would reset the list, unless "rawTranslations = []" in the class body is run multiple times.
Worth noting is that if I uncomment the same line in parse_translation(...), it will print the desired output, meaning it's extracting the text correctly and the problem seems to be unique to the main parse(...) method.
My attempts to resolve what I believe is a synchronization problem were pretty aimless--I just tried using an RLock object based on as many Google tutorials I could find and I'm 99% sure I misused it anyway, as the result was identical.
The problem here is that you are not understanding how scrapy really works.
Scrapy is a crawling framework, used for creating website spiders, not just for doing requests, that's the requests module for.
Scrapy's requests work asynchronously, when you call yield Request(...) you are adding requests to a stack of requests that will be executed at some point (you don't have control over it). Which means that you can't expect that some part of your code after the yield Request(...) will be executed at that moment. In fact, your method should always end yielding a Request or an Item.
Now from what I can see and most cases of confusion with scrapy, you want to keep populating an item you created on some method, but the information you need is on a different request.
On that case, communication is usually done with the meta parameter of the Request, something like this:
...
yield Request(url, callback=self.second_method, meta={'item': myitem, 'moreinfo': 'moreinfo', 'foo': 'bar'})
def second_method(self, response):
previous_meta_info = response.meta
# I can access the previous item with `response.meta['item']`
...
So this seems like somewhat of a hacky solution, especially since I just learned of Scrapy's request priority functionality, but here's my new code that gives the desired result:
class LbcVideosSpider(scrapy.Spider):
...code omitted...
done = 0 #variable to keep track of subtopic iterations
rawTranscripts = []
rawTranslations = []
def parse(self, response):
#initialize containers for each field
rawTitles = []
rawVideos = []
...code omitted...
index = 0
query = sel.xpath('li/ul/li/a')
for sub in query: #scrape the sublist items
index += 1
if index%2!=0: #odd numbered entries are the transcripts
transcriptLink = sub.xpath('#href').extract()
#url = response.urljoin(transcriptLink[0])
#yield scrapy.Request(url, callback=self.parse_transcript)
else: #even numbered entries are the translations
translationLink = sub.xpath('#href').extract()
url = response.urljoin(translationLink[0])
yield scrapy.Request(url, callback=self.parse_translation, \
meta={'index': index/2, 'maxIndex': len(query)/2})
print rawTitles
print rawVideos
def parse_translation(self, response):
#grab meta variables
i = response.meta['index']
maxIndex = response.meta['maxIndex']
#interested in p nodes without class
query = response.xpath('//p[not(#class)]')
for sel in query:
rawTranslation = sel.xpath('text()').extract()
rawTranslation = ''.join(rawTranslation) #collapse each line
self.rawTranslations.append(rawTranslation)
#increment number of translations done, check if finished
self.done += 1
print self.done
if self.done==maxIndex:
print self.rawTranslations
Basically, I just kept track of how many requests were completed and making some code conditional on a request being final. This prints the fully populated list.

How to start a new request after the item_scraped scrapy signal is called?

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com.
There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.
I don't wan't a for loop for handling each request. So i followed the below mentioned steps.
started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal
After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
itemIDs = [11111,22222,33333]
current_item_num = 0
def __init__(self, itemids=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.item_scraped, signals.item_scraped)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.quit()
def start_requests(self):
request = self.make_requests_from_url('http://example.com/itemview')
yield request
def parse(self,response):
self.driver = webdriver.PhantomJS()
self.driver.get(response.url)
first_data = self.driver.find_element_by_xpath('//div[#id="itemview"]').text.strip()
yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)
def processDetails(self,response):
itemID = self.itemIDs[self.current_item_num]
..form submission with the current itemID goes here...
...the content of the page is updated with the given itemID...
yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)
def processData(self,response):
...some more scraping goes here...
item = ExamplecrawlerItem()
item['first_data'] = response.meta['first_data']
yield item
def item_scraped(self,item,response,spider):
self.current_item_num += 1
#i need to call the processDetails function here for the next itemID
#and the process needs to contine till the itemID finishes
self.parse(response)
My piepline:
class ExampleDBPipeline(object):
def process_item(self, item, spider):
MYCOLLECTION.insert(dict(item))
return
I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.
self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))
However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:
yield item
self.counter += 1
yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}
Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.
Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:
import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()
And then there is no need to increment the counter and your program is thread safe.

Celery: clean way of revoking the entire chain from within a task

My question is probably pretty basic but still I can't get a solution in the official doc. I have defined a Celery chain inside my Django application, performing a set of tasks dependent from eanch other:
chain( tasks.apply_fetching_decision.s(x, y),
tasks.retrieve_public_info.s(z, x, y),
tasks.public_adapter.s())()
Obviously the second and the third tasks need the output of the parent, that's why I used a chain.
Now the question: I need to programmatically revoke the 2nd and the 3rd tasks if a test condition in the 1st task fails. How to do it in a clean way? I know I can revoke the tasks of a chain from within the method where I have defined the chain (see thisquestion and this doc) but inside the first task I have no visibility of subsequent tasks nor of the chain itself.
Temporary solution
My current solution is to skip the computation inside the subsequent tasks based on result of the previous task:
#shared_task
def retrieve_public_info(result, x, y):
if not result:
return []
...
#shared_task
def public_adapter(result, z, x, y):
for r in result:
...
But this "workaround" has some flaw:
Adds unnecessary logic to each task (based on predecessor's result), compromising reuse
Still executes the subsequent tasks, with all the resulting overhead
I haven't played too much with passing references of the chain to tasks for fear of messing up things. I admit also I haven't tried Exception-throwing approach, because I think that the choice of not proceeding through the chain can be a functional (thus non exceptional) scenario...
Thanks for helping!
I think I found the answer to this issue: this seems the right way to proceed, indeed. I wonder why such common scenario is not documented anywhere, though.
For completeness I post the basic code snapshot:
#app.task(bind=True) # Note that we need bind=True for self to work
def task1(self, other_args):
#do_stuff
if end_chain:
self.request.callbacks[:] = []
....
Update
I implemented a more elegant way to cope with the issue and I want to share it with you. I am using a decorator called revoke_chain_authority, so that it can revoke automatically the chain without rewriting the code I previously described.
from functools import wraps
class RevokeChainRequested(Exception):
def __init__(self, return_value):
Exception.__init__(self, "")
# Now for your custom code...
self.return_value = return_value
def revoke_chain_authority(a_shared_task):
"""
#see: https://gist.github.com/bloudermilk/2173940
#param a_shared_task: a #shared_task(bind=True) celery function.
#return:
"""
#wraps(a_shared_task)
def inner(self, *args, **kwargs):
try:
return a_shared_task(self, *args, **kwargs)
except RevokeChainRequested, e:
# Drop subsequent tasks in chain (if not EAGER mode)
if self.request.callbacks:
self.request.callbacks[:] = []
return e.return_value
return inner
This decorator can be used on a shared task as follows:
#shared_task(bind=True)
#revoke_chain_authority
def apply_fetching_decision(self, latitude, longitude):
#...
if condition:
raise RevokeChainRequested(False)
Please note the use of #wraps. It is necessary to preserve the signature of the original function, otherwise this latter will be lost and celery will make a mess at calling the right wrapped task (e.g. it will call always the first registered function instead of the right one)
As of Celery 4.0, what I found to be working is to remove the remaining tasks from the current task instance's request using the statement:
self.request.chain = None
Let's say you have a chain of tasks a.s() | b.s() | c.s(). You can only access the self variable inside a task if you bind the task by passing bind=True as argument to the tasks' decorator.
#app.task(name='main.a', bind=True):
def a(self):
if something_happened:
self.request.chain = None
If something_happened is truthy, b and c wouldn't be executed.