Parsing the data from Wikipedia takes an unacceptably long time. I want to do instead of one thread\process, at least 5. After googling I found that in Python 3.5 there is async for.
Below is a "very short" version of the current "synced" code to show the whole proccess (with comments to quickly understand what the code does).
def update_data(region_id=None, country__inst=None, upper_region__inst=None):
all_ids = []
# Get data about countries or regions or subregions
countries_or_regions_dict = OSM().get_countries_or_regions(region_id)
# Loop that I want to make async
for osm_id in countries_or_regions_dict:
names = countries_or_regions_dict[osm_id]['names']
if 'wiki_uri' in countries_or_regions_dict[osm_id]:
wiki_uri = countries_or_regions_dict[osm_id]['wiki_uri']
# PARSER: From Wikipedia gets translations of countries or regions or subregions
translated_names = Wiki().get_translations(wiki_uri, osm_id)
if not region_id: # Means it is country
country__inst = Countries.objects.update_or_create(osm_id=osm_id,
defaults={**countries_regions_dict[osm_id]})[0]
else: # Means it is region\subregion (in case of recursion)
upper_region__inst = Regions.objects.update_or_create(osm_id=osm_id,
country=country__inst,
region=upper_region__inst,
defaults={**countries_regions_dict[osm_id]})[0]
# Add to DB translated names from wiki
for lang_code in names:
###
# RECURSION: If country has regions or region has subregions, start recursion
if 'divisions' in countries_or_regions_dict[osm_id]:
regions_list = countries_or_regions_dict[osm_id]['divisions']
for division_id in regions_list:
all_regions_osm_ids = update_osm(region_id=division_id, country__inst=country__inst,
upper_region__inst=upper_region__inst)
all_ids += all_regions_osm_ids
return all_ids
I realized that I need to change the def update_data to async def update_data and accordingly for osm_id in countries_or_regions_dict to async for osm_id in countries_or_regions_dict,
but I could not find the information whether is it necessary to use get_event_loop() in my case and where?, and how\where to specify how many iterations of the loop can be run simultaneously? Could someone help me please to make the loop for asynchronous?
asyncio module doesn't create multiple threads/process, it run code in one thread, one process, but can handle situations, with I/O blocks (if you wrote code special way). Read, when you should use asyncio.
As soon as your code have synchronous nature, I would suggest to use threads instead of asyncio. Create ThreadPoolExecutor and use it to parse Wiki in multiple threads.
Related
I have read and write queries in my single Django view function. As in below code:
def multi_query_function(request):
model_data = MyModel.objects.all() #first read command
...(do something)...
new_data = MyModel(
id=1234,
first_property='random value',
second_property='another value'
)
new_data.save() #second write command
return render(request, index.html)
I need these queries in the function to be executed consecutively. For example, if multiple users use this function at the same time, it should execute this function for both users one by one. The 'read' of one user should only be allowed if the previous user has completed both of his/her 'read and write'. The queries of both users should never be intermingled.
Should I use table locking feature of my PostgreSQL DB or is there any other well managed way?
Yep, using your database's locks are a good way to do this.
https://github.com/Xof/django-pglocks looks like a good library to give you a lock context manager.
someone?
I have a function:
def _process_sensor(self, last_value, raw_data, raw_string, tariff_data, counter, electricity):
raw_data = OrderedDict(sorted(raw_data.items(), key=lambda t: iso8601.parse_date(t[0])))
interval_data = self._transform_cumulative_to_interval(last_value, raw_data, electricity)
interval_data = self._calculate_sensor_commodities(interval_data, tariff_data, tariff_data['ID'], electricity)
self._persist_results(raw_string, interval_data, tariff_data['ID'], electricity)
And yesterday each function execution took 2s, today I improved it to take around 0.25s and I am very happy about that but when I call my function:
from multiprocessing.pool import ThreadPool as Pool
pool = Pool(processes=8)
for sensor_id in today_values:
try:
pool.apply(self._process_sensor, (yesterday_values[sensor_id], today_values[sensor_id],
raw_string[sensor_id], self.electricity_tariff_data[sensor_id],
processed, True, ))
except Exception as e:
self.logger.warning('Some data error: {}'.format(e))
processed += 1
Looping through 100 elements takes the same amount of time: about 24 seconds. What can I be doing wrong? Parameters passed are parts of dictionaries. self._persist_results calls AWS from s3transfer.manager import TransferManager.
Edit: I know I have an 8 core box I'm running the code on. And did a pool.apply_async with same results.
If a single run takes 0.25s and 100 take 24s, it sounds like the function is running into contention, so that most of its run time isn't executing in parallel. This is quite common when there's contention around I/O resources, so my guess would be that _persist_results call is the culprit.
Looking into TransferManager, it seems there's a max_concurrency setting on the [TransferConfig][1] - looks like that should default to 10, any chance you've reduced that? I'd suggest checking if that's being set, and if not, see if setting it explicitly helps.
Fixed it:
from multiprocessing.pool import Pool
def _looop(dict_obj):
_process_sensor(**dict_obj)
to_process = []
for sensor_id in today_values:
to_process.append(copy.deepcopy({'fields': True,
'for': True,
'_process_sensor': True}))
pool.map_async(_looop, to_process)
Didn't have time to check if copy.deepcopy is necessary (and to be honest I don't need to put it in a separate loop - I will refactor that soon).
I wrote the django application to tokenize a list of documents.I try to run it in parallel by using multiprocessing. But I found instead of all cores use 100% of its computation capability, it just take turns to let single one use up its 100% computation capability with other thread almost idle. I run a 4 cores 8 thread ubuntu 14.04 os, and python 2.7. Here I simplify my code to more easy to understand my code .
tokenization.py
def compute_customizedStopwords():
stopword_dictionary = open(BASE_DIR + "/app1/NLP/Dictionary/humanDecisionDictionary.txt",'r')
customizedStopwords = set()
# compute stopwords set
for line in stopword_dictionary:
customizedStopwords.add(line.strip('\n').lower()
return customizedStopwords
def tokenize_task(narrative, customizedStopwords)
tokens = narrative.corpus.split(",")
tokens = [token for token in tokens if token not in customizedStopwords] # remove stopwords
newTokenObjects = [ Token(token = token) for token in tokens]
Token.objects.bulk_create(newTokenObjects) # save all tokens to database
return tokens
views.py
def tokenize(request) :
narratives = models.Narrative.objects.all() # get all documents
customizedStopwords = compute_customizedStopwords() # get stopwords set
pool = Pool()
results = [pool.apply(tokenize_task, args=(narrative, customizedStopwords)) for narrative in narratives]
tokens = []
tokens += results # flat the token list
return HttpResponse(tokens)
Is it because the database write operation is the bottleneck, tokenization itself is very fast, but only one thread can write to databse and thus block all other threads? If this is the case, whether there is any ways to optimize this code?
I also has concerns about the stopwrods dictionary sets. I doubt python will copy this object for every job, and distribute them to every task. This will add memory cost especially I have 1 million documents in database.
Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.
My question is probably pretty basic but still I can't get a solution in the official doc. I have defined a Celery chain inside my Django application, performing a set of tasks dependent from eanch other:
chain( tasks.apply_fetching_decision.s(x, y),
tasks.retrieve_public_info.s(z, x, y),
tasks.public_adapter.s())()
Obviously the second and the third tasks need the output of the parent, that's why I used a chain.
Now the question: I need to programmatically revoke the 2nd and the 3rd tasks if a test condition in the 1st task fails. How to do it in a clean way? I know I can revoke the tasks of a chain from within the method where I have defined the chain (see thisquestion and this doc) but inside the first task I have no visibility of subsequent tasks nor of the chain itself.
Temporary solution
My current solution is to skip the computation inside the subsequent tasks based on result of the previous task:
#shared_task
def retrieve_public_info(result, x, y):
if not result:
return []
...
#shared_task
def public_adapter(result, z, x, y):
for r in result:
...
But this "workaround" has some flaw:
Adds unnecessary logic to each task (based on predecessor's result), compromising reuse
Still executes the subsequent tasks, with all the resulting overhead
I haven't played too much with passing references of the chain to tasks for fear of messing up things. I admit also I haven't tried Exception-throwing approach, because I think that the choice of not proceeding through the chain can be a functional (thus non exceptional) scenario...
Thanks for helping!
I think I found the answer to this issue: this seems the right way to proceed, indeed. I wonder why such common scenario is not documented anywhere, though.
For completeness I post the basic code snapshot:
#app.task(bind=True) # Note that we need bind=True for self to work
def task1(self, other_args):
#do_stuff
if end_chain:
self.request.callbacks[:] = []
....
Update
I implemented a more elegant way to cope with the issue and I want to share it with you. I am using a decorator called revoke_chain_authority, so that it can revoke automatically the chain without rewriting the code I previously described.
from functools import wraps
class RevokeChainRequested(Exception):
def __init__(self, return_value):
Exception.__init__(self, "")
# Now for your custom code...
self.return_value = return_value
def revoke_chain_authority(a_shared_task):
"""
#see: https://gist.github.com/bloudermilk/2173940
#param a_shared_task: a #shared_task(bind=True) celery function.
#return:
"""
#wraps(a_shared_task)
def inner(self, *args, **kwargs):
try:
return a_shared_task(self, *args, **kwargs)
except RevokeChainRequested, e:
# Drop subsequent tasks in chain (if not EAGER mode)
if self.request.callbacks:
self.request.callbacks[:] = []
return e.return_value
return inner
This decorator can be used on a shared task as follows:
#shared_task(bind=True)
#revoke_chain_authority
def apply_fetching_decision(self, latitude, longitude):
#...
if condition:
raise RevokeChainRequested(False)
Please note the use of #wraps. It is necessary to preserve the signature of the original function, otherwise this latter will be lost and celery will make a mess at calling the right wrapped task (e.g. it will call always the first registered function instead of the right one)
As of Celery 4.0, what I found to be working is to remove the remaining tasks from the current task instance's request using the statement:
self.request.chain = None
Let's say you have a chain of tasks a.s() | b.s() | c.s(). You can only access the self variable inside a task if you bind the task by passing bind=True as argument to the tasks' decorator.
#app.task(name='main.a', bind=True):
def a(self):
if something_happened:
self.request.chain = None
If something_happened is truthy, b and c wouldn't be executed.