Show file information that is transferring using Django celery? - django

I have a task like this in Django celery:
#task
def file(password, source12, destination):
subprocess.Popen(['sshpass', '-p', password, 'rsync', '-avz', '--info=progress2', source12, destination],
stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()[0]
I have a function that executes the above task:
#celery.task
#login_required(login_url='/login_backend/')
def sync(request):
#user_id = request.session['user_id']
"""Sync the files into the server with the progress bar"""
if request.method == 'POST':
choices = request.POST.getlist('choice')
for i in choices:
new_source = source +"/"+ i
#b = result.successful()
#result.get() #Poll the database to get the progress
start_date1 = datetime.datetime.utcnow().replace(tzinfo=utc)
source12 = new_source.replace(' ', '') #Remove whitespaces
file.delay(password, source12, destination)
return HttpResponseRedirect('/uploaded_files/')
else:
return HttpResponseRedirect('/uploaded_files/')
I want to show the user the file transferring information with the progress.I want to show the file name, remaining time and size of the file that is being transferred. How can I do that?

Here you have example of how pass the information through celery itself. Simply start upload process as an AJAX call, return task`s id and later ask by AJAX about state of the task.
To actually know how you're doing you need to know the progress in the file task itself. I doubt if you can achieve that by calling subprocess sshpass in other way then parsing PIPE output.
Mind celery spawn a process for you already when you .delay a task.
Try to read destination in chunks, rsync the chunks and then merge the file in destination. That way you know how many chunks you've transferred already.

Related

Django: Script that executes many queries runs massively slower when executed from Admin view than when executed from shell

I have a script that loops through the rows of an external csv file (about 12,000 rows) and executes a single Model.objects.get() query to retrieve each item from the database (final product will be much more complicated but right now it's stripped down to the barest functionality possible to try to figure this out).
For right now the path to the local csv file is hardcoded into the script. When I run the script through the shell using py manage.py runscript update_products_from_csv it runs in about 6 seconds.
The ultimate goal is to be able to upload the csv through the admin and then have the script run from there. I've already been able to accomplish that, but the runtime when I do it that way takes more like 160 seconds. The view for that in the admin looks like...
from .scripts import update_products_from_csv
class CsvUploadForm(forms.Form):
csv_file = forms.FileField(label='Upload CSV')
#admin.register(Product)
class ProductAdmin(admin.ModelAdmin):
# list_display, list_filter, fieldsets, etc
def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
extra_context['csv_upload_form'] = CsvUploadForm()
return super().changelist_view(request, extra_context=extra_context)
def get_urls(self):
urls = super().get_urls()
new_urls = [path('upload-csv/', self.upload_csv),]
return new_urls + urls
def upload_csv(self, request):
if request.method == 'POST':
# csv_file = request.FILES['csv_file'].file
# result_string = update_products_from_csv.run(csv_file)
# I commented out the above two lines and added the below line to rule out
# the possibility that the csv upload itself was the problem. Whether I execute
# the script using the uploaded file or let it use the hardcoded local path,
# the results are the same. It works, but takes more than 20 times longer
# than executing the same script from the shell.
result_string = update_products_from_csv.run()
print(result_string)
messages.success(request, result_string)
return HttpResponseRedirect(reverse('admin:products_product_changelist'))
Right now the actual running parts of the script are about as simple as this...
import csv
from time import time
from apps.products.models import Product
CSV_PATH = 'path/to/local/csv_file.csv'
def run():
csv_data = get_csv_data()
update_data = build_update_data(csv_data)
update_handler(update_data)
return 'Finished'
def get_csv_data():
with open(CSV_PATH, 'r') as f:
return [d for d in csv.DictReader(f)]
def build_update_data(csv_data):
update_data = []
# Code that loops through csv data, applies some custom logic, and builds a list of
# dicts with the data cleaned and formatted as needed
return update_data
def update_handler(update_data):
query_times = []
for upd in update_data:
iter_start = time()
product_obj = Product.objects.get(external_id=upd['external_id'])
# external_id is not the primary key but is an indexed field in the Product model
query_times.append(time() - iter_start)
# Code to export query_times to an external file for analysis
update_handler() has a bunch of other code checking field values to see if anything needs to be changed, and building the objects when a match does not exist, but that's all commented out right now. As you can see, I'm also timing each query and logging those values. (I've been dropping time() calls in various places all day and have determined that the query is the only part that's noticeably different.)
When I run it from the shell, the average query time is 0.0005 seconds and the total of all query times comes out to about 6.8 seconds every single time.
When I run it through the admin view and then check the queries in Django Debug Toolbar it shows the 12,000+ queries as expected, and shows a total query time of only about 3900ms. But when I look at the log of query times gathered by the time() calls, the average query time is 0.013 seconds (26 times longer than when I run it through the shell), and the total of all query times always comes out at 156-157 seconds.
The queries in Django Debug Toolbar when I run it through the admin all look like SELECT ••• FROM "products_product" WHERE "products_product"."external_id" = 10 LIMIT 21, and according to the toolbar they are mostly all 0-1ms. I'm not sure how I would check what the queries look like when running it from the shell, but I can't imagine they'd be different? I couldn't find anything in django-extensions runscript docs about it doing query optimizations or anything like that.
One additional interesting facet is that when running it from the admin, from the time I see result_string print in the terminal, it's another solid 1-3 minutes before the success message appears in the browser window.
I don't know what else to check. I'm obviously missing something fundamental, but I don't know what.
Somebody on Reddit suggested that running the script from the shell might be automatically spinning up a new thread where the logic can run unencumbered by the other Django server processes, and this seems to be the answer. If I run the script in a new thread from the admin view, it runs just as fast as it does when I run it from the shell.

Does celery task id change after redistribution

I have a Django model which has a column called celery_task_id. I am using RabbitMQ as the broker. There's a celery function called test_celery which takes a model object as parameter. Now I have the following lines of code which creates a celery task.
def create_celery_task():
celery_task_id = test_celery.apply_async((model_obj,), eta='Future Datetime Object')
model_obj.celery_task_id = celery_task_id
model_obj.save()
----
----
Now inside the celery function I am verifying if the task id is same as of the one stored in the DB or not.
#app.task
def test_celery(model_obj):
if model_obj.celery_task_id == test_celery.request.id:
## Do something
My problem is there are a lot of cases where I can see the task being received and succeeding in the log but not executing the code inside of if condition.
Is it possible that celery task id changes after redistribution. Or are there any other reasons.
One of the recommendations is not to pass Database/ORM objects into the Celery tasks because the may contain stale data. Try to rewrite the task as:
#app.task
def test_celery(model_obj_id):
model_obj = YourModel.objects.get(id=model_obj_id)
if model_obj:
if model_obj.celery_task_id == test_celery.request.id:
## Do something

Manage multiple uploads with Flask session

I have a following situation. I created a simple backend in Flask that handles file uploads. With files received, Flask does something (uploads them), and returns the data to the caller. There are two scenarios with the app, to upload one image and multiple images. When uploading one image, I can simply get the response and voila, I'm all set.
However, I am stuck on handling multiple file uploads. I can use the same handler for the actual file upload, but the issue is that all of those files need to be stored into a list or something, then processed, and after doing that, a single link (album) containing all those images, needs to be delivered.
Here is my upload handling code:
#app.route('/uploadv3', methods=['POST'])
def upload():
if request.method == 'POST':
data_file = request.files["file"]
file_name = data_file.filename
path_to_save_to = os.path.join(app.config['UPLOAD_FOLDER'], file_name)
data_file.save(path_to_save_to)
file_url = upload_image_to_image_host(path_to_save_to)
return file_url
I was experimenting with session in flask, but I dont know can I create a list of items under one key, like session['links'], and then get all those, and clear it after doing the work. Or is there some other simpler solution?
I assume that I could probably do this via key for each image, like session["link1"], and so on, but that would impose a limit on the images (depending on how much of those I create), would make the code very ugly, make the iteration over each in order to generate a list that is passed to an album building method problematic, and session clearing would be tedious.
Some code that I wrote for getting the actual link at the end and clearing the session follows (this assume that session['link'] has a list of urls, which I can't really achieve with my knowledge of session management in Flask:
def create_album(images):
session.pop('link', None)
new_album = im.create_album(images)
return new_album.link
#app.route('/get_album_link')
def get_album_link():
return create_album(session['link'])
Thanks in advance for your time!
You can assign anything to a session including individual value or list/dictionary etc. If you know the links, you can store them in the session as follows:
session['links'] = ['link1','link2'...and so on]
This way, you have a list of all the links. You can now access a link by:
if 'links' in session:
for link in session['links']:
print link
Once you are done with them, you can clear the session as:
if 'links' in session:
del session['links']
To clarify what I have done to make this work. At the end, it appeared that the uploading images and adding them to the album anonymously had to be done "reversely", so not adding images to an album object, but uploading an image object to an album id.
I made a method that gets the album link and puts it in the session:
#app.route('/get_album_link')
def get_album_link():
im = pyimgur.Imgur(CLIENT_ID)
new_album = im.create_album()
session.clear()
session['album'] = new_album.deletehash
session['album_link'] = new_album.link
return new_album.link
Later on, when handling uploads, I just add the image to the album and voila, all set :)
uploaded_image = im.upload_image(path_of_saved_image, album=session['album'])
file_url = uploaded_image.link
return file_url
One caveat is that the image should be added to the "deleteahash" value passed as the album value, not the album ID (which is covered by the imgur api documentation).

django - iterating over return render_to_response

I would like to read a file, update the website, read more lines, update the site, etc ...The logic is below but it's not working.
It only shows the first line from the logfile and stops. Is there a way to iterate over 'return render_to_response'?
#django view calling a remote python script that appends output to the logfile
proc = subprocess.Popen([program, branch, service, version, nodelist])
logfile = 'text.log'
fh = open(logfile, 'r')
while proc.poll() == None:
where = fh.tell()
line = fh.read()
if not line:
time.sleep(1)
fh.seek(where,os.SEEK_SET)
else:
output = cgi.escape(line)
output = line.replace('\n\r', '<br>')
return render_to_response('hostinfo/deployservices.html', {'response': output})
Thank you for your help.
You can actually do this, by making your function a generator - that is, using 'yield' to return each line.
However, you would need to create the response directly, rather than using render to response.
render_to_response will render the first batch to the website and stop. Then the website must call this view again somehow if you want to send the next batch. You will also have to maintain a record of where you were in the log file so that the second batch can be read from that point.
I assume that you have some logic in the templates so that the second post to render_to_response doesnt overwrite the first
If your data is not humongous, you should explore sending over the entire contents you want to show on the webpage each time you read some new lines.
Instead of re-inventing the wheel, use django_logtail

Returning data to the original process that invoke a subprocess

Someone told me to post this as a new question. This is a follow up to
Instantiating a new WX Python GUI from spawn thread
I implemented the following code to a script that gets called from a spawned thread (Thread2)
# Function that gets invoked by Thread #2
def scriptFunction():
# Code to instantiate GUI2; GUI2 contains wx.TextCtrl fields and a 'Done' button
p = subprocess.Popen("python secondGui.py", bufsize=2048, shell=True,stdin=subprocess.PIPE, stdout=subprocess.PIPE)
# Wait for a response
p.wait()
# Read response
response = p.stdout.read()
# Process entered data
processData()
On the new process running GUI2, I want the 'Done' button event handler to return 4 data sets to Thread2, and then destroy itself (GUI2)
def onDone(self,event):
# This is the part I need help with; Trying to return data back to main process that instantiated this GUI (GUI2)
process = subprocess.Popen(['python', 'MainGui.py'], shell=False, stdout=subprocess.PIPE)
print process.communicate('input1', 'input2', 'input3', 'input4')
# kill GUI
self.Close()
Currently, this implementation spawns another Main GUI in a new process. What I want to do is return data back to the original process. Thanks.
Do the two scripts have to be separate? I mean, you can have multiple frames running on one main loop and transfer information between the two using pubsub: http://www.blog.pythonlibrary.org/2010/06/27/wxpython-and-pubsub-a-simple-tutorial/
Theoretically, what you're doing should work too. Other methods I've heard of involve using Python's socket server library to create a really simple server that runs that the two programs can post to and read data from. Or a database or watching a directory for file updates.
Function that gets invoked by Thread #2
def scriptFunction():
# Code to instantiate GUI2; GUI2 contains wx.TextCtrl fields and a 'Done' button
p = subprocess.Popen("python secondGui.py", bufsize=2048, shell=True,stdin=subprocess.PIPE, stdout=subprocess.PIPE)
# Wait for a response
p.wait()
# Read response and split the return string that contains 4 word separated by a comma
responseArray = string.split(p.stdout.read(), ",")
# Process entered data
processData(responseArray)
'Done' button event handler that gets invoked when the 'Done' button is clicked on GUI2
def onDone(self,event):
# Package 4 word inputs into string to return back to main process (Thread2)
sys.stdout.write("%s,%s,%s,%s" % (dataInput1, dataInput2, dataInput3, dataInput4))
# kill GUI2
self.Close()
Thanks for your help Mike!