django subprocess p.wait() doesn't return web - django

With a django button, I need to launch multiples music (with random selection).
In my models.py, I have two functions 'playmusic' and 'playmusicrandom' :
def playmusic(self, music):
if self.isStarted():
self.stop()
command = ("sudo /usr/bin/mplayer "+music.path)
p = subprocess.Popen(command+str(music.path), shell=True)
p.wait()
def playmusicrandom(request):
conn = sqlite3.connect(settings.DATABASES['default']['NAME'])
cur = conn.cursor()
cur.execute("SELECT id FROM webgui_music")
list_id = [row[0] for row in cur.fetchall()]
### Get three IDs randomly from the list ###
selected_ids = random.sample(list_id, 3)
for i in (selected_ids):
music = Music.objects.get(id=i)
player.playmusic(music)
With this code, three musics are played (one after the other), but the web page is just "Loading..." during execution...
Is there a way to display the refreshed web page to the user, during the loop ??
Thanks.

Your view is blocked from returning anything to the web server while it is waiting for playmusicrandom() to finish.
You need to arrange for playmusicrandom() to do its task after you're returned the HTTP status from the view.
This means that you likely need a thread (or similar solution).
Your view will have something like this:
import threading
t = threading.Thread(target=player_model.playmusicrandom,
args=request)
t.setDaemon(True)
t.start()
return HttpResponse()
This code snippet came from here, where you will find more detailed information about the issues you face and possible solutions.

Related

Django: Script that executes many queries runs massively slower when executed from Admin view than when executed from shell

I have a script that loops through the rows of an external csv file (about 12,000 rows) and executes a single Model.objects.get() query to retrieve each item from the database (final product will be much more complicated but right now it's stripped down to the barest functionality possible to try to figure this out).
For right now the path to the local csv file is hardcoded into the script. When I run the script through the shell using py manage.py runscript update_products_from_csv it runs in about 6 seconds.
The ultimate goal is to be able to upload the csv through the admin and then have the script run from there. I've already been able to accomplish that, but the runtime when I do it that way takes more like 160 seconds. The view for that in the admin looks like...
from .scripts import update_products_from_csv
class CsvUploadForm(forms.Form):
csv_file = forms.FileField(label='Upload CSV')
#admin.register(Product)
class ProductAdmin(admin.ModelAdmin):
# list_display, list_filter, fieldsets, etc
def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
extra_context['csv_upload_form'] = CsvUploadForm()
return super().changelist_view(request, extra_context=extra_context)
def get_urls(self):
urls = super().get_urls()
new_urls = [path('upload-csv/', self.upload_csv),]
return new_urls + urls
def upload_csv(self, request):
if request.method == 'POST':
# csv_file = request.FILES['csv_file'].file
# result_string = update_products_from_csv.run(csv_file)
# I commented out the above two lines and added the below line to rule out
# the possibility that the csv upload itself was the problem. Whether I execute
# the script using the uploaded file or let it use the hardcoded local path,
# the results are the same. It works, but takes more than 20 times longer
# than executing the same script from the shell.
result_string = update_products_from_csv.run()
print(result_string)
messages.success(request, result_string)
return HttpResponseRedirect(reverse('admin:products_product_changelist'))
Right now the actual running parts of the script are about as simple as this...
import csv
from time import time
from apps.products.models import Product
CSV_PATH = 'path/to/local/csv_file.csv'
def run():
csv_data = get_csv_data()
update_data = build_update_data(csv_data)
update_handler(update_data)
return 'Finished'
def get_csv_data():
with open(CSV_PATH, 'r') as f:
return [d for d in csv.DictReader(f)]
def build_update_data(csv_data):
update_data = []
# Code that loops through csv data, applies some custom logic, and builds a list of
# dicts with the data cleaned and formatted as needed
return update_data
def update_handler(update_data):
query_times = []
for upd in update_data:
iter_start = time()
product_obj = Product.objects.get(external_id=upd['external_id'])
# external_id is not the primary key but is an indexed field in the Product model
query_times.append(time() - iter_start)
# Code to export query_times to an external file for analysis
update_handler() has a bunch of other code checking field values to see if anything needs to be changed, and building the objects when a match does not exist, but that's all commented out right now. As you can see, I'm also timing each query and logging those values. (I've been dropping time() calls in various places all day and have determined that the query is the only part that's noticeably different.)
When I run it from the shell, the average query time is 0.0005 seconds and the total of all query times comes out to about 6.8 seconds every single time.
When I run it through the admin view and then check the queries in Django Debug Toolbar it shows the 12,000+ queries as expected, and shows a total query time of only about 3900ms. But when I look at the log of query times gathered by the time() calls, the average query time is 0.013 seconds (26 times longer than when I run it through the shell), and the total of all query times always comes out at 156-157 seconds.
The queries in Django Debug Toolbar when I run it through the admin all look like SELECT ••• FROM "products_product" WHERE "products_product"."external_id" = 10 LIMIT 21, and according to the toolbar they are mostly all 0-1ms. I'm not sure how I would check what the queries look like when running it from the shell, but I can't imagine they'd be different? I couldn't find anything in django-extensions runscript docs about it doing query optimizations or anything like that.
One additional interesting facet is that when running it from the admin, from the time I see result_string print in the terminal, it's another solid 1-3 minutes before the success message appears in the browser window.
I don't know what else to check. I'm obviously missing something fundamental, but I don't know what.
Somebody on Reddit suggested that running the script from the shell might be automatically spinning up a new thread where the logic can run unencumbered by the other Django server processes, and this seems to be the answer. If I run the script in a new thread from the admin view, it runs just as fast as it does when I run it from the shell.

Dash-Plotly: keep dropdown selection on page reload

Im quite new to web development in python and am developing a dash-plotly application at the moment. In the application there is a dropdown menu for the user to select a specific time interval for shown data in the graph. When the page is refreshed the selection returns obviously back to default. This simplified code shows the dropdown setup:
import dash_core_components as dcc
app = dash.Dash(__name__)
app.layout = html.Div(
dcc.Store('memory-intervals', storage_type='session')
dcc.Dropdown(
id='time',
options=get_intervals(),
value=Interval.DAY.value,
multi=False,
),
)
What I understood for now is, that I can store data in the browser session through dash's Store component. I managed to store selection like this:
#app.callback(
Output('memory-intervals', 'data'),
Input('time', 'value'),
)
def select_interval(interval):
if interval is None:
raise PreventUpdate
return interval
So I am stuck at this point... how can set the store's data as selection value after page reload?
Thank you in advance!
You could use the Dash persistence feature.
As per documentation:
persistence_type ('local', 'session', or 'memory'; default 'local'):
Where persisted user changes will be stored:
memory: only kept in memory, reset on page refresh. This is useful for
example if you have a tabbed app, that deletes the component when a
different tab is active, and you want changes persisted as you switch
tabs but not after reloading the app.
local: uses window.localStorage. This is the default, and keeps the
data indefinitely within that browser on that computer.
session: uses window.sessionStorage. Like 'local' the data is kept
when you reload the page, but cleared when you close the browser or
open the app in a new browser tab.
In your example, if you need to preserve a dropdown value only after reloading, but not after exiting the browser or closing the tab, you could write:
import dash_core_components as dcc
app = dash.Dash(__name__)
app.layout = html.Div(
dcc.Store('memory-intervals', storage_type='session')
dcc.Dropdown(
id='time',
options=get_intervals(),
value=Interval.DAY.value,
multi=False,
persistence = True,
persistence_type = 'session'
),
)
However, if you need to keep the selection indefinitely within that browser on that computer you could simply use persistence = True because the 'local' type is the default.
I don't know if this is the best solution, but regarding to plotly documentation I managed to do it like this:
#app.callback(
Output('time', 'value'),
Output('memory-intervals', 'data'),
Input('time', 'value'),
Input('memory-intervals', 'data'),
Input('memory-intervals', 'modified_timestamp'),
)
def select_interval(dd_interval, memory_interval, timestamp):
ctx = dash.callback_context
trigger_id = ctx.triggered[0]["prop_id"].split(".")[0]
if trigger_id == 'time':
interval = dd_interval
elif timestamp == -1 or memory_interval is None:
interval = Interval.DAY.value
else:
interval = memory_interval
return interval, interval

Trying to offload long running task from django to a separate thread

I have a Django 2 app with an action in the admin that takes one to many images and uses the face_recognition project to find faces and store the face information in the database. I would like to offload this action to a different thread. However, in my attempt at starting a new thread, the Django app does not return until all the threads are done. I thought that if I put the long running task in a separate thread, then the Django app would return and the user could do other things with it as the long running process ran in the background.
DocumentAdmin action code:
def find_faces_4(self, request, queryset):
from biometric_identification.threaded_tasks import map_thread_pool_face_finder
images_to_scan = []
for obj in queryset:
# If the object is a Photo with one to many people, send to face recognition
metadata = DocumentMetaData.objects.filter(document_id = obj.document_id)[0].metadata
if "Photo Type" in metadata and metadata["Photo Type"] != 'No People':
images_to_scan.append(obj)
map_thread_pool_face_finder(images_to_scan)
The code to find the images:
def map_thread_pool_face_finder(document_list):
t0 = timeit.default_timer()
pool = ThreadPool()
pool.map(fftask2, document_list)
pool.close()
pool.join()
t1 = timeit.default_timer()
logger.debug("Function map_thread_pool_face_finder {} seconds".format(t1 - t0))
def fftask2(obj):
find_faces_task(obj.document_id, obj.storage_file_name.name, settings.MEDIA_ROOT)
The find_faces_task does the actual image scanning for faces.
I expected the action to pass off the face finding to separate threads, and return right away while the images are found in the background. The face recognition works, but the Django app is frozen until all the images are found. What am I missing in my understanding of multithreading or my code?
Thanks!
Mark

How to start a new request after the item_scraped scrapy signal is called?

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com.
There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.
I don't wan't a for loop for handling each request. So i followed the below mentioned steps.
started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal
After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
itemIDs = [11111,22222,33333]
current_item_num = 0
def __init__(self, itemids=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.item_scraped, signals.item_scraped)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.quit()
def start_requests(self):
request = self.make_requests_from_url('http://example.com/itemview')
yield request
def parse(self,response):
self.driver = webdriver.PhantomJS()
self.driver.get(response.url)
first_data = self.driver.find_element_by_xpath('//div[#id="itemview"]').text.strip()
yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)
def processDetails(self,response):
itemID = self.itemIDs[self.current_item_num]
..form submission with the current itemID goes here...
...the content of the page is updated with the given itemID...
yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)
def processData(self,response):
...some more scraping goes here...
item = ExamplecrawlerItem()
item['first_data'] = response.meta['first_data']
yield item
def item_scraped(self,item,response,spider):
self.current_item_num += 1
#i need to call the processDetails function here for the next itemID
#and the process needs to contine till the itemID finishes
self.parse(response)
My piepline:
class ExampleDBPipeline(object):
def process_item(self, item, spider):
MYCOLLECTION.insert(dict(item))
return
I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.
self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))
However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:
yield item
self.counter += 1
yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}
Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.
Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:
import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()
And then there is no need to increment the counter and your program is thread safe.

Can Django run in a threaded model?

I was looking through the code base, particularly the database connectivity parts, and came accross this issue.
First, one gets a cursor to the database using the following stamentent:
from django.db import connection, transaction
cursor = connection.cursor()
connection is a module attribute, so in a threaded model, all threads would share that variable. Sounds a bit strange. Diving in further, the cursor() method belongs to django.db.backends.BaseDatabaseWrapper and looks like this:
def cursor(self):
self.validate_thread_sharing()
if (self.use_debug_cursor or
(self.use_debug_cursor is None and settings.DEBUG)):
cursor = self.make_debug_cursor(self._cursor())
else:
cursor = util.CursorWrapper(self._cursor(), self)
return cursor
The key is the call to _cursor(), which executes the backend code for whatever database is being used. In the case of MySQL, it executes the _cursor() method on django.db.backends.mysql.DatabaseWrapper, which looks like:
def _cursor(self):
new_connection = False
if not self._valid_connection():
new_connection = True
kwargs = {
'conv': django_conversions,
'charset': 'utf8',
'use_unicode': True,
}
settings_dict = self.settings_dict
if settings_dict['USER']:
kwargs['user'] = settings_dict['USER']
if settings_dict['NAME']:
kwargs['db'] = settings_dict['NAME']
if settings_dict['PASSWORD']:
kwargs['passwd'] = settings_dict['PASSWORD']
if settings_dict['HOST'].startswith('/'):
kwargs['unix_socket'] = settings_dict['HOST']
elif settings_dict['HOST']:
kwargs['host'] = settings_dict['HOST']
if settings_dict['PORT']:
kwargs['port'] = int(settings_dict['PORT'])
# We need the number of potentially affected rows after an
# "UPDATE", not the number of changed rows.
kwargs['client_flag'] = CLIENT.FOUND_ROWS
kwargs.update(settings_dict['OPTIONS'])
self.connection = Database.connect(**kwargs)
self.connection.encoders[SafeUnicode] = self.connection.encoders[unicode]
self.connection.encoders[SafeString] = self.connection.encoders[str]
connection_created.send(sender=self.__class__, connection=self)
cursor = self.connection.cursor()
if new_connection:
# SQL_AUTO_IS_NULL in MySQL controls whether an AUTO_INCREMENT column
# on a recently-inserted row will return when the field is tested for
# NULL. Disabling this value brings this aspect of MySQL in line with
# SQL standards.
cursor.execute('SET SQL_AUTO_IS_NULL = 0')
return CursorWrapper(cursor)
So a new cursor is not necessarily created. If a call to _cursor() had already been made, the previously used cursor would have been returned.
In a threaded model, that means multiple threads are possibly sharing the same database cursor, which seems like a no-no.
There are also other signs that indicate that threading is not allowed in Django. This module-level code from django/db/init.py, from for example:
def close_connection(**kwargs):
for conn in connections.all():
conn.close()
signals.request_finished.connect(close_connection)
So if any request finished, all database connections are closed. What if there are concurrent requests?
Seems like a lot of stuff is being shared, which indicates that threading is not allowed. I didn't see synchronization code anywhere.
Thanks!