Related
I have a script that loops through the rows of an external csv file (about 12,000 rows) and executes a single Model.objects.get() query to retrieve each item from the database (final product will be much more complicated but right now it's stripped down to the barest functionality possible to try to figure this out).
For right now the path to the local csv file is hardcoded into the script. When I run the script through the shell using py manage.py runscript update_products_from_csv it runs in about 6 seconds.
The ultimate goal is to be able to upload the csv through the admin and then have the script run from there. I've already been able to accomplish that, but the runtime when I do it that way takes more like 160 seconds. The view for that in the admin looks like...
from .scripts import update_products_from_csv
class CsvUploadForm(forms.Form):
csv_file = forms.FileField(label='Upload CSV')
#admin.register(Product)
class ProductAdmin(admin.ModelAdmin):
# list_display, list_filter, fieldsets, etc
def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
extra_context['csv_upload_form'] = CsvUploadForm()
return super().changelist_view(request, extra_context=extra_context)
def get_urls(self):
urls = super().get_urls()
new_urls = [path('upload-csv/', self.upload_csv),]
return new_urls + urls
def upload_csv(self, request):
if request.method == 'POST':
# csv_file = request.FILES['csv_file'].file
# result_string = update_products_from_csv.run(csv_file)
# I commented out the above two lines and added the below line to rule out
# the possibility that the csv upload itself was the problem. Whether I execute
# the script using the uploaded file or let it use the hardcoded local path,
# the results are the same. It works, but takes more than 20 times longer
# than executing the same script from the shell.
result_string = update_products_from_csv.run()
print(result_string)
messages.success(request, result_string)
return HttpResponseRedirect(reverse('admin:products_product_changelist'))
Right now the actual running parts of the script are about as simple as this...
import csv
from time import time
from apps.products.models import Product
CSV_PATH = 'path/to/local/csv_file.csv'
def run():
csv_data = get_csv_data()
update_data = build_update_data(csv_data)
update_handler(update_data)
return 'Finished'
def get_csv_data():
with open(CSV_PATH, 'r') as f:
return [d for d in csv.DictReader(f)]
def build_update_data(csv_data):
update_data = []
# Code that loops through csv data, applies some custom logic, and builds a list of
# dicts with the data cleaned and formatted as needed
return update_data
def update_handler(update_data):
query_times = []
for upd in update_data:
iter_start = time()
product_obj = Product.objects.get(external_id=upd['external_id'])
# external_id is not the primary key but is an indexed field in the Product model
query_times.append(time() - iter_start)
# Code to export query_times to an external file for analysis
update_handler() has a bunch of other code checking field values to see if anything needs to be changed, and building the objects when a match does not exist, but that's all commented out right now. As you can see, I'm also timing each query and logging those values. (I've been dropping time() calls in various places all day and have determined that the query is the only part that's noticeably different.)
When I run it from the shell, the average query time is 0.0005 seconds and the total of all query times comes out to about 6.8 seconds every single time.
When I run it through the admin view and then check the queries in Django Debug Toolbar it shows the 12,000+ queries as expected, and shows a total query time of only about 3900ms. But when I look at the log of query times gathered by the time() calls, the average query time is 0.013 seconds (26 times longer than when I run it through the shell), and the total of all query times always comes out at 156-157 seconds.
The queries in Django Debug Toolbar when I run it through the admin all look like SELECT ••• FROM "products_product" WHERE "products_product"."external_id" = 10 LIMIT 21, and according to the toolbar they are mostly all 0-1ms. I'm not sure how I would check what the queries look like when running it from the shell, but I can't imagine they'd be different? I couldn't find anything in django-extensions runscript docs about it doing query optimizations or anything like that.
One additional interesting facet is that when running it from the admin, from the time I see result_string print in the terminal, it's another solid 1-3 minutes before the success message appears in the browser window.
I don't know what else to check. I'm obviously missing something fundamental, but I don't know what.
Somebody on Reddit suggested that running the script from the shell might be automatically spinning up a new thread where the logic can run unencumbered by the other Django server processes, and this seems to be the answer. If I run the script in a new thread from the admin view, it runs just as fast as it does when I run it from the shell.
I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com.
There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.
I don't wan't a for loop for handling each request. So i followed the below mentioned steps.
started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal
After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
itemIDs = [11111,22222,33333]
current_item_num = 0
def __init__(self, itemids=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.item_scraped, signals.item_scraped)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.quit()
def start_requests(self):
request = self.make_requests_from_url('http://example.com/itemview')
yield request
def parse(self,response):
self.driver = webdriver.PhantomJS()
self.driver.get(response.url)
first_data = self.driver.find_element_by_xpath('//div[#id="itemview"]').text.strip()
yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)
def processDetails(self,response):
itemID = self.itemIDs[self.current_item_num]
..form submission with the current itemID goes here...
...the content of the page is updated with the given itemID...
yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)
def processData(self,response):
...some more scraping goes here...
item = ExamplecrawlerItem()
item['first_data'] = response.meta['first_data']
yield item
def item_scraped(self,item,response,spider):
self.current_item_num += 1
#i need to call the processDetails function here for the next itemID
#and the process needs to contine till the itemID finishes
self.parse(response)
My piepline:
class ExampleDBPipeline(object):
def process_item(self, item, spider):
MYCOLLECTION.insert(dict(item))
return
I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.
self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))
However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:
yield item
self.counter += 1
yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}
Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.
Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:
import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()
And then there is no need to increment the counter and your program is thread safe.
I have a following situation. I created a simple backend in Flask that handles file uploads. With files received, Flask does something (uploads them), and returns the data to the caller. There are two scenarios with the app, to upload one image and multiple images. When uploading one image, I can simply get the response and voila, I'm all set.
However, I am stuck on handling multiple file uploads. I can use the same handler for the actual file upload, but the issue is that all of those files need to be stored into a list or something, then processed, and after doing that, a single link (album) containing all those images, needs to be delivered.
Here is my upload handling code:
#app.route('/uploadv3', methods=['POST'])
def upload():
if request.method == 'POST':
data_file = request.files["file"]
file_name = data_file.filename
path_to_save_to = os.path.join(app.config['UPLOAD_FOLDER'], file_name)
data_file.save(path_to_save_to)
file_url = upload_image_to_image_host(path_to_save_to)
return file_url
I was experimenting with session in flask, but I dont know can I create a list of items under one key, like session['links'], and then get all those, and clear it after doing the work. Or is there some other simpler solution?
I assume that I could probably do this via key for each image, like session["link1"], and so on, but that would impose a limit on the images (depending on how much of those I create), would make the code very ugly, make the iteration over each in order to generate a list that is passed to an album building method problematic, and session clearing would be tedious.
Some code that I wrote for getting the actual link at the end and clearing the session follows (this assume that session['link'] has a list of urls, which I can't really achieve with my knowledge of session management in Flask:
def create_album(images):
session.pop('link', None)
new_album = im.create_album(images)
return new_album.link
#app.route('/get_album_link')
def get_album_link():
return create_album(session['link'])
Thanks in advance for your time!
You can assign anything to a session including individual value or list/dictionary etc. If you know the links, you can store them in the session as follows:
session['links'] = ['link1','link2'...and so on]
This way, you have a list of all the links. You can now access a link by:
if 'links' in session:
for link in session['links']:
print link
Once you are done with them, you can clear the session as:
if 'links' in session:
del session['links']
To clarify what I have done to make this work. At the end, it appeared that the uploading images and adding them to the album anonymously had to be done "reversely", so not adding images to an album object, but uploading an image object to an album id.
I made a method that gets the album link and puts it in the session:
#app.route('/get_album_link')
def get_album_link():
im = pyimgur.Imgur(CLIENT_ID)
new_album = im.create_album()
session.clear()
session['album'] = new_album.deletehash
session['album_link'] = new_album.link
return new_album.link
Later on, when handling uploads, I just add the image to the album and voila, all set :)
uploaded_image = im.upload_image(path_of_saved_image, album=session['album'])
file_url = uploaded_image.link
return file_url
One caveat is that the image should be added to the "deleteahash" value passed as the album value, not the album ID (which is covered by the imgur api documentation).
How can I show a please wait loading message from a django view?
I have a Django view that takes significant time to perform calculations on a large dataset.
While the process loads, I would like to present the user with a feedback message e.g.: spinning loading animated gif or similar.
After trying the two different approaches suggested by Brandon and Murat, Brandon's suggestion proved the most successful.
Create a wrapper template that includes the javascript from http://djangosnippets.org/snippets/679/. The javascript has been modified: (i) to work without a form (ii) to hide the progress bar / display results when a 'done' flag is returned (iii) with the JSON update url pointing to the view described below
Move the slow loading function to a thread. This thread will be passed a cache key and will be responsible for updating the cache with progress status and then its results. The thread renders the original template as a string and saves it to the cache.
Create a view based on upload_progress from http://djangosnippets.org/snippets/678/ modified to (i) instead render the original wrapper template if progress_id='' (ii) generate the cache_key, check if a cache already exists and if not start a new thread (iii) monitor the progress of the thread and when done, pass the results to the wrapper template
The wrapper template displays the results via document.getElementById('main').innerHTML=data.result
(* looking at whether step 4 might be better implemented via a redirect as the rendered template contains javascript that is not currently run by document.getElementById('main').innerHTML=data.result)
Another thing you could do is add a javascript function that displays a loading image before it actually calls the Django View.
function showLoaderOnClick(url) {
showLoader();
window.location=url;
}
function showLoader(){
$('body').append('<div style="" id="loadingDiv"><div class="loader">Loading...</div></div>');
}
And then in your template you can do:
This will take some time...
Here's a quick default loadingDiv : https://stackoverflow.com/a/41730965/13476073
Note that this requires jQuery.
a more straightforward approach is to generate a wait page with your gif etc. and then use the javascript
window.location.href = 'insert results view here';
to switch to the results view which starts your lengthy calculation. The page wont change until the calculation is finished. When it finishes, then the results page will be rendered.
Here's an oldie, but might get you going in the right direction: http://djangosnippets.org/snippets/679/
A workaround that I chose was to use beforunload and unload events to show the loading image. This can be used with or without window.load. In my case, it's the view that is taking a great amount of time and not the page loading, hence I am not using window.load (because it's already a lot of time by the time window.load comes into picture, and at that point of time, I do not need the loading icon to be shown anymore).
The downside is that there is a false message that goes out to the user that the page is loading even when when the request has not even reached the server or it's taking much time. Also, it doesn't work for requests coming from outside my website. But I'm living with this for now.
Update: Sorry for not adding code snippet earlier, thanks #blockhead. The following is a quick and dirty mix of normal JS and JQuery that I have in the master template.
Update 2: I later moved to making my view(s) lightweight which send the crucial part of the page quickly, and then using ajax to get the remaining content while showing the loading icon. It needed quite some work, but the end result is worth it.
window.onload=function(){
$("#load-icon").hide(); // I needed the loading icon to hide once the page loads
}
var onBeforeUnLoadEvent = false;
window.onunload = window.onbeforeunload= function(){
if(!onBeforeUnLoadEvent){ // for avoiding dual calls in browsers that support both events
onBeforeUnLoadEvent = true;
$("#load-icon").show();
setTimeout(function(){
$("#load-icon").hide();},5000); // hiding the loading icon in any case after
// 5 seconds (remove if you do not want it)
}
};
P.S. I cannot comment yet hence posted this as an answer.
Iterating HttpResponse
https://stackoverflow.com/a/1371061/198062
Edit:
I found an example to sending big files with django: http://djangosnippets.org/snippets/365/ Then I look at FileWrapper class(django.core.servers.basehttp):
class FileWrapper(object):
"""Wrapper to convert file-like objects to iterables"""
def __init__(self, filelike, blksize=8192):
self.filelike = filelike
self.blksize = blksize
if hasattr(filelike,'close'):
self.close = filelike.close
def __getitem__(self,key):
data = self.filelike.read(self.blksize)
if data:
return data
raise IndexError
def __iter__(self):
return self
def next(self):
data = self.filelike.read(self.blksize)
if data:
return data
raise StopIteration
I think we can make a iterable class like this
class FlushContent(object):
def __init__(self):
# some initialization code
def __getitem__(self,key):
# send a part of html
def __iter__(self):
return self
def next(self):
# do some work
# return some html code
if finished:
raise StopIteration
then in views.py
def long_work(request):
flushcontent = FlushContent()
return HttpResponse(flushcontent)
Edit:
Example code, still not working:
class FlushContent(object):
def __init__(self):
self.stop_index=2
self.index=0
def __getitem__(self,key):
pass
def __iter__(self):
return self
def next(self):
if self.index==0:
html="loading"
elif self.index==1:
import time
time.sleep(5)
html="finished loading"
self.index+=1
if self.index>self.stop_index:
raise StopIteration
return html
Here is another explanation on how to get a loading message for long loading Django views
Views that do a lot of processing (e.g. complex queries with many objects, accessing 3rd party APIs) can take quite some time before the page is loaded and shown to the user in the browser. What happens is that all that processing is done on the server and Django is not able to serve the page before it is completed.
The only way to show a show a loading message (e.g. a spinner gif) during the processing is to break up the current view into two views:
First view renders the page with no processing and with the loading message
The page includes a AJAX call to the 2nd view that does the actual processing. The result of the processing is displayed on the page once its done with AJAX / JavaScript
I'm trying to get my images thumbnailed and stored on s3 using django-storages, boto, and sorl-thumbnail. I have it working, but it's very slow, even with small images. I don't mind it being slow when I save the form and upload the images to s3, but I'd like it to display the image quickly after that.
The answer to this SO question explains that the thumbnail won't be created until first access, but that you can use get_thumbnail() to create it beforehand.
Django + S3 (boto) + Sorl Thumbnail: Suggestions for optimisation
I'm doing that, and now it seems that all entries into the thumbnail_kvstore table are created when uploading the image, rather than when it is displayed.
The problem is that the page displaying the image is still really slow. Looking at the logging panel in the debug toolbar, it looks like there is still lots of communication with s3. It seems like after the image and thumbnails are uploaded and cached, page should render quickly without communicating with s3.
What am I doing wrong? Thanks!
Update: weak hack seems to have gotten it working, but I'd love to know how to do this properly:
https://github.com/asciitaxi/sorl-thumbnail/commit/545cce3f5e719a91dd9cc21d78bb973b2211bbbf
Update: more information for #sorl
I'm working with 2 views:
ADD VIEW: In this view I submit the form to create the model with the image in it. The image is uploaded to s3. In a post_save signal, I call get_thumbnail() to generate the thumbnail before it's needed:
im = get_thumbnail(instance.image, '360x360')
DISPLAY VIEW: In this view I display the thumbnail generated in the add view:
{% thumbnail object.image "360x360" as im %}
<img src="{{ im.url }}" width="{{ im.width }}" height="{{ im.height }}">
{% endthumbnail %}
Without the patch:
ADD VIEW: creates 3 entries in the kvstore table, accesses the cache 10 times (6 sets, 4 gets), logging tab of debug toolbar says "establishing HTTP connection" 12 times
DISPLAY VIEW: still just 3 entries in the kvstore table, just 1 get from cache, but debug toolbar says "establishing HTTP connection" 3 times still
With only the change on line 122:
ADD VIEW: same as above, except the logging only says "establishing HTTP connection" 2 times
DISPLAY VIEW: same as above, except the logging only says "establishing HTTP connection" 1 time
Also adding the change on line 118:
ADD VIEW: same as above, but now we are down to 2 "establishing HTTP connection" messages
DISPLAY VIEW: same as above, with no logging messages at all
UPDATE: It looks like storage._setup() is called twice, and storage.url() is called once. Based on the timing, I'd say each one makes connections to s3:
1304711315.4
_setup
1304711317.84
1304711317.84
_setup
1304711320.3
1304711320.39
_url
1304711323.66
This seems to be reflected by the boto logging, which says "establishing HTTP connection" 3 times.
As the author of sorl thumbnail I am really interested in solving this if it is not working as I intended. If the key value sotre is populated it will currently store: name, storage and size. I have made the assumption that the url is based on the name and thus should not cause any storage calls. Looking at django storages, https://github.com/e-loue/django-storages/blob/master/storages/backends/s3boto.py#L214 it seems like a safe assumption to make. In your patch you have patched the read method for some reason. When creating a thumbnail a ImageFile instance is fetched from cache (if not create it) then you can of course call read which will read the file, but the intended use is .url which calls url on the storage with the cached name which inturn should be a non storage access op. Could you try to isolate your problem to exacly where in your code this storage access happends?
Also make sure you have THUMBNAIL_DEBUG on and that you have the key value store properly set up.
I'm not sure if you problem is the same as mine, but I found that accessing the width or height property of a normal Django ImageField would read the file from the storage backend, load it into PIL, and return the dimensions from there. This is especially costly with a remote backend like we're using, and we have very media-heavy pages.
https://code.djangoproject.com/ticket/8307 was opened to address this but the Django devs closed as wontfix because they want the width and height properties to always return the true values. So I just monkeypatch _get_image_dimensions() to use those fields, which does prevent a large number of the boto messages and improves my page-load times.
Below is my code modified from the patch attached to that ticket. I stuck this in a place which gets executed early, such as a models.py.
from django.core.files.images import ImageFile, get_image_dimensions
def _get_image_dimensions(self):
from numbers import Number
if not hasattr(self, '_dimensions_cache'):
close = self.closed
if self.field.width_field and self.field.height_field:
width = getattr(self.instance, self.field.width_field)
height = getattr(self.instance, self.field.height_field)
#check if the fields have proper values
if isinstance(width, Number) and isinstance(height, Number):
self._dimensions_cache = (width, height)
else:
self.open()
self._dimensions_cache = get_image_dimensions(self, close=close)
else:
self.open()
self._dimensions_cache = get_image_dimensions(self, close=close)
return self._dimensions_cache
ImageFile._get_image_dimensions = _get_image_dimensions
After looking at the #shadfc django ticket, I reimplemented the monkeypatch as follows:
from django.core.files.images import ImageFile
def _get_image_dimensions(self):
if not hasattr(self, '_dimensions_cache'):
if getattr(self.storage, 'IGNORE_IMAGE_DIMENSIONS', False):
self._dimensions_cache = (0, 0)
else:
close = self.closed
self.open()
self._dimensions_cache = get_image_dimensions(self, close=close)
return self._dimensions_cache
ImageFile._get_image_dimensions = _get_image_dimensions
To use it, just add a IGNORE_IMAGE_DIMENSIONS = True to your storage class and it will not be touched to get image dimensions. Likely:
from storages.backends.s3boto import S3BotoStorage
S3BotoStorage.IGNORE_IMAGE_DIMENSIONS = True
I still need to investigate where the numbers are used, to know if simple returning (0, 0) can lead to any problem, but no bug raised for now.