Storing large data on the server - django

I am using Django to write a website that conducts a user study. For each user, I need to load a large amount of data in RAM, and let that data be accessible throughout this particular user's time on the website. When the user leaves the website, this data can be discarded. When the next user visits the website, a new set of data will be loaded into RAM. The data is the same size, but of different value, for each user. A maximum of four users will be visiting the website at any one time. The data can be up to 100MB in size.
What is the best way to implement this? The only solution I can think of is to store the data as a session variable, but I'm wondering whether this involves any memory copying, which might be slow given that the data is large?

You shouldn't allocate RAM via Django. If you have heavy processes to run, run them asynchronously - you probably need Celery:
https://pypi.python.org/pypi/django-celery
http://www.celeryproject.org/
First do your "machine learning calculations based on the user's input" in a Django command. Then you can check with Celery when to run it...
The workflow would be:
- user enters some data in a form
- user submits it: that saves a record in the database
- the command is automatically ran afterwards using that record

Related

Django: how to take input as range and store it in the database?

I am beginner in Django and thinking to create a website that asks the user for input (the input here is an hour range) and store it in the data base
e.g. A slider that ranges from 5am till 9pm the user can set the start and end for example 7am-9am and submit it and then it will be stored in the database
I am not sure how to approach it. Also, how will it be stored in the database as a single entry or multiple entries? Your help is much appreciated.
I have been trying continuously to search for tutorials, videos, articles, even any piece of information here and there that could give me the kickstart but I failed

Storing raw text data vs analytics

I’ve been working on a hobby project that’s a django react site that give analytics and data viz for texts. Most likely will host on AWS. The user uploads a csv of texts. The current logic is that they get stored in the db and then when the user calls the api it runs the analytics on them and sends the analytics. I’m trying to decide whether to store the raw text data (what I have now) or run the analytics on the texts once when they're uploaded and then discard them, only storing the analytics.
My thoughts are:
Raw data:
pros:
changes to analytics won’t require re uploading
probably simpler db schema
cons:
more sensitive data (not sure how safe it is in a django db on AWS, not sure what measures I could put in place to protect it more)
more data to store (not sure what it would cost to store a lot of rows of texts)
Analytics:
pros:
less sensitive, less space
cons:
if something goes wrong with the analytics on the first run (that doesn’t throw an error), then they could be inaccurate and will remain that way

Django database; how to download huge data in csv format

I have setup my database in Django in which I have huge amount of data. The task is to download all the data at a time in csv format. The problem which I am facing here is when the data size (in number of table rows) is upto 2000, I am able to download it but when number of rows reaches to more than 5k, it throws an error, "Gateway timeout". How to handle such issue. There is no table indexing as of now.
Also, when there is 2K data available, it takes around 18sec to download. So how this can be optimized.
First, make sure the code that is generating the CSV is as optimized as possible.
Next, the gateway timeout is coming from your front end proxy; so simply increase the timeout there.
However, this is a temporary reprieve - as your data set grows, this timeout will be exhausted and you'll keep getting these errors.
The permanent solution is to trigger a separate process to generate the CSV in the background, and then download it once its finished. You can do this by using celery or rq which are both ways to queue tasks for execution (and then collect the results at a later time).
If you are currently using HttpResponse from django.http then you could try using StreamingHttpResponse instead.
Failing that, you could try querying the database directly. For example, if you use the MySql database backend, these answers might help you:
dump-a-mysql-database-to-a-plaintext-csv-backup-from-the-command-line
As for the speed of the transaction, you could experiment with other database backends. However, if you need to do this often enough for the speed to be a major issue then there may be something else in the larger process which should be optimized instead.

Access to pandas dataframe object between requests via session key

I have a pandas dataframe with a loose wrapper class around it that provides metadata for my django/DRF application. The application is basically a user friendly (non programmer) way to do some data analysis and validation. Between requests I want to be able to save the state of the dataframe so I can have a series of interactions with the data but it does not need to be saved in a database ( It only needs to survive as long as the browser session ). From this it was logical to check out django's session framework, but from what I've heard session data should be lightweight and the dataframe object does not json serialize.
Because I dont have a ton of users, and I want the app to feel like a desktop site, I was thinking of using the django cache as a way to keep the dataframe object in memory. So putting the data in the cache would go something like this
>>> from django.core.cache import caches
>>> cache1 = caches['default']
>>> cache1.set(request.session._get_session_key, dataframe_object)
and then the same except using get in the following requests to access.
Is this a good way to do handle this workflow or is there another system I should use to keep rather large data(5mb to 100mb) in memory?
If you are running your application on a modern server then 100mb is not a huge amount of memory. However if you have more than a couple dozen simultaneous users, each requiring 100mb of cache then this could add up to more memory than your server can handle. Your cache and server should be configured appropriately and you may want to limit the total number of cached dataframes in your python code.
Since it does appear that Django needs to serialize session data your choice is to either use sessions with PickleSerializer or to use the cache. According to documentation, PickleSerializer is not recommended for security reasons so your choice to use the cache is a good one.
The default cache backend in Django does not share entries across processes so you would get better memory and time efficiency by installing memcached and enabling the memcached.MemcachedCache backend.

Django Cache + Django Database request

I'm building a Django web application which allow users to select a photo from the computer system and keep populating onto the users timeline. The timeline will be showing 10 photos initially and then have a pull to refresh to fetch the next 10 photos on the timeline.
So my first question is I'm able to upload images which gets store on the file system,but how do I show only first 10 and then pull a refresh to fetch the next 10 and so on.
Next, I want the user experience of the app to be fast. So, I'm considering caching. So, i was thinking, what do I cache. Since there are 3 types of cache in Django- Database cache, MemCache, or FileSystem Caching.
So my secon question is should I cache the first 10 photos of each user or something else?
Kindly answer with your suggestions.
So my first question is I'm able to upload images which gets store on the file system,but how do I show only first 10 and then pull a refresh to fetch the next 10 and so on.
Fetch first 10 with your initial logic, fetch next photos in chronological order. You must have some timestamp relating to your photo posting. Fetch images according to that. You can use Django Paginator for this.
what do I cache
Whatever static data you want to show to the user frequently and wont change right away. You can cache per user or for all users. According to that you choose what to cache.
should I cache the first 10 photos of each user or something else
Depends on you. Are those first pictures common to all the users? Then you can cache. If not and the pictures are user dependent, there is no point caching them. The user will anyway have to fetch the first images. And I highly doubt the user will keep asking for the same first 10 photos frequently. Again, it's your logic. If you think caching will help, you can go ahead and cache.
The DiskCache project was first created for a similar problem (caching images). It includes a couple of features that will help you to cache and serve images efficiently. DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.
diskcache.DjangoCache provides a Django-compatible cache interface with a few extra features. In particular, the get and set methods permit reading and writing files. An example:
from django.core.cache import cache
with open('filename.jpg', 'rb') as reader:
cache.set('filename.jpg', reader, read=True)
Later you can get a reference to the file:
reader = cache.get('filename.jpg', read=True)
If you simply wanted the name of the file on disk (in the cache):
try:
with cache.get('filename.jpg', read=True) as reader:
filename = reader.name
except AttributeError:
filename = None
The code above requests a file from the cache. If there is no such value, it will return None. None will cause an exception to be raised by the with statement because it lacks an __exit__ method. In that case, the exception is caught and filename is set to None.
With the filename, you can use something like X-Accel-Redirect to tell Nginx to serve the file directly from disk.