Access to pandas dataframe object between requests via session key - django

I have a pandas dataframe with a loose wrapper class around it that provides metadata for my django/DRF application. The application is basically a user friendly (non programmer) way to do some data analysis and validation. Between requests I want to be able to save the state of the dataframe so I can have a series of interactions with the data but it does not need to be saved in a database ( It only needs to survive as long as the browser session ). From this it was logical to check out django's session framework, but from what I've heard session data should be lightweight and the dataframe object does not json serialize.
Because I dont have a ton of users, and I want the app to feel like a desktop site, I was thinking of using the django cache as a way to keep the dataframe object in memory. So putting the data in the cache would go something like this
>>> from django.core.cache import caches
>>> cache1 = caches['default']
>>> cache1.set(request.session._get_session_key, dataframe_object)
and then the same except using get in the following requests to access.
Is this a good way to do handle this workflow or is there another system I should use to keep rather large data(5mb to 100mb) in memory?

If you are running your application on a modern server then 100mb is not a huge amount of memory. However if you have more than a couple dozen simultaneous users, each requiring 100mb of cache then this could add up to more memory than your server can handle. Your cache and server should be configured appropriately and you may want to limit the total number of cached dataframes in your python code.
Since it does appear that Django needs to serialize session data your choice is to either use sessions with PickleSerializer or to use the cache. According to documentation, PickleSerializer is not recommended for security reasons so your choice to use the cache is a good one.
The default cache backend in Django does not share entries across processes so you would get better memory and time efficiency by installing memcached and enabling the memcached.MemcachedCache backend.

Related

what is the best method to initialize or store a lookup dictionary that will be used in django views

I'm reviving an old django 1.2 app. most of the steps have been taken.
I have views in my django app that will reference a simple dictionary of only 1300ish key-value pairs.
Basically the view will query the dictionary a few hunderd to a few thousand times for user supplied values.The dictionary data may change twice a year or so.
fwiw: django served by gunicorn, db=postgres, apache as proxy, no redis available yet on the server
I thought of a few options here:
a table in the database that will be queried and let caching do its
job (at the expense of a few hundred sql queries)
Simply define the dictionary in the settings file (ugly, and how many time is it read? Every time you do an 'from django.conf import settings'?
This was the situation how it was coded in the django 1.2 predecessor of this app many years ago
read a tab delimited file using Pandas in the django settings and make this available. the advantage is that I can do some pandas magic in the view. (How efficient is this, will the file be read many times for different users or just once during server startup?)
prepopulate a redis cache from a file as part of the startup process (complicates things on the server side and we want it to be simple, but its fast.
List items in a tab delimited file and read it in in the view (my least popular option since it seems to be rather slow)
What are your thoughts on this? Any other options?
Let me give a few - simple to more involved
Hold it in memory
Basic flat file
Sqlite file
Redis
DB
I wouldn't bring redis in for 1300 kv pairs that don't even get mutated all that much
I would put a file alongside the code that gets slurped in memory at startup or do a single sql query and grab the entire thing at startup and keep it in memory to use throughout the application

Storing raw text data vs analytics

I’ve been working on a hobby project that’s a django react site that give analytics and data viz for texts. Most likely will host on AWS. The user uploads a csv of texts. The current logic is that they get stored in the db and then when the user calls the api it runs the analytics on them and sends the analytics. I’m trying to decide whether to store the raw text data (what I have now) or run the analytics on the texts once when they're uploaded and then discard them, only storing the analytics.
My thoughts are:
Raw data:
pros:
changes to analytics won’t require re uploading
probably simpler db schema
cons:
more sensitive data (not sure how safe it is in a django db on AWS, not sure what measures I could put in place to protect it more)
more data to store (not sure what it would cost to store a lot of rows of texts)
Analytics:
pros:
less sensitive, less space
cons:
if something goes wrong with the analytics on the first run (that doesn’t throw an error), then they could be inaccurate and will remain that way

How much can request.session store?

I'm new to learning about Django sessions (and Django in general). It seems to me that request.session functions like a dictionary, but I'm not sure how much data I can save on it. Most of the examples I have looked at so far have been using request.session to store relatively small data such as a short string or integer. So is there a limit to the amount of data I can save on a request.session or is it more related to what database I am using?
Part of the reason why I have this question is because I don't fully understand how the storage of request.session works. Does it work like another Model? If so, how can I access the keys/items on the admin page?
Thanks for any help in advance!
In short: it depends on the backend you use, you specify this with the SESSION_BACKEND [Django-doc]. Te backends can be (but are not limited to):
'django.contrib.sessions.backends.db'
'django.contrib.sessions.backends.file'
'django.contrib.sessions.backends.cache'
'django.contrib.sessions.backends.cached_db'
'django.contrib.sessions.backends.signed_cookies'
Depending on how each backend is implemented, different maximums are applied.
Furthermore the SESSION_SERIALIZER matters as well, since this determines how the data is encoded. There are two builtin serializers:
'django.contrib.sessions.serializers.JSONSerializer'; and
''django.contrib.sessions.serializers.PickleSerializer'.
Serializers
The serializer determines how the session data is converted to a stream, and thus has some impact on the compression rate.
For the JSONSerializer, it will make a JSON dump that is then compressed with base64 compression, and signed with hmac/SHA1. This compression ratio will likely have ~33% overhead compared to the original JSON blob.
The PickleSerializer will first pickle the object, and then compress it as well and sign it. Pickling tends to be less compact than JSON encoding, but pickling on the other hand can convert objects that are not dictionaries, lists, etc. into a stream.
Backends
Once the data is serialized, the backend determines where it is stored. Some backends have limitations.
django.contrib.sessions.backends.db
Here Django uses a database model to store session data. If the database can store values up to 4 GiB (like MySQL for example), then it will probably store JSON blobs up to 3 GiB per session. Note that of course there should be sufficient disk space to store the table.
django.contrib.sessions.backends.file
Here the data is written to a file. There are no limitations implemented, but of course there should be sufficient disk space. Some operating systems can add certain limitations to the amount of disk space files in a directory can allocate.
django.contrib.sessions.backends.cache
Here it is stored in one of the caches you specified in the CACHES setting [Django-doc], depending on the cache system you pick certain limitations can apply.
django.contrib.sessions.backends.cache_db
Here you use a combination of cache and db: you use the cache, but the data is backed by the database, such that if the cache is invalidated, the database still contains the data. This thus means that the limitations of both backends apply.
django.contrib.sessions.backends.signed_cookies
Here you store signed cookies at the browser of the client. The limitations of the cookies are here specified by the browser.
RFC-2965 on HTTP State Management Mechanism specifies that a browser should normally be capable of storing at least 4096 bytes per cookie. But with the signing part, it might be possible that this threshold is not sufficient at all.
If you use the cookies of the browser, you thus can only store very limited amounts of data.

PostgreSQL with Django: should I store static JSON in a separate MongoDB database?

Context
I'm making, a Django web application that depends on scraped API data.
The workflow:
A) I retrieve data from external API
B) Insert structured, processed data that I need in my PostgreSQL database (about 5% of the whole JSON)
I would like to add a third step, (before or after the "B" step) which will store the whole external API response in my databases. For three reasons:
I want to "freeze" data, as an "audit trail" in case of the API changes the content (It happened before)
API calls in my business are expensive, and often limited to 6 months of history.
I might decide to integrate more data from the API later.
Calling the external API again when data is needed is not possible because of 2) and 3)
Please note that the stored API responses will never be updated and read performance is not really important. Also, being able to query the stored API responses would be really nice, to perform exploratory analysis.
To provide additional context, there is a few thousand API calls a day, which represent around 50GB of data a year.
Here comes my question(s)
Should I store the raw JSON in the same PostgreSQL database I'm using for the Django web application, or in a separate datastore (MongoDB or some other NoSQL database)?
If I go with storing the raw JSON in my PostgreSQL database, I fear that my web application performance will decrease due to the database being "bloated" (50Mb of parsed SQL data in my Django database are equivalent to 2GB of raw JSON from the external API, so integrating the full API responses in my database will multiply its size by 40)
What about cost? as all this is hosted on a DBaaS. I understand that the cost will increase greatly (due to the DBs size increase), but is any of the two options more cost effective?

Storing large data on the server

I am using Django to write a website that conducts a user study. For each user, I need to load a large amount of data in RAM, and let that data be accessible throughout this particular user's time on the website. When the user leaves the website, this data can be discarded. When the next user visits the website, a new set of data will be loaded into RAM. The data is the same size, but of different value, for each user. A maximum of four users will be visiting the website at any one time. The data can be up to 100MB in size.
What is the best way to implement this? The only solution I can think of is to store the data as a session variable, but I'm wondering whether this involves any memory copying, which might be slow given that the data is large?
You shouldn't allocate RAM via Django. If you have heavy processes to run, run them asynchronously - you probably need Celery:
https://pypi.python.org/pypi/django-celery
http://www.celeryproject.org/
First do your "machine learning calculations based on the user's input" in a Django command. Then you can check with Celery when to run it...
The workflow would be:
- user enters some data in a form
- user submits it: that saves a record in the database
- the command is automatically ran afterwards using that record