Django textarea for 50,000,000 character data - django

I've a django application that deals with large text files, up roughly 50,000,000 characters. For a variety of reasons it's desirable to store them as a model field.
We are using sqlite for dev and postgres for production.
Users do not need to enter the data via any UI.
The field does not need to be visible in the admin or elsewhere to the user.
Several questions:
Is it practicable to store this much text in a textarea field?
What, if any, performance issues will this likely create?
Would using a binary field improve performance?
Any guidance would be greatly appreciated.

Another consideration is that when you are querying that model, make sure you use defer on your querysets, so you aren't transferring 50MB of data down the pipe everytime you want to retrieve an object from the db.
I highly recommend storing those files on disk or S3 or equivalent in a FileField though. You won't really be able to query on the contents of those files efficiently.

This is more related to the database you use. You use SQLite so look at the limits of SQLite:
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000).
http://www.sqlite.org/limits.html
Besides that, it's probably better to use a TextField in Django.
A binary field wouldn't improve performance. Binary fields are meant for binary data, and you are storing text.

After some experimentation we decided to use a Django file field and not store the file contents in Postgresql. Performance was the primary decision driver. With file field we are able to query very quickly to get the underlying field file which in turn can be accessed directly at the OS level with much higher performance than is available if the data is stored in a Postgresql table.
Thanks for the input. It was a big help.

Related

Django Upload File, Analyise Conents and write to DB or update form

I'm pretty new to Django, Trying to get my grips with it, and expand what I think its capable of doing, and maybe one of you more intelligent people here can point me in the right direction.
I'm basically trying to build a system similar to a so called "Asset Management" system, to track software version of a product, so when an engineer updates the software version, they run a script which gathers all the information (Version, Install date, Hardware etc), which is stored in a .txt file, The engineer then comes back to the website and upload this .txt file for that customer, and it automatically updates the fields in the form, or directly to the database.
While, I've search a bit on here for concepts, I haven't been able to find something similar (Maybe my search terms aren't correct?), and wanted to ask if anyone knows, what I'm doing is even feasible, or am I lost down the rabbit hole of limitations :) Maybe its not do-able within Django, Any suggestions on how I should approach such a problem would be greatly appreciated.
What you're asking is doable both in the form and post-submit of a form or file upload.
In the form approach you'll want a live-reload of the form should the data come from a .txt file. That can be done with JavaScript. This will mean that the data will come from the text file and be input into the forms in the manner you define. This also means form validation will work as you want it to.
Another option is to require a txt file in a specified format, and parse it in the view, form_valid() for Class-Based Views, and request.FILES[] for function based views, and then perform all the required validations and then save the values to the database as the model instance.

Saving a Base64 string representing an image on Django database

I'm fairly new to Django and I'm looking for the best way to store Base64 images on my Dajango db/server.
My goal is to be able to send batches of images through http request, therefore it would make sense to send them as encoded Base64 images, I may end up rendering them on a webpage but they will primarily be sent to a desktop application in batches.
After looking through several other posts it looks like there are three approaches and I'm not sure what best fits my needs.
Store Base64 string in model TextField
Store Base64 string in FileField
Store an image in an image field and convert it to Base64 when needed
My concern with option 1 is that storing large textfields in the db will hinder performance, however like I said I'm new so I really don't have any idea.
Option 2 seems to make sense to me as this is similar to how django handles images, by storing them else and just referencing the location in the db. However, I'm not sure if this is simply because sqlite does not support fields of this type. I also see the potential for additional overhead, having to open and read files vs just reading a text field.
Lastly option 3 appears to be a rather unattractive option due to my use case, as these base64 images will be primarily sent in batches via http requests so I figured it would be best to store the converted version rather than encode each image upon each request.
I would greatly appreciate any insight the community could offer as to which approach might make the most sense for me to take. What are your thoughts?
Follow up question, if I intend on converting my database to Postgres does anything change regarding which approach I should take?
It is better not to store binary data in the database. Typically this will requires escaping to create/update/retrieve data, and thus results in less efficient access.
What is usually done is working with a FileField [Django-doc] or an ImageField [Django-doc]. For these two model fields, it will store the file in the file system, and save the path in the database. This will thus reduce the amount of overhead to load or save an object.
You can decide to store a base64 encoding of the file, but likely that will not be more efficient: it means that it requires more time to read the file from the disk. Encoding to base64 is efficient, and therefore it will likely be more efficient to store the file in a compact way and return a base64 that is created in the view.

Should I create a model or just use a string

I'm writing a database (using django) and I can either use a string (max 4 chars) in a few places or I can create a model and reference it in those places. I've looked and I can't find good discussion on the merits of either solution. Because this database is going to be quite large, what solution scales better both in performance and size?
Thanks!
If you just have a string or a few strings which will be constant, just define it in settings.py. And use it like:
from django.conf import settings
print settings.my_string_1
If it is the case, you can save time by avoiding database access.
If you are going to use many strings which may vary over time, or need insertion, update or delete operations frequently, you have to use database to store it. If you already have a database like MySQL or postgres setup in the project, you can use it. If not, it is enough to use sqlite database if the database size is not going to be large enough.

Is small changes in large documents a thing document databases is good for?

Sometimes documents with it's free form structure is attractive for storing data (in contrast to a relational database). But one problem is persistence in combination with making small changes to the data, since the entire document has to be rewritten to disk.
So my question is, are "document databases" especially made to solve this?
UPDATE
I think I understand the concept of "document oriented databases" better now. It's obviously not documents of any kind but each implementation uses it's own format, such as for instance JSON. And then the answer to my question also becomes obvious. If the entire JSON-structure had to be rewritten to disk after each change to keep it persisted, it wouldn't be a very good database.
If the entire JSON-structure had to be rewritten to disk after each change to keep it persisted, it wouldn't be a very good database.
I would say this is not true of any document database I know of. For example, Mongo doesn't store documents as JSON, it stores them as BSON (http://en.wikipedia.org/wiki/BSON).
Also databases like Mongo will store documents in RAM and persist them to disk later.
In fact many document databases will follow that pattern of storing documents in main memory and then writing them to disk.
But the fact that a given document database will write data to disk - and the fact that some documents might get changed a lot - does not mean the database is non-performant. I wouldn't disregard document databases based on speculation.

c++ Reading big text file to string (bigger than string::max_size)

I have a huge text file (~5GB) which is the database for my program. During run this database is read completely many times with string functions like string::find(), string::at(), string::substr()...
The problem is that this text file cannot be loaded in one string, because string::max_size is definitely too small.
How would you implement this? I had the idea of loading a part to string->reading->closing->loading another part to same string->reading->closing->...
Is there a better/more efficient way?
How would you implement this?
With a real database, for instance SQLite. The performance improvement from having indexes is more than going to make up for your time learning another API.
Since this is a database, I'm assuming it'd have many records. That to me implies best idea would be to implement a data class for each records and populate a list/vector/etc depending upon how you plan to use it. I'd also look into persistent cache as the file is big.
And within in your container class of all records, you could implement search etc functions as you see fit. But as suggested for a db of this size, you're probably best of using a database.