I'm writing a database (using django) and I can either use a string (max 4 chars) in a few places or I can create a model and reference it in those places. I've looked and I can't find good discussion on the merits of either solution. Because this database is going to be quite large, what solution scales better both in performance and size?
Thanks!
If you just have a string or a few strings which will be constant, just define it in settings.py. And use it like:
from django.conf import settings
print settings.my_string_1
If it is the case, you can save time by avoiding database access.
If you are going to use many strings which may vary over time, or need insertion, update or delete operations frequently, you have to use database to store it. If you already have a database like MySQL or postgres setup in the project, you can use it. If not, it is enough to use sqlite database if the database size is not going to be large enough.
Related
I am writing a C++ program, I have a class that provides services for the rest of the clases in the program.
I am writing now the clases and the UML.
1) the class that I refer to has a task list that is changing over time and conditions are being checked on this list, I am thinking to keep it in a table in a databasse that every line in the table would represent a task, this way in case that the program crashes or stops working I can restore the last situation, the other option is to keep the task list in memory and keep a copy in the database.
the task list should be searched every second
Which approach is more recommended?
2) In order to write and to read to the database I can call the database directly from the class or build a database communication class, if I write a data communication class I need to give specific options and to build a mini server for this,
e.g. write a line to the database, read a line to the database, update only the first column etc..
what is the recommended approach for this?
Thanks.
First, if the database is obvious and easy, and there are no performance problems, just do that. You're talking about running a query once/second, and maybe marking a task done or adding a new one every so often; even sqlite on a slow SMB share should be able to handle that just fine.
If you do need to optimize it, then there are two approaches: Either still with the database and cache it in-memory, or use memory as your primary storage and come up with a persistence mechanism that uses the database. But until you need to optimize it, don't.
Next, how should you do it? Your question makes it sound like you're thinking in terms of a whole three-tier system, with a "mini-server" sitting between the database server and your task list. There's really no need for that. What you want is a bespoke ORM, but that makes it sound more complicated than it is. All you're doing is writing a class that wraps a database connection and provides a handful of methods—get_due, mark_done, add, get_next_id—each of which maps SQL parameters to Task members. For example (with no error handling):
void mark_done(Task task) {
db.execute("UPDATE Task SET done=true WHERE id=%s", task.id);
}
Three more methods like that, plus a constructor to connect to the database (including creating the Task table if it didn't already exist), and your class is done.
The reason you don't want to write the database stuff directly into Task is that you don't really have anywhere to store shared information like the database connection object; either you need globals (or class attributes, which are effectively globals), or you need copies in every single Task instance (or, really, weak references—which you're going to fake with either a reference or a raw pointer, either way leading to shutdown problems somewhere down the line).
Finally, your whole reason for doing this is error recovery, and databases do a great job of journaling so nothing ever gets inconsistent, but you do have to make sure to structure your app to take advantage of that. For example, you may want to mark all the now-due tasks "in process", then process them, then mark them all "done"; that way, at recovery time, you know exactly which tasks may or may not have been done, and can act appropriately. The more steps you can commit to the database, the less data loss you have to deal with—but of course the more code you have to write, and the slower it gets. So, do as much as necessary, but no more.
Saving information in Database just to recover crashed information may be bit of an overkill.
You ideally want to serialize the list and save it - as binary, xml or csv based values. This can be done based on a timer or certain events in your applications.
Databases may also be used if you can come up with a structure that looks exactly similar to tables - so that you can do one-to-one mapping between the objects and probably write SQL queries easily. But keep that on a separate layer for abstraction.
I've a django application that deals with large text files, up roughly 50,000,000 characters. For a variety of reasons it's desirable to store them as a model field.
We are using sqlite for dev and postgres for production.
Users do not need to enter the data via any UI.
The field does not need to be visible in the admin or elsewhere to the user.
Several questions:
Is it practicable to store this much text in a textarea field?
What, if any, performance issues will this likely create?
Would using a binary field improve performance?
Any guidance would be greatly appreciated.
Another consideration is that when you are querying that model, make sure you use defer on your querysets, so you aren't transferring 50MB of data down the pipe everytime you want to retrieve an object from the db.
I highly recommend storing those files on disk or S3 or equivalent in a FileField though. You won't really be able to query on the contents of those files efficiently.
This is more related to the database you use. You use SQLite so look at the limits of SQLite:
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000).
http://www.sqlite.org/limits.html
Besides that, it's probably better to use a TextField in Django.
A binary field wouldn't improve performance. Binary fields are meant for binary data, and you are storing text.
After some experimentation we decided to use a Django file field and not store the file contents in Postgresql. Performance was the primary decision driver. With file field we are able to query very quickly to get the underlying field file which in turn can be accessed directly at the OS level with much higher performance than is available if the data is stored in a Postgresql table.
Thanks for the input. It was a big help.
Sometimes documents with it's free form structure is attractive for storing data (in contrast to a relational database). But one problem is persistence in combination with making small changes to the data, since the entire document has to be rewritten to disk.
So my question is, are "document databases" especially made to solve this?
UPDATE
I think I understand the concept of "document oriented databases" better now. It's obviously not documents of any kind but each implementation uses it's own format, such as for instance JSON. And then the answer to my question also becomes obvious. If the entire JSON-structure had to be rewritten to disk after each change to keep it persisted, it wouldn't be a very good database.
If the entire JSON-structure had to be rewritten to disk after each change to keep it persisted, it wouldn't be a very good database.
I would say this is not true of any document database I know of. For example, Mongo doesn't store documents as JSON, it stores them as BSON (http://en.wikipedia.org/wiki/BSON).
Also databases like Mongo will store documents in RAM and persist them to disk later.
In fact many document databases will follow that pattern of storing documents in main memory and then writing them to disk.
But the fact that a given document database will write data to disk - and the fact that some documents might get changed a lot - does not mean the database is non-performant. I wouldn't disregard document databases based on speculation.
I have a huge text file (~5GB) which is the database for my program. During run this database is read completely many times with string functions like string::find(), string::at(), string::substr()...
The problem is that this text file cannot be loaded in one string, because string::max_size is definitely too small.
How would you implement this? I had the idea of loading a part to string->reading->closing->loading another part to same string->reading->closing->...
Is there a better/more efficient way?
How would you implement this?
With a real database, for instance SQLite. The performance improvement from having indexes is more than going to make up for your time learning another API.
Since this is a database, I'm assuming it'd have many records. That to me implies best idea would be to implement a data class for each records and populate a list/vector/etc depending upon how you plan to use it. I'd also look into persistent cache as the file is big.
And within in your container class of all records, you could implement search etc functions as you see fit. But as suggested for a db of this size, you're probably best of using a database.
I will be storing terabytes of information, before indexes, and after compression methods.
Should I code up a Binary Tree Database by hand using sort files etc, or use something like MongoDB or even something like MySQL?
I am worried about (space) cost per record with things like MySQL and other DB's that are around. I also know that some databases even allow for compression, but they convert to read only tables. These tables/records need to be accessed and overwritten with new data fairly often. I think if I were to code something in C++ I'd be able to keep the cost of space per record to a minimum.
What should I do?
There are new non-relational databases that are becoming popular these days, that specialize in managing large-scale data.
Check out Hadoop or Cassandra, both of these are at the Apache Project.