Is small changes in large documents a thing document databases is good for? - document-database

Sometimes documents with it's free form structure is attractive for storing data (in contrast to a relational database). But one problem is persistence in combination with making small changes to the data, since the entire document has to be rewritten to disk.
So my question is, are "document databases" especially made to solve this?
UPDATE
I think I understand the concept of "document oriented databases" better now. It's obviously not documents of any kind but each implementation uses it's own format, such as for instance JSON. And then the answer to my question also becomes obvious. If the entire JSON-structure had to be rewritten to disk after each change to keep it persisted, it wouldn't be a very good database.

If the entire JSON-structure had to be rewritten to disk after each change to keep it persisted, it wouldn't be a very good database.
I would say this is not true of any document database I know of. For example, Mongo doesn't store documents as JSON, it stores them as BSON (http://en.wikipedia.org/wiki/BSON).
Also databases like Mongo will store documents in RAM and persist them to disk later.
In fact many document databases will follow that pattern of storing documents in main memory and then writing them to disk.
But the fact that a given document database will write data to disk - and the fact that some documents might get changed a lot - does not mean the database is non-performant. I wouldn't disregard document databases based on speculation.

Related

what's the efficiency of qsqlite data base in Qt?

I am very new to database and I am trying to implement a offline map viewer. What would be the efficiency of the qsqldatabase?
To make it extreme, for example, is it possible to download all satellite image of all the detail levels of US from the google's map server and store it in a local sqlite database and still perform real time query based on my current gps location?
The Qt Database driver for SQLite uses SQLite internally (surprise!). So the question is more like: Is SQLite the right database to use? My answer: I would not use it to store geographical data, consider to look for a database which is optimized for this task.
If this is not an option; SQLite is really efficient. First check if your data is within the limits. Do not forget to create indexes and analyze the database. Then it should be able to handle your task. Here I assume you just want to get an image by its geographical position (but other solutions can be a lot faster because your data is sortable — if I remember correctly SQLite is not optimized for that).
As you will store large blobs, you may want to have a look at the Internal Versus External BLOBs in SQLite document. Maybe this gives you the answer already.

Django textarea for 50,000,000 character data

I've a django application that deals with large text files, up roughly 50,000,000 characters. For a variety of reasons it's desirable to store them as a model field.
We are using sqlite for dev and postgres for production.
Users do not need to enter the data via any UI.
The field does not need to be visible in the admin or elsewhere to the user.
Several questions:
Is it practicable to store this much text in a textarea field?
What, if any, performance issues will this likely create?
Would using a binary field improve performance?
Any guidance would be greatly appreciated.
Another consideration is that when you are querying that model, make sure you use defer on your querysets, so you aren't transferring 50MB of data down the pipe everytime you want to retrieve an object from the db.
I highly recommend storing those files on disk or S3 or equivalent in a FileField though. You won't really be able to query on the contents of those files efficiently.
This is more related to the database you use. You use SQLite so look at the limits of SQLite:
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000).
http://www.sqlite.org/limits.html
Besides that, it's probably better to use a TextField in Django.
A binary field wouldn't improve performance. Binary fields are meant for binary data, and you are storing text.
After some experimentation we decided to use a Django file field and not store the file contents in Postgresql. Performance was the primary decision driver. With file field we are able to query very quickly to get the underlying field file which in turn can be accessed directly at the OS level with much higher performance than is available if the data is stored in a Postgresql table.
Thanks for the input. It was a big help.

Replace strings in large file

I have a server-client application where clients are able to edit data in a file stored on the server side. The problem is that the file is too large in order to load it into the memory (8gb+). There could be around 50 string replacements per second invoked by the connected clients. So copying the whole file and replacing the specified string with the new one is out of question.
I was thinking about saving all changes in a cache on the server side and perform all the replacements after reaching a certain amount of data. After reaching that amount of data I would perform the update by copying the file in small chunks and replace the specified parts.
This is the only idea I came up with but I was wondering if there might be another way or what problems I could encounter with this method.
When you have more than 8GB of data which is edited by many users simultaneously, you are far beyond what can be handled with a flatfile.
You seriously need to move this data to a database. Regarding your comment that "the file content is no fit for a database": sorry, but I don't believe you. Especially regarding your remark that "many people can edit it" - that's one more reason to use a database. On a filesystem, only one user at a time can have write access to a file. But a database allows concurrent write access for multiple users.
We could help you to come up with a database schema, when you open a new question telling us how your data is structured exactly and what your use-cases are.
You could use some form of indexing on your data (in a separate file) to allow quick access to the relevant parts of this gigantic file (we've been doing this with large files successfully (~200-400gb), but as Phillipp mentioned you should move that data to a database, especially for the read/write access. Some frameworks (like OSG) already come with a database back-end for 3d terrain data, so you can peek there, how they do it.

Raw Binary Tree Database or MongoDb/MySQL/Etc?

I will be storing terabytes of information, before indexes, and after compression methods.
Should I code up a Binary Tree Database by hand using sort files etc, or use something like MongoDB or even something like MySQL?
I am worried about (space) cost per record with things like MySQL and other DB's that are around. I also know that some databases even allow for compression, but they convert to read only tables. These tables/records need to be accessed and overwritten with new data fairly often. I think if I were to code something in C++ I'd be able to keep the cost of space per record to a minimum.
What should I do?
There are new non-relational databases that are becoming popular these days, that specialize in managing large-scale data.
Check out Hadoop or Cassandra, both of these are at the Apache Project.

Best way to store data in C++

I'm just learning C++, just started to mess around with QT, and I am sitting here wondering how most applications save data? Is there an industry standard? Do they store it in a XML file, text file, SQLite? What about sensitive data that say accounting software would need to save? I'm just interested in learning what the best practices for this are.
Thanks
This question is way too broad. The only answer is it depends on the nature of the particular application and the data, and whether or not it is written in C++ has very little to do with it.
For example, user-configurable application settings are often stored in text files, but on Windows they are typically stored in the Registry. Accounting applications typically keep their data in a database of some sort.
There are many good ways to store application data (call it serialization).
Personally, I think for larger datasets, using an open format is much, much easier for debugging. If you go with XML, for example, you can store your data in an open form so that if you have file corruption issues (i.e. a client can't open your file for some reason), it's easier to find. If you have sensitive data in there, you can always encrypt it before writing it to file using key encryption. Microsoft, for instance, has gone from using a proprietary format to open xml in their office docs. They use .*x extension (.docx, .xlsx, etc). It's really just a compressed folder with xml files.
Using binary serialization is, of course, the industry standard at the moment for most standalone applications. Most likely that is because of the application framework they are using (such as MFC, which is old). If you take a look at most of the serialization techniques in modern application frameworks, XML serialization is very well supported.
First you need to clarify what kind of data you would like to save.
If you just want to save some application settings, use QSettings to save your settings to an INI file or registry.
If it is much more than just some application settings, go for XML files or SQL.
There is no standard practice, however if you want to use complex structured data, consider using an embedded database engine such as SQLite or Metakit, or Berkeley DB files. XML files would also do the job and be human readable/writable. Preferences can use INI files or the Windows registry, and so on. In short, it really depends on your usage pattern.
This is a general question. Like many things, the right answer depends on your application and its needs.
Most desktop applications save end-user data to a file (think Word and Excel). The format is up to you, XML, binary, etc. And if you can serialize/deserialize objects to file it will probably make your life easier.
Internal application data such as configuration files or temporary data might be saved to an XML file or an lightweight, local database such as SQLite
Often, "enterprise" applications used internally by a business will save their data to a back-end database such as SQL Server or Oracle. This is so all of the enterprise's data is saved to a single central location. And then it is available for reporting, etc.
For accounting software, you would need to consider the business domain and end users. For example, if the software is to be sold to large businesses you would probably use some form of a database to save data. Otherwise a binary file would be fine, perhaps with some form of encryption if you are really paranoid.
When you say "the best way", then you have to define what you mean by "good".
The problem is that various requirements conflict with each other, therefore so you can't satisfy all of them simultaneously.
For example, if one requirement is "concurrent multi-user access to the data" then this suggests using a database engine, but that conflicts with "as small as possible" and "minimize dependencies on 3rd-party software".
If a requirement is "portable data format" then this suggests XML, but that conflicts with "compact" and "indexed".
Do they store it in a XML file, text file, SQLite?
Yes.
Also, Binary files and relational databases.
Anything else?