Is there an advantage for using UUIDField with Django and Postgresql (native datatype) over a self-made generated unique key?
Currently I use a random-generated alphanumeric ID field on my models and I am wondering if the Postgres native datatype and the UUIDField are better for this purpose and whether there's a reason to switch over.
I generate the id using random letters and digits. It's 25 chars long. I put a db_index on it for faster retrieval. I don't shard my DB. The reason being that some models cannot have consecutive ids for business purposes
Switching to UUID will have an advantage particularly if you have a large number of records. Lookups and inserts ought to be a tiny bit faster. You will be saving 9 bytes of storage per row since UUID fields are only 128 bits.
However that doesn't mean your home made primary key is a bad idea. Far from it. It's a good one and a similar approach is used by Instagram who also happen to be using Postgresq and DJango. Their solution though uses only 64 bits and also manages to squeeze in information about object creation time into the primary key.
Their primary purpose is sharding but works very well even for non shared dbs. Just use some random number for the 13 bits that represent their sharding information. They have a sql sample at the link above.
For a model in my database I need to store around 300 values for a specific field. What would be the drawbacks, in terms of performance and simplicity in query, if I use Postgres-specific ArrayField instead of a separate table with One-to-Many relationship?
If you use an array field
The size of each row in your DB is going to be a bit large thus Postgres is going to be using a lot more toast tables (http://www.postgresql.org/docs/9.5/static/storage-toast.html)
Every time you get the row, unless you specifically use defer (https://docs.djangoproject.com/en/1.9/ref/models/querysets/#defer) the field or otherwise exclude it from the query via only, or values or something, you paying the cost of loading all those values every time you iterate across that row. If that's what you need then so be it.
Filtering based on values in that array, while possible isn't going to be as nice and the Django ORM doesn't make it as obvious as it does for M2M tables.
If you use M2M
You can filter more easily on those related values
Those fields are postponed by default, you can use prefetch_related if you need them and then get fancy if you want only a subset of those values loaded
Total storage in the DB is going to be slightly higher with M2M because of keys, and extra id fields
The cost of the joins in this case is completely negligible because of keys.
Personally I'd say go with the M2M tables, but I don't know your specific application. If you're going to be working with a massive amount of data it's likely worth grabbing a representative dataset and testing both methods with it.
It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!
You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.
I'm reading a book about coding style in Django and one thing they discuss is db_index=True. Ever since I started using Django, I've never used this function because I'm not really sure what it does.
So my question is, when to consider adding indexes?
This is not really django specific; more to do with databases. You add indexes on columns when you want to speed up searches on that column.
Typically, only the primary key is indexed by the database. This means look ups using the primary key are optimized.
If you do a lot of lookups on a secondary column, consider adding an index to that column to speed things up.
Keep in mind, like most problems of scale, these only apply if you have a statistically large number of rows (10,000 is not large).
Additionally, every time you do an insert, indexes need to be updated. So be careful on which column you add indexes.
As always, you can only optimize what you can measure - so use the EXPLAIN statement and your database logs (especially any slow query logs) to find out where indexes can be useful.
The above answer is correct but in some cases where the search is being done on columns that have only varchar datatype like email. There you need to add an index.
Following is the way of doing that:
Index(name='covering_index', fields=['headline'], include=['pub_date'])
reference from https://docs.djangoproject.com/en/3.2/ref/models/indexes/
I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.
I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:
CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
`header_id` tinyint(3) unsigned NOT NULL auto_increment,
`sensor` varchar(10) NOT NULL,
`bytes` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`header_id`),
UNIQUE KEY `header_id_UNIQUE` (`header_id`),
UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`transverse` bit(1) NOT NULL default b'0',
`data` varbinary(1080) NOT NULL,
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
KEY `time` (`time_sec`)
KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`raster_id` int(10) unsigned NOT NULL,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`header_id` tinyint(3) unsigned NOT NULL,
`data_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.
Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.
The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.
Here is an example SQL Query we use to get history data:
SELECT
rel_RASTER_TABLE.time_sec,
rel_RASTER_TABLE.time_nsec,
RAW_DATA_TABLE.transverse,
HEADER_TABLE.bytes,
RAW_DATA_TABLE.data
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!
EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?
First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.
I would say partitioning is an absolute must in a case like this:
large amount of data
new data coming in continuously
implicit: old data getting deleted continuously.
Check out this for mySQL.
Looking at your select stmt (which filters on time), I'll say partition on the time column.
Of course you might wanna add a few indexes based on the frequent queries you want to use.
--edit--
I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date.
So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first.
Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.
--edit end--
A quickie on partitioning if you're new to it.
Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
Advantages
insertions are faster as physically it's inserted into a smaller file (partition).
deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
queries with a where clause on the key by which the table is partitioned is much much faster.
Index building is faster.
I don't have much experience with MySQL, but here are some a priori thoughts that jump to mind.
Is your select in a stored procedure?
The select's predicate is usually searched in the order its asked in. If the data on the disk is reordered to match the primary key, then doing raster id first is fine. You would be paying the cost of reordering on every insert though. If the data is stored in time order on disk, you would probably want to search on time_sec before raster_id.
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
Your indexes don't follow the search predicates.
It will create indexes based on the keys, generally.
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
It may not be using the primary index because you aren't using internal_id. You may want to set internal_id as the primary key and create a separate index based on your search parameters. At least on raster_id and time_sec.
Are the joins too loose?
This may be my inexperience with MySQL, but I expect to see conditions on the joins. Does using FROM here do a natural join? I don't see any foreign keys specified, so I don't know how it would join these tables rationally.
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
Usually when developing something like this I would work with a smaller set and remove predicates to makes sure that each step meets what I expect. If you accidentally cast a wide net up front, then narrow down later you may mask some inefficiencies.
Most query optimizers have a way to output how the optimized, make sure it meets your expectations. One of the comments mention Explain plans, I assume that is what it is called.
Without knowing what all the queries are its difficult to give specific advice, however looking at the single query you have provided, there are no indexes which are idealy suited to resolving this.
In fact the structure is a bit messy - if internal_id is an auto-increment value then it is unique - why add other stuff in the primary key? It looks as if a more sensible structure for rel_RASTER_TABLE would be:
PRIMARY KEY (`internal_id`),
KEY (`raster_id`,`time_sec`,`time_nsec`),
And as for RAW_DATA_TABLE, it should be blindingly obvious that its indexes are far from optimal. And should probably be:
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
KEY `time` (`time_sec`, `time_nsec`)
Note that removing redundant indexes will speed up inserts/updates.
Capturing slow queries should help - and learn how to use 'explain' to see what indexes are redundant / needed.
You may also get a performance boost by tuning the mysql instance - particularly increasing the sort and join buffers - try running mysqltuner
First, I would try to create a view with only the necessary info that needs to be selected between the different tables.
By the way, MySQL is not necessarily the most optimized database system for what you are trying to accomplish... Look into other solutions such Oracle, Microsoft SQL, PostgreSQL etc. Also, the performance will vary depending on the server being used.