django UUIDField with Postgresql - django

Is there an advantage for using UUIDField with Django and Postgresql (native datatype) over a self-made generated unique key?
Currently I use a random-generated alphanumeric ID field on my models and I am wondering if the Postgres native datatype and the UUIDField are better for this purpose and whether there's a reason to switch over.
I generate the id using random letters and digits. It's 25 chars long. I put a db_index on it for faster retrieval. I don't shard my DB. The reason being that some models cannot have consecutive ids for business purposes

Switching to UUID will have an advantage particularly if you have a large number of records. Lookups and inserts ought to be a tiny bit faster. You will be saving 9 bytes of storage per row since UUID fields are only 128 bits.
However that doesn't mean your home made primary key is a bad idea. Far from it. It's a good one and a similar approach is used by Instagram who also happen to be using Postgresq and DJango. Their solution though uses only 64 bits and also manages to squeeze in information about object creation time into the primary key.
Their primary purpose is sharding but works very well even for non shared dbs. Just use some random number for the 13 bits that represent their sharding information. They have a sql sample at the link above.

Related

Benefits of using SmallAutoField as primary key in Django?

My table will have around 2000 rows.
Are there any performance benefits to using SmallAutoField as the primary key rather than the standard AutoField?
Also, if in the future I run out of the 32,000 incrementing integers in my SmallAutoField, how difficult would it be to change it to a regular AutoField?
PS. I am using SQLite but may switch to Postgres.

Using SQL SEQUENCEs with Django

I need to generate numbers that are unique and in sequence (for easy traceability). Also, they should be as gapless as possible (for decent non-repudiation).
SQL SEQUENCEs are the obvious answer. What's the cleanest way to use them with Django 1.8 and MySQL? They doesn't seem to be a good way to use SEQUENCEs with Django.
Since MySQL doesn't have separate sequences the only clear way is to create a new table that has autoincrementing IDs.
Thinking about the design in a super relational way, I am also able to store other related data in the new table.
The other data is otherwise being redundantly stored in other tables (redundantly since some of the other tables are for logging). However, storing the data in a more organised way will result in better traceability and smaller database size.

MySQL Performance issues with large amounts of data

I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.
I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:
CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
`header_id` tinyint(3) unsigned NOT NULL auto_increment,
`sensor` varchar(10) NOT NULL,
`bytes` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`header_id`),
UNIQUE KEY `header_id_UNIQUE` (`header_id`),
UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`transverse` bit(1) NOT NULL default b'0',
`data` varbinary(1080) NOT NULL,
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
KEY `time` (`time_sec`)
KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`raster_id` int(10) unsigned NOT NULL,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`header_id` tinyint(3) unsigned NOT NULL,
`data_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.
Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.
The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.
Here is an example SQL Query we use to get history data:
SELECT
rel_RASTER_TABLE.time_sec,
rel_RASTER_TABLE.time_nsec,
RAW_DATA_TABLE.transverse,
HEADER_TABLE.bytes,
RAW_DATA_TABLE.data
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!
EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?
First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.
I would say partitioning is an absolute must in a case like this:
large amount of data
new data coming in continuously
implicit: old data getting deleted continuously.
Check out this for mySQL.
Looking at your select stmt (which filters on time), I'll say partition on the time column.
Of course you might wanna add a few indexes based on the frequent queries you want to use.
--edit--
I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date.
So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first.
Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.
--edit end--
A quickie on partitioning if you're new to it.
Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
Advantages
insertions are faster as physically it's inserted into a smaller file (partition).
deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
queries with a where clause on the key by which the table is partitioned is much much faster.
Index building is faster.
I don't have much experience with MySQL, but here are some a priori thoughts that jump to mind.
Is your select in a stored procedure?
The select's predicate is usually searched in the order its asked in. If the data on the disk is reordered to match the primary key, then doing raster id first is fine. You would be paying the cost of reordering on every insert though. If the data is stored in time order on disk, you would probably want to search on time_sec before raster_id.
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
Your indexes don't follow the search predicates.
It will create indexes based on the keys, generally.
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
It may not be using the primary index because you aren't using internal_id. You may want to set internal_id as the primary key and create a separate index based on your search parameters. At least on raster_id and time_sec.
Are the joins too loose?
This may be my inexperience with MySQL, but I expect to see conditions on the joins. Does using FROM here do a natural join? I don't see any foreign keys specified, so I don't know how it would join these tables rationally.
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
Usually when developing something like this I would work with a smaller set and remove predicates to makes sure that each step meets what I expect. If you accidentally cast a wide net up front, then narrow down later you may mask some inefficiencies.
Most query optimizers have a way to output how the optimized, make sure it meets your expectations. One of the comments mention Explain plans, I assume that is what it is called.
Without knowing what all the queries are its difficult to give specific advice, however looking at the single query you have provided, there are no indexes which are idealy suited to resolving this.
In fact the structure is a bit messy - if internal_id is an auto-increment value then it is unique - why add other stuff in the primary key? It looks as if a more sensible structure for rel_RASTER_TABLE would be:
PRIMARY KEY (`internal_id`),
KEY (`raster_id`,`time_sec`,`time_nsec`),
And as for RAW_DATA_TABLE, it should be blindingly obvious that its indexes are far from optimal. And should probably be:
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
KEY `time` (`time_sec`, `time_nsec`)
Note that removing redundant indexes will speed up inserts/updates.
Capturing slow queries should help - and learn how to use 'explain' to see what indexes are redundant / needed.
You may also get a performance boost by tuning the mysql instance - particularly increasing the sort and join buffers - try running mysqltuner
First, I would try to create a view with only the necessary info that needs to be selected between the different tables.
By the way, MySQL is not necessarily the most optimized database system for what you are trying to accomplish... Look into other solutions such Oracle, Microsoft SQL, PostgreSQL etc. Also, the performance will vary depending on the server being used.

Google Plus user id representation technique

What kind of technique does use Google Plus to generate users' unique ids?
Example
https://plus.google.com/102766325060234825733/posts
You can only assume that they are randomly generated ID's that are large enough to be generated non-sequentially with sufficient entropy.
The ID's are too big to be stored in a bigint field which is interesting, again probably due to the required entropy and non-sequential requirement (so that nothing can be inferred by comparing userid's).
A simple encryption of a serially generated number, with a secret key can be used to generate the IDs. It can be a 1 way hash, or a decryptable encryption.
The reason for not using serial numbers directly is obvious: You can easily guess userids of other users on the network, which can result in Bots scraping the content of the network.

Django: how to rearrange object ids

I've started a model with the default, automatically handled IDs Django provides.
Now I've started interfacing with an external system which has its own IDs for the same objects, and it would be very convenient to align my own IDs with the external system's.
However, the numerical ranges overlap, so a naive solution wouldn't work.
Is there some elegant way to alter the IDs in a safe manner? (the object have multiple foreign keys/m2ms etc)
Thanks
I don't know if this is the best way but I would set all foreign/m2m keys to cascade updates and add a defined number to all IDs in the django db to remove the overlapping of the ranges (10k or 100K or more depeding on your data).
with that done, you can copy IDs from the other system. For safety, I would duplicate the ID columns while doing that in order to not lose the original ones... until sure all works ok