how to deal with virtual index in a database table in Django + PostgreSQL

how to deal with virtual index in a database table in Django + PostgreSQL - django

Here is my current scenario:
Need to add a new field to an existing table that will be used for ordering QuerySet.
This field will be an integer between 1 and not a very high number, I expect less than 1000. The whole reasoning behind this field is to use it for visual ordering on the front-end, thus, index 1 would be the first element to be returned, index 2 second, etc...
This is how the field is defined in model:
priority = models.PositiveSmallIntegerField(verbose_name=_(u'Priority'),
default=0,
null=True)
I will need to re-arrange (reorder) the whole set of elements in this table if a new or existing element gets this field updated. So for instance, imagine I have 3 objects it this table:
Element A
priority 1
Element B
priority 2
Element C
priority 3
If I change Element C priority to 1 I should have:
Element C
priority 1
Element A
priority 2
Element B
priority 3
Since this is not a real db index ( and have empty values), I'm gonna have to query for all elements on database each time a new element is created / updated and change priority value for each record in table. Not really worried about performance since table will always be small BUT, I'm worried this way to proceed is not the way to go or simply it generates too much overhead.
Maybe there is simpler way to do this with plain SQL stuff? If I use an index though, I will get an error every time an existing priority is used, something I don't want either.
Any pointers?

To insert at 10th position all you need is a single sql query:
MyModel.objects.filter(priority__gte=10).update(priority=models.F('priority')+1)
Then you would need a similar one for deleting an element, and swapping two elements (or whatever your use case requires). It all should be doable in a similar manner with bulk update queries, no need to manually update entry by entry.

First, you can very well index this column, just don't enforce it to contains unique values. Such standard indexes can have nulls and duplicates... they are just used to locate the row(s) matching a criteria.
Second, updating each populated* row each time you insert/update a record should be looked at based on the expected update frequency. If each user is inserting several records each time they use the system and you have thousands of concurrent users, it might not be a good idea... whereas if you have a single user updating any number of rows once in a while, it is not so much an issue. On the same vein, you need to consider if other updates are occurring to the same rows or not. You don't want to lock all rows too often if they are to be updated/locked for updating other fields.
*: to be accurate, you wouldn't update all populated rows, but only the ones having a priority lower than the inserted one. (inserting a priority 999 would only decrease the priority of items with 999 and 1000)

Related

Short incremental uinque id for neo4j

I use django with neo4j as database. I need to use short url based on node ids in my rest api. In neo4j there is an id used in database that didn't recommended to use in app, and there is approach to use uuid that is too long for my short urls. So I add my uid generator:
def uid_generator():
last_id = db.cypher_query("MATCH (n) RETURN count(*) AS lastId")[0][0][0]
if last_id is None:
last_id = 0
last_id = str(last_id)
hash = sha256()
hash.update(str(time.time()).encode())
return hash.hexdigest()[0:(max(2, len(last_id)))] + str(uuid.uuid4()).replace('-', '')[0:(max(2, len(last_id)))]
I have two question, First I read this question in stack overflow and still not sure that MATCH (n) RETURN count(*) AS lastId is O(1) there was no reference to that! Is there any reference for that answer? Second is there a better approach to do in both id uniqueness and speed?

First, you should put a unique constraint on the id property to make sure there are no collisions created by parallel create statements. This requires using a label, but you NEED this fail-safe if you plan to do anything serious with this data. But this way, you can have rolling ids for different labels. (All indexed labels will have a count table. UNIQUE CONSTRAINT also creates an index)
Second, you should do the generation and creation in the same cypher like this
MATCH (n:Node) WITH count(*) AS lastId
CREATE (:Node{id:lastId})
This will minimize time between generation and commit, reducing chances of collision. (Remember to retry on failed attempts from unique violations)
I'm not sure what you are doing with the hash, just that you are doing it wrong. Either you generate a new time based UUID (It will require no parameters) and use it as is, or you use the incriminating id. (By altering a UUID, you invalidate the logic that guaranteed uniqueness, thus significantly increasing collision chance)
You can also store the current index count in a node like is explained here. It's not guaranteed to be thread safe, but shouldn't be a problem as long as you have Unique Constraints in place, and retry on constraint violations. This will be more tolerant of deleting nodes.

Your approach is not good because it's based on the number of node in the database.
What happened if you create a node (call it A), and then delete a random node, and then create a new node (call it B).
A and B will have the same ID, and I think that's why you have added a hash in code based on the time (but I barely understand the line :)).
On the other side, Neo4j's ID ensure you to have a unique ID across the database, but not in the time. Per default, Neo4j recycle unused ID (an ID is release when a node is deleted).
You can change this behavour by changing the configuration (see the doc HERE ) : dbms.ids.reuse.types.override=RELATIONSHIP
Becarefull with such a configuration, the size of your database on your harddrive can only increase, even if you delete nodes.

Why not create your own identifier? You can get the maximum of your last identifier (let's call it RN for record number).
match (n) return max(n.RN) as lastID
max is one of several numeric functions in cypher.

Fastest way to select several inserted rows

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id

I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.

That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.

As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.

What time would/wouldn't be good for setting db_index=True?

I know that I add indexes on columns when I want to speed up searches on that column.
Here is an example model
class Blog(models.Model):
title = models.CharField(max_length=100)
added = models.DateTimeField(auto_now_add=True)
body = models.TextField()
I need to look up title and added columns and I should set db_index=True on that columns.
class Blog(models.Model):
title = models.CharField(db_index=True, max_length=100)
added = models.DateTimeField(db_index=True, auto_now_add=True)
body = models.TextField()
But I search internet resource about more examples, I still can't understand or conclude how to use it. What time would/wouldn't be good for setting db_index=True?

When to consider adding an index to a column?
In general, you need to consider many points before deciding to add an index to a column.
Oracle in its docs, has defined multiple guidelines on when to add an index to a column:
http://docs.oracle.com/cd/B19306_01/server.102/b14211/data_acc.htm#i2769
Consider indexing keys that are used frequently in WHERE clauses.
Consider indexing keys that are used frequently to join tables in SQL statements.
Choose index keys that have high selectivity. The selectivity of an index is the percentage of rows in a table having the same value for the indexed key. An index's selectivity is optimal if few rows have the same value. Indexing low selectivity columns can be helpful if the data distribution is skewed so that one or two values occur much less often than other values.
Do not use standard B-tree indexes on keys or expressions with few distinct values. Such keys or expressions usually have poor selectivity and therefore do not optimize performance unless the frequently selected key values appear less frequently than the other key values. You can use bitmap indexes effectively in such cases, unless the index is modified frequently, as in a high concurrency OLTP application.
Do not index columns that are modified frequently. UPDATE statements that modify indexed columns and INSERT and DELETE statements that modify indexed tables take longer than if there were no index. Such SQL statements must modify data in indexes as well as data in tables. They also generate additional undo and redo.
Do not index keys that appear only in WHERE clauses with functions or operators. A WHERE clause that uses a function, other than MIN or MAX, or an operator with an indexed key does not make available the access path that uses the index except with function-based indexes.
Consider indexing foreign keys of referential integrity constraints in cases in which a large number of concurrent INSERT, UPDATE, and DELETE statements access the parent and child tables. Such an index allows UPDATEs and DELETEs on the parent table without share locking the child table.
When choosing to index a key, consider whether the performance gain for queries is worth the performance loss for INSERTs, UPDATEs, and DELETEs and the use of the space required to store the index.
Remember when you add additional indexes, Read operations get faster but Write operations becomes slower because of recalculation of the indexes. So use them as per your use case demands.

The penalty for using indexes is slower write performance -- given you're unlikely to be posting a new blog post every 0.0001s you should feel free to add indexes for anything you're searching on.

sqlite, what is the fastest way to get a row ?

if I want to get the first row, usually I use such query :
SELECT * FROM tableOfFamousUndeadPeople WHERE ID = 1
I guess sqlite check ID of all rows then I get the result. So if my table have n rows, the time is O(n).
Actually, my ID column has the flags INTEGER PRIMARY KEY, I don't know if sqlite do some blackmagick trick to speed up. I don't know if there is another way to get one row. I don't really understand how to use ROWID or if my ID column is used as ROWID

Time is much better than O(n). The database stores the primary key (also other indices) sorted, so that it can perform a binary search to find the desired row.
The time complexity is O(log(n))
http://bigocheatsheet.com/
what is the fastest way to get a row ?
Search using an index (including the primary key) that is highly discriminating.
(A discriminating index is one where many of the values are mostly unique. Indexing gender is not very good because it divides a table into just two categories, male and female. Indexing a ZIP code is pretty good for most purposes. Using the primary key is ideal, since each value is guaranteed to be unique).

MongoDB: what is the most efficient way to query a single random document?

I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?

I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.

It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js