sqlite, what is the fastest way to get a row ? - c++

if I want to get the first row, usually I use such query :
SELECT * FROM tableOfFamousUndeadPeople WHERE ID = 1
I guess sqlite check ID of all rows then I get the result. So if my table have n rows, the time is O(n).
Actually, my ID column has the flags INTEGER PRIMARY KEY, I don't know if sqlite do some blackmagick trick to speed up. I don't know if there is another way to get one row. I don't really understand how to use ROWID or if my ID column is used as ROWID

Time is much better than O(n). The database stores the primary key (also other indices) sorted, so that it can perform a binary search to find the desired row.
The time complexity is O(log(n))
http://bigocheatsheet.com/
what is the fastest way to get a row ?
Search using an index (including the primary key) that is highly discriminating.
(A discriminating index is one where many of the values are mostly unique. Indexing gender is not very good because it divides a table into just two categories, male and female. Indexing a ZIP code is pretty good for most purposes. Using the primary key is ideal, since each value is guaranteed to be unique).

Related

How does django db index work with minus?

What does this index with minus (-text) mean in Django?
migrations.AddIndex(
model_name='comment',
index=models.Index(fields=['-text'], name='xxx'),
)
Indexes are generally created in ascending order based on the named column. However, in this case, the '-' says that the index should be based on the named column in descending order, (or reverse alphabetical assuming text is a CharField of some sort).
The process of indexing itself tells the database that this is an important column, and that it should be pre-prepared. This means that searches on an indexed column don't have to search every row, and so are faster.
Indexing a field should speed any given database call that searches that field. Reverse indexing with the minus is useful if, for example, you know most of your searches will be for recent data, so you create a reverse index on a datetime field. On a text search, it depends if you think searches on ZYX will be more frequent than ABC - possible but very case specific.
In terms of django, you wouldn't index one column before a search in a view (for example). Instead, you might recognise that a column was used frequently in your searches, or contains a lot of data or records, or was part of very common equations (see https://docs.djangoproject.com/en/4.0/ref/models/indexes/ for examples), so you index it at the database level, and by indexing the column you make all searches using it faster and more efficient.

Fastest way to select a lot of rows based on their ID in PostgreSQL?

I am using postgres with libpqxx, and I have a table that we will simplify down to
data_table
{
bytea id PRIMARY KEY,
BigInt size
}
If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids, what is the best way to get the ID and the Size parameters out of data_table?
I have so far used a prepared statement:
constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";
Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,
pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;
Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.
I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.
E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.
The task as I understand it:
Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").
The fastest way of retrieval is index-only scans. The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):
CREATE TEMP TABLE data_table (
id bytea
, size bigint
, PRIMARY KEY (id) INCLUDE (size) -- !
)
About covering indexes:
Do covering indexes in PostgreSQL help JOIN columns?
Then retrieve all rows in a single query (or few queries) for many IDs at once like:
SELECT id, size
FROM data_table
JOIN (
VALUES ('id1'), ('id2') -- many more
) t(id) USING (id);
Or one of the other methods laid out here:
Query table by indexes from integer array
Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:
How to update selected rows with values from a CSV file in Postgres?
And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.
You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.
The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.

how to deal with virtual index in a database table in Django + PostgreSQL

Here is my current scenario:
Need to add a new field to an existing table that will be used for ordering QuerySet.
This field will be an integer between 1 and not a very high number, I expect less than 1000. The whole reasoning behind this field is to use it for visual ordering on the front-end, thus, index 1 would be the first element to be returned, index 2 second, etc...
This is how the field is defined in model:
priority = models.PositiveSmallIntegerField(verbose_name=_(u'Priority'),
default=0,
null=True)
I will need to re-arrange (reorder) the whole set of elements in this table if a new or existing element gets this field updated. So for instance, imagine I have 3 objects it this table:
Element A
priority 1
Element B
priority 2
Element C
priority 3
If I change Element C priority to 1 I should have:
Element C
priority 1
Element A
priority 2
Element B
priority 3
Since this is not a real db index ( and have empty values), I'm gonna have to query for all elements on database each time a new element is created / updated and change priority value for each record in table. Not really worried about performance since table will always be small BUT, I'm worried this way to proceed is not the way to go or simply it generates too much overhead.
Maybe there is simpler way to do this with plain SQL stuff? If I use an index though, I will get an error every time an existing priority is used, something I don't want either.
Any pointers?
To insert at 10th position all you need is a single sql query:
MyModel.objects.filter(priority__gte=10).update(priority=models.F('priority')+1)
Then you would need a similar one for deleting an element, and swapping two elements (or whatever your use case requires). It all should be doable in a similar manner with bulk update queries, no need to manually update entry by entry.
First, you can very well index this column, just don't enforce it to contains unique values. Such standard indexes can have nulls and duplicates... they are just used to locate the row(s) matching a criteria.
Second, updating each populated* row each time you insert/update a record should be looked at based on the expected update frequency. If each user is inserting several records each time they use the system and you have thousands of concurrent users, it might not be a good idea... whereas if you have a single user updating any number of rows once in a while, it is not so much an issue. On the same vein, you need to consider if other updates are occurring to the same rows or not. You don't want to lock all rows too often if they are to be updated/locked for updating other fields.
*: to be accurate, you wouldn't update all populated rows, but only the ones having a priority lower than the inserted one. (inserting a priority 999 would only decrease the priority of items with 999 and 1000)

Sorting in DynamoDB

Is there any way to get sorted result out of Dynamodb when using Scan/Query APIs? I know in Query API you can sort by Rangekey and ScanIndexForward which sorts the result ascending if the value is true and descending if false;
+But as far as I understood you can have one range key, so how if I want to sort based on different fields?
+Also if I'm using scan, it seems there is no option to sort the result either!
Any help is appreciated!
For the first question about having only one range key, you can use Local secondary Index. You assign a normal attribute as the range key of the LSI and DynamoDB will sort your rows (with the same hashkey) by comparing that attribute.
So essentially LSI gives you "additional rangeKey". You can create up to 5 LSIs.
See here and here for example of querying LSI. You can treat an Index just like a regular table. You can do query & scan on index (but not put).
For your second question about sorting the rows globally instead of sorting items with the same hashkey, I don't think DynamoDB supports this feature out-of-the-box. You will have to
a) scan and sort the items on your own
b) or create a global secondary index with just one hash key and dump all your items into that hashkey. It is not recommended because this creates a hot partition in GSI.
c) or design your schema to avoid having to sort items globally.

MongoDB: what is the most efficient way to query a single random document?

I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?
I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.
It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.