This is my query on workbench
select t1.COL 1 from ex1 t1 left outer join ch1 t2 on t1.COL 1=t2.COL 1;
why this is taking too long to fetch data?
Outer joins can be slow since all of t1s records are returned. Since you're joining on id columns, it should be easy to index them. Without an index, when you join t2, you are evaluating each of the 142,000 records to search for matching ids. With an index, you are setting aside memory to "remember" the locations of each id in sequence. It's like using a bookmark instead of flipping through each page to find the page you want.
I don't know what database management system you're using, but here's a guide on creating clustered and unclustered indices on SQL Server:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql
Related
I am using postgres with libpqxx, and I have a table that we will simplify down to
data_table
{
bytea id PRIMARY KEY,
BigInt size
}
If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids, what is the best way to get the ID and the Size parameters out of data_table?
I have so far used a prepared statement:
constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";
Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,
pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;
Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.
I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.
E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.
The task as I understand it:
Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").
The fastest way of retrieval is index-only scans. The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):
CREATE TEMP TABLE data_table (
id bytea
, size bigint
, PRIMARY KEY (id) INCLUDE (size) -- !
)
About covering indexes:
Do covering indexes in PostgreSQL help JOIN columns?
Then retrieve all rows in a single query (or few queries) for many IDs at once like:
SELECT id, size
FROM data_table
JOIN (
VALUES ('id1'), ('id2') -- many more
) t(id) USING (id);
Or one of the other methods laid out here:
Query table by indexes from integer array
Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:
How to update selected rows with values from a CSV file in Postgres?
And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.
You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.
The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.
If we are using the Interleave option with a secondary index, is there still benefit to using the storing clause?
https://cloud.google.com/spanner/docs/secondary-indexes
Yes, there can still be benefit, although it's less likely to be a major benefit:
Let's say you have interleaved tables Singers->Albums->Songs, and you have an index:
CREATE INDEX SongsBySingerSongName ON Songs(SingerId, SongName),
INTERLEAVE IN Singers
Let's also assume that Songs has a FLOAT64 column, LengthInSeconds, for storing the length of a song.
If you wanted to look up all songs for SingerId 123 that started with "T" and were less than 4 minutes long, your query could be executed by:
Using SongsBySingerSongName to lookup all songs for Singer 123
that start with "T"
For these songs, back-join with Songs to
lookup LengthInSeconds to filter by length.
Since both Songs and SongsBySingerSongName are interleaved in the Singers table, we know that our data should all be in the same split, which means it will all reside on the same machine, which means the back-join in step (2) won't be terribly costly. However, the local back-join still incurs a cost to lookup the data, so saving step (2) by having a STORING clause could still reduce your query latency and overall cost. You would want to do benchmarking of your workload to see if the extra storing clause provides a net-benefit.
In general, if you have filters in your query that refer to columns in the index (either key columns or 'storing' columns), the filters can be evaluated before doing the back-join to the base table, and if the filter does not match, the back-join can be avoided. If the filter refers to a column that is not in the index, the back-join has to be done first to get the column value that the filter refers
Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.
I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id
I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.
That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.
As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.
Alright I am in a rather difficult situation, or at least I think so anyway. I have been doing some research on how to fix my problem but have really come up empty handed.
I need to be able to reindex the rowid of my table after I delete a row. That way at any given time when I want to update or index a row by the rowid it is accessing the correct one.
Now for those of you asking why. Basically I am interfacing a "homebrewed" db that was programmed in C and is really just a bunch of memory locations all accessed like they were a db table. So what I'm trying to say is they can look up a row by searching for a value in the table, or by simply saying i want row 6. Lastly the table could consist of really anything, and any values which means they dont create a column as an index and ultimately the only thing for me to index their row by row number is the rowid to my knowledge.
So I have found that VACUUM would do what I want or need but it appears that the system that database is in isn't giving sqlite privileges to write so when VACUUM is run it comes back with and error. (ERROR 14 or Unable to open the database file) (I also know that my db is open so that isn't the issue but not having write privileges is the only reason I can come up with) I have also read some stuff about the auto increment or something like that but didn't really understand/think that was going to be able to fix my problem.
Any suggestions or ideas from the sqlite or database geniuses out that would be appreciated.
Not sure if I have understood completely your problem, but if you can use SQL code maybe you can write a query to update the IDs (assuming they are in dense order).
You can use a query like this:
UPDATE t1
SET id = (SELECT rank
FROM (SELECT id,
(
SELECT count()+1
FROM (SELECT DISTINCT id
FROM t1 AS t
WHERE t.id < t1.id
)
) rank
FROM t1
) AS sub
WHERE sub.id = t1.id
);
You can check my demo in SQLFiddler. In this demo you will see the result of the DELETE and UPDATE statements (to simulate your case) if you run all queries together.