Fastest way to select several inserted rows

Fastest way to select several inserted rows - c++

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id

I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.

That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.

As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.

Related

DynamoDB record size increasing with time

I have a customer table in DynamoDB with basic attributes like name, dob, zipcode, email, etc. I want to add another attribute to it which will keep increasing with time. For example, each time the user clicks on a product (item), I want to add that to the record so that I have the full snapshot of the customer's profile in a single value indexed by the customerId. So, my new attribute would be called viewedItems and would be a list of itemIds viewed (along with the timestamp).
However, given the 4KB size limit for DynamoDB value, it is going to be surpassed with time as I keep adding the clicked products to the customer profile.
How can I best define my objects so as to perform the following?
Access the full profile of the customer by customerId, including the views.
Access time filtered profile of the customer (like all interactions since last N days), in which case the viewed items should be filtered by the given time range.
Scan the entire table with a time filter on viewedItems.
The query needs to be performant as the profile could be pulled at request time.
Ability to update individual customer record (via a batch job, for example, that updates each customer's record if need be).
One way to do this would be to create a different table (say customer_viewed_items) with hash key customerId and a range key timestamp with value being the itemId that the customer viewed. But this looks like an increasingly complicated schema - not to mention twice the cost involved in accessing the item. If I have to create another attribute based on (say) "bought" items, then I'll need to create another table. So, the solution I have in mind does not seem good to me.
Would really appreciate if you could help suggest a better schema/approach.

As soon as you really don't know how many items will be viewed by user (edge case - user opens all items sequentially, multiple times) - you cannot store this information in single dynamodb record.
The only solution is to normalize your database and create separate table like you've described.
Now, next question - how to minimize retrieval cost in such scheme? Usually you don't need to fetch all viewed items, probably you want to display some of them, then you need to fetch only last X.
You can cache such items in main table customer, ie - create field "lastXviewedItems" and updated it, so it contains only limited number of items without breaking size limit, of course for BI analysis - you will have to store them in 2nd table too.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?

SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.

It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

how to deal with virtual index in a database table in Django + PostgreSQL

Here is my current scenario:
Need to add a new field to an existing table that will be used for ordering QuerySet.
This field will be an integer between 1 and not a very high number, I expect less than 1000. The whole reasoning behind this field is to use it for visual ordering on the front-end, thus, index 1 would be the first element to be returned, index 2 second, etc...
This is how the field is defined in model:
priority = models.PositiveSmallIntegerField(verbose_name=_(u'Priority'),
default=0,
null=True)
I will need to re-arrange (reorder) the whole set of elements in this table if a new or existing element gets this field updated. So for instance, imagine I have 3 objects it this table:
Element A
priority 1
Element B
priority 2
Element C
priority 3
If I change Element C priority to 1 I should have:
Element C
priority 1
Element A
priority 2
Element B
priority 3
Since this is not a real db index ( and have empty values), I'm gonna have to query for all elements on database each time a new element is created / updated and change priority value for each record in table. Not really worried about performance since table will always be small BUT, I'm worried this way to proceed is not the way to go or simply it generates too much overhead.
Maybe there is simpler way to do this with plain SQL stuff? If I use an index though, I will get an error every time an existing priority is used, something I don't want either.
Any pointers?

To insert at 10th position all you need is a single sql query:
MyModel.objects.filter(priority__gte=10).update(priority=models.F('priority')+1)
Then you would need a similar one for deleting an element, and swapping two elements (or whatever your use case requires). It all should be doable in a similar manner with bulk update queries, no need to manually update entry by entry.

First, you can very well index this column, just don't enforce it to contains unique values. Such standard indexes can have nulls and duplicates... they are just used to locate the row(s) matching a criteria.
Second, updating each populated* row each time you insert/update a record should be looked at based on the expected update frequency. If each user is inserting several records each time they use the system and you have thousands of concurrent users, it might not be a good idea... whereas if you have a single user updating any number of rows once in a while, it is not so much an issue. On the same vein, you need to consider if other updates are occurring to the same rows or not. You don't want to lock all rows too often if they are to be updated/locked for updating other fields.
*: to be accurate, you wouldn't update all populated rows, but only the ones having a priority lower than the inserted one. (inserting a priority 999 would only decrease the priority of items with 999 and 1000)

Redshift: Should the sortkey contain the distkey?

We have customer data that is sharded by a company ID. That is, no companies data would ever mix with another companies data so this was chosen as the distkey.
Should the company ID be the first column in the sortkey given that a node may contain several thousand companies? Or does the distkey already limit the data to a given company before it starts scanning?

Dist key does not affect the order in which rows are stored in each node/slice/block. Sort key (or natural order in the absence of such) defines the order.
If you expect frequent queries with company_id and you want to achieve maximum performance, make company_id the main sort key (COMPOUND or default, not just INTERLEAVED).
I'd also advise familiarising yourself with the SVL_QUERY_REPORT view. It can tell you whether full-scan was used (or range-restricted when using optimal sort keys), against which slices, and how many rows were actually scanned. Try different table layouts for the same data, and not only look at query times, but also confirm from this report that Redshift does what you expect it to do.

Partitioning a table in sybase-select query

My main concern:
I have an existing table with huge data.It is having a clustered index.
My c++ process has a list of many keys with which it checks whether the key exists in the table,
and if yes, it will then check the row in the table and the new row are similar. if there is a change the new row is updated in the table.
In general there will less changes. But its huge data in the table.
S it means there will be lot of select queries but not many update queries.
What I would I like to achieve:
I just read about partitioning a table in sybase here.
I just wanted to know will this be helpful for me, as I read in the article it mentions about the insert queries only. But how can I improve my select query performance.
Could anyone please suggest what should I look for in this case?

Yes it will improve your query (read) performance so long as your query is based on the partition keys defined. Indexes can also be partitioned and it stands to reason that a smaller index will mean faster read performance.
For example if you had a query like select * from contacts where lastName = 'Smith' and you have partitioned your table index based on first letter of lastName, then the server only has to search one partition "S" to retrieve its results.
Be warned that partitioning your data can be difficult if you have a lot of different query profiles. Queries that do not include the index partition key (e.g. lastName) such as select * from staff where created > [some_date] will then have to hit every index partition in order to retrieve it's result set.
No one can tell you what you should/shouldn't do as it is very application specific and you will have to perform your own analysis. Before meddling with partitions, my advice is to ensure you have the correct indexes in place, they are being hit by your queries (i.e. no table scans), and your server is appropriately resourced (i.e got enough fast disk and RAM), and you have tuned your server caches to suit your queries.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js