Kettle PDI: better lookup and insert update or insert update + lookup - kettle

In Kettle, aka Pentaho Data Integration, I read an xls with some products linked to some categories and I insert them in a db.
The relationship category-product is 1:n (one category has more products, one product is of one category).
I do the insert of category, then the insert of the product.
CASE 1:
Insert/update category (really, i do insert only);
Lookup category by code and return the id used in the other steps;
CASE 2:
Lookup category by code;
Filter row: if(id>0) then go to other steps; else go to step 3;
insert category and return id;
Is better (faster/memory use) the case 1 or the case 2?
The same choose is applied to sub-category, supplyer and other related entities.
Actually I use case 1 and pdi process 4 record per second and I have files with 100k records.

I suggest to use the 2nd method, reading the products and for each product use the Lookup steam to find the one category of the product.
The reason to go that way is that it is the way a human thinks. And also that the. Lookup stream (not the Lookup Database) is pretty optimized. In some case even quicker than a database left join.

Related

GSI vs redundancy dynamoDB

I have this scenario:
I have to save in dynamoDB table a lot of shops. Every shop has a ID string and its PK.
Every shop has a field "category" that is a string that indicates its category (food,tatoo ...).
So far everything is ok.
I have this use-case: "given in a category get all the stores of that category".
To accomplish this, two options came to my mind:
create a GSI that has like PK the "category id" and like field "shop ID".
In this way with the id of the category I get all the IDs of the stores of that category and then for each store id I query the main table to get all the info of each single store (name, address, etc.).
I create in the main table a PK called type "category_$id" (where $id is the category id) and as field the id of the store. This, as in the case of GSI, given a category ID, I have the set of IDs of the shops and then for each ID I execute the query on the same table to get all the info of that shop.
I wanted to know what the difference between these two options is in terms of cost / benefit and which is the best.
They seem to me substantially the same thing (the only difference is that the first uses another table, i.e. the index, while the second uses the same table), but I await the opinion of someone more experienced than me
One benefit of GSI is that it will result in less management. Lets say you delete/add a record from/to a main table. This will be automatically be reflected in your GSI.
In contrast, if you have two independent tables, you have to manage the synchronization between them yourself.

Index versions(historical) data

I have a table storing all the old versions for item in DynamoDB. Is there a way to index the versions data using search engine or other things so that I can do something gracefully for the following scenarios:
Use case 1:
Search for current items that match a specific query with sorting and pagination and get their historical data for a date range (Between January and June, what were the statuses of those items each month?). This would ideally be able to happen in one query.
Use case 2:
Find differences for a field based on the version history.
eg. Find all items whose field changed from 'A' to 'B'
Current state:
Use case 1:
1.Get all the items matching query first.
2.For each item ,issue a separate query to get all the versions for that
item.
3.Do a lot of in-memory filter operations in back end, and get the data
within data range.
Use case 2:
not supported now

Fastest way to select several inserted rows

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id
I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.
That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.
As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.

Partitioning a table in sybase-select query

My main concern:
I have an existing table with huge data.It is having a clustered index.
My c++ process has a list of many keys with which it checks whether the key exists in the table,
and if yes, it will then check the row in the table and the new row are similar. if there is a change the new row is updated in the table.
In general there will less changes. But its huge data in the table.
S it means there will be lot of select queries but not many update queries.
What I would I like to achieve:
I just read about partitioning a table in sybase here.
I just wanted to know will this be helpful for me, as I read in the article it mentions about the insert queries only. But how can I improve my select query performance.
Could anyone please suggest what should I look for in this case?
Yes it will improve your query (read) performance so long as your query is based on the partition keys defined. Indexes can also be partitioned and it stands to reason that a smaller index will mean faster read performance.
For example if you had a query like select * from contacts where lastName = 'Smith' and you have partitioned your table index based on first letter of lastName, then the server only has to search one partition "S" to retrieve its results.
Be warned that partitioning your data can be difficult if you have a lot of different query profiles. Queries that do not include the index partition key (e.g. lastName) such as select * from staff where created > [some_date] will then have to hit every index partition in order to retrieve it's result set.
No one can tell you what you should/shouldn't do as it is very application specific and you will have to perform your own analysis. Before meddling with partitions, my advice is to ensure you have the correct indexes in place, they are being hit by your queries (i.e. no table scans), and your server is appropriately resourced (i.e got enough fast disk and RAM), and you have tuned your server caches to suit your queries.

DB Structure for a shopping cart

I like to develop a shopping cart website with multiple products.
(ex.: mobile phone, furniture etc.,)
here mobile phone specification will cover
size of display
memory
operating system
camera etc.,
but for furniture - its specification is entirely different from above electronic product.
type of wood
color
weight
shape
glass or mat finish etc.,
My question is: how to handle a common database-table for product specification ?
each & every category of product & its spec will be differ - so how to have a common
table ProductSpecificationTable ?
I searched many site including google.. but cant able to get the perfect soultion.
Please help me to move to next step.
Ask yourself the question: How can I accomplish this kind of database? First of all you need products.. Every product has to be in some kind of category and every category has to have his own properties. So, you've to create a product table with unique id and every product needs a category id. At this moment it is time to link from your property table to your category table(by id) and to set the values you need a 'property_value' table.
**table:** **id**
product --> category id
property --> category_id
property_value --> property_id
I hope you will understand my explanation otherwise just ask :)
You can add 1 more table to accomplish that. Table that contains cat_id, product_id and the property. That is a many to many relationship. I believe this way you can accomplish thst.
You can achieve this with a single table within a database but that will complicate the CRUD operations over the table. So I will recommend you to create one database like ‘Inventory’ which can have multiple tables (one table for each of the Product Type).
First Table could be list of Product Types you have (mobile phones, accessories, furniture):
You can use this table to populate your list of items available. Here the column _table_name will contain the actual name of the Tables.
Then for each of the product you can have different tables with different number of columns:
Table for Product Type Mobile Phones:
Table for Product Type Furniture:
I hope this will help.