Join data from different sources in Kettle - kettle

My database has the following tables:
USER
userid INT
username STRING
USER_SESSION_HISTORY
userid INT (foreign key to USER)
login_date DATETIME
Then I have a CSV with this header:
username;login_date
I need to insert the CSV data into the USER_SESSION_HISTORY table. As you can see, I need to join the two data sources (the USER table and the CSV file) to be able to get the user id.
I'm new to Kettle and just got to learn very simple transformations until now.

You need to read the CSV and the USER table, after those two steps, you need to add a Sort step for each one (check for the case sensitive/insensitive possibilities in the Sort step), afterwards, with a merge join step you merge both streams of data, you set it up as a LEFT OUTER JOIN, on the left the CSV data (coming from the sort step), on the right the USER data (coming from the sort step), so for each username in the CSV you look if the username already exists in the USER table.
Following the MERGE JOIN, you put a Filter step, looking if the userid data IS NULL, if it's NULL, that means that the username in the CSV doesn't exist, so you'll need to insert it first in the USER table.
If you have the userid (filter condition false in the previous step) you can simply insert the data in the USER_SESSION_HISTORY with the userid you retrieved from the MERGE JOIN.
For the true condition in the filter (no previous data in the user data), it's going to depend on how you generate the userid. If you have a sequence associated to the userid, and the column is automatically filled with the corresponding next value of the column, you can insert the username in the USER table and the database will take care of filling the userid. If you can't simply insert the username in the USER table, you'll have to add some intermediate steps to generate the userid depending on how you handle it in the database.
I don't know if after inserting in the USER table you'll be able to see the userid generated, you can test it. If you have it available in this transformation, you can add a block step after the Insert step in the USER table, waiting for the Insert step in the USER_SESSION_HISTORY to finish (for the data where the username was already present in the USER table, the false condition in the filter step). The block step is needed because Pentaho runs all the steps at the same time unless you use this block step, so the USER_SESSION_HISTORY table would be blocked with two transactions executing at the same time. After the block step, you add a second Insert step for the USER_SESSION_HISTORY table.
If you don't have available the userid available after inserting, I think the easier way to work would be with two transformations, first one to insert the new usernames in the USER table, and a second transformation to insert the data in the USER_SESSION_HISTORY, in the second transformation we are sure we already have all the usernames available in the USER table

Related

DynamoDB record size increasing with time

I have a customer table in DynamoDB with basic attributes like name, dob, zipcode, email, etc. I want to add another attribute to it which will keep increasing with time. For example, each time the user clicks on a product (item), I want to add that to the record so that I have the full snapshot of the customer's profile in a single value indexed by the customerId. So, my new attribute would be called viewedItems and would be a list of itemIds viewed (along with the timestamp).
However, given the 4KB size limit for DynamoDB value, it is going to be surpassed with time as I keep adding the clicked products to the customer profile.
How can I best define my objects so as to perform the following?
Access the full profile of the customer by customerId, including the views.
Access time filtered profile of the customer (like all interactions since last N days), in which case the viewed items should be filtered by the given time range.
Scan the entire table with a time filter on viewedItems.
The query needs to be performant as the profile could be pulled at request time.
Ability to update individual customer record (via a batch job, for example, that updates each customer's record if need be).
One way to do this would be to create a different table (say customer_viewed_items) with hash key customerId and a range key timestamp with value being the itemId that the customer viewed. But this looks like an increasingly complicated schema - not to mention twice the cost involved in accessing the item. If I have to create another attribute based on (say) "bought" items, then I'll need to create another table. So, the solution I have in mind does not seem good to me.
Would really appreciate if you could help suggest a better schema/approach.
As soon as you really don't know how many items will be viewed by user (edge case - user opens all items sequentially, multiple times) - you cannot store this information in single dynamodb record.
The only solution is to normalize your database and create separate table like you've described.
Now, next question - how to minimize retrieval cost in such scheme? Usually you don't need to fetch all viewed items, probably you want to display some of them, then you need to fetch only last X.
You can cache such items in main table customer, ie - create field "lastXviewedItems" and updated it, so it contains only limited number of items without breaking size limit, of course for BI analysis - you will have to store them in 2nd table too.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Kettle PDI how to pass multiple parameters not used in Table Input

I'm converting data from one database to another with a slightly different structure. In my flow at some point I need to read data from the first database filtering on the id coming from previous steps.
This is the image of my flow:
In the step "ZtlBus note" the query is:
SELECT e.*,UNIX_TIMESTAMP(v.dataInserimento)*1000 as timestamp
FROM verbale_evento ve JOIN evento e ON ve.eventi_id=e.id
WHERE ve.Verbale_id=? AND e.titolo='Note verbale'
Because I've just one parameter, in the previous step I use a Select values step. Unfortunately, after the Table input I need others fields coming from previous steps (Audit step) as marked in the picture.
I'm wondering how I can pass these fields after Table input. Some advice is appreciated.
if you use the "Database Join" step instead the input table step you will be able to keep the previous values of your transformation.

Fastest way to select several inserted rows

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id
I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.
That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.
As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.

What's cheaper on DynamoDB (GSI vs multiple tables)

I have an issue of making a username AND an email unique. It is quite easy with relationaldatabase and just do 2 queries and get the count back on each.
select count(email) from users;
select count(username) from users;
But in DynamoDB (NoSQL) is it better (i.e. cheaper) to have 2 tables like so:
username table (where username is the hash) and check that table with a PUT and attribute_does_not_exist
AND
email table (where email is the hash) and check that table after the first one with a PUT and attribute_does_not_exist
OR do I
email table (hash) and username (GSI in that table). Then query the GSI first and if it doesn't exist then do a PUT with email and username
Which is better (cheaper)?
Two questions so I'll address them separately.
Which is cheaper?
You can run a single table with one GSI or two tables for the exact same cost if you want to because throughput for GSIs are provisioned the same way the primary table's throughput is.
Cost should not be a deciding factor.
Which is better?
The fact DynamoDB makes it difficult to have a secondary attribute retain its uniqueness is difficult is a common problem. Because of the asynchronous nature of GSIs the HASH or HASH/RANGE combination for a GSI is not unique. This can be taken advantage of in some circumstances.
If you use two tables you are taking the responsibility for keeping both tables in sync (something that is not easy to do in many situations). This comes with some important responsibilities (what happens if your app dies after writing to the first table but before it writes to the second), but this additional responsibility could allow you to maintain the uniqueness you want.
To explain how you would actually accomplish the dual uniqueness while maintaining accuracy, you would want to take advantage of conditional writes. The following outline describes a series of steps that would ensure that you maintain uniqueness.
Write record to username table with condition that username is not in the table, but include a conditional flag set to false (if write fails, we bail)
Write record to email table with condition that email is not in the table (if write fails, we delete the previous username record)
Update the username record to set the conditional flag to true
The reason you would want to use a conditional flag with the username to essentially indicate that the record is not in a valid state is to ensure you actually maintain the uniqueness.