Kettle PDI how to pass multiple parameters not used in Table Input - kettle

I'm converting data from one database to another with a slightly different structure. In my flow at some point I need to read data from the first database filtering on the id coming from previous steps.
This is the image of my flow:
In the step "ZtlBus note" the query is:
SELECT e.*,UNIX_TIMESTAMP(v.dataInserimento)*1000 as timestamp
FROM verbale_evento ve JOIN evento e ON ve.eventi_id=e.id
WHERE ve.Verbale_id=? AND e.titolo='Note verbale'
Because I've just one parameter, in the previous step I use a Select values step. Unfortunately, after the Table input I need others fields coming from previous steps (Audit step) as marked in the picture.
I'm wondering how I can pass these fields after Table input. Some advice is appreciated.

if you use the "Database Join" step instead the input table step you will be able to keep the previous values of your transformation.

Related

Power Bi web.contents Variables

Please excuse my lack of knowledge in explaining my problem as i have only just started learning Power Bi.
I am attempting to return data by using a dynamic variable within my source url.
Source = Json.Document(Web.Contents("https://api.****.com/jobs/{ID}/invoices", [Headers=[Authorization="Bearer "&GetToken()]]))
I have successfully returned the data i needed from multiple queries Blank Query 1 Query Names
However, i am trying to run a final query in which a job ID needs to be specified.
Source = Json.Document(Web.Contents("https://api.****.com/jobs/{ID}/invoices", [Headers=[Authorization="Bearer "&GetToken()]]))
With the bold item being the variable.
I have successfully returned values by hard coding the variable (seen below).
Hard coded variable
However, i would like to make dynamic in that it will return the values for all the Job ID's witin the "jobs" table.
Job Id's
I don't know if what im asking is possible, or if my explanation is good enough, but any help would be greatly appreciated!
What you are looking for is a custom function.
Make a function out of your above query by adding (ID) => in the first line and separating "ID" in your URL string.
(ID) =>
let
Source = Json.Document(Web.Contents("https://api.****.com/jobs/{" & ID & "}/invoices", [Headers=[Authorization="Bearer "&GetToken()]]))
in
Source
Of cause you can add all your other transformation steps too.
Now take your JobIDs table and add a column by invoking a custom function, select the above function and take the ID parameter from your ID column.
For every row you'll get a separate table and all that's left is simply expanding these tables into your query.
This will solve your problem.

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Pentaho PDI get SQL SUM() with conditions

I'm using Pentaho PDI 7.1. I'm trying to convert data from Mysql to Mysql changing the structure of data.
I'm reading the source table (customers) and for each row I've to run another query to calculate the balance.
I was trying to use Database value lookup to accomplish it but maybe is not the best way.
I've to run a query like this to get the balance:
SELECT
SUM(
CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END
)
FROM Movimento WHERE contoFidelizzato_id = ?
I should set the parameter taking it from the previous step. Some advice?
The Database lookup value may be a good idea, especially if you are used to database reasoning, but it may result in many queries which may not be the most efficient.
A more PDI-ish style would be to make the query like:
SELECT contoFidelizzato_id
, SUM(CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END)
FROM Movimento
GROUP BY contoFidelizzato_id
and use it as the info source of a Lookup Stream Step, like this:
An even more PDI-ish style would be to divert the source table (customer) in two flows : one in which you keep the source rows, and one that you group by contoFidelizzato_id. Of course, you need a formula, or a Javascript, or to put a formula in the SQL of the Table input to change the sign when needed.
Test to know which strategy is better in your case. You'll soon discover that the PDI is very good at handling large data.

Kettle PDI how to define parameters before Table input

I'm converting data from one database to another with a slightly different structure.
In my flow at some point I need to read data from the first database filtering on the id coming from previous steps.
This is the image of my flow
The last step is where I need to filter data. The query is:
SELECT e.*,UNIX_TIMESTAMP(v.dataInserimento)*1000 as timestamp
FROM verbale_evento ve JOIN evento e ON ve.eventi_id=e.id
WHERE ve.Verbale_id=? AND e.titolo='Note verbale'
Unfortunately ve.Verbale_id is a column of the first table (first step). How can I define to filter by that field?
Right now I've an error:
2017/12/22 15:01:00 - Error setting value #2 [Boolean] on prepared statement
2017/12/22 15:01:00 - Parameter index out of range (2 > number of parameters, which is 1).
I need to do this query at the end of the entire transformation.
You can pass previous rows of data as parameters.
However, the number of parameter placeholders in the Table input query must match the number of fields of the incoming data stream. Also, order matters.
Try trimming the data stream to only the field you want to pass using a select values step and then choose that step in the “get data from” box near the bottom of the table input. Also, check the “execute for each input row”.