kettle etl transformation hop between steps doesn't work - kettle

I am using PDI 6 and new to PDI. I created these two tables:
create table test11 (
a int
)
create table test12 (
b int
)
I created a transformation in PDI, simple ,just two steps
In first step:
insert into test11 (a)
select 1 as c;
In second step:
insert into test12 (b)
select 9 where 1 in (select a from test11);
I was hoping second step execute AFTER first step, so the value 9 will be inserted. But when I run it, nothing got inserted into table test12. It looks to me the two steps are executed in parallel. To proved this, I eliminated second step and put the sql in step 1 like this
insert into test11 (a)
select 1 as c;
insert into test12 (b)
select 9 where 1 in (select a from test11);
and it worked. So why? I was thinking one step is one step so next step will wait until it finishes, but it is not?

In PDI Transformations, the step initialization and execution happen in parallel. So if you are having multiple steps in a single transformation, these steps will be executed in parallel and the data movement happens in round-robin fashion (by default). This is primarily the reason why your two execute SQL steps do not work, since both the steps are executed in parallel. The same is not the case with PDI Jobs. Jobs work in a sequential fashion unless it is configured to run in parallel.
Now for your question, you can try to do any one of the below steps:
Create two separate transformations with the SQL steps and place it inside a JOB. Execute the job in sequence.
You can try using the Block this step until finish in transformation which will wait for a particular step to get execute. This is one way to avoid parallelism in transformations. The design of your transformation will similar to as below:
Data grids are a dummy input step. No need to assign any data to the data grids.
Hope this helps :)

Related

R Studio: Mutating a column variable based on two selection conditions

The dataframe above represents a repeated-measures design, each participant took part in both task A and B. The condition determines which order the tasks occurred - if in condition 1, then Task A came first followed by task B, and vice-versa for condition 2.
I would like to mutate a new column in my dataframe called 'First Task'. This column must represent the scores from the task that always occurred first. For example, participant 1001 was in condition 1, so their score from task A should go into this first task column. For participant 1002, in condition 2, their score from task B should go into the first task column, and so on.
After scouring possible threads (which have always solved every need I have!) I considered using the mutate function, combined with cases_when (group == 1), and thereafter I am not sure how to properly pipe something along the lines of select score from TASK A. Alternatively, I considered how I may go about using if or ifelse, probably the likely piece of code to execute something like this?
It would be an elegant piece of coding like this I am after, as opposed to re-creating a new dataframe . I would greatly appreciate any thoughts or ideas on this. Let me know if this is clear (note I have simplified this image as a example to make the question clearer).
Many Thanks community

Synchronous + Asynchronous steps in step function

I am very new to step function and still exploring the same.
I have workflow something like this
--Steps A to C are synchronous.
Step A
if(respose is X)
Step B
else
Step C
--Need to return response to user here and need to follow two below steps asynchronously to unblock the caller of step function.
Step D
Step E
Is it possible to achieve the same? I believe, I will append .sync for step A, B and C. Will not append anything to D and E and it should work. Am I missing anything here?
Note that, all steps will be executed by activity workers only.
We can take two approaches.
Break the step function into two. First three steps will be in an express step function and last two steps will be a regular step function.
OR
We can have just one step function, where ever we call this step function, we need to wait for first three steps to be completed before moving forward. This can be done by calling get-execution-history in a loop to grab the output of intermediate step. Here is an answer with this approach.

Pentaho Kettle Data Integration - How to do a Loop

I hope this message finds you all well!
I'm stucked in the following Spoon's situation: I have a variable named Directory. In this variable, I have a path of a directory where the transformation reads a XLS file. After that, I run three jobs to complete my flow.
Now, instead of read just one file, I want to do a loop for it. In other words, after read the first xls file, the process will get the next one in the directory.
For example:
-> yada.xls -> job 1 -> job 2 -> job 3
-> yada2.xls -> job 1 -> job 2 -> job 3
Did you fellas already faced the same situation?
Any help are welcome!
Loops are not intuitive or very configurable in Spoon/PDI. Normally, you want to first get all the iterations into a list and copy that to "result rows". The next step then has to be configured to "Execute every input row" (checkbox). You can then pass each row individually to that job/transformation in a loop. Specify each "Stream Column Name" from the result rows under Parameters tab.
Step 1 (generate result rows) --> Step 2 ("Execute every input row")
Step 2 can be a job with multiple steps handling each individual row as parameters.
A related article you may find helpful: https://anotherreeshu.wordpress.com/2014/12/23/using-copy-rows-to-result-in-pentaho-data-integration/

PDI - Block this step until steps finished not working

Why my Block this step until steps finished not work? I should wait all my insert step before run rest of them. Any suggestion?
All table input step will run parallelly when you execute the transformation.
If you want to stop table execution then I suggest adding one constant (i.e 1) before block until step and in the table input step you can add one condition like where 1 = ? with option enabling and execute for each row
You are possibly confusing blocking the data flow and finishing the connection. See there.
As far as I can understand by you questions since 3 month, you should really have a look here and there.
And try to move to writing Jobs (kjb) to orchestrate your transformations (ktr).

How can we count number of rows in Talend jobs

I have a scenario in which I only process my job only when i have numbers of rows greater then two.
I used MySqlInput and tMap and tLog components in my job.
You'll want a Run if connection between 2 components somewhere (they both have to be sub job startable - they should have a green square background when you drop them on to the canvas) and to use the NB_Line variable from the previous sub job component with something like this as your Run if condition (click the link and then click the component tab):
((Integer)globalMap.get("tMysqlInput_1_NB_LINE")) > 2
Be aware that the NB_Line functionality is only usable at the end of a sub job and can have "interesting" effects when using mid job but the Run if will end that first sub job and conditionally start the second one. If you are unable to find a way to break your job into 2 sub jobs then you can always use a tHash or a tBuffer output followed by an input and put the Run if link between the two.