Pentaho Kettle Data Integration - How to do a Loop - kettle

I hope this message finds you all well!
I'm stucked in the following Spoon's situation: I have a variable named Directory. In this variable, I have a path of a directory where the transformation reads a XLS file. After that, I run three jobs to complete my flow.
Now, instead of read just one file, I want to do a loop for it. In other words, after read the first xls file, the process will get the next one in the directory.
For example:
-> yada.xls -> job 1 -> job 2 -> job 3
-> yada2.xls -> job 1 -> job 2 -> job 3
Did you fellas already faced the same situation?
Any help are welcome!

Loops are not intuitive or very configurable in Spoon/PDI. Normally, you want to first get all the iterations into a list and copy that to "result rows". The next step then has to be configured to "Execute every input row" (checkbox). You can then pass each row individually to that job/transformation in a loop. Specify each "Stream Column Name" from the result rows under Parameters tab.
Step 1 (generate result rows) --> Step 2 ("Execute every input row")
Step 2 can be a job with multiple steps handling each individual row as parameters.
A related article you may find helpful: https://anotherreeshu.wordpress.com/2014/12/23/using-copy-rows-to-result-in-pentaho-data-integration/

Related

how to load multiple files into one file using informatica

I am new to informatics, I have created a mapping that using expression and sorter transformation to load multiple files into one single file which have 2 columns
1 data
2 seq number
All 10 files have random sequence numbers Like
example:
file1
erfef 3
abcdn 1
file 2
wewewr 4
wderfv 5
and so on till 10 files.
Expression transformation code is :
INTEGER(LTRIM(RTRIM(seq_num)),TRUE)
what I want is to load the file into one big file and sort it according to the seq number.
Got data in output file but number with incorrect seq number.
How to get data in the final table with a correct sequence number.
doing exactly what is mention in the below solution but still getting wrong output. getting output like:
erfef 3
abcdn 1
wewewr 4
wderfv 5
where as it should be like below:
where as it should be like
abcdn 1
erfef 3
wewewr 4
wderfv 5
Thanks in Advance !!!
Use indirect file load using a list of files to load all files together. Then use sorter on col2 to order the data. Finally use a target file to store data.
Whole mapping should be like this -
SQ --> EXP--> SRT(key = col2) --> Target
Few things to note -
In the session, use indirect file and use a list file name - mention filelist1.txt
Use ls -1 file* >filelist1.txt in pre session command task to create a file list with all required files.
Expression transformation- convert the col2 to INTEGER if its coming up as string in SQ.
Sorter transformation- use col2 as key column.
Using indirect file source is one way.
Another way is to use command as source and specify a command that will spit out data from all the files, like cat file*.csv.
Just change the Input Type to Command and provide the command - all this can be set by editing session -> mapping tab -> Source -> properties.
Here's an example session:

PDI - Block this step until steps finished not working

Why my Block this step until steps finished not work? I should wait all my insert step before run rest of them. Any suggestion?
All table input step will run parallelly when you execute the transformation.
If you want to stop table execution then I suggest adding one constant (i.e 1) before block until step and in the table input step you can add one condition like where 1 = ? with option enabling and execute for each row
You are possibly confusing blocking the data flow and finishing the connection. See there.
As far as I can understand by you questions since 3 month, you should really have a look here and there.
And try to move to writing Jobs (kjb) to orchestrate your transformations (ktr).

PDI - Check data types of field

I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.

kettle etl transformation hop between steps doesn't work

I am using PDI 6 and new to PDI. I created these two tables:
create table test11 (
a int
)
create table test12 (
b int
)
I created a transformation in PDI, simple ,just two steps
In first step:
insert into test11 (a)
select 1 as c;
In second step:
insert into test12 (b)
select 9 where 1 in (select a from test11);
I was hoping second step execute AFTER first step, so the value 9 will be inserted. But when I run it, nothing got inserted into table test12. It looks to me the two steps are executed in parallel. To proved this, I eliminated second step and put the sql in step 1 like this
insert into test11 (a)
select 1 as c;
insert into test12 (b)
select 9 where 1 in (select a from test11);
and it worked. So why? I was thinking one step is one step so next step will wait until it finishes, but it is not?
In PDI Transformations, the step initialization and execution happen in parallel. So if you are having multiple steps in a single transformation, these steps will be executed in parallel and the data movement happens in round-robin fashion (by default). This is primarily the reason why your two execute SQL steps do not work, since both the steps are executed in parallel. The same is not the case with PDI Jobs. Jobs work in a sequential fashion unless it is configured to run in parallel.
Now for your question, you can try to do any one of the below steps:
Create two separate transformations with the SQL steps and place it inside a JOB. Execute the job in sequence.
You can try using the Block this step until finish in transformation which will wait for a particular step to get execute. This is one way to avoid parallelism in transformations. The design of your transformation will similar to as below:
Data grids are a dummy input step. No need to assign any data to the data grids.
Hope this helps :)

How can we count number of rows in Talend jobs

I have a scenario in which I only process my job only when i have numbers of rows greater then two.
I used MySqlInput and tMap and tLog components in my job.
You'll want a Run if connection between 2 components somewhere (they both have to be sub job startable - they should have a green square background when you drop them on to the canvas) and to use the NB_Line variable from the previous sub job component with something like this as your Run if condition (click the link and then click the component tab):
((Integer)globalMap.get("tMysqlInput_1_NB_LINE")) > 2
Be aware that the NB_Line functionality is only usable at the end of a sub job and can have "interesting" effects when using mid job but the Run if will end that first sub job and conditionally start the second one. If you are unable to find a way to break your job into 2 sub jobs then you can always use a tHash or a tBuffer output followed by an input and put the Run if link between the two.