PDI - Block this step until steps finished not working - kettle

Why my Block this step until steps finished not work? I should wait all my insert step before run rest of them. Any suggestion?

All table input step will run parallelly when you execute the transformation.
If you want to stop table execution then I suggest adding one constant (i.e 1) before block until step and in the table input step you can add one condition like where 1 = ? with option enabling and execute for each row

You are possibly confusing blocking the data flow and finishing the connection. See there.
As far as I can understand by you questions since 3 month, you should really have a look here and there.
And try to move to writing Jobs (kjb) to orchestrate your transformations (ktr).

Related

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

Synchronous + Asynchronous steps in step function

I am very new to step function and still exploring the same.
I have workflow something like this
--Steps A to C are synchronous.
Step A
if(respose is X)
Step B
else
Step C
--Need to return response to user here and need to follow two below steps asynchronously to unblock the caller of step function.
Step D
Step E
Is it possible to achieve the same? I believe, I will append .sync for step A, B and C. Will not append anything to D and E and it should work. Am I missing anything here?
Note that, all steps will be executed by activity workers only.
We can take two approaches.
Break the step function into two. First three steps will be in an express step function and last two steps will be a regular step function.
OR
We can have just one step function, where ever we call this step function, we need to wait for first three steps to be completed before moving forward. This can be done by calling get-execution-history in a loop to grab the output of intermediate step. Here is an answer with this approach.

BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.
I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1
There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

kettle etl transformation hop between steps doesn't work

I am using PDI 6 and new to PDI. I created these two tables:
create table test11 (
a int
)
create table test12 (
b int
)
I created a transformation in PDI, simple ,just two steps
In first step:
insert into test11 (a)
select 1 as c;
In second step:
insert into test12 (b)
select 9 where 1 in (select a from test11);
I was hoping second step execute AFTER first step, so the value 9 will be inserted. But when I run it, nothing got inserted into table test12. It looks to me the two steps are executed in parallel. To proved this, I eliminated second step and put the sql in step 1 like this
insert into test11 (a)
select 1 as c;
insert into test12 (b)
select 9 where 1 in (select a from test11);
and it worked. So why? I was thinking one step is one step so next step will wait until it finishes, but it is not?
In PDI Transformations, the step initialization and execution happen in parallel. So if you are having multiple steps in a single transformation, these steps will be executed in parallel and the data movement happens in round-robin fashion (by default). This is primarily the reason why your two execute SQL steps do not work, since both the steps are executed in parallel. The same is not the case with PDI Jobs. Jobs work in a sequential fashion unless it is configured to run in parallel.
Now for your question, you can try to do any one of the below steps:
Create two separate transformations with the SQL steps and place it inside a JOB. Execute the job in sequence.
You can try using the Block this step until finish in transformation which will wait for a particular step to get execute. This is one way to avoid parallelism in transformations. The design of your transformation will similar to as below:
Data grids are a dummy input step. No need to assign any data to the data grids.
Hope this helps :)

Use a long running database migration script

I'm trialing FluentMigrator as a way of keeping my database schema up to date with minimum effort.
For the release I'm currently building, I need to run a database script to make a simple change to a large number of rows of existing data (around 2% of 21,000,000 rows need to be updated).
There's too much data for to be updated in a single transaction (the transaction log gets full and the script aborts), so I use a WHILE loop to iterate through the table, updating 10,000 rows at a time, each batch in a separate transacticon. This works, and takes around 15 minutes to run to completion.
Now I have the script complete, I'm trying to integrate it into FluentMigrator.
FluentMigrator seems to run all the migrations for a single batch in one transaction.
How do I get FM to run each migration in a separate transaction?
Can I tell FM to not use a transaction for a specific migration?
This is not possible as of now.
There are ongoing discussions and some work already in progress.
Check it out here : https://github.com/schambers/fluentmigrator/pull/178
But your use case will surely help in pushing the things in the right direction.
You are welcome to take part to the discussion!
Maybe someone will find a temporary workaround?