I am creating a step function where my input is in the form of array like: {"ids": [1, 2, 3]}. Next, I have 2 Glue jobs that I want to execute for these ids. E.g. Glue job 1 will execute with id 1 and Glue job 2 will execute with id 2 and then Glue job 1 would execute with id 3 when it will process the job with id 1. I have tried using Parallel state in Step function, but that does not work on chunk of input but takes complete ids list as input. I have thought of using Map state, but Map state takes only one task to execute in parallel, but in my case I have 2 Glue jobs.
What could be the resolution for this? Please suggest a solution using Step function.
What if you spilt your id2 to two arrays first (your first step). So convert
{"ids": [1, 2, 3, 4, 5]}
To
{
ids1:[1,3,5]
ids2:[2,4]
}
Then add a step with two parallel step, each contain a map, one for iterate over ids1 and send them to Glue Job1 and the.other to iterate over ids2 and send them to Glue Job2
Update 1:
If you don't want any Glue job finishs sooner and becomes idle then instead of two array you can keep one list but add a status to each row:
{
id:1,
Status: null | job1 | job2
}
And instead of map state for each job, create a while loop, first pick an item from the list and then call Glue job.
So your Select_an_id state will chose one id from that list. and change the status for that record. You need to create a Lambda task state to do this.
Related
Subject ID Condition Task A Task B First Task
1001 1
1002 2
1003 1
This is a within-subjects design. Each participant took part in tasks A and B, however, the order in which the tasks were presented (first or second) depends upon condition (e.g., those in condition 1 perform task A first followed by task B and vice-versa). Note that the task columns do have their own score but I cannot add here.
Is it possible to produce an elegant piece of code that mutates a new column/variable called 'first task' that selects cases when a subject belonging to condition 1, their corresponding score from task A is put into this new 'first task' column. Those subjects belonging to condition 2, their score from task B is put into the first task column (because those in condition 2 received task B first).
I hope this makes sense. I am trying to combine mutate with cases_when, group_by and if/if_else functions to achieve something like this, but have not succeeded.
We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).
I hope this message finds you all well!
I'm stucked in the following Spoon's situation: I have a variable named Directory. In this variable, I have a path of a directory where the transformation reads a XLS file. After that, I run three jobs to complete my flow.
Now, instead of read just one file, I want to do a loop for it. In other words, after read the first xls file, the process will get the next one in the directory.
For example:
-> yada.xls -> job 1 -> job 2 -> job 3
-> yada2.xls -> job 1 -> job 2 -> job 3
Did you fellas already faced the same situation?
Any help are welcome!
Loops are not intuitive or very configurable in Spoon/PDI. Normally, you want to first get all the iterations into a list and copy that to "result rows". The next step then has to be configured to "Execute every input row" (checkbox). You can then pass each row individually to that job/transformation in a loop. Specify each "Stream Column Name" from the result rows under Parameters tab.
Step 1 (generate result rows) --> Step 2 ("Execute every input row")
Step 2 can be a job with multiple steps handling each individual row as parameters.
A related article you may find helpful: https://anotherreeshu.wordpress.com/2014/12/23/using-copy-rows-to-result-in-pentaho-data-integration/
I have a table in DynamoDB with 1 million rows.
I need to run a process on the 1 million rows.
The table would look like so:
Date, Type, Quantity, value
Jan23, M, 10, 0.4
Jan24, F, 5, 0.6
Jan26, M, 6, 0.8
The process would go as follows:
Take all records of F and M and sort them individually into two lists by date.
List 1:
Jan23, M , 10, 0.4
jan26, M, 6, 0.8
List2:
Jan24, F, 5,0.6
Now for each row in List2 I need to find the first available row in List 1 and process it.
So (10*0.4-0.6*5) = 1 <- Log this value
Now since I took away 5 from jan23 row, it only has 5 as quantity remaining.
It's a simple process, however, can this be done in Lambda with 1 million records? I would somehow need the Lambda to have a hold of all 1 million records, as the list cannot be split due to having to know the quantity of each row.
The data is stored in DynamoDB and not S3 because some rows need to be edited with ease from a web app. I can and will implement a way to store it on S3 if that is needed for this solution.
I've been looking for a parallel implementation, but for that I would need to know where to split each list beforehand.
You are solving the problem with the wrong database. Dynamo is not to be used for analytical or statistical problem-solving.
DymanoDB is not meant to be used for huge data fetch at least as of now.
Solutions,
DynamoDB -- Streams -- Lambda -- RDS
Do all the complex query with RDS.
If the data is going to grow huge, you can introduce Redshift as well.
DynamoDB -- Streams -- Lambda -- Firehose -- Redshift
Use Redshift tools and update the results to DymanoDB for transactional consumption.
Hope it helps.
I am using PDI 6 and new to PDI. I created these two tables:
create table test11 (
a int
)
create table test12 (
b int
)
I created a transformation in PDI, simple ,just two steps
In first step:
insert into test11 (a)
select 1 as c;
In second step:
insert into test12 (b)
select 9 where 1 in (select a from test11);
I was hoping second step execute AFTER first step, so the value 9 will be inserted. But when I run it, nothing got inserted into table test12. It looks to me the two steps are executed in parallel. To proved this, I eliminated second step and put the sql in step 1 like this
insert into test11 (a)
select 1 as c;
insert into test12 (b)
select 9 where 1 in (select a from test11);
and it worked. So why? I was thinking one step is one step so next step will wait until it finishes, but it is not?
In PDI Transformations, the step initialization and execution happen in parallel. So if you are having multiple steps in a single transformation, these steps will be executed in parallel and the data movement happens in round-robin fashion (by default). This is primarily the reason why your two execute SQL steps do not work, since both the steps are executed in parallel. The same is not the case with PDI Jobs. Jobs work in a sequential fashion unless it is configured to run in parallel.
Now for your question, you can try to do any one of the below steps:
Create two separate transformations with the SQL steps and place it inside a JOB. Execute the job in sequence.
You can try using the Block this step until finish in transformation which will wait for a particular step to get execute. This is one way to avoid parallelism in transformations. The design of your transformation will similar to as below:
Data grids are a dummy input step. No need to assign any data to the data grids.
Hope this helps :)