Parallel processes with dependency - concurrency

There are two parallel processes. Each process has two steps. The second step of the first process is always executed after the first step. The second step of the second process is performed only under a certain condition.
Activity diagram:
How to reflect an additional condition: to complete the second step of the second process, the first step of the first process must be completed.
I managed:
Flaws:
No match between fork and join
If the condition of the second process is not met, the token “hangs” before join

Having looked at your solution once more made me think that you saw issues, where there are none. You are worried about the hanging token, but that is no issue in this case. If P22 is bypassed, the token from P11 will go down directly to the join node. P11 and P12 will pass their token down also with no issue, thereby creating that ghost token which gets stuck in the middle right join. Since the lower join now has two tokens it will continue to the end where the activity is terminated. At that point any free running tokens (and even active actions) are terminated as well. All good.
I leave my former answer for further inspiration. But basically they will all be implemented in similar ways since they represent a gateway.
Original answer
I guess that using an event would be the best way:
This way D can only start (and finish) when the event has been received which is sent after As completion.
Another way would be to use an object that stores the finalization of action A and which is read by D.
Note that the diagonal connectors through a ready are ObjectFlows which UML does not per default distinguish optically (unlike SysML).
P. 374 of UML 2.5 states
Object tokens pass over ObjectFlows, carrying data through an Activity via their values, or carrying no data ( null tokens). A null token can still be passed along an ObjectFlow and used like any other token. For example, an Action can output a null token to explicitly indicate that it did not produce an optional value, and a downstream DecisionNode (see sub clause 15.3) can test for this and branch accordingly.
So you can see that as a buffer holding a token and no real data is needed to be stored. Basically that's the same as an event. Implementation wise you would use a semaphore or a stream to realize that, but of course at this level you would not care too much about such details.

Related

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

Passing Input to Step Function from a manual approval step

I have a usecase where one of the task in Step Function is Manual Approval Step.
As a part of completion of this step, we want to pass some inputs which will be used by subsequent tasks.
This there a way to do it ?
I have seen passing JSON in output while completing the Manual Approval Step. Is there a way that we can read this output as input in next step ?
client.sendTaskSuccess(new SendTaskSuccessRequest()
.withOutput("{\"key\": \"this is value\"}")
.withTaskToken(getActivityTaskResult.getTaskToken()));
It is possible, but your question doesn't provide enough information for a specific answer. Some general tips about input/output processing:
By default, the output of a state becomes the input to the next state. You can use ResultPath to write the output of the the Task to a new field without replacing the entire JSON payload that becomes the input to the next state.
If subsequent states are using InputPath or Parameters, you might be filtering the input and removing the output of the approval step. Similarly with OutputPath.
The doc on Input and Output Processing may be helpful: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-input-output-filtering.html

Dataflow pipeline waits for elements from all streams before performing GroupBy

We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:
I noticed that the job will not produce any output, until both streams contain some traffic.
For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.
I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.
(Example JobID: 2017-02-10_06_48_01-14191266875301315728)
As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.
Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.
You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.
If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.

Mailgun: algorithm for event polling

We are implementing support for tracking of Mailgun events in our application. We reviewed the proposed event polling algorithm but find ourselves not quite comfortable with it. First, we would prefer not to discard the data that we have already fetched and then retry from scratch after a pause. It is not very efficient and leaves a door open for a long loop of retries, as it is not clear when the loop is supposed to end. Second, the "threshold age" seems to be the key to determine "trustworthiness", but its value is not defined, only a very large "half an hour" is suggested.
It is our understanding that the events become "trustworthy" after some threshold delay, let us call it D_max, when the events are guaranteed to reside in the event storage. If so, we can implement this algorithm in a different way, so that we do not fetch the data that we know are not "trustworthy" and make use of all data which have been fetched.
We would be fetching data periodically, and on each iteration we would:
Make a request to the events API specifying an ascending time range from T_1 to T_2 = now() - D_max. For the first iteration, T_1 can be set to some time in the past, "e.g., half an hour ago". For the subsequent iterations, T_1 is set to the value of T_2 from the previous iteration.
Fetch all pages one by one while the next page URL is returned.
Use all fetched events, as they are all "trustworthy".
My questions are:
Q1: Are there any problems with this approach?
Q2: What is the minimum realistic value of D_max? Obviously, we can use "half an hour" for it, but we would like to be more agile in tracking events, so it would be great to know what is the minimum value we can set it to and still reliably fetch all events.
Thanks!
1: I see no problems with this solution (in fact I'm doing something very similar). I'm also storing ID's of the events to validate I'm not inserting duplicate entries.
2: I've been working through this similar process. Right now I am testing with D_max at 10 minutes.
Additionally, While going through a testing process I'm running an additional task nightly that goes back over the entire day to validate a few things:
Am I missing existing metrics?
Diagnose if there is a problem with the assumptions I've made about D_max.

Underlying mechanism in firing SQL Queries in Oracle

When we fire a SQL query like
SELECT * FROM SOME_TABLE_NAME under ORACLE
What exactly happens internally? Is there any parser at work? Is it in C/C++ ?
Can any body please explain ?
Thanks in advance to all.
Short answer is yes, of course there is a parser module inside Oracle that interprets the statement text. My understanding is that the bulk of Oracle's source code is in C.
For general reference:
Any SQL statement potentially goes through three steps when Oracle is asked to execute it. Often, control is returned to the client between each of these steps, although the details can depend on the specific client being used and the manner in which calls are made.
(1) Parse -- I believe the first action is actually to check whether Oracle has a cached copy of the exact statement text. If so, it can save the work of parsing your statement again. If not, it must of course parse the text, then determine an execution plan that Oracle thinks is optimal for the statement. So conceptually at least there are two entities at work in this phase -- the parser and the optimizer.
(2) Execute -- For a SELECT statement this step would generally run just enough of the execution plan to be ready to return some rows to the client. Depending on the details of the plan, that might mean running the whole thing, or might mean doing just a very small fraction of the work. For any other kind of statement, the execute phase is when all of the work is actually done.
(3) Fetch -- This is when rows are actually returned to the client. Generally the client has a predetermined fetch array size which sets the maximum number of rows that will be returned by a single fetch call. So there may be many fetches made for a single statement. Of course if the statement is one that cannot return rows, then there is no fetch step necessary.
Manasi,
I think internally Oracle would have its own parser, which does parsing and tries compiling the query. Think its not related to C or C++.
But need to confirm.
-Justin Samuel.