GCP Workflow : Can we setup a step wait for other steps to complete? - google-cloud-platform

I am using GCP Workflow Beta to check if I can build some of my workflows. The documentation mentions how we can conditionally execute steps with switch case and next for jumps. However can we have a flows where
A step waits for two or more previous steps to complete
Multiple steps triggered the same time.
As you can see, what I am implying is conditional parallel execution of steps. Is there a way to do this ?
Also, I see we have some basic functions like len, string etc in the examples. Can you please guide me where I can find a list of all such functions that are available ? I was looking for something to manipulate JSON.

You can't wait a step because you can't run multiple step in parallel for now.
In my company, we also expect, and wait, a lot about parallelization and that's why I had a meeting with the PM yesterday to share these expectation with him. It's one of the top priority in the roadmap but I don't know when it will be released (something like Q1 or Q2 2021 I guess).
The functions sounds like python code but it's not really clear in the documentation. I will share this with the PM.

Related

how to read/write a value from/in the context object cdk step functions

I am writing in AWS CDK a Step Function which run two tasks in parallel. I would like to access from one of the tasks , a value of the second tasks , which runs in parallel (for example, I would like to know in task 1, which is the time started task 2, or maybe id from task 2).
Here an screenshot of the state machine definition in Step Function.
In the example of the screenshot, I would like to use the Id of the GlueStartRunJob (1) in GlueStartRunJob.
I was thinking about using the Context Object for that purpose. Nevertheless, I am not sure if this is the right approach...
The Context Object is read-only and allows a give state to access contextual information about it's self, not about other states from elsewhere in the workflow.
I'm not 100% clear what you are aiming to accomplish here, but I can see a couple of possible approaches.
First, you might just want to order these Glue Jobs to run sequentially so the output from the first can be used in the second.
Second, if you need the workflow to take action after the Glue Jobs have started but before they have completed, you'd need to take an approach that does not use the .sync integration pattern. With this integration pattern, Step Functions puts a synchronous facade over an asynchronous interaction, taking care of the steps to track completion and return you the results. You could instead use the default RequestResponse pattern to start the jobs in your parallel state, then do whatever you needed to after. You'd need to then include your own polling logic if you wanted the workflow to wait for completion of the jobs and return data on them or take action on completion. You can see an example of such polling for Glue Crawlers in this blog post (for which you can find sample code here).

How to update a Dataflow incompatible pipeline without loosing in data?

I'm practicing for the Data Engineer GCP certification exam and got the following question:
You have a Google Cloud Dataflow streaming pipeline running with a
Google Cloud Pub/Sub subscription as the source. You need to make an
update to the code that will make the new Cloud Dataflow pipeline
incompatible with the current version. You do not want to lose any
data when making this update.
What should you do?
Possible answers:
Update the current pipeline and use the drain flag.
Update the current pipeline and provide the transform mapping JSON object.
The correct answer according to the website 1 my answer was 2. I'm not convinced my answer is incorrect and these are my reasons:
Drain is a way to stop the pipeline and does not solve the incompatibility issues.
Mapping solves the incompatibility issue.
The only way that I see 1 as the correct answer is if you don't care about compatibility.
So which one is right?
I'm studying for the same exam, and the two cores of this question are:
1- Don't lose data ← Drain, is perfect for this because you process all buffer data and stop reviving messages; normally this message is alive for 7 days of retry, so when you start a new job you will receive all without lose any data.
2- Incompatible new code ← mapping solve some incompatibilities like change name of a ParDO but no a version issue. So launch a new job with the new code, it's the only option.
So, option is A.
I think the main point is that you cannot solve all the incompatibilities with the transform mapping. Mapping can be done for simple pipeline changes (for example, names), but it doesn't generalize well.
The recommended solution is constantly draining the pipeline running a legacy version, as it will stop taking any data from reading components, finish all the work pending on workers, and shutdown.
When you start a new pipeline, you don't have to worry about state compatibility, as workers are starting fresh.
However, the question is indeed ambiguous, and it should be more precise about the type of incompatibility or state something like in general. Arguably you can always try to update the job with mapping, and if Dataflow finds the new job to be incompatible, it will not affect the running pipeline -- then your only choice would be the drain option.

GNU Parallel host sticky jobs

I am writing a parallel build farm to build C++ cross-platform applications against various platforms / environments. Every time new code is pushed to a git repo, I build and test the latest code against all the platforms.
I've setup parallel to correctly distribute the jobs among several hosts using the --sshlogin option.
I transfer files, collect output and results. It's all working more than fine and I love the tool.
The build time being sometimes quite long for some platforms, I would like the build to be as incremental as possible.
My only issue is that the build is only incremental if the scheduler sends the jobs to the same machine and reuse the artefacts of the previous build on this specific host.
Say I have 3 hosts, I have 1 chance in 3 for the build to be incremental. If a hosts hasn't built this platform in a while, it might take a long time.
Is it possible to gain control over the host a specific input source will run on and only fallback to the other hosts if the host is busy?
Ideally, I would love to see a tag system where I tag input source with a name and tag several hosts with a name, creating pools of jobs and pools of machines specialized into that type of build.
But a very simple implementation where the input sources are distributed in the same order as the order the sshlogins are defined could be a simple & quick fix in my situation.
I tried to find the source code to implement it myself but I only see doc generation when I browse the code on Savannah.
Any ideas?
Thanks,
M
There is currently no support for prioritizing a given argument to a given sshlogin. The source code is at https://savannah.gnu.org/git/?group=parallel
Feel free to join the mailing list and discuss the idea: https://lists.gnu.org/mailman/listinfo/parallel
The only priority in the code is when a job has failed on an sshlogin, then GNU Parallel prefers to retry that job on another sshlogin. Maybe that could be extended?
If a job is marked as having failed -1 time for a given sshlogin, then GNU Parallel ought to prefer to run the job on that sshlogin.
I've been trying to discuss this idea on the mailing list as you suggested but never had any respone in more than 10 days... I guess you must be busy with other things at the moment. So I went along and forked the source code to make the necessary changes and make my solution work.
I pushed it there a week ago:
http://michakfromparis.github.io/gnu-parallel-sticky/
the source code is available on github here:
https://github.com/michaKFromParis/gnu-parallel-sticky
Wasn't exactly easy without any guidance as the source code has a lot of history so I tried to keep the changes surgical to ease merge of your future releases.
I've been using it in production for more than a week now and it works perfectly in my configuration.
It is also compatible with older formats, should be a drop-in replacement for usual parallel uses with extra features on the side.
Would love to get feedback from other users though as it might not be completely dry.
Thanks for sharing the original source code.
Best Regards,
M

High level PHP library for Amazon SWF deciders to check state of activity tasks

I'm writing PHP for fairly simple workflow for Amazon SWF. I've found myself starting to write a library to check if certain actions have been started or completed. Essentially looping over the event list to check how things have progressed, and then starting an appropriate activity if its needed. This can be a bit faffy at times as the activity type and input information isn't in every event, it seems to be in the ActivityTaskScheduled event. This sort of thing I've discovered along the way, and I'm concerned that I could be missing subtle things about event lists.
It makes me suspect that someone must have already written some sort of generic library for finding the current state of various activities. Maybe even some sort of more declarative way of coding up the flowcharts that are associated with SWF. Does anything like this exist for PHP?
(Googling hasn't come up with anything)
I'm not aware of anything out there that does what you want, but you are doing it right. What you're talking about is coding up the decider, which necessarily has to look at the entire execution state (basically loop through the event list) and decide what to do next.
Here's an example written in python
( Using Amazon SWF To communicate between servers )
that looks for events of type 'ActivityTaskCompleted' to then decide what to do next, and then, yes, looks at the previous 'ActivityTaskScheduled' entry to figure out what the attributes for the previous task were.
If you write a php framework that specifies the workflow in a declarative way then a generic decider that implements it, please consider sharing it :)
I've since found https://github.com/cbalan/aws-swf-fluent-php which looks promising, but not really used it, so can't speak to the whether it works or not.
I've forked it and started a bit of very light refactoring to allow some testing, available at https://github.com/michalc/aws-swf-fluent-php

Hudson - different build targets for different triggers

I would like to have different build targets for periodic builds and for those that are triggered by polling SCM.
More specific: The idea is that nightly builds should call 'mvn verify' which includes integration tests, while a normal build calls 'mvn test' that just executes Unit tests.
Any ideas how this can be achieved using Hudson?
Cheers
Chris
You could create two jobs - one scheduled and the other polled.
In the scheduled you can specify a different maven goal from the polled.
The answer by Raghuram is straight forward and correct. But you can also have three jobs. The first two do the triggering and pass the maven goal as a parameter into the third job. Sounds like a lot of clutter, and to a certain point it is. But it will help, if you have a lot of configuration to do (especially if the configuration needs to be changed regularly). It will help to have the configuration correct for both jobs. Configuration does not only include the build steps but also the harvesting of all reports, post build cleanup, notifications, triggering of downstream jobs, ... Another advantage is, that you don't need to synchronize the two jobs, so that they not run in parallel (if that causes problems).
Don't understand me wrong, my first impulse would go for two jobs, which has it's own advantages. The history for the nightly build will be contain the whole day (actually since the last nightly build) and not only for the time since the last build (which could be a triggered one. Integration tests usually need a more extensive setup or access to scarce resources. With two jobs you don't block these resources when you run the test goal. In addition I expect that more test results need to be harvested to be displayed and tracked over time by Hudson. You also might want to run more metrics against your code whose results should be displayed by Hudson. The disadvantage is that you of course need to keep the build steps basically the same all the time.
But in the end it is a case-by base decision if you go with 2 or 3 jobs.