How to update a Dataflow incompatible pipeline without loosing in data? - google-cloud-platform

I'm practicing for the Data Engineer GCP certification exam and got the following question:
You have a Google Cloud Dataflow streaming pipeline running with a
Google Cloud Pub/Sub subscription as the source. You need to make an
update to the code that will make the new Cloud Dataflow pipeline
incompatible with the current version. You do not want to lose any
data when making this update.
What should you do?
Possible answers:
Update the current pipeline and use the drain flag.
Update the current pipeline and provide the transform mapping JSON object.
The correct answer according to the website 1 my answer was 2. I'm not convinced my answer is incorrect and these are my reasons:
Drain is a way to stop the pipeline and does not solve the incompatibility issues.
Mapping solves the incompatibility issue.
The only way that I see 1 as the correct answer is if you don't care about compatibility.
So which one is right?

I'm studying for the same exam, and the two cores of this question are:
1- Don't lose data ← Drain, is perfect for this because you process all buffer data and stop reviving messages; normally this message is alive for 7 days of retry, so when you start a new job you will receive all without lose any data.
2- Incompatible new code ← mapping solve some incompatibilities like change name of a ParDO but no a version issue. So launch a new job with the new code, it's the only option.
So, option is A.

I think the main point is that you cannot solve all the incompatibilities with the transform mapping. Mapping can be done for simple pipeline changes (for example, names), but it doesn't generalize well.
The recommended solution is constantly draining the pipeline running a legacy version, as it will stop taking any data from reading components, finish all the work pending on workers, and shutdown.
When you start a new pipeline, you don't have to worry about state compatibility, as workers are starting fresh.
However, the question is indeed ambiguous, and it should be more precise about the type of incompatibility or state something like in general. Arguably you can always try to update the job with mapping, and if Dataflow finds the new job to be incompatible, it will not affect the running pipeline -- then your only choice would be the drain option.

Related

Background jobs occur very frequently and eat memory

I'd like to optimize my notification system, so here is how it works now:
Every time some change occurred on application, we're calling background job (Sidekiq) in order to compute some values and then to notify users via email.
This approach worked very well for a while, but suddenly we got memory leak as there were a lot of actions very frequently and we had about 30-50 workers per second so I need to refactor this.
What I would like to do is, instead of running worker immediately, to store it in array and perform bit later.
But I'm afraid that also will cause a problem, but just "delayed" problem.
I'm looking forward to hear more approaches and solutions as well.
Thanks in advance
So I found one very interesting solution:
I'm storing values to Redis directly as key - value, where the value is dataset with data I'd need later for computation. Then I'm using simple cron job, which occurs service which is responsible for reading data from Redis and computing them. I optimized Sidekiq workers to work only when cron is executed, everything works perfectly fine and even much faster then before.
I'm still eager to hear if there is any other approach/solution.
Thanks

GNU Parallel host sticky jobs

I am writing a parallel build farm to build C++ cross-platform applications against various platforms / environments. Every time new code is pushed to a git repo, I build and test the latest code against all the platforms.
I've setup parallel to correctly distribute the jobs among several hosts using the --sshlogin option.
I transfer files, collect output and results. It's all working more than fine and I love the tool.
The build time being sometimes quite long for some platforms, I would like the build to be as incremental as possible.
My only issue is that the build is only incremental if the scheduler sends the jobs to the same machine and reuse the artefacts of the previous build on this specific host.
Say I have 3 hosts, I have 1 chance in 3 for the build to be incremental. If a hosts hasn't built this platform in a while, it might take a long time.
Is it possible to gain control over the host a specific input source will run on and only fallback to the other hosts if the host is busy?
Ideally, I would love to see a tag system where I tag input source with a name and tag several hosts with a name, creating pools of jobs and pools of machines specialized into that type of build.
But a very simple implementation where the input sources are distributed in the same order as the order the sshlogins are defined could be a simple & quick fix in my situation.
I tried to find the source code to implement it myself but I only see doc generation when I browse the code on Savannah.
Any ideas?
Thanks,
M
There is currently no support for prioritizing a given argument to a given sshlogin. The source code is at https://savannah.gnu.org/git/?group=parallel
Feel free to join the mailing list and discuss the idea: https://lists.gnu.org/mailman/listinfo/parallel
The only priority in the code is when a job has failed on an sshlogin, then GNU Parallel prefers to retry that job on another sshlogin. Maybe that could be extended?
If a job is marked as having failed -1 time for a given sshlogin, then GNU Parallel ought to prefer to run the job on that sshlogin.
I've been trying to discuss this idea on the mailing list as you suggested but never had any respone in more than 10 days... I guess you must be busy with other things at the moment. So I went along and forked the source code to make the necessary changes and make my solution work.
I pushed it there a week ago:
http://michakfromparis.github.io/gnu-parallel-sticky/
the source code is available on github here:
https://github.com/michaKFromParis/gnu-parallel-sticky
Wasn't exactly easy without any guidance as the source code has a lot of history so I tried to keep the changes surgical to ease merge of your future releases.
I've been using it in production for more than a week now and it works perfectly in my configuration.
It is also compatible with older formats, should be a drop-in replacement for usual parallel uses with extra features on the side.
Would love to get feedback from other users though as it might not be completely dry.
Thanks for sharing the original source code.
Best Regards,
M

How to see the full build queue in Jenkins

Our Jenkins instance has a job for our main application. It builds all git branches in the one job, and so can sometimes get pretty far behind. However, the Build Queue on the lefthand side only ever shows the next job, not all the others. Is there a way to see all the queued executions of a single job? Ideally it'd even show the branch as well.
I'm aware of solutions like creating a new job for each branch, but this really clutters up the already horrible interface, and I'd rather avoid that.
For a single job, with same parameters, Jenkins doesn't place a build in the queue if it already contained in the queue. You can use a simple trick to add an unused parameter and set some random value to this parameter every time you run the job. Now you can have multiple jobs in the queue for the same job.

Job inheritance in Jenkins jobs

How do you handle mapping Jenkins jobs to your build process, and have you been able to build in cascading configurations on inheritance?
For any given build I'll have at least three jobs (standard continuous integration/nightly, security scan, coverage) and then some downstream integration testing jobs. The configuration slicer plugin handles some aspects cross jobs but each jobs is still very much its own individual entity with no relationship to the other jobs in its group.
I recently saw QuickBuild and it has job inheritance where a parent jobs can define a standard group of steps and its children can override and specialize. With Jenkins, I have copies of jobs, which is fine until I need to change something. With QuickBuild the relationship between jobs allows me to spread my changes with little effort.
I've been trying to figure out how to handle this in Jenkins. I could use the parameterized build trigger plugin to allow jobs to call others and override aspects. I'd then harvest the data from the called jobs to its caller. I suspect I'll run into a series of problems where there are aspects which I can't override which will force me to implement Jenkins functionality in my own script thus making Jenkins less useful.
How do you handle complexity in your build jobs in Jenkins? Have you heard of any serious problems with QuickBuild?
I would like to point out to you the release of a plugin that my team has developed and only recently published under open source.
It implements full "Inheritance between jobs".
Here for further links that might help you:
Presentation: https://www.youtube.com/watch?v=wYi3JgyN7Xg
Wiki: https://wiki.jenkins-ci.org/display/JENKINS/inheritance-plugin
Releases: http://repo.jenkins-ci.org/releases/hudson/plugins/project-inheritance/
I had pretty much the same problem. We have a set of jobs that needs to run for our trunk as well as at least two branches. The branches represent our versions, and a new branch is created every few months. Creating new jobs by hand for this is no solution, so I checked out some possibilities.
One possibility is to use the template plugin. This lets you create a hierarchy of jobs of a kind. It provides inheritance for builders, publishers and SCM settings. Might work for some, for me it was not enough.
Second thing I checked out was the Ant Script for job cloning, and his sibling the Bash Script. These are truly great. The idea is to make the script create a new job for, copy all settings from a template job, make changes as you need them. As this is a script it is very flexible and you can do a lot with that. Only drawback is, that this will not result in a real hierarchy, so changes in the template job will not reflect on jobs already cloned, only on jobs that will be created going forward.
Looking at the drawbacks and virtues of those two solutions, a combination of both might work best. You create a template project with some basic settings that will be true for all jobs, and then use a bash or ant script to create jobs depending on that template.
Hope that helps.
I was asked what our eventual solution to the problem was... After many months of fighting with our purchasing system we spent around $4000 US on Quickbuild. In a about 2-3 months we had a templated build system in place and were very happy with it. Before I left the company we had several product groups in the system and were automating the release process as well.
Quickbuild was a great product. It should be in the $40k class but it's priced at much less. While I'm sure Jenkins could do this, it would be a bit of a kludge whereas Quickbuild had this functionality baked in. I've implemented complex behaviors on top of products before (e.g. merge tracking in SVN 1.0) and regretted it. Quickbuild was reasonably priced and provided a solid base for our build and test systems.
At present, I'm at a firm using Bamboo and hope its new feature branch feature will provide much of what Quickbuild can do
EZ Templates plugin allows you to use any job as a template for other jobs. It is really awesome. All you need is to set the base job as a template:
* Usually you would also disable the base job (like "abstract class").
Then create a new job, set it to use the base job template, and save:
Now edit the new job - it will include everything! (and you can override existing configurations).
Note: There's another plugin Template Project for configuration templates, but it was not updated recently (last commit on 2016).
We use quickbuild and it seems to work great for most things. I have even been able to use their APIs to write custom plugins. One area where quickbuild is lacking is sonar integration. The sonar team has a Jenkins plugin and not one for quickbuild.
Given that the goal is DRY (don't repeat yourself) I presently favor this approach:
Use jenkins shared library with jenkins pipeline unit to support TDD
Use docker images using groovy/python or whatever language you like to execute complex actions requiring apis etc
Keep the actual job pipeline very spartan (basically just for pulling build params and passing them to functions in shared library which may use docker images to do the work.
This works really well an eliminates the DRY issues around complex build jobs.
Shared Pipeline Docker Code Example - vars/releasePipeline.groovy
/**
* Run image
* #param closure to run within image
* #return result from execution
*/
def runRelengPipelineEphemeralDocker(closure) {
def result
artifactory.withArtifactoryEnvAuth {
docker.withRegistry("https://${getDockerRegistry()}", 'docker-creds-id') {
docker.image(getReleasePipelineImage()).inside {
result = closure()
}
}
}
return result
}
Usage example
library 'my-shared-jenkins-library'
releasePipeline.runRelengPipelineEphemeralDocker {
println "Running ${pythonScript}"
def command = "${pythonInterpreter} -u ${pythonScript} --cluster=${options.clusterName}"
sh command
}

How do you model a business workflow in ColdFusion?

Since there's no complete BPM framework/solution in ColdFusion as of yet, how would you model a workflow into a ColdFusion app that can be easily extensible and maintainable?
A business workflow is more then a flowchart that maps nicely into a programming language. For example:
How do you model a task X that follows by multiple tasks Y0,Y1,Y2 that happen in parallel, where Y0 is a human process (need to wait for inputs) and Y1 is a web service that might go wrong and might need auto retry, and Y2 is an automated process; follows by a task Z that only should be carried out when all Y's are completed?
My thoughts...
Seems like I need to do a whole lot of storing / managing / keeping
track of states, and frequent checking with cfscheuler.
cfthread ain't going to help much since some tasks can take days
(e.g. wait for user's confirmation).
I can already image the flow is going to be spread around in multiple UDFs,
DB, and CFCs
any opensource workflow engine in other language that maybe we can port over to CF?
Thank you for your brain power. :)
Study the Java Process Definition Language specification where JBoss has an execution engine for it. Using this Java based engine may be your easiest solution, and it solves many of the problems you've outlined.
If you intend to write your own, you will probably end up modelling states and transitions, vertices and edges in a directed graph. And this as Ciaran Archer wrote are the components of a State Machine. The best persistence approach IMO is capturing versions of whatever data is being sent through workflow via serialization, capturing the current state, and a history of transitions between states and changes to that data. The mechanism probably needs a way to keep track of who or what has responsibility for taking the next action against that workflow.
Based on your question, one thing to consider is whether or not you really need to represent parallel tasks in your solution. Where instead it might be possible to en-queue a set of messages and then specify a wait state for all of those to complete. Representing actual parallelism implies you are moving data simultaneously through several different processes. In which case when they join again you need an algorithm to resolve deltas, which is very much a non trivial task.
In the context of ColdFusion and what you're trying to accomplish, a scheduled task may be necessary if the system you're writing needs to poll other systems. Consider WDDX as a serialization format. JSON, while seductively simple, I recall has some edge cases around numbers and dates that can cause you grief.
Finally see my answer to this question for some additional thoughts.
Off the top of my head I'm thinking about the State design pattern with state persisted to a database. Check out the Head First Design Patterns's Gumball Machine example.
Generally this will work if you have something (like a client / order / etc.) going through a number of changes of state.
Different things will happen to your object depending on what state you are in, and that might mean sitting in a database table waiting for a flag to be updated by a user manually.
In terms of other languages I know Grails has a workflow module available. I don't know if you would be better off porting to CF or jumping ship to Grails (right tool for the job and all that).
It's just a thought, hope it helps.