Can I configure Google DataFlow to keep nodes up when I drain a pipeline - google-cloud-platform

I am deploying a pipeline to Google Cloud DataFlow using Apache Beam. When I want to deploy a change to the pipeline, I drain the running pipeline and redeploy it. I would like to make this faster. It appears from the logs that on each deploy DataFlow builds up new worker nodes from scratch: I see Linux boot messages going by.
Is it possible to drain the pipeline without tearing down the worker nodes so the next deployment can reuse them?

rewriting Inigo's answer here:
Answering the original question, no, there's no way to do that. Updating should be the way to go. I was not aware it was marked as experimental (probably we should change that), but the update approach has not changed in the last 3 i have been using DF. About the special cases of update not working, supposing your feature existed, the workers would still need the new code, so no really much to save, and update should work in most of the other cases.

Related

It is possible to re-run a job in Google Cloud Dataflow after succeded

Maybe the question sounds stupid but I was wondering if once the job is successfully finished and having ID, is it possible to start the same job again?
Or is it necessary to create another one?
Because otherwise I would have the job with the same name throughout the list.
I just want to know if there is a way to restart it without recreating it again.
It's not possible to run the exact same job again, but you can create a new job with the same name that runs the same code. It will just have a different job ID and show up as a separate entry in the job list.
If you want to make running repeated jobs easier, you can create a template. This will let you create jobs from that template via a gcloud command instead of having to run your pipeline code.
Cloud Dataflow does have a re-start function. See SDK here. One suggested pattern (to help with deployment) is to create a template for the graph you want to repeatedly run AND execute the template.

Setting build priority in yaml or UI

Is there a way to setup up a build's priority in a yaml based pipeline? There seem to be references to build priority in the Azure DevOps API, but nothing in how to do this via yaml. I thought there might be some docs in the Triggers section, but no.
We need this because we have some fast building NuGet packages, but these get starved via slow-build pipelines making turnaround time for packages painful.
The closest thing I could come up with to working around this is via agent demands in the yaml
demands:
- Agent.ComputerName = XYZ
to separate build pipelines, but this is a bit of a hack and doesn't use agents efficiently.
A way to set this in UI would be acceptable, but I couldn't seem to find anything.
Recently Azure DevOps introduced the ability to manually specify a build/release runs next.
This manifests as a Run next button. (image source).
So while you can't say "this pipeline always takes priority" yet, you can manually force a specific run to the front of the queue.
If you need a specific pipeline to always take priority, then you likely want to setup a separate agent pool just for those pipelines, or use demands as Leo Liu mentioned.
Setting build priority in yaml or UI
I'm afraid this feature is not yet supported in Azure DevOps at this moment.
There is a popular user voice about it, you can upvote it and check the feedback from that ticket.
Currently as a workaround, just like what you did, set the demands in build definitions to force building with the specific agents.
Hope this helps.

Running a background task involving a third-party service in Django

The use-case is this: we need to pull in data from a third-party service and update the database with fresh records, every week. The different ways I have been able to explore have been
either creating a custom django-admin command
or running a background task using Celery (and probably ELK for logging)
I just want to know which way is more feasible and simpler? And if there's another way that I can explore. What I want is monitoring the task for the first few runs then just relying on the logs.

Keep running Dataproc Master node

Is it possible to keep the master machine running in Dataproc? Every time that I run the job after a while (~1 hour), I see the master node is stopped. It is not a real issue since I would easily start it again but I would like to know if there is a way to keep it awake.
A possible way that occurs to me is to do a schedule job in the master machine, but want to know if there is more official way to achieve this.
Dataproc does not stop any cluster nodes (including master) when they are idle.
You need to check if you have some kind of automation or user that can do this on your end.

How to automate the Updating/Editing of Amazon Data Pipeline

I want to use AWS Data Pipeline service and have created some using the manual JSON based mechanism which uses the AWS CLI to create, put and activate the pipeline.
My question is that how can I automate the editing or updating of the pipeline if something changes in the pipeline definition? Things that I can imagine changing could be schedule time, addition or removal of Activities or Preconditions, references to DataNodes, resources definition etc.
Once the pipeline is created, we cannot edit quite a few things as mentioned here in the official doc: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-manage-pipeline-modify-console.html#dp-edit-pipeline-limits
This makes me believe that if I want to automate the updating of pipeline then I would have to delete and re-create/activate a new pipeline? If yes, then the next question is that how can I create a automated process which identifies the previous version's ID, deletes it and creates a new one? Essentially trying to build a release management flow for this where the configuration JSON file is released and deployed automatically.
Most commands like activate, delete, list-runs, put-pipeline-definition etc. take the pipeline-id which is not known until a new pipeline created. I am unable to find anything which remains constant across updates or recreation (the unique-id and name parameters of the createpipeline command are consistent but then I can't use them for the above mentioned tasks (I need pipeline-id for that.
Of course I can try writing shell scripts which grep and search the output and try to create a script but is there any other better way? Some other info that I am missing?
Thanks a lot.
You cannot edit schedules completely or change references so creating/deleting pipelines seems to be the best way for your scenario.
You'll need the pipeline-id to delete a pipeline. Is it not possible to keep a record of that somewhere? You can have a file with the last used id stored locally or in S3 for instance.
Some other ways I can think of are:
If you have only 1 pipeline in the account you can list-pipelines and
use the only result
If you have the pipeline name you can list-pipelines and find the id