how to call cloud workflows sequentially?
I don't want to start workflow when another (same) workflow is processing.
You have a couple of options:
create a primary top level worklow that calls all the other workflows using the googleapis.workflowexecutions.v1.projects.locations.workflows.executions.create action as steps
literally this means you have 1 main workflow with many steps, each trigger one workflow after the other using the above call statement. Steps are executed sequentially.
Leverage Firestore API to write a flag to a collection that controls whether a workflow is in progress, and if another workflow starts check the flag and stop.
Related
In my current architecture, multiple dataflow jobs are triggered at various stages, as part of ABC framework, I need to capture the job id of those jobs as audit metrics inside the dataflow pipeline and update it in BigQuery.
How do I get the run id of dataflow job from the pipeline using JAVA?
Is there any existing method that I can use for that or do I need to use google cloud's client library inside the pipeline for that?
If you are submitting to dataflow, I believe this might work:
DataflowPipelineJob result = (DataflowPipelineJob)pipeline.run()
result.getJobId()
But you cannot access that within the pipeline itself afaik (DoFns etc).
The best way to ensure you know your job id/name, is to set it yourself. You can do this by setting --jobName and this is accessible via options.getJobName(), dataflow will use this. Note it must be unique.
I have the following infrastructure in place: Dataflow is used to send messages from AWS SQS to Google Cloud's Pub/Sub.
Messages are read with java and Apache Beam (SqsIO).
Is there a way with Dataflow to delete the messages in AWS SQS once they arrive / are read in PubSub and how would that look like? Can this be done in java with Apache Beam?
Thank you for any answers in advance!
There's no in-built support for message deletion, but you can add code to delete messages that are read from AWS SQS using a Beam ParDo. But you must perform such a deletion with care.
A Beam runner performs reading using one or more workers. A given work item could fail at any time and a runner usually re-runs a failed work item. Additionally, most runners fuse multiple steps. For example, if you have a Read transform followed by a delete ParDo, a runner may fuse these transforms an execute them together. Now if a work item fails after partially deleting data, a re-run of such a work item may fail or may produce incorrect data.
The usual solution is to add a fusion break between the two steps. You can achieve this with Beam's Reshuffle.viaRandomKey() transform (or just by adding any transform that uses GroupByKey). For example, the flow of your program can be as follows.
pipeline
.apply(SqsIO.read())
.apply(Reshuffle.viaRandomKey())
.apply(ParDo.of(new DeleteSQSDoFn()))
.apply(BigQuery.Write(...))
I have a use case where I schedule a task 24h into the future after an event occurs. This task represents some sort of "deadline" for other things to happen.
The scheduled task triggers a creation of a report. If not all of the above mentioned "other things" have completed by this time, then the triggered report creation process creates it anyways with the information it has at the time.
If, on the other hand, all other things do complete before these 24h, then ideally I'd like to re-use the same Google Cloud Task to trigger the same process (as it's identical as the previous case but will contain all of the information possible).
I would imagine the easiest way to achieve the above is to:
schedule a task 24h into the future
if all information arrives: run the task early before it's scheduled time
However, reading through the Google Cloud Tasks documentation I don't see the option to run the task early. However, that feature does exist on the Cloud Tasks console, so I was wondering if it is available in the documentation and client libraries.
Thanks!
This is probably what you're looking for
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/run
NOTE: It does say however that "This command is meant to be used for manual debugging"
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
Any ideas on how to reliably trigger an URL (web service) at a specific time? With the precision in seconds? For example, the script will be set so that it will be able to trigger a web service at 2015-05-27 12:34:55. In my scenario, the user will be able to select at what time, down to seconds a trade should execute. The web service must be then triggered at a specific time
AWS Lambda is not able to run at specific times.
Cron jobs won't work as it does not run every second
An SQS might work but coding it up to be reliable could be hard.
Thanks!
"at" command does what you need: https://calomel.org/cron_at.html
An addtitional tool one can use is called "at" and is used to execute a job only once. "at" is very useful, for example if you want run a backup job starting at 8pm and you expect to be leaving at 5:30pm.