Want to Trigger DataFlowpipeline with CloudFunctions - google-cloud-platform

I have a scenario that we want to trigger a data-flow pipeline via cloud function, And in data Flow pipeline we have to transform some data and insert in big query
I had created our custom data_Flowpipeline and transformed the Data inserted in the big query (follow standard way of installing Apache beam and using Deployment command from cloud Shell) Pipeline ran successfully, log is showing in monitoring with DAG.
Now what I want do is to trigger the pipeline with cloud-function and for that I researched that
(i)we can create custom flex template of pipeline
(ii) stage it in google Bucket
(iii)Call it with REST-API from cloud function
is the mentioned step in second step is recommended way of doing it or should I try another approach? I don't get any other way apart from classic templates

Related

Google Cloud Run service deployment, is it the best direction in my situation?

I have some experience with Google Cloud Functions (CF). I tried to deploy a CF function recently with a Python app, but it uses an NLP model so the 8GB memory limit is exceeded when the model is triggered. The function is triggered when a JSON file is uploaded to a bucket.
So, I plan to try Google Cloud Run but I have no experience with it. Also, I am not completely sure if it is the best course of action.
If it is, what is the best way of implementing provided that the Run service will be triggered by a file uploaded to a bucket? In CF, you can select the triggering event, in Run I didn't see anything like that. I could use some starting points as I couldn't find my case in the GCP documentation.
Any help will be appreciated.
You can use at least these two things:
The legacy one: Create a GCS notification in PubSub. Then create a push subscription and add the Cloud Run URL in the HTTP push destination
A more recent way is to use Eventarc to invoke directly a Cloud Run endpoint from an event (it roughly create the same thing with a PubSub topic and push subscription, but it's fully configured for you)
EDIT 1
When you use Push notification, you will received a standard PubSub message. The format is described in the documentation for the attributes and for the body content; keep in mind that the raw content is base64 encoded and you have to decode it to get the final format
I personally have a Cloud Run service that log the contents of any requests to be able to get in the logs all the data that I need to develop. When I have a new message format, I configure the push to that Cloud Run endpoint and I automatically get the format
For Eventarc, the format will be added to the UI soon (I view that feature in preview, but it's not yet available). The best solution is to log the content to know what you get to know what to do!

Amazon SageMaker Model Registry / Pipelines - how to manually set a Stage for a given Model Version?

This might be a very specific question, but I will try anyway.
I want to explicitly set the Stage column in Model registry for a given Model Version:
This picture comes from the documentation and it gets set only when you run the example SageMaker Projects MLOps Templates they provide. When I create the Model Package (i.e. Model Version) manually, the column remains empty. How do I set it? What API do I call?
Additionally, the documentation on browsing the model version history has a following sentence
How do we send that exact event ("Deployed to stage XYZ") manually?
I already thoroughly went over all the files SageMaker MLOps Project generates (CodeBuild Builds, CodePipeline, CloudFormation, various .py files, SageMaker Pipeline) but could not find any direct and explicit call for that event.
I think it may be somehow connected to the Tag sagemaker:deployment-stage but I've already set it on Endpoint, EndpointConfiguration and Model, with no success. I also tried to blindly call the UpdateModelPackage API and set Stage in CustomerMetadataProperties. Again - no luck.
The only thing I get in that Activity tab is that given Model Version is deployed to Inference endpoint:
You can set the status with the ModelApprovalStatus parameter in the create_model_package API or the update_model_package API
Model package state change should create an event in EventBridge (like many other SageMaker events) https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html#eventbridge-model-package, which enables you to run the automation of your choice.
In the default SageMaker Pipelines Project template, you can see the EventBridge-driven proposed logic in the CodePipeline pipeline created for deployment: you can see on top "Trigger - CloudWatchEvent".
You don't see the event source as code in the git, because the status change is expected to be done in the Studio model registry UI in that demo template.
Those EventBridge events emitted by the Model Registry can also be seen in few blogs:
Taming Machine Learning on AWS with MLOps: A Reference Architecture
Patterns for multi-account, hub-and-spoke Amazon SageMaker model registry
Build MLOps workflows with Amazon SageMaker projects, GitLab, and GitLab pipelines
I was having the exact same issue, I wanted to change the model stage but could not find where it was being done in the sample code AWS provides.
After some research and looking into the sample code I realized that it was being done in the cloud formation execution. First they add the tag
'sagemaker:deployment-stage': stage_config['Parameters']['StageName']
and then the cloud formation execution (cfnUpdate call) updates the stage and deploys.
I couldn't find another way to change the state with a call to update_model_package or other methods.

How to get dataflow job id from inside that dataflow job - JAVA

In my current architecture, multiple dataflow jobs are triggered at various stages, as part of ABC framework, I need to capture the job id of those jobs as audit metrics inside the dataflow pipeline and update it in BigQuery.
How do I get the run id of dataflow job from the pipeline using JAVA?
Is there any existing method that I can use for that or do I need to use google cloud's client library inside the pipeline for that?
If you are submitting to dataflow, I believe this might work:
DataflowPipelineJob result = (DataflowPipelineJob)pipeline.run()
result.getJobId()
But you cannot access that within the pipeline itself afaik (DoFns etc).
The best way to ensure you know your job id/name, is to set it yourself. You can do this by setting --jobName and this is accessible via options.getJobName(), dataflow will use this. Note it must be unique.

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

Google App Engine Parse Logs in DataStore Save to Table

I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py