Cloud Data Fusion triggered pipeline - reuse the already provisioned Dataproc clusters - google-cloud-platform

Is there a way to avoid the provisioning step for subsequently triggered outbound pipelines? It looks like when a pipeline triggers an outbound pipeline, it does the provisioning all over again. Can we simply execute the triggered pipeline on the provisioned cluster of the first?
Thanks.

Cloud Data Fusion has support for running a pipeline against an existing Dataproc cluster, see more details in this doc.

Related

Turn on/off AWS EMR clusters

How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.

How to run PySpark on AWS EMR with AWS Lambda

How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.
You can access this link on how to initialise the same.
What are the processes available to create a EMR cluster:
Using boto3
/ AWS
CLI
/ Java
SDK
Using cloudformation
Using Data Pipeline
Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
No. It isn’t mandatory to use lambda to create an auto-terminating cluster.
You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref
Note:
Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the
resources that are being created will be permanent to some extent.
If you REALLY had to do it this way, you could make an AWS api call to
delete the CF stack upon finishing your EMR tasks.
How may I make my PySpark code to run with AWS EMR from AWS Lambda?
You can design your lambda to submit spark
job.
You can find an example
here
In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.

Scheduling cron jobs on Google Cloud DataProc

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!
For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.
Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

How to submit job on Dataproc cluster with specific service account?

I'm trying to execute jobs in the Dataproc cluster which access several resources of GCP like Google Cloud Storage.
My concern is whatever file or object is being created through my job is owned/created by Dataproc default user.
Example - 123456789-compute#developer.gserviceaccount.com.
Is there any way I can configure this user/service-account so that the object gets created by a given user/service-account instead of default one?
You can configure service account to be used by a Dataproc cluster using flag --service-account at cluster creation time.
Gcloud command would look like:
gcloud dataproc clusters create cluster-name \
--service-account=your-service-account#project-id.iam.gserviceaccount.com
More details: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts
https://cloud.google.com/dataproc/docs/concepts/iam/iam
Note: it is better to have one dataproc cluster per job so that each job get isolated environment and doesnt affect each other and you can manage them better (in terms of security as well).
you can also look at GCP Composer using which you can schedule jobs and automate them.
Hope this helps.

Automate turning on and off of redis cluster

I'm trying to automate the turning on and off process of Redis Cluster in aws. I saw the following link for reference (https://forums.aws.amazon.com/thread.jspa?threadID=149772). Is there a way to do it via cloudwatch ?
I am very new to aws platform.
Check the documentation regarding scale in/out
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/redis-cluster-resharding-online.html It also has commands to reshard a cluster manually.
Check CloudWatch metrics from the Redis cluster. https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.HostLevel.html and https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html Choose the metrics that will trigger autoscaling
You can trigger an AWS Lambda on some event for a metric https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html
From the Lambda you cal call aws cli to reshard the cluster as described in 1. Example: https://alestic.com/2016/11/aws-lambda-awscli/
If you need to turn off the cluster completely, instead of the resharding commands just use https://docs.aws.amazon.com/cli/latest/reference/elasticache/delete-cache-cluster.html