Is there anyway I can use preemptible instance for dataflow jobs? - google-cloud-platform

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
options.setTempLocation("gs://temp/");
options.setRunner(DataflowRunner.class);
options.setTemplateLocation("gs://temp-location/");
options.setWorkerMachineType("n1-standard-4");
options.setMaxNumWorkers(20);
options.setWorkerCacheMb(2000);
I'm not able to find out any pipeline options with preemptible instance setting.

Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
SDK: requires 2.12.0 or newer for Java or Python.
Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.

Related

GCP cost optimization of compute resources

We have multiple teams using Google Dataproc service. We would like to analyze the usage of Dataproc resources and determine if there is any scope for optimization. Few examples are listed below.
A dataproc cluster is created without a TTL, in this case even >after a job is run, the cluster is still active and incurs cost. Here, >if we are able to determine if the cluster is idle, we could recommend >the team to stop the cluster when not in use.
A dataproc cluster is provisioned with higher configuration (CPU >and RAM), but the utilization is very minimal (ex. less than 10%). >Here scaling down the resource might bring down the cost and also >serve the requirement to run a job.
We would like to understand if GCP already has such features, if not available, is there a place where we can find relevant logs and build our own solution for this use case?

Cloud Composer 2 wait much time until synchronizing latest DAGs to GKE Workers

Problem is the same the title. We sometime wait about 1hour. This thing make our develop experience become too bad.
Composer version is composer-2.0.4-airflow-2.2.3 .
We have 17 DAGs.
Scheduler parse DAGs fast. So, We expect that workers of composer doesn’t sync DAGs with GCS FUSE.
Are there other reason? What should we do to solve this problem?
Our GKE Workload Configuration is follow the picture.
According to the configuration, I would suggest you increase the resources. Generally in Cloud Composer 2, the GKE workloads like Scheduler and Workers have their resources limited to the resources defined. Sometimes lack of CPU and memory resources also lead to delay in synchronization. You can monitor your DAG’s to increase and decrease the resources according to the requirement as mentioned in this documentation.
There are many possible causes for delayed synchronization. You can follow this documentation for handling larger numbers of DAG’s. For more information on tuning Cloud Composer performance, you can check this link.

Monitoring works or identifying bottlenecks in data pipeline

I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.

Running steps of EMR in parallel

I am running a spark-job on EMR cluster,The issue i am facing is all the
EMR jobs triggered are executing in steps (in queue)
Is there any way to make them run parallel
if not is there any alteration for that
Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.
Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.
YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.
If you are concerned about YARN jobs running in a queue(submitted by spark)..
There are multiple solutions to run jobs in parallel ,
By default, EMR uses YARN CapacityScheduler with DefaultResourceCalculator and has one single DEFAULT queue where all YARN jobs are submitted. SInce there is only one queue, the number of yarn jobs that you can RUN(not submit) in parallel really depends on the parallel number of AM's , mapper and reducers that your EMR cluster supports.
For example : You have a cluster that can run atmost 10 mappers in parallel. (see AWS EMR Parallel Mappers?)
Suppose you submitted 2 map-only jobs each requiring 10 mappers one after another. The first job will take up all mapper container capacity and runs , while the second waits on the queue for the containers to free up. This behavior is similar for AM's and Reducers as well.
Now, to make them run in parallel inspire of having that limitation on number of containers that is supported by cluster ,
Keeping capacity scheduler , You can create multiple queues configuring %'s of capacity with Max capacity in each queue. So that job in first queue might not fully use up all containers even though it needs it. You can submit a seconds your job in second queue which will have pre-determined capacity.
You might need to use FAIR scheduler by configuring yarn-site.xml . The FAIR scheduler allows you share configure queues and share resources across those queues fairly. You might also use PREEMPTION option of fair scheduler.
Note that the choice of what option to go with - really depends on your use-case and business needs. It is important to learn about all options and possible impact.
https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html
Amazon EMR now supports the ability to run multiple steps in parallel. The number of steps allowed to run at once is configurable and can be set when a cluster is launched and at any time after the cluster has started.
Please see this announcement for more details: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.
Just adding updated information. EMR supports parallel steps:
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

How to EMR Auto Scale UP/DOWN?

I am new with AWS EMR where I need to scale up/down my task nodes automatically based on the usage. What I am thinking to add SNS event on Cloudwatch alarm for AppPending (scale up) and isIdle ( scale down).
Am I thinking correctly?
Is there any good documentation on this?
Please advice.
Thanks.
There is no in-built capability within Amazon EMR to automatically scale the cluster size based upon some metric.
One method is to add/remove Task Nodes as a Job Step. This does not automatically scale based upon demand, but can scale when you know that a large job step is required.
For example, if the cluster is performing a batch of several job steps and one of the steps requires more servers:
Create a job step that adds Task Nodes
Create a job step to perform the work
Create a job step to remove excess Task Nodes
To be truly automatic, you would need to monitor some combination of metrics that would indicate heavy load, and then add/remove nodes accordingly. The choice of metrics, however, would depend upon your particular workloads.
Another option is to fire-up a cluster for specific jobs, then terminate the cluster when the job is finished.
You could take a look at Themis, an EMR autoscaling framework developed at Atlassian.
Current features include reactive autoscaling (based on current usage) as well as proactive autoscaling (based on predefined schedules).
The tool also comes with a simple Web UI and is very easy to configure.