We have multiple teams using Google Dataproc service. We would like to analyze the usage of Dataproc resources and determine if there is any scope for optimization. Few examples are listed below.
A dataproc cluster is created without a TTL, in this case even >after a job is run, the cluster is still active and incurs cost. Here, >if we are able to determine if the cluster is idle, we could recommend >the team to stop the cluster when not in use.
A dataproc cluster is provisioned with higher configuration (CPU >and RAM), but the utilization is very minimal (ex. less than 10%). >Here scaling down the resource might bring down the cost and also >serve the requirement to run a job.
We would like to understand if GCP already has such features, if not available, is there a place where we can find relevant logs and build our own solution for this use case?
Related
We are migrating our production environment from DigitalOcean to GCP.
However, because it is different, we don't know where to get some information about our VMs.
Is it possible to have a report that tells me the amount of CPUs, Machine Type, amount of RAM, amount of SSD and amount of SSD used by VM?
Compute Engine lets you export detailed reports of your Compute Engine usage (daily & monthly) to a Cloud Storage bucket using the usage export feature. Usage reports provide information about the lifetime of your resources.
VM instance insights help you understand the CPU, memory, and network usage of your Compute Engine VMs.
As #Dharmaraj mentioned in the comment, GCP introduced a new observability tab designed to give insights into common scenarios and issues associated with CPU, Disk, Memory, Networking, and live processes. With access to all of this data in one location, you can easily correlate between signals over a given time frame.
Finally, the Stackdriver agent can be installed on GCE VMs, allowing additional metrics like memory monitoring. You can also use Stackdriver's notification and alerting features. However, premium-tier accounts are the only ones that can access agent metrics.
Problem is the same the title. We sometime wait about 1hour. This thing make our develop experience become too bad.
Composer version is composer-2.0.4-airflow-2.2.3 .
We have 17 DAGs.
Scheduler parse DAGs fast. So, We expect that workers of composer doesn’t sync DAGs with GCS FUSE.
Are there other reason? What should we do to solve this problem?
Our GKE Workload Configuration is follow the picture.
According to the configuration, I would suggest you increase the resources. Generally in Cloud Composer 2, the GKE workloads like Scheduler and Workers have their resources limited to the resources defined. Sometimes lack of CPU and memory resources also lead to delay in synchronization. You can monitor your DAG’s to increase and decrease the resources according to the requirement as mentioned in this documentation.
There are many possible causes for delayed synchronization. You can follow this documentation for handling larger numbers of DAG’s. For more information on tuning Cloud Composer performance, you can check this link.
We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.
It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
options.setTempLocation("gs://temp/");
options.setRunner(DataflowRunner.class);
options.setTemplateLocation("gs://temp-location/");
options.setWorkerMachineType("n1-standard-4");
options.setMaxNumWorkers(20);
options.setWorkerCacheMb(2000);
I'm not able to find out any pipeline options with preemptible instance setting.
Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
SDK: requires 2.12.0 or newer for Java or Python.
Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.
I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.