I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.
Related
We are migrating our production environment from DigitalOcean to GCP.
However, because it is different, we don't know where to get some information about our VMs.
Is it possible to have a report that tells me the amount of CPUs, Machine Type, amount of RAM, amount of SSD and amount of SSD used by VM?
Compute Engine lets you export detailed reports of your Compute Engine usage (daily & monthly) to a Cloud Storage bucket using the usage export feature. Usage reports provide information about the lifetime of your resources.
VM instance insights help you understand the CPU, memory, and network usage of your Compute Engine VMs.
As #Dharmaraj mentioned in the comment, GCP introduced a new observability tab designed to give insights into common scenarios and issues associated with CPU, Disk, Memory, Networking, and live processes. With access to all of this data in one location, you can easily correlate between signals over a given time frame.
Finally, the Stackdriver agent can be installed on GCE VMs, allowing additional metrics like memory monitoring. You can also use Stackdriver's notification and alerting features. However, premium-tier accounts are the only ones that can access agent metrics.
We have multiple teams using Google Dataproc service. We would like to analyze the usage of Dataproc resources and determine if there is any scope for optimization. Few examples are listed below.
A dataproc cluster is created without a TTL, in this case even >after a job is run, the cluster is still active and incurs cost. Here, >if we are able to determine if the cluster is idle, we could recommend >the team to stop the cluster when not in use.
A dataproc cluster is provisioned with higher configuration (CPU >and RAM), but the utilization is very minimal (ex. less than 10%). >Here scaling down the resource might bring down the cost and also >serve the requirement to run a job.
We would like to understand if GCP already has such features, if not available, is there a place where we can find relevant logs and build our own solution for this use case?
I want to use GPU capacity for deep learning models. Sagemaker is great in its flexibility of starting on demand clusters for training. However, my department wants to have guarantees we won't overspend on the AWS budget. Is there a way to 'cap' the costs without resorting to using a dedicated machine?
Here are couple ideas:
You can use Service Catalog on top of SageMaker to restrict how
end-users consume the product (and for example limit instance types
and permissions)
On SageMaker Notebooks: you can use a lifecycle configuration to automatically shutdown notebooks that are idle.
You can also create AWS Lambda functions that automate controls (eg
shut down notebook instances at night, send a notif if big machines
are used etc)
On SageMaker Training you can use Spot capacity to benefit of significant savings on the training costs (up to 90% is possible). You can also apply a max training duration with train_max_run
it's a good idea to make sure your code makes good use of GPU. For example on P3 instance (V100 cards), you should try to use mixed-precision training so that training is faster and cheaper. Also, tune data loading, batch size and algorithm complexity so that GPU have enough work to do and don't just spend
their time doing data reads and model updates. Also, scale training vertically first (bigger and bigger machine) vs horizontally. Multi-machine training is generally more costly as harder to write and debug and with more communication overhead. This is more on your side that on SageMaker side though.
Training in the SageMaker Training API, not in SageMaker Notebook. When you write code in a Notebook, very little time is spent on training and much more in writing, debugging and reading documentation. All this time, GPU sits idle. On the other hand, SageMaker Training provides ephemeral, fresh instances billed per second only for the duration of training and comes with additional benefits such as spot, managed metrics, logs, data I/O and metadata management.
You can create cost allocation tags to make cost reporting at custom granularity
We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.
It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
options.setTempLocation("gs://temp/");
options.setRunner(DataflowRunner.class);
options.setTemplateLocation("gs://temp-location/");
options.setWorkerMachineType("n1-standard-4");
options.setMaxNumWorkers(20);
options.setWorkerCacheMb(2000);
I'm not able to find out any pipeline options with preemptible instance setting.
Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
SDK: requires 2.12.0 or newer for Java or Python.
Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.