Cloud ML: Varying training time taken for the same data - google-cloud-ml

I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.

The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).

Related

Are there any problems with running same cron job that takes 2 hours to complete every 10 minutes?

I have a script that takes two hours to run and I want to run it every 15 minutes as a cronjob on a cloud vm.
I noticed that my cpu is often at 100% usage. Should I resize memory and/or number_of_cores ?
Each time you execute your cron job, a new process will be created.
So if your job takes 120 min (2h) to complete, and you will be starting new jobs every 15 minutes, then you will be having 8 jobs running at the same time (120/15).
Thus, if the jobs are resource intensive, you will observe issues, such as 100% cpu usage.
So the question whether to up-scale or not is really dependent on the nature of these jobs. What do they do, how much cpu and memory do they take? Based on your description you are already running at 100% CPU often, thus an upgrade would be warranted in my view.
It would depend on your cron, but outside of resourcing for your server/application the following issues should be considered:
Is there overlap in data? i.e. do you retrieve a pool of data that will be processed multiple times.
Will duplicate critical actions happen? i.e. will a customer receive an email multiple times, will a payment be processed multiple times.
Is there a chance of a race condition that cause the script to exit early.
Will there be any collisions in the processing i.e. duplicate bookings made etc.
You will need to increase the CPU and Memory specification of your VM instance (in GCP) due to the high CPU load of your instance. The document [1] on upgrading the machine type of your VM instance, to do this need to shutdown your VM instance and change it´s machine type.
To know about different machine types in GCP, please have the link [2].
On the other hand, you can autoscale based on the average CPU utilization if you use managed instance group (MIG) [3]. Using this policy tells the autoscaler to collect the CPU utilization of the instances in the group and determine whether it needs to scale. You set the target CPU utilization the autoscaler should maintain and the autoscaler works to maintain that level.
[1] https://cloud.google.com/compute/docs/instances/changing-machine-type-of-stopped-instance
[2] https://cloud.google.com/compute/docs/machine-types
[3] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing#scaling_based_on_cpu_utilization

Optimizing apache beam / cloud dataflow startup

I have done a few tests with apache-beam using both auto-scale workers and 1 worker, and each time I see a startup time of around 2 minutes. Is it possible to reduce that time, and if so, what are the suggested best practices for reducing the startup time?
IMHO: Two minutes is very fast for a product like Cloud Dataflow. Remember, Google is launching a powerful Big Data service for you that autoscales.
Compare that time to the other cloud vendors. I have seen some clusters (Hadoop) take 15 minutes to come live. In any event, you do not control the initialization process for Dataflow so there is nothing for you to improve.

Optimization of the google dataproc cluster

I am using the dataproc cluster for spark processing. I am new to whole google cloud stuff. In our application we have 100s of jobs which uses dataproc. With every job we spawn new cluster and terminate it once the job is over. I am using pyspark for processing purpose.
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
What is the best software configuration for improving the performance of the dataproc cluser. I am aware of the in-house infrastructure optimisation of hadoop/spark cluster. Is it applicable as it is for dataroc cluster or something else is needed?
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
Any help in time and price optimisation is appreciated. Thanks in advance.
Thanks
Manish
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
That's absolutely fine. We've used that on 300+ node clusters, only issues were with long-running clusters when nodes were getting preempted, and jobs were not optimised to account for node reclamation (no RDD replication, huge long-running DAGs). Also Tez does not like preemptible nodes getting reclaimed.
Is it applicable as it is for dataroc cluster or something else is needed?
Correct. However Google Storage driver has different characteristics when it comes to operation latency (for example, FileOutputCommitter can take huge amounts of time when trying to do recursive move or remove with overpartitioned output), and memory usage (writer buffers are 64 Mb vs 4 Kb on HDFS).
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
Only performance tests can help with that.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Make sure to use dynamic allocation and your cluster is sized to your workload. Scheduling tab in YARN UI should show utilisation close to 100% (if not, your cluster is oversized to the job, or you have not enough partitions). In Spark UI, better to have number running tasks close to number of cores (if not, it again might be not enough partitions, or cluster is oversized).
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
From throughput perspective, GCS is not bad, but it is much worse in case of many small files, both from reading (when computing splits) and writing (when FileOutputCommitter) perspective. Also many parallel writes can result in OOMs due to bigger write buffer size.

What are the consequences of not reaching target workers in a dataflow job?

My apache beam scio dataflow job is asking for more workers than my current quota. The job completes successfully, but is limited to 575 workers. What are the consequences of not giving it the RAM it is asking for. More disk IO of intermediate steps? Slower sink IO? Does it depend on what's going on with the job? In particular, my job is pretty simple really has 2 steps:
-aggregateByKey
-DO IO per key
I can run my own experiments, but I'm also interested in the cost of the job, since it isn't extremely time sensitive operation (aka I'm okay letting it run longer if it is cheaper)...
In this case, your job will have a higher runtime than if your quota was higher, but the aggregate amount of time spent performing work by all workers should be about the same.
Dataflow bills you on the amount of time each CPU, memory and storage unit is allocated. If the total CPU-hours, RAM GB-hours and storage GB-hours are about the same, your job should cost about the same.
Note: Dataflow also charges by the amount of bytes shuffled if you use the shuffle service. This should also not be affected by the number of workers.

How to reduce the initialisation and termination time in google dataflow job?

I'm currently working on a POC and primarily focusing on Dataflow for ETL processing. I have created the pipeline using Dataflow 2.1 Java Beam API, and it takes about 3-4 minutes just to initialise, and also it takes about 1-2 minutes for termination on each run. However, the actual transformation (ParDo) takes less than a minute. Moreover, I tried running the jobs by following different approaches,
Running the job on local machine
Running the job remotely on GCP
Running the job via Dataflow template
But it looks like, all the above methods consume more or less same time for initialization and termination. So this is being a bottleneck for the POC as we intend to run hundreds of jobs every day.
I'm looking for a way to share the initialisation/termination time across all jobs so that it can be a one-time activity or any other approaches to reduce the time.
Thanks in advance!
From what I know, there are no ways to reduce startup or teardown time. You shouldn't consider that to be a bottleneck, as each run of a job is independent of the last one, so you can run them in parallel, etc. You could also consider converting this to a streaming pipeline if that's an option to eliminate those times entirely.