Optimizing apache beam / cloud dataflow startup - google-cloud-platform

I have done a few tests with apache-beam using both auto-scale workers and 1 worker, and each time I see a startup time of around 2 minutes. Is it possible to reduce that time, and if so, what are the suggested best practices for reducing the startup time?

IMHO: Two minutes is very fast for a product like Cloud Dataflow. Remember, Google is launching a powerful Big Data service for you that autoscales.
Compare that time to the other cloud vendors. I have seen some clusters (Hadoop) take 15 minutes to come live. In any event, you do not control the initialization process for Dataflow so there is nothing for you to improve.

Related

Serverless python requests with long timeouts?

I have a several python scripts that follow a similar format: you pass in a date, and it either: - checks my S3 bucket for the file with that date in the filename, and parses it or - Runs a python script doing some analysis on the file of that date (which take over 1 hour to run)
I am looking for a serverless solution that would let me call these functions on a range of dates, and run them all in parallel. Because of the long duration of my python script, services like AWS and Google Cloud Functions don't work because of their timeouts (15 minutes and 9 minutes respectively). I have looked at Google Cloud Dataflow, but am not sure whether this is overkill for my relatively simple use case.
Something with the lowest possible outages is important, so I am leaning towards something from AWS, Google Cloud, etc.
I also would like to be able to see a dashboard of the progress of each job with logs, so I can see which dates have completed and which dates had a bug (plus what the bug is)
As you said, with Google Cloud Functions you can configure the timeout for up to 9 minutes during the deployment.
Solutions different to Dataflow that allow higher timeouts:
App engine Flex
Other GCP product that allows higher timeouts (up to 60 minutes) is the App Engine Flex environment link.
Cloud Tasks
Cloud tasks is also similar, but asynchronous. With timeouts up to 30 min. It is a task queue, you put the task in the queue and returns quickly. Then, the worker (or workers) of the queue will evaluate the tasks one by one.
The usual output of Cloud Tasks is to send emails or to save the results into a Storage link.
With this solution, you can add a task for each file/filename to process and each of this tasks has the timeout of 30 min.
Long running duration is planned in the Cloud Run roadmap but we don't have date for now.
Today, the best recommended way is to use AppEngine in addition of Task Queue. With push queue, you can run process up to 24H long when you deploy in manual scaling mode. But Be careful, manual scaling doesn't scale to 0!
If you prefer container, I know 2 "strange" workaround on GCP:
Use Cloud Build. Cloud Build allows you to build custom builder in a container. Do whatever you want in this container, even if it's not for building something. Think to set up the correct timeout for your processing step. You have 120 minutes per day FREE with Cloud Build (shared across the entire organisation, it's not a free tier per project!). You can run up to 10 build jobs in parallel.
Use AI Platform training. Similarly to Cloud Build, AI Platform training allows you to run a custom container for performing processing, initially think for training. But, it's a container, you can run whatever you want in it. No free tier here. You are limited to 20 CPU in parallel but you can ask for increasing the limit up to 450 concurrent vCPU.
Sadly, it's not as easy as a Function or a Cloud Run to use. You don't have an HTTP endpoint and you simply call this with the date that you want and enjoy. But you can wrap this into a function which perform the API calls to the Cloud Build or the AI Platform training.

Cloud computing service to run thousands of containers in parallel

Is there any provider, that offers such an option out of the box? I need to run at least 1K concurrent sessions (docker containers) of headless web-browsers (firefox) for complex UI tests. I have a Docker image that I just want to deploy and scale to 1000 1CPU/1GB instances in second, w/o spending time on maintaining the cluster of servers (I need to shut them all down after the job is done), just focuse on the code. The most close thing I found so far is Amazon ECS/Fargate, but their limits have no sense to me ("Run containerized applications in production" -> max limit: 50 tasks -> production -> ok). Am I missing something?
I think that AWS Batch might be a better solution for your use case. You define a "compute environment" that provides a certain level of capacity, then submit tasks that are run on that compute environment.
I don't think that you'll find anything that can start up an environment and deploy a large number of tasks in "one second": in my experience it takes about a minute or two ramp-up time for Batch, although once the machines are up and running they are able to sequence jobs quickly. You should also give consideration to whether it makes sense to run all 1,000 jobs concurrently; that will depend on what you're trying to get out of your tests.
You'll also need to be aware of any places where you might be throttled (for example, retrieving configuration from the AWS Parameter Store). This talk from last year's NY Summit covers some of the issues that the speaker ran into when deploying multiple-thousands of concurrent tasks.
You could use lambda layers to run headless browsers (I know there are several implementations for chromium/selenium on github, not sure about firefox).
Alternatively you could try and contact the AWS team to see how much the limit for concurrent tasks on Fargate can be increased. As you can see at the documentation, the 50 task is a soft limit and can be raised.
Be aware if you start via Fargate, there is some API limit on the requests per second. You need to make sure you throttle your API calls or you use the ECS Create Service.
In any case, starting 1000 tasks would require 1000 seconds, which is probably not what you expect.
Those limits are not there if you use ECS, but in that case you need to manage the cluster, so it might be a good idea to explore the lambda option.

Cloud ML: Varying training time taken for the same data

I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).

How to reduce the initialisation and termination time in google dataflow job?

I'm currently working on a POC and primarily focusing on Dataflow for ETL processing. I have created the pipeline using Dataflow 2.1 Java Beam API, and it takes about 3-4 minutes just to initialise, and also it takes about 1-2 minutes for termination on each run. However, the actual transformation (ParDo) takes less than a minute. Moreover, I tried running the jobs by following different approaches,
Running the job on local machine
Running the job remotely on GCP
Running the job via Dataflow template
But it looks like, all the above methods consume more or less same time for initialization and termination. So this is being a bottleneck for the POC as we intend to run hundreds of jobs every day.
I'm looking for a way to share the initialisation/termination time across all jobs so that it can be a one-time activity or any other approaches to reduce the time.
Thanks in advance!
From what I know, there are no ways to reduce startup or teardown time. You shouldn't consider that to be a bottleneck, as each run of a job is independent of the last one, so you can run them in parallel, etc. You could also consider converting this to a streaming pipeline if that's an option to eliminate those times entirely.

How to use google cloud computing platform to perform a schedule task once a day

I have a task that gathers some information from several web-sites and saves it to disk. I want this task to run on daily basis and automatically.
I took a little tour into google cloud platform, but couldn't understand how to fit this service to my needs.
I would really like it if someone could suggest some key-points/main guidelines on how it should be done.
Thanks!
The easiest way to run any jobs which are time or scheduled based are done via CronJob of Linux. (https://help.ubuntu.com/community/CronHowto)
You can set up your scripts to be run at a specific time or interval and it should work. A checklist for you:
Bash scripts of tasks you want to perform
CronJobs that are schedules to run these scripts at specified time intervals
That should do it.