When I start a data flow job, it sometimes waits for more than 30 minutes without being allocated an instance.
What is happen??
Your Dataflow Job is getting slow because the time needed to start the VMs on Google Compute Engine grows with the number of VMs you start, and in general VM startup and shutdown performance can have high variance.
you can look at Cloud Logs for your job ID, and see if there's any logging going on, also you can check the Dataflow monitoring interface inside your Dataflow job.[1]
you can enable autoscaling[2] instead of specifying a large number of instances manually, it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Without autoscaling, you have to choose a fixed number of workers by specifying workers to execute your pipeline. As the input workload varies over time, this number can become either too high or too low. Provisioning too many workers results in unnecessary extra cost, and provisioning too few workers results in higher latency for processed data. By enabling autoscaling, resources are used only as they are needed.
The objective of autoscaling is to minimize backlog while maximizing worker utilization and throughput, and quickly react to spikes in load.
[1] https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf
[2] https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#streaming-autoscaling
Related
I was wondering if it is possible to run large number of "jobs" (or "pipeline" or whatever is the right way) to execute some modelling tasks in parallel.
So what I planned to do is to do a ETL process and EDA done and after that when the data is ready, I would like to fire 2000 modelling jobs. We have 2000 products and each job can start with a data (SELECT * FROM DATA WHERE PROD_ID='xxxxxxxxx') and my idea is to run these training jobs in parallel (there is no dependency between them - so it makes sense to me).
First of all - 1) Can it be done in AWS SageMaker? 2) What would be the right approach? 3) Any special considerations I need to be aware of?
Thanks a lot in advance!
it's possible to run this on SageMaker, with SageMaker pipelines that will orchestrate a SageMaker Processing job, followed by a Training job. You can define the PROD_ID as a String parameter to the SageMaker Pipeline, then run multiple pipelines executions concurrently (default soft limit is 200 concurrent executions).
As you have a very high numbers of jobs (2K) which you want to run in parallel, and perhaps optimize compute usage, you might also want to look at AWS Batch, which allows you to queue up tasks, for a fleet of instances that starts containers to perform these jobs. AWS Batch also support Spot instances which could reduce your instance cost by 70%-90%. Another advantage of AWS Batch is that jobs reuse the same running instance (only container stop/start), while in SageMaker there's a ~2 minute overhead to start the instance per job. Additionally, AWS Batch also takes care of retries and allowing you to chain all 2,000 jobs together and run a "finisher" job when all jobs have completed.
Limits increase - For any service, you'll need to increase your service quota limits. It can be done from the console "Quotas" for most services, or by contacting AWS support. Some services has hard limits.
I am observing that if a new batch job has been submitted shortly after the last instance in a compute environment has shut down it takes over 10 minutes for Batch to add a new instance to the Compute environments even if CE's have available resources.
Can anyone who is aware please let me know if this is an expected behavior from AWS side or if there is a way to fix it.
Thanks
Chaitanya
This is expected behaviour. There is a thread here which includes a response from the AWS team.
Further, it is important to note that the AWS Batch resource scaling
decisions occur on a different frequency. Upon receiving your first
job submission, AWS Batch will launch an initial set of compute
resources. After this point Batch re-evaluates resource needs
approximately every 10 minutes. By making scaling decisions less
frequently, we avoid scenarios where AWS Batch would scale up too
quickly and complete all RUNNABLE jobs, leaving a large number of
unused instances with partially consumed billing hours.
(p.s. I notice that it looks like yours was the last comment in the linked thread, but I post this anyway for the benefit of others)
I submited a GCP dataflow pipeline to receive my data from GCP Pub/Sub, parse and store to GCP Datastore. It seems work perfect.
Through 21 days, I found the cost is $144.54 and worked time is 2,094.72 hour. It means after I submitted it, it will be charged every sec, even there is not receive (process) any data from Pub/Sub.
Is this behavior normal? Or I set a wrong parameters?
I thought CPU use time only be counted when data is received.
Is there have any way to reduce the cost in same working model (receive from Pub/Sub and store to Datastore)?
The Cloud Dataflow service usage is billed in per second increments, on a per job basis. I guess your job used 4 n1-standard-1 workers, which used 4 vCPUs giving an estimated of 2,000 vCPU hr resource usage. Therefore, this behavior is normal. To reduce the cost, you can use either autoscaling, to specify the maximum number of workers, or the pipeline options, to override the resource settings that are allocated to each worker. Depending on your necessities, you could consider using Cloud Functions which cost less, but considering its limits.
Hope it helps.
We are hosting a sale every month. Once we are ready with all the deals data we send a notification to all of our users. As a result of that we get huge traffic with in seconds and it lasts for about an hour. Currently we are changing the instance class type to F4_1G before the sale time and back to F1 after one hour. Is there a better way to handle this?
A part from changing the instance class of App Engine Standard based on the expected demand that you have, you can (and should) also consider a good scaling approach for your application. App Engine Standard offers three different scaling types, which are documented in detail, but let me summarize their main features here:
Automatic scaling: based on request rate, latency in the responses and other application metrics. This is probably the best option for the use case you present, as more instances will be spun up in response to demand.
Manual scaling: continuously running, instances preserve state and you can configure the exact number of instances you want running. This can be useful if you already know how to handle your demand from previous occurrences of the spikes in usage.
Basic scaling: the number of instances scales with the volume demand, and you can set up the maximum number of instances that can be serving.
According to the use case you presented in your question, I think automatic scaling is the scaling type that matches your requirements better. So let me get a little more in-depth on the parameters that you can tune when using it:
Concurrent requests to be handled by each instance: set up the maximum number of concurrent requests that can be accepted by an instance before spinning up a new instance.
Idle instances available: how many idle (not serving traffic) instances should be ready to handle traffic. You can tune this parameter to be higher when you have the traffic spike, so that requests are handled in a short time without having to wait for an instance to be spun up. After the peak you can set it to a lower value to reduce the costs of the deployment.
Latency in the responses: the allowed time that a request can be left in the queue (when no instance can handle it) without starting a new instance.
If you play around with these configuration parameters, you can define in a very deterministic manner the amount of instances that you want to have, being able to accommodate the big spikes and later returning to lower values in order to decrease the usage and cost.
An additional note that you should take into account when using automatic scaling is that, after a traffic load spike, you may see more idle instances than you specified (they are not torn down in order to avoid that new instances must be started), but you will only be billed for the max_idle_instances that you specified.
My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.
Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)