Aws batch: Do array jobs create separate resources? - amazon-web-services

AWS batch documentation states that array jobs share common parameters like job definition, vCPUs and memory. Does it mean if I have an array job of 1000 and vCPU as 4, each child job will get 4 vCPU or total vCPU across all child jobs will be 4?

The former. The job definition defines the amount of resources for a given job. The array job then points to the job definition to state "this is what is required for each of the individual jobs in the array"
In your example, the each job gets 4 vCPU. Your compute environment will probably have some max number of vCPU. If that max is 8, then only 2 individual jobs run at the same time and the other 998 wait in the queue until resources are free.

Related

Cost of DPUs for parallel Running Job

I am running 10 concurrent run of same glue job.The job is taking Lot of DPUs.Does concurrent run of same glue job take more DPUs than running multiple different glue jobs in parallel.
Generally it should not matter if you run your jobs in parallel or sequentially. Every job will consume some DPUs and is directly based on the time it takes. So 1 job for 10 mins and 10 jobs for 1 min should result in same cost.
You can refer to pricing examples in the documentation page.
https://aws.amazon.com/glue/pricing/
Or can share more data (screenshot maybe) of how you're calculating the pricing.

GCP Dataflow vCPU usage and pricing question

I submited a GCP dataflow pipeline to receive my data from GCP Pub/Sub, parse and store to GCP Datastore. It seems work perfect.
Through 21 days, I found the cost is $144.54 and worked time is 2,094.72 hour. It means after I submitted it, it will be charged every sec, even there is not receive (process) any data from Pub/Sub.
Is this behavior normal? Or I set a wrong parameters?
I thought CPU use time only be counted when data is received.
Is there have any way to reduce the cost in same working model (receive from Pub/Sub and store to Datastore)?
The Cloud Dataflow service usage is billed in per second increments, on a per job basis. I guess your job used 4 n1-standard-1 workers, which used 4 vCPUs giving an estimated of 2,000 vCPU hr resource usage. Therefore, this behavior is normal. To reduce the cost, you can use either autoscaling, to specify the maximum number of workers, or the pipeline options, to override the resource settings that are allocated to each worker. Depending on your necessities, you could consider using Cloud Functions which cost less, but considering its limits.
Hope it helps.

AWS batch - how to limit number of concurrent jobs

I am looking for a way to limit the number of batch jobs that are running by holding the remaining jobs in the queue. Is it possible with aws batch?
Limiting the maximum number of vcpus of the managed compute environment the queue is tied to will effectively limit the number of batch jobs running concurrently on that queue.
However, this comes with the caveat that, if you have other queues sharing this compute environment, they would also be limited accordingly. Moreover, if you have multiple compute environments associated with that queue you are attempting to limit, Batch will eventually begin scheduling jobs on the secondary compute environments if there are enough jobs waiting in the RUNNABLE state.

AWS Glue what is optimal data size for ETL

I am planning to use AWS Glue for my ETL process, and have custom python code written and run as a AWS Glue Job.
I found in AWS Glue documentation, that by default, AWS Glue allocates 10 DPU per job.Is there a maximum limit of DPUs for a job, (I do not see anything in the LIMITs section, i.e., Max of DPUs per Job limits).
Or is there any optimal data size in MB / GB, that is recommended to avoid any Out of memory error issue. Please clarify.
Thanks.
According to the Glue API docs, the max you can allocate per Job execution is 100 DPUs.
AllocatedCapacity – Number (integer).
The number of AWS Glue data processing units (DPUs) allocated to runs of this job. From 2 to 100 DPUs can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.
The limits aren't the same for Python Glue jobs (which the OP plans to implement) where you can have maximum 1 DPU. Below is the official documentation (as of Aug 2019)
The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.
You can set the value to 0.0625 or 1. The default is 0.0625.

Scheduling long-running tasks using AWS services

My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.
Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)