I'm trying to run a 100 node AWS Batch job, when I set my computing environment to use only m4.xlarge and m5.xlarge instances everything works fine and my job is picked up and runs.
However, when I begin to include other instance types in my compute environment such as m5.2xlarge, the job is stuck in the runnable state indefinitely. The only variable I am changing in these updates is the instance types in the compute environment.
I'm not sure what is causing this job to not be picked up when I include other instance types in the computing environment. In the documentation for Compute Environment Parameters the only note is:
When you create a compute environment, the instance types that you select for the compute environment must share the same architecture. For example, you can't mix x86 and ARM instances in the same compute environment.
The JobDefinition is multi-node:
Node 0
vCPUs: 1
Memory: 15360 MiB
Node 1:
vCPUs: 2
Memory: 15360 MiB
My computing environment max vCPUs is set to 10,000, is always in a VALID state and always ENABLED. Also my EC2 vCPU limit is 6,000. CloudWatch provides no logs because the job has not started, I'm not sure what else to try here. I am also not using the optimal setting for instance types because I ran into issues with not getting enough instances.
I just resolved this issue, the problem is with the BEST_FIT strategy in Batch. The jobs that I'm submitting are not close enough to the instance type so they never get picked up.
I figured this out by modifying the job definition to use 8 vCPU and 30GB of memory and the job began with the m5.2xlarge instances.
I'm going to see if using the BEST_FIT_PROGRESSIVE strategy will resolve this issue and report back, although I doubt it will.
--
Update: I have spoken with AWS Support and got some more insight. The BEST_FIT_PROGRESSIVE allocation strategy has built-in protections for over-scaling so that customers do not accidentally launch thousands of instances. Although this has the side effect of what I am experiencing which leads to jobs failing to start.
The support engineers recommendation was to use a single instance type in the Compute Environment and the BEST_FIT allocation strategy. Since my jobs have different instance requirements I was able to successfully create three separate ComputeEnvironments targeting difference instances types (c5.large, c5.xlarge, m4.xlarge), submit jobs and have them run in the appropriate Compute Environment.
Related
I’m coding a chess engine for a project at school and i need more computing power than my pc can offer.
So i turned to AWS and especially EC2. I want to test different algorithms.
I know how to start the instance and how to begin the computing on the instance but as soon as the computing is finished, I would like to send automatically the data files on s3 (i know the command but not how to automatically execute it) and shutdown the instance to avoid paying for nothing.
Thank you for your help,
You mention your script is in python, in that case probably you could execute it with AWS Lambda and configure rules in CloudWatch Events Rules to define the time to run the script. Consider the maximum time that Lambda allows is 15 minutes so if it is not suitable for your case you can check ECS Fargate Scheduled Tasks to run the process at the defined time and then delete the container.
As pointed out by samtoddler on Jan 11, 2021 :
a call to os.shutdown() after the computing provided what I wanted !
Thank you
At the moment I have a load balancer which runs a Compute Engine Instance Group which has a minimum of 1 server and a maximum of 5 servers.
This is running auto scaling and use a pre-build ubuntu template with all the base stuff needed.
When an instance boots up it will log a runner into the GitLab project, and then trigger the job to update the instance to the latest copy of the code.
This is fine and works well.
The issue comes when I make a change to the git branch and push the changes, it only seems to be being picked up by one of the random 5 instances that have loaded.
I was under the impression that GitLab would push out to all the runners logged, but this doesn't seem to be the case.
I have seen answers on here that show multiple runners, but on a single server, I haven't come across my particular situation.
Has anyone come across this before? I would assume that this is a pretty normal situation, and weird that it doesn't just work.
For each job that runs in GitLab, only 1 runner receives the job. The mechanism is PULL based -- the runners constantly ask GitLab if there's any jobs available to run. GitLab never initiates communication with the runners.
Therefore, your load balancer rules do nothing to affect which runner receives a job and there is no "fairness" in distributing jobs across server. Runners will keep asking for jobs every few seconds as long as they are able to take them (according to concurrency settings in the config.toml) and GitLab will hand them out on a first-come, first-served basis.
If you set the concurrency to 1 and start multiple jobs, you should see multiple servers pick up the jobs.
Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances).
job submission: spark-submit myApplication.py
graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely what containers are in the context of EMR, so this isn't obvious what its telling me:
actual running executors: and then I've got this in my spark history UI, which shows that I only have 4 executors ever got created.
Dynamic Allocation: Then I've got spark.dynamicAllocation.enabled=True and I can see that in my environment details.
Executor memory: Also, the default executor memory is at 5120M.
Executors: Next, I've got my executors tab, showing that I've got what looks like 3 active and 1 dead executor:
So, at face value, it appears to me that I'm not using all my nodes or available memory.
how do I know if i'm using all the resources I have available?
if I'm not using all available resources to their full potential, how do I change what I'm doing so that the available resources are being used to their full potential?
Another way to go to see how many resources are being used by each of the nodes of the cluster is to use the web tool of Ganglia.
This is published on the master node and will show a graph of each node's resource usage. The issue will be if you have not enable Ganglia at the time of cluster creation as one of the tools available on the EMR cluster.
Once enable however you can go to the web page and see how much each node is being utilized.
When you pick a more performant node, say a r3.xlarge vs m3.xlarge, will Spark automatically utilize the additional resources? Or is this something you need to manually configure and tune?
As far as configurations go, which are the most configuration values to tune to get the most out of your cluster?
It will try..
AWS has a setting you can enable in your EMR cluster configuration that will attempt to do this. It is called spark.dynamicAllocation.enabled. In the past there were issues with this setting where it would give too many resources to Spark. In newer releases they have lowered the amount they are giving to spark. However, if you are using Pyspark they will not take python's resource requirements into account.
I typically disable dynamicAllocation and set the appropriate memory and cores settings dynamically from my own code based upon what instance type is selected.
This page discusses what defaults they will select for you:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
If you do it manually, at a minimum you will want to set:
spark.executor.memory
spark.executor.cores
Also, you may need to adjust the yarn container size limits with:
yarn.scheduler.maximum-allocation-mb
yarn.scheduler.minimum-allocation-mb
yarn.nodemanager.resource.memory-mb
Make sure you leave a core and some RAM for the OS, and RAM for python if you are using Pyspark.
We use Jenkins for our CI build system. We also use 'concurrent builds' so that Jenkins will build each change independently. This means we often have 5 or 6 builds of the same job running simultaneously. To accommodate this, we have 4 slaves each with 12 executors.
The problem is that Jenkins doesn't really 'load balance' among its slaves. It tries to build a job on the same slave that it previously built on (presumably to reduce the time syncing from source control). This is a problem because Jenkins will build all 6 instances of our build on the same slave (or more likely between 2 slaves). One build machine gets bogged down and runs very slowly while the rest of them sit idle.
How do I configure the load balancing behavior of Jenkins, and how it controls its slaves?
We were facing a similar issue. So I've put together a plugin that changes the Load Balancer in Jenkins to select a node that currently has the least load - https://plugins.jenkins.io/leastload/
Any feedback is appreciated.
If you do not find a plugin that does it automatically, here's an idea of what you can do:
Install Node Label Parameter plugin
Add SLAVE parameter to your jobs
Restrict jobs to run on ${SLAVE}
Add a trigger job that will do the following:
Analyze load distribution via a System Groovy Script and decide on which node to start next build.
Dispatch the build on that node with Parameterized Trigger
plugin
by assigning appropriate value to SLAVE parameter.
In order to analyze load distribution you need to install Groovy plugin and familiarize yourself with Jenkins Main Module API. Here are some useful initial pointers.
If your build machines cannot comfortably handle more than 1 build, why configure them with 12 executors? If that is indeed the case, you should reduce the number of executors to 1. My Jenkins has 30 slaves, each with 1 executor.
You may also use the Throttle Concurrent Builds plugin to restrict how many instances of a job can run in parallel on the same node
I have two labels -- one for small tasks and one for big tasks. I have one executor for the big task and 4 for the small tasks. This does balance things a little.