Why spark environment parameters is not consistent with executors? - amazon-web-services

I'm running my spark application on EMR. In spark history UI, under the environment tab, spark.executor.instances equals 1. However, under the executors tab, it shows that there were total 9 executors , including 1 driver.
Why does this happen?

spark.executor.instances this is set to the initial number of core nodes plus the number of task nodes in the cluster.
executors is the number of tasks that each executor can execute in parallel.
This link will help explain the various meanings and options:
Submitting User Applications with spark-submit

Related

AWS Batch Job Stuck in Runnable State

I'm trying to run a 100 node AWS Batch job, when I set my computing environment to use only m4.xlarge and m5.xlarge instances everything works fine and my job is picked up and runs.
However, when I begin to include other instance types in my compute environment such as m5.2xlarge, the job is stuck in the runnable state indefinitely. The only variable I am changing in these updates is the instance types in the compute environment.
I'm not sure what is causing this job to not be picked up when I include other instance types in the computing environment. In the documentation for Compute Environment Parameters the only note is:
When you create a compute environment, the instance types that you select for the compute environment must share the same architecture. For example, you can't mix x86 and ARM instances in the same compute environment.
The JobDefinition is multi-node:
Node 0
vCPUs: 1
Memory: 15360 MiB
Node 1:
vCPUs: 2
Memory: 15360 MiB
My computing environment max vCPUs is set to 10,000, is always in a VALID state and always ENABLED. Also my EC2 vCPU limit is 6,000. CloudWatch provides no logs because the job has not started, I'm not sure what else to try here. I am also not using the optimal setting for instance types because I ran into issues with not getting enough instances.
I just resolved this issue, the problem is with the BEST_FIT strategy in Batch. The jobs that I'm submitting are not close enough to the instance type so they never get picked up.
I figured this out by modifying the job definition to use 8 vCPU and 30GB of memory and the job began with the m5.2xlarge instances.
I'm going to see if using the BEST_FIT_PROGRESSIVE strategy will resolve this issue and report back, although I doubt it will.
--
Update: I have spoken with AWS Support and got some more insight. The BEST_FIT_PROGRESSIVE allocation strategy has built-in protections for over-scaling so that customers do not accidentally launch thousands of instances. Although this has the side effect of what I am experiencing which leads to jobs failing to start.
The support engineers recommendation was to use a single instance type in the Compute Environment and the BEST_FIT allocation strategy. Since my jobs have different instance requirements I was able to successfully create three separate ComputeEnvironments targeting difference instances types (c5.large, c5.xlarge, m4.xlarge), submit jobs and have them run in the appropriate Compute Environment.

EMR Hadoop does not utilize all cluster nodes

We are experimenting with Hadoop and processing of the Common Crawl.
Our problem is that if we create a cluster with 1 Master Node and 1 Core and 2 Task nodes, only one of the nodes per group will get high CPU/Network usage.
We tried also with 2 Core and no Task nodes, but in this case also only one Core node was used.
Following some screenshots of the Node/Cluster monitoring. The job was running all the time (in the first two parallel map phases), and should have used most of the available CPU power, as you can see in the screenshot of the working Task node.
But why is the idle Task node not utilized?
Our hadoop job, running as an Jar step, has no limits for the map jobs. It consists of multiple map/reduce steps chained. The last reduce job is limited to one Reducer.
Screenshots:
https://drive.google.com/drive/folders/1xwABYJMJAC_B0OuVpTQ9LNSj12TtbxI1?usp=sharing
ClusterId: j-3KAPYQ6UG9LU6
StepId: s-2LY748QDLFLM9
We found the following in the System Logs of the idle Node during an other run, maybe it is an EMR problem?
ERROR main: Failed to fetch extraInstanceData from https://aws157-instance-data-1-prod-us-east-1.s3.amazonaws.com/j-2S62KOVL68GVK/ig-3QUKQSH7YJIAU.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X
Greetings
Lukas
Late to the party, but have you tried setting these properties as part of the spark submit command.
--conf 'spark.dynamicAllocation.enabled=true'
--conf 'spark.dynamicAllocation.minExecutors=<MIN_NO_OF_CORE_OR_TASK_NODES_YOU_WANT>'

Am I fully utilizing my EMR cluster?

Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances).
job submission: spark-submit myApplication.py
graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely what containers are in the context of EMR, so this isn't obvious what its telling me:
actual running executors: and then I've got this in my spark history UI, which shows that I only have 4 executors ever got created.
Dynamic Allocation: Then I've got spark.dynamicAllocation.enabled=True and I can see that in my environment details.
Executor memory: Also, the default executor memory is at 5120M.
Executors: Next, I've got my executors tab, showing that I've got what looks like 3 active and 1 dead executor:
So, at face value, it appears to me that I'm not using all my nodes or available memory.
how do I know if i'm using all the resources I have available?
if I'm not using all available resources to their full potential, how do I change what I'm doing so that the available resources are being used to their full potential?
Another way to go to see how many resources are being used by each of the nodes of the cluster is to use the web tool of Ganglia.
This is published on the master node and will show a graph of each node's resource usage. The issue will be if you have not enable Ganglia at the time of cluster creation as one of the tools available on the EMR cluster.
Once enable however you can go to the web page and see how much each node is being utilized.

How to submit multiple queries in hive concurrently

I am trying to submit multiple Hive queries using CLI and I want the queries to run concurrently. However, these queries are running sequentially.
Can somebody tell me how to invoke a number of Hive queries so that they do in fact run concurrently?
This is not because of Hive, it has to do with your Hadoop configuration. By default, Hadoop uses a simple FIFO queue for job submission and execution. You can, however, configure a different policy so that multiple jobs can run at once.
Here's a nice blog post from Cloudera back in 2008 on the matter: Job Scheduling in Hadoop
Pretty much any scheduler other than the default will support concurrent jobs, so take your pick!

How can I modify the Load Balancing behavior Jenkins uses to control slaves?

We use Jenkins for our CI build system. We also use 'concurrent builds' so that Jenkins will build each change independently. This means we often have 5 or 6 builds of the same job running simultaneously. To accommodate this, we have 4 slaves each with 12 executors.
The problem is that Jenkins doesn't really 'load balance' among its slaves. It tries to build a job on the same slave that it previously built on (presumably to reduce the time syncing from source control). This is a problem because Jenkins will build all 6 instances of our build on the same slave (or more likely between 2 slaves). One build machine gets bogged down and runs very slowly while the rest of them sit idle.
How do I configure the load balancing behavior of Jenkins, and how it controls its slaves?
We were facing a similar issue. So I've put together a plugin that changes the Load Balancer in Jenkins to select a node that currently has the least load - https://plugins.jenkins.io/leastload/
Any feedback is appreciated.
If you do not find a plugin that does it automatically, here's an idea of what you can do:
Install Node Label Parameter plugin
Add SLAVE parameter to your jobs
Restrict jobs to run on ${SLAVE}
Add a trigger job that will do the following:
Analyze load distribution via a System Groovy Script and decide on which node to start next build.
Dispatch the build on that node with Parameterized Trigger
plugin
by assigning appropriate value to SLAVE parameter.
In order to analyze load distribution you need to install Groovy plugin and familiarize yourself with Jenkins Main Module API. Here are some useful initial pointers.
If your build machines cannot comfortably handle more than 1 build, why configure them with 12 executors? If that is indeed the case, you should reduce the number of executors to 1. My Jenkins has 30 slaves, each with 1 executor.
You may also use the Throttle Concurrent Builds plugin to restrict how many instances of a job can run in parallel on the same node
I have two labels -- one for small tasks and one for big tasks. I have one executor for the big task and 4 for the small tasks. This does balance things a little.