We are experimenting with Hadoop and processing of the Common Crawl.
Our problem is that if we create a cluster with 1 Master Node and 1 Core and 2 Task nodes, only one of the nodes per group will get high CPU/Network usage.
We tried also with 2 Core and no Task nodes, but in this case also only one Core node was used.
Following some screenshots of the Node/Cluster monitoring. The job was running all the time (in the first two parallel map phases), and should have used most of the available CPU power, as you can see in the screenshot of the working Task node.
But why is the idle Task node not utilized?
Our hadoop job, running as an Jar step, has no limits for the map jobs. It consists of multiple map/reduce steps chained. The last reduce job is limited to one Reducer.
Screenshots:
https://drive.google.com/drive/folders/1xwABYJMJAC_B0OuVpTQ9LNSj12TtbxI1?usp=sharing
ClusterId: j-3KAPYQ6UG9LU6
StepId: s-2LY748QDLFLM9
We found the following in the System Logs of the idle Node during an other run, maybe it is an EMR problem?
ERROR main: Failed to fetch extraInstanceData from https://aws157-instance-data-1-prod-us-east-1.s3.amazonaws.com/j-2S62KOVL68GVK/ig-3QUKQSH7YJIAU.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X
Greetings
Lukas
Late to the party, but have you tried setting these properties as part of the spark submit command.
--conf 'spark.dynamicAllocation.enabled=true'
--conf 'spark.dynamicAllocation.minExecutors=<MIN_NO_OF_CORE_OR_TASK_NODES_YOU_WANT>'
Related
Is it possible to keep the master machine running in Dataproc? Every time that I run the job after a while (~1 hour), I see the master node is stopped. It is not a real issue since I would easily start it again but I would like to know if there is a way to keep it awake.
A possible way that occurs to me is to do a schedule job in the master machine, but want to know if there is more official way to achieve this.
Dataproc does not stop any cluster nodes (including master) when they are idle.
You need to check if you have some kind of automation or user that can do this on your end.
I'm running my spark application on EMR. In spark history UI, under the environment tab, spark.executor.instances equals 1. However, under the executors tab, it shows that there were total 9 executors , including 1 driver.
Why does this happen?
spark.executor.instances this is set to the initial number of core nodes plus the number of task nodes in the cluster.
executors is the number of tasks that each executor can execute in parallel.
This link will help explain the various meanings and options:
Submitting User Applications with spark-submit
I've setup Hadoop in my laptop,
and when I submit a job on Hadoop (though MapReduce and Tez),
the status always ACCEPTED, but the progress always stuck at 0% and description wrote something like "waiting for AM container to be allocated".
When I check the node through YARN UI(localhost:8088),
it shows that the active node is 0
But from HDFS UI(localhost:50070), it shows that there are one live node.
Is that the main reason that cause the job stuck since there are no available node? If that's the case, what should I do?
In your YARN UI, it shows you have zero vcores and zero memory so there is no way for any job to ever run since you lack computing resources. The datanode is only for storage (HDFS in this case) and does not matter as far as why your application is stuck.
To fix your problem, you need to update your yarn-site.xml and provide settings for the memory and vcore properties described in the following:
http://blog.cloudera.com/blog/2015/10/untangling-apache-hadoop-yarn-part-2/
You might consider using a Cloudera QuickStart VM or Hortonworks Sandbox (at least as a reference for configuration values for the yarn-site.xml).
https://www.cloudera.com/downloads/quickstart_vms/5-10.html
https://hortonworks.com/products/sandbox/
Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances).
job submission: spark-submit myApplication.py
graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely what containers are in the context of EMR, so this isn't obvious what its telling me:
actual running executors: and then I've got this in my spark history UI, which shows that I only have 4 executors ever got created.
Dynamic Allocation: Then I've got spark.dynamicAllocation.enabled=True and I can see that in my environment details.
Executor memory: Also, the default executor memory is at 5120M.
Executors: Next, I've got my executors tab, showing that I've got what looks like 3 active and 1 dead executor:
So, at face value, it appears to me that I'm not using all my nodes or available memory.
how do I know if i'm using all the resources I have available?
if I'm not using all available resources to their full potential, how do I change what I'm doing so that the available resources are being used to their full potential?
Another way to go to see how many resources are being used by each of the nodes of the cluster is to use the web tool of Ganglia.
This is published on the master node and will show a graph of each node's resource usage. The issue will be if you have not enable Ganglia at the time of cluster creation as one of the tools available on the EMR cluster.
Once enable however you can go to the web page and see how much each node is being utilized.
We use Jenkins for our CI build system. We also use 'concurrent builds' so that Jenkins will build each change independently. This means we often have 5 or 6 builds of the same job running simultaneously. To accommodate this, we have 4 slaves each with 12 executors.
The problem is that Jenkins doesn't really 'load balance' among its slaves. It tries to build a job on the same slave that it previously built on (presumably to reduce the time syncing from source control). This is a problem because Jenkins will build all 6 instances of our build on the same slave (or more likely between 2 slaves). One build machine gets bogged down and runs very slowly while the rest of them sit idle.
How do I configure the load balancing behavior of Jenkins, and how it controls its slaves?
We were facing a similar issue. So I've put together a plugin that changes the Load Balancer in Jenkins to select a node that currently has the least load - https://plugins.jenkins.io/leastload/
Any feedback is appreciated.
If you do not find a plugin that does it automatically, here's an idea of what you can do:
Install Node Label Parameter plugin
Add SLAVE parameter to your jobs
Restrict jobs to run on ${SLAVE}
Add a trigger job that will do the following:
Analyze load distribution via a System Groovy Script and decide on which node to start next build.
Dispatch the build on that node with Parameterized Trigger
plugin
by assigning appropriate value to SLAVE parameter.
In order to analyze load distribution you need to install Groovy plugin and familiarize yourself with Jenkins Main Module API. Here are some useful initial pointers.
If your build machines cannot comfortably handle more than 1 build, why configure them with 12 executors? If that is indeed the case, you should reduce the number of executors to 1. My Jenkins has 30 slaves, each with 1 executor.
You may also use the Throttle Concurrent Builds plugin to restrict how many instances of a job can run in parallel on the same node
I have two labels -- one for small tasks and one for big tasks. I have one executor for the big task and 4 for the small tasks. This does balance things a little.