Configuring Spark on EMR - amazon-web-services

When you pick a more performant node, say a r3.xlarge vs m3.xlarge, will Spark automatically utilize the additional resources? Or is this something you need to manually configure and tune?
As far as configurations go, which are the most configuration values to tune to get the most out of your cluster?

It will try..
AWS has a setting you can enable in your EMR cluster configuration that will attempt to do this. It is called spark.dynamicAllocation.enabled. In the past there were issues with this setting where it would give too many resources to Spark. In newer releases they have lowered the amount they are giving to spark. However, if you are using Pyspark they will not take python's resource requirements into account.
I typically disable dynamicAllocation and set the appropriate memory and cores settings dynamically from my own code based upon what instance type is selected.
This page discusses what defaults they will select for you:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
If you do it manually, at a minimum you will want to set:
spark.executor.memory
spark.executor.cores
Also, you may need to adjust the yarn container size limits with:
yarn.scheduler.maximum-allocation-mb
yarn.scheduler.minimum-allocation-mb
yarn.nodemanager.resource.memory-mb
Make sure you leave a core and some RAM for the OS, and RAM for python if you are using Pyspark.

Related

AWS Batch Job Stuck in Runnable State

I'm trying to run a 100 node AWS Batch job, when I set my computing environment to use only m4.xlarge and m5.xlarge instances everything works fine and my job is picked up and runs.
However, when I begin to include other instance types in my compute environment such as m5.2xlarge, the job is stuck in the runnable state indefinitely. The only variable I am changing in these updates is the instance types in the compute environment.
I'm not sure what is causing this job to not be picked up when I include other instance types in the computing environment. In the documentation for Compute Environment Parameters the only note is:
When you create a compute environment, the instance types that you select for the compute environment must share the same architecture. For example, you can't mix x86 and ARM instances in the same compute environment.
The JobDefinition is multi-node:
Node 0
vCPUs: 1
Memory: 15360 MiB
Node 1:
vCPUs: 2
Memory: 15360 MiB
My computing environment max vCPUs is set to 10,000, is always in a VALID state and always ENABLED. Also my EC2 vCPU limit is 6,000. CloudWatch provides no logs because the job has not started, I'm not sure what else to try here. I am also not using the optimal setting for instance types because I ran into issues with not getting enough instances.
I just resolved this issue, the problem is with the BEST_FIT strategy in Batch. The jobs that I'm submitting are not close enough to the instance type so they never get picked up.
I figured this out by modifying the job definition to use 8 vCPU and 30GB of memory and the job began with the m5.2xlarge instances.
I'm going to see if using the BEST_FIT_PROGRESSIVE strategy will resolve this issue and report back, although I doubt it will.
--
Update: I have spoken with AWS Support and got some more insight. The BEST_FIT_PROGRESSIVE allocation strategy has built-in protections for over-scaling so that customers do not accidentally launch thousands of instances. Although this has the side effect of what I am experiencing which leads to jobs failing to start.
The support engineers recommendation was to use a single instance type in the Compute Environment and the BEST_FIT allocation strategy. Since my jobs have different instance requirements I was able to successfully create three separate ComputeEnvironments targeting difference instances types (c5.large, c5.xlarge, m4.xlarge), submit jobs and have them run in the appropriate Compute Environment.

Trouble configuring Presto's memory allocation on AWS EMR

I am really hoping to use Presto in an ETL pipeline on AWS EMR, but I am having trouble configuring it to fully utilize the cluster's resources. This cluster would exist solely for this one query, and nothing more, then die. Thus, I would like to claim the maximum available memory for each node and the one query by increasing query.max-memory-per-node and query.max-memory. I can do this when I'm configuring the cluster by adding these settings in the "Edit software settings" box of the cluster creation view in the AWS console. But the Presto server doesn't start, reporting in the server.log file an IllegalArgumentException, saying that max-memory-per-node exceeds the useable heap space (which, by default, is far too small for my instance type and use case).
I have tried to use the session setting set session resource_overcommit=true, but that only seems to override query.max-memory, not query.max-memory-per-node, because in the Presto UI, I see that very little of the available memory on each node is being used for the query.
Through Google, I've been led to believe that I need to also increase the JVM heap size by changing the -Xmx and -Xms properties in /etc/presto/conf/jvm.config, but it says here (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that it is not possible to alter the JVM settings in the cluster creation phase.
To change these properties after the EMR cluster is active and the Presto server has been started, do I really have to manually ssh into each node and alter jvm.config and config.properties, and restart the Presto server? While I realize it'd be possible to manually install Presto with a custom configuration on an EMR cluster through a bootstrap script or something, this would really be a deal-breaker.
Is there something I'm missing here? Is there not an easier way to make Presto allocate all of a cluster to one query?
As advertised, increasing query.max-memory-per-node, and also by necessity the -Xmx property, indeed cannot be achieved on EMR until after Presto has already started with the default options. To increase these, the jvm.config and config.properties found in /etc/presto/conf/ have to be changed, and the Presto server restarted on each node (core and coordinator).
One can do this with a bootstrap script using commands like
sudo sed -i "s/query.max-memory-per-node=.*GB/query.max-memory-per-node=20GB/g" /etc/presto/conf/config.properties
sudo restart presto-server
and similarly for /etc/presto/jvm.conf. The only caveats are that one needs to include the logic in the bootstrap action to execute only after Presto has been installed, and that the server on the coordinating node needs to be restarted last (and possibly with different settings if the master node's instance type is different than the core nodes).
You might also need to change resources.reserved-system-memory from the default by specifying a value for it in config.properties. By default, this value is .4*(Xmx value), which is how much memory is claimed by Presto for the system pool. In my case, I was able to safely decrease this value and give more memory to each node for executing the query.
As a matter of fact, there are configuration classifications available for Presto in EMR. However, please note that these may vary depending on the EMR release version. For a complete list of the available configuration classifications per release version, please visit 1 (make sure to switch between the different tabs according to your desired release version). Specifically regarding to jvm.config properties, you will see in 2 that these are not currently configurable via configuration classifications. That being said, you can always edit the jvm.config file manually per your needs.
Amazon EMR 5.x Release Versions
1
Considerations with Presto on Amazon EMR - Some Presto Deployment Properties not Configurable:
2

Am I fully utilizing my EMR cluster?

Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances).
job submission: spark-submit myApplication.py
graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely what containers are in the context of EMR, so this isn't obvious what its telling me:
actual running executors: and then I've got this in my spark history UI, which shows that I only have 4 executors ever got created.
Dynamic Allocation: Then I've got spark.dynamicAllocation.enabled=True and I can see that in my environment details.
Executor memory: Also, the default executor memory is at 5120M.
Executors: Next, I've got my executors tab, showing that I've got what looks like 3 active and 1 dead executor:
So, at face value, it appears to me that I'm not using all my nodes or available memory.
how do I know if i'm using all the resources I have available?
if I'm not using all available resources to their full potential, how do I change what I'm doing so that the available resources are being used to their full potential?
Another way to go to see how many resources are being used by each of the nodes of the cluster is to use the web tool of Ganglia.
This is published on the master node and will show a graph of each node's resource usage. The issue will be if you have not enable Ganglia at the time of cluster creation as one of the tools available on the EMR cluster.
Once enable however you can go to the web page and see how much each node is being utilized.

AWS - Keeping AMIs updated

Let's say I've created an AMI from one of my EC2 instances. Now, I can add this manually to then LB or let the AutoScaling group to do it for me (based on the conditions I've provided). Up to this point everything is fine.
Now, let's say my developers have added a new functionality and I pull the new code on the existing instances. Note that the AMI is not updated at this point and still has the old code. My question is about how I should handle this situation so that when the autoscaling group creates a new instance from my AMI it'll be with the latest code.
Two ways come into my mind, please let me know if you have any other solutions:
a) keep AMIs updated all the time; meaning that whenever there's a pull-request, the old AMI should be removed (deleted) and replaced with the new one.
b) have a start-up script (cloud-init) on AMIs that will pull the latest code from repository on initial launch. (by storing the repository credentials on the instance and pulling the code directly from git)
Which one of these methods are better? and if both are not good, then what's the best practice to achieve this goal?
Given that anything (almost) can be automated using the AWS using the API; it would again fall down to the specific use case at hand.
At the outset, people would recommend having a base AMI with necessary packages installed and configured and have init script which would download the the source code is always the latest. The very important factor which needs to be counted here is the time taken to checkout or pull the code and configure the instance and make it ready to put to work. If that time period is very big - then it would be a bad idea to use that strategy for auto-scaling. As the warm up time combined with auto-scaling & cloud watch's statistics would result in a different reality [may be / may be not - but the probability is not zero]. This is when you might consider baking a new AMI frequently. This would enable you to minimize the time taken for the instance to prepare themselves for the war against the traffic.
I would recommend measuring and seeing which every is convenient and cost effective. It costs real money to pull down the down the instance and relaunch using the AMI; however thats the tradeoff you need to make.
While, I have answered little open ended; coz. the question is also little.
People have started using Chef, Ansible, Puppet which performs configuration management. These tools add a different level of automation altogether; you want to explore that option as well. A similar approach is using the Docker or other containers.
a) keep AMIs updated all the time; meaning that whenever there's a
pull-request, the old AMI should be removed (deleted) and replaced
with the new one.
You shouldn't store your source code in the AMI. That introduces a maintenance nightmare and issues with autoscaling as you have identified.
b) have a start-up script (cloud-init) on AMIs that will pull the
latest code from repository on initial launch. (by storing the
repository credentials on the instance and pulling the code directly
from git)
Which one of these methods are better? and if both are not good, then
what's the best practice to achieve this goal?
Your second item, downloading the source on server startup, is the correct way to go about this.
Other options would be the use of Amazon CodeDeploy or some other deployment service to deploy updates. A deployment service could also be used to deploy updates to existing instances while allowing new instances to download the latest code automatically at startup.

How to replicate code changes across multiple AWS instances?

We have a load balanced setup in AWS with two instances. We do pretty frequent code updates, utilizing SVN. I need to know how easy it is to update the code changes across all the instances in our cluster. Can we simply do 'snapshots' and create new volumes each time for the instances?...or?...
I would not do updates via EBS snapshots. Think of EBS volumes as a hard disk - you would not change your harddisk if you have an update for your software.
As you have your code in a version control system, code updates should be quite simple like logging in to your (multiple) servers and doing a git pull or svn update. This should fetch the latest code files from your servers. Depending on the type of application you would have to do some other tasks afterwards, running build scripts, emptying cache etc.
The problem is that this kind of setup does not scale well. If you have n servers, you will have to login and do this command n times. Therefore it makes sense to look into some remote management tools that you can use in one step. With a lot of these tools, you also get a complete configuration management stack: you define a set of recipes or tasks (like installed packages, configuration files, fetch the latest code, necessary build steps) for each of your servers, and when you boot up a new server it fetches the lastest version of its configuration and installs itself.
Popular configuration management tools include Puppet or Salt. Both tools have remote execution included which should make your task to publish your code base easier, you would only have to fire one command on your master server and it automatically executes this task on all its minions / slave servers.