Trouble configuring Presto's memory allocation on AWS EMR - amazon-web-services

I am really hoping to use Presto in an ETL pipeline on AWS EMR, but I am having trouble configuring it to fully utilize the cluster's resources. This cluster would exist solely for this one query, and nothing more, then die. Thus, I would like to claim the maximum available memory for each node and the one query by increasing query.max-memory-per-node and query.max-memory. I can do this when I'm configuring the cluster by adding these settings in the "Edit software settings" box of the cluster creation view in the AWS console. But the Presto server doesn't start, reporting in the server.log file an IllegalArgumentException, saying that max-memory-per-node exceeds the useable heap space (which, by default, is far too small for my instance type and use case).
I have tried to use the session setting set session resource_overcommit=true, but that only seems to override query.max-memory, not query.max-memory-per-node, because in the Presto UI, I see that very little of the available memory on each node is being used for the query.
Through Google, I've been led to believe that I need to also increase the JVM heap size by changing the -Xmx and -Xms properties in /etc/presto/conf/jvm.config, but it says here (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that it is not possible to alter the JVM settings in the cluster creation phase.
To change these properties after the EMR cluster is active and the Presto server has been started, do I really have to manually ssh into each node and alter jvm.config and config.properties, and restart the Presto server? While I realize it'd be possible to manually install Presto with a custom configuration on an EMR cluster through a bootstrap script or something, this would really be a deal-breaker.
Is there something I'm missing here? Is there not an easier way to make Presto allocate all of a cluster to one query?

As advertised, increasing query.max-memory-per-node, and also by necessity the -Xmx property, indeed cannot be achieved on EMR until after Presto has already started with the default options. To increase these, the jvm.config and config.properties found in /etc/presto/conf/ have to be changed, and the Presto server restarted on each node (core and coordinator).
One can do this with a bootstrap script using commands like
sudo sed -i "s/query.max-memory-per-node=.*GB/query.max-memory-per-node=20GB/g" /etc/presto/conf/config.properties
sudo restart presto-server
and similarly for /etc/presto/jvm.conf. The only caveats are that one needs to include the logic in the bootstrap action to execute only after Presto has been installed, and that the server on the coordinating node needs to be restarted last (and possibly with different settings if the master node's instance type is different than the core nodes).
You might also need to change resources.reserved-system-memory from the default by specifying a value for it in config.properties. By default, this value is .4*(Xmx value), which is how much memory is claimed by Presto for the system pool. In my case, I was able to safely decrease this value and give more memory to each node for executing the query.

As a matter of fact, there are configuration classifications available for Presto in EMR. However, please note that these may vary depending on the EMR release version. For a complete list of the available configuration classifications per release version, please visit 1 (make sure to switch between the different tabs according to your desired release version). Specifically regarding to jvm.config properties, you will see in 2 that these are not currently configurable via configuration classifications. That being said, you can always edit the jvm.config file manually per your needs.
Amazon EMR 5.x Release Versions
1
Considerations with Presto on Amazon EMR - Some Presto Deployment Properties not Configurable:
2

Related

Is it possible to set fix under replica to be done automatically when needed from ambari?

We have ambari cluster - HDP version 2.6.5
from time to time we see under replica in ambari dashboard , ( it means that under replica need to fix )
When I saw this, I ran the HDFS command that fixes it.
but I want to know if we can set some parameters that will do it automatically , by changing some values from ambari configuration.
#Anirban166 Yes, you can certainly do this. I would warn against automating this task and silently forgetting about it. They left it like this on purpose so that Administrators may evaluate why there is data loss and take some corrective action if one is required before running the HDFS command and/or rebalancing.

How to apply rolling updates in VM instances instead of using Managed Instance group in GCP?

Problem: I want to apply patch updates in a VM instance which is not a part of a Managed Instance Group. The patch update could be-
A change in the version of the current OS of a VM instance, that is, change from Ubuntu-16-v1 to Ubuntu-16-v2.
An upgrade of the OS boot, that is, changing from Ubuntu-16 OS to Ubuntu-18 OS.
Installation of a new package in the existing machine.
Exploration:
For Problem 1 & 2 stated above
I have explored and tried the rolling update feature present in Managed Instance Group in the Google Cloud Platform and this seems to be a good approach for the problem stated, but what should be the best approach with best practices if someone is not using a Managed Instance Group? You may find the details here.
For Problem 3 stated above
I have tried the Os-patch Management service of GCP but is there any other method that I could use?
Create an "image" from the boot disks of your existing Compute Engine instances.
For updating with newer configurations and software, group images in "image family" which always points to the latest image.
See https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#setting_families
For your use case, I think you should use IAC script like terraform to recreate similar VMs with the same name, disk, internal address, etc..and call the script from the repo directly on a scheduled date automatically or provide self patch instructions.
Here is the likely process:
Send Email Notification to all the VM owners that Auto-Patch is
scheduled on XYZ.
Email content should include an Instance list going to be
patched/update, list of action, patch team contact details.
An email should also include a link for skipping this auto-update and perform "Self Patching instruction"
documents
Self patching documents should have a command to call autopatch
wrapper script like: "curl -u "encrypted-auth:x-oauth-basic" -k -H 'Accept:
application/vnd.github.VERSION.raw'
'https://github.com/api/v3/repos/xyz/images/contents/gcp/patch_OS_update.sh?ref=master'
|bash -s -- -q"
The above script can also have other options like to query patchset available for particular VM or scan the VM for pending updates

Yarn UI shows no active node while it appeared in HDFS UI

I've setup Hadoop in my laptop,
and when I submit a job on Hadoop (though MapReduce and Tez),
the status always ACCEPTED, but the progress always stuck at 0% and description wrote something like "waiting for AM container to be allocated".
When I check the node through YARN UI(localhost:8088),
it shows that the active node is 0
But from HDFS UI(localhost:50070), it shows that there are one live node.
Is that the main reason that cause the job stuck since there are no available node? If that's the case, what should I do?
In your YARN UI, it shows you have zero vcores and zero memory so there is no way for any job to ever run since you lack computing resources. The datanode is only for storage (HDFS in this case) and does not matter as far as why your application is stuck.
To fix your problem, you need to update your yarn-site.xml and provide settings for the memory and vcore properties described in the following:
http://blog.cloudera.com/blog/2015/10/untangling-apache-hadoop-yarn-part-2/
You might consider using a Cloudera QuickStart VM or Hortonworks Sandbox (at least as a reference for configuration values for the yarn-site.xml).
https://www.cloudera.com/downloads/quickstart_vms/5-10.html
https://hortonworks.com/products/sandbox/

Configuring Spark on EMR

When you pick a more performant node, say a r3.xlarge vs m3.xlarge, will Spark automatically utilize the additional resources? Or is this something you need to manually configure and tune?
As far as configurations go, which are the most configuration values to tune to get the most out of your cluster?
It will try..
AWS has a setting you can enable in your EMR cluster configuration that will attempt to do this. It is called spark.dynamicAllocation.enabled. In the past there were issues with this setting where it would give too many resources to Spark. In newer releases they have lowered the amount they are giving to spark. However, if you are using Pyspark they will not take python's resource requirements into account.
I typically disable dynamicAllocation and set the appropriate memory and cores settings dynamically from my own code based upon what instance type is selected.
This page discusses what defaults they will select for you:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
If you do it manually, at a minimum you will want to set:
spark.executor.memory
spark.executor.cores
Also, you may need to adjust the yarn container size limits with:
yarn.scheduler.maximum-allocation-mb
yarn.scheduler.minimum-allocation-mb
yarn.nodemanager.resource.memory-mb
Make sure you leave a core and some RAM for the OS, and RAM for python if you are using Pyspark.

How to replicate code changes across multiple AWS instances?

We have a load balanced setup in AWS with two instances. We do pretty frequent code updates, utilizing SVN. I need to know how easy it is to update the code changes across all the instances in our cluster. Can we simply do 'snapshots' and create new volumes each time for the instances?...or?...
I would not do updates via EBS snapshots. Think of EBS volumes as a hard disk - you would not change your harddisk if you have an update for your software.
As you have your code in a version control system, code updates should be quite simple like logging in to your (multiple) servers and doing a git pull or svn update. This should fetch the latest code files from your servers. Depending on the type of application you would have to do some other tasks afterwards, running build scripts, emptying cache etc.
The problem is that this kind of setup does not scale well. If you have n servers, you will have to login and do this command n times. Therefore it makes sense to look into some remote management tools that you can use in one step. With a lot of these tools, you also get a complete configuration management stack: you define a set of recipes or tasks (like installed packages, configuration files, fetch the latest code, necessary build steps) for each of your servers, and when you boot up a new server it fetches the lastest version of its configuration and installs itself.
Popular configuration management tools include Puppet or Salt. Both tools have remote execution included which should make your task to publish your code base easier, you would only have to fire one command on your master server and it automatically executes this task on all its minions / slave servers.