How to set heap size for EMR Master - elastic-map-reduce

I have a job which I am trigger from in EMR. The master triggers the mapper. Once it is done, it loads a heavweight operation in memory and then evenutualy will dump out. Right now, the job which runs on the cluster fails after a few minutes because it runs out of heap space. By default it sets about 1000m on its master
Tried the exact action below, but that did not work . The program is still set to 1000m
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args -s,mapred.child.java.opts=Xmx4000m

There is a specific way provided by EMR to set the heap size of the namenode, use the following bootstrap command while launching the cluster:
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args
--namenode-heap-size=4096
Also you may try using a config file instead.
Create an XML config file and upload it to s3.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx4096m</value>
</property>
</configuration>
Now launch the cluster with the following bootstrap action:
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args
"--mapred-config-file,
s3:///custom-heap-size.xml"

Related

amazon EMR spark-submit doesn't allow docker Image pattern sha256 diges

I am using Amazon EMR
my question log is following.
Image name '<account_id>.dkr.ecr.ap-northeast-2.amazonaws.com/pyspark-etl#sha256:3d3a07135.......https://forums.aws.amazon.com/....0.87563e51ef5c841841a5d1a6dde9c' doesn't match docker image name pattern
Release label:emr-6.2.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.0.1,
Hive 3.1.2,
JupyterHub 1.1.0,
Ganglia 3.7.2
Zeppelin 0.9.0
Livy 0.7.0 \
Hue 4.8.0
PrestoSQL 343
The reason I had to use sha256 digest is becuase I previously used TAG:latest pyspark image hardcoded in airflow job ALSO containerized in ECR image.
so, when my airflow container runs a EMROperator(SSHoperator precisely) as a CLI spark-submit. It pull :latest spark container which doesn't update because of some reason.
It is strange because when I ssh into core instance, I am able to pull sha256 name pattern from ECR, also update :latest TAG if something is changed(so digest changed).
I think this is something about spark configuration or spark source from AWS which prohibited digest name pattern, but I can't debug this because I do not have spark(Amazon) source on my own. I would appreciate your answer.
Many thanks,
I am editing on my own because I got an answer from someone.
The problem was YARN configuration installed on my EMR master node. YARN default setting for docker default image update is False.
https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java#L344
so, in order to fix default setting, you should go /etc/hadoop/conf/ and find a yarn-sites.xml and fix(or add) this
<property>
<name>yarn.nodemanager.runtime.linux.docker.image-update</name>
<value>false</value>
<description>
Optional. Default option to decide whether to pull the latest image
or not.
</description>
</property>
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/DockerContainers.html

Why would Oozie fail a job with Error Code LimitExceededException when yarn reports that oozie launcher & mapreduce job have completed successfully?

There are a few questions similar to this on SO. However nothing has worked for me. So I am posting this question.
I am Using CDH 6.2.1
I have a workflow that has map-reduce action. The map-reduce job creates a lot of counters (I think m/r job produces ~300 counters).
I have set the cdh/yarn/config mapreduce.job.counters.max property to 8192.
I have also set the:
YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml
YARN Service MapReduce Advanced Configuration Snippet (Safety Valve)
MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml
If I run the map-reduce job as a stand-alone yarn job (using yarn jar command on the command-line), the job completes successfully.
When I run the job as part of the workflow:
On Yarn/All Applications Page I see that: the oozie launcher job completes successfully.
On Yarn/All Applications Page I see that: the map/reduce job completes successfully.
However oozie fails the job reporting: LimitExceededException: Too many counters: 121 max=120
The configuration for the mapreduce job & oozie launcher as reported by yarn has the setting:
<property>
<name>mapreduce.job.counters.max</name>
<value>8192</value>
<final>true</final>
<source>yarn-site.xml</source>
</property>
Oozie web interface System-Info/OS-Env reports that the following HADOOP_CONF_DIR: /var/run/cloudera-scm-agent/process/459-oozie-OOZIE_SERVER/yarn-conf/
In that folder I can see that the mapred-site.xml also has:
<!--'mapreduce.job.counters.max', originally set to '8192' (final), is overridden below by a safety valve-->
<property>
<name>mapreduce.job.counters.max</name>
<value>8192</value>
<final>true</final>
</property>
However I cannot find that property in the yarn-site.xml.
I am not sure what else I can do at this point...
This is an oozie issue which has been resolved. However, it is not available in the current version of cloudera.
I am posting this here, in case anyone else has the same issue.

How to move file from local to HDFS using oozie?

I am trying to move data from a local file system to the Hadoop distributed file system , but i am not able to move it through oozie
Can we move or copy data from a local filesystem to HDFS using oozie ???
I found a workaround for this problem. The ssh action will always execute from the Oozie server. So if your files are located on the local file system of the Oozie server, you will be able to copy them to HDFS.
The ssh action will always be executed by the 'oozie' user. So your ssh action should look like this: myUser#oozie-server-ip, where myUser is a user with read rights on the files from the Oozie server.
Next, you need to set up passwordless ssh between the oozie user and myUser, on the Oozie server. Generate a public key for the 'oozie' user and copy the generated key in the authorized_keys file of 'myUser'. This is the command for generating the rsa key:
ssh-keygen -t rsa
When generating the key, you need to be logged in with the oozie user. Usually on a Hadoop cluster this user will have its home in /var/lib/oozie and the public key will be generated in id_rsa.pub in /var/lib/oozie/.ssh
Next copy this key in the authorized_keys file of 'myUser'. You will find it in the user's home, in the .ssh folder.
Now that you have set up the passwordless ssh, it time to set up the ssh oozie action. This action will execute the command 'hadoop' and will have as arguments '-copyFromLocal', '${local_file_path}' and '${hdfs_file_path}'.
No, Oozie isn't aware of a local filesystem, cause it's run in Map-Reduce cluster nodes. You should use Apache Flume to move data from a local filesystem to HDFS.
Oozie will not support the Copy action from Local to HDFS or vise versa, but u can call java program to do the same, Shell action will also work, but if you have more than one node in a cluster, then all the node should be having the said local Mount point available or mounted with read/write access.
You can do this using Oozie shell action by putting the copy command in the shell script.
https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action
Example:
<workflow-app name="reputation" xmlns="uri:oozie:workflow:0.4">
<start to="shell"/>
<action name="shell">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>run.sh</exec>
<file>run.sh#run.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
In Your run.sh you can use: hadoop fs -copyFromLocal command.

Elastic Beanstalk, Docker and Continuous integration

I have a beanstalk environment which use Docker.
Each time I push something, jenkins build and upload my new snapshot to S3. (I use S3 to store my version). Each version is a zip which contains my app and my Dockerfile.
Then I update my BS environment with the version I just uploaded.(BS create a new version with the version uploaded to S3, if the version exist it replace it, it usefull for snapshot).
Everything works fine the first time I deploy the version.
But when i do it a second time, it continue to works but it seems that my last version is not used. Docker not re build my freshly updated app.
Why this ? Did I missed something ? this is my Dockefile
Basically it seems the update-environment call refuses to update a the same version number - and thats why we always rely on ${maven.build.timestamp} and friends. Here's your retouched pom :]
Notice I'm using properties - Thats the suggested style for the latest version (oops, someone forgot to update the docs).
I've decided to try it with the latest 1.4.0-SNAPSHOT. Here's what you should add to your profile:
<profiles>
<profile>
<id>awseb</id>
<properties>
<maven.deploy.skip>true</maven.deploy.skip>
<beanstalker.region>eu-west-1</beanstalker.region>
<beanstalk.applicationName>wisdom-demo</beanstalk.applicationName>
<beanstalk.cnamePrefix>wisdom-demo</beanstalk.cnamePrefix>
<beanstalk.environmentName>${beanstalk.cnamePrefix}</beanstalk.environmentName>
<beanstalk.artifactFile>${project.basedir}/target/${project.build.finalName}.zip</beanstalk.artifactFile>
<beanstalk.environmentRef>${beanstalk.cnamePrefix}.elasticbeanstalk.com</beanstalk.environmentRef>
<maven.build.timestamp.format>yyyyMMddHHmmss</maven.build.timestamp.format>
<beanstalk.s3Key>apps/${project.artifactId}/${project.version}/${project.artifactId}-${project.version}-${maven.build.timestamp}.zip</beanstalk.s3Key>
<beanstalk.useLatestVersion>true</beanstalk.useLatestVersion>
<beanstalk.versionLabel>${project.artifactId}-${project.version}-${maven.build.timestamp}</beanstalk.versionLabel>
<beanstalk.applicationHealthCheckURL>/ping</beanstalk.applicationHealthCheckURL>
<beanstalk.instanceType>m1.small</beanstalk.instanceType>
<beanstalk.keyName>aldrin#leal.eng.br</beanstalk.keyName>
<beanstalk.iamInstanceProfile>aws-elasticbeanstalk-ec2-role</beanstalk.iamInstanceProfile>
<beanstalk.solutionStack>64bit Amazon Linux 2014.* running Docker 1.*</beanstalk.solutionStack>
<beanstalk.environmentType>SingleInstance</beanstalk.environmentType>
</properties>
<build>
<plugins>
<plugin>
<groupId>br.com.ingenieux</groupId>
<artifactId>beanstalk-maven-plugin</artifactId>
<version>1.4.0-SNAPSHOT</version>
<executions>
<execution>
<id>default-deploy</id>
<phase>deploy</phase>
<goals>
<goal>upload-source-bundle</goal>
<goal>create-application-version</goal>
<goal>put-environment</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
From the example above, just tweak your cnamePrefix and the latest three properties. Here's a rundown:
So if you want to deploy, simply:
$ mvn -Pawseb deploy
Or, if you want to boot it
from scratch the latest version (thus using useLatestVersion) once deployed, simply do:
$ mvn -Pawseb -Dbeanstalk.versionLabel= beanstalk:create-environment
By setting to blank the versionLabel, it effectively activates the useLatestVersion behaviour: When there isn't a version, use the latest one.
Oh, a deployment failed?
Easy peasy:
$ mvn -Pawseb beanstalk:rollback-version
Thank you for your explanation and the link to the blog post.
I follow these step by step instructions and successfully deployed my first Wisdom application in a Docker container on AWS Elastic Beanstalk.
I then upgrade the Java source code, compiled with mvn package, tested locally and deployed again the new ZIP file using AWS Console.
My AWS Elastic BeansTalk environment was correctly updated.
So, it looks like the deployment problem you are observing is lying in the maven AWS Elastic Beanstalk plugin that deploys the code.
Manual deploys work correctly. Since this maven plugin is a third-party, open-source project, I am not the right person to investigate this. I would suggest you to contact the project maintainer and / or open an issue in their Issue Tracking System
As a workaround, you can deploy manually (or script this procedure from your CI/CD environment) :
Copy your artefact to your AWS Elastic Beanstalk bucket
aws s3 --region <REGION_NAME> cp ./target/YOUR_ARTIFACTID-1.0-SNAPSHOT.zip s3://<YOUR_BUCKET_NAME>/20141128-210900-YOUR_ARTIFACTID-1.0-SNAPSHOT.zip
Create an application version with your zip file
aws elasticbeanstalk create-application-version --region <REGION_NAME> --application-name <YOUR_APPLICATION_NAME> --version-label 20141128-212100 --source-bundle S3Bucket=<YOUR_BUCKET_NAME>,S3Key=20141128-210900-YOUR_ARTIFACTID-1.0-SNAPSHOT.zip
Deploy that version
aws elasticbeanstalk update-environment --region <YOUR_REGION_NAME> --environment-name <YOUR_ENVIRONMENT_NAME> --version-label 20141128-212100
These three steps might be automated from maven or jenkins, I will let you this as an exercise :-)

Lauching a map reduce job in amazon elastic map reduce

I am trying to launch a map reduce job in amazon map reduce cluster. My map reduce job does some pre-processing before generating map/reduce tasks. This pre-processing requires third party libs such as javacv, opencv. Following the amazon's documentation, I have included those libraries in HADOOP_CLASSPATH such that I have a line HADOOP_CLASSPATH= in hadoop-user-env.sh in the location /home/hadoop/conf/ of master node. According to the documentation, the entry in this script should be included in hadoop-env.sh. Hence, I assumed that HADOOP_CLASSPATH now has my libs in the classpath. I did this in bootstrap actions. However, when i launch the job, it still complains class not found exception pointing to a class in the jar which is supposed to be in the classpath.
Can someone tell me where I am going wrong? bbtw, i am using hadoop 2.2.0. In my local infrastructure, i have a small bash script that exports HADOOP_CLASSPATH with all the libs included in it and calls hadoop jar -libjars .
I solved this with an AWS EMR bootstrap task to add a jar to the hadoop classpath:
Uploaded my jar to S3
Created a bootstrap script to copy the jar from S3 to the EMR instance and add the jar to the classpath:
#!/bin/bash
hadoop fs -copyToLocal s3://my-bucket/libthrift-0.9.2.jar /home/hadoop/lib/
echo 'export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/home/hadoop/lib/libthrift-0.9.2.jar"' >> /home/hadoop/conf/hadoop-user-env.sh
Saved that script as "add-jar-to-hadoop-classpath.sh" and uploaded it to S3.
My "aws emr create-cluster" command adds the bootstrap script with this argument: --bootstrap-actions Path=s3://my-bucket/add-jar-to-hadoop-classpath.sh
When the EMR spins up the instance will have the file /home/hadoop/conf/hadoop-user-env.sh created and my MR job was able to instantiate the thrift classes in the jar.
UPDATE : I was able to instantiate thrift classes from the MASTER node, but not from the CORE node. I sshed into the CORE node and the lib was properly copied to /home/hadoop/lib and my HADOOP_CLASSPATH setting was there, but I was still getting class not found at runtime when the mapper tried to use thrift.
Solution ended up being to the the maven-shade-plugin and embed the thrift jar:
<plugin>
<!-- Use the maven shade plugin to embed the thrift classes in our jar.
Couldn't get the HADOOP_CLASSPATH on AWS EMR to load these classes even
with the jar copied to /home/hadoop/lib and the proper env var in
/home/hadoop/conf/hadoop-user-env.sh -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<includes>
<include>org.apache.thrift:libthrift</include>
</includes>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>
When your job is executed, the "controller" logfile contains the actually executed commandline. This could look something like:
2014-06-02T15:37:47.863Z INFO Fetching jar file.
2014-06-02T15:37:54.943Z INFO Working dir /mnt/var/lib/hadoop/steps/13
2014-06-02T15:37:54.944Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/13 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/13/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar <YOUR_JAR> <YOUR_ARGS>
The log is located on the master node in /mnt/var/lib/hadoop/steps/ - it´s easily accessible when you SSH into the master node (requires specifying a key pair when creating the cluster).
I´ve never really worked with what´s in HADOOP_CLASSPATH, but if you define a bootstrap action to just copy your libraries into /home/hadoop/lib, that should solve the issue.