Lauching a map reduce job in amazon elastic map reduce - elastic-map-reduce

I am trying to launch a map reduce job in amazon map reduce cluster. My map reduce job does some pre-processing before generating map/reduce tasks. This pre-processing requires third party libs such as javacv, opencv. Following the amazon's documentation, I have included those libraries in HADOOP_CLASSPATH such that I have a line HADOOP_CLASSPATH= in hadoop-user-env.sh in the location /home/hadoop/conf/ of master node. According to the documentation, the entry in this script should be included in hadoop-env.sh. Hence, I assumed that HADOOP_CLASSPATH now has my libs in the classpath. I did this in bootstrap actions. However, when i launch the job, it still complains class not found exception pointing to a class in the jar which is supposed to be in the classpath.
Can someone tell me where I am going wrong? bbtw, i am using hadoop 2.2.0. In my local infrastructure, i have a small bash script that exports HADOOP_CLASSPATH with all the libs included in it and calls hadoop jar -libjars .

I solved this with an AWS EMR bootstrap task to add a jar to the hadoop classpath:
Uploaded my jar to S3
Created a bootstrap script to copy the jar from S3 to the EMR instance and add the jar to the classpath:
#!/bin/bash
hadoop fs -copyToLocal s3://my-bucket/libthrift-0.9.2.jar /home/hadoop/lib/
echo 'export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/home/hadoop/lib/libthrift-0.9.2.jar"' >> /home/hadoop/conf/hadoop-user-env.sh
Saved that script as "add-jar-to-hadoop-classpath.sh" and uploaded it to S3.
My "aws emr create-cluster" command adds the bootstrap script with this argument: --bootstrap-actions Path=s3://my-bucket/add-jar-to-hadoop-classpath.sh
When the EMR spins up the instance will have the file /home/hadoop/conf/hadoop-user-env.sh created and my MR job was able to instantiate the thrift classes in the jar.
UPDATE : I was able to instantiate thrift classes from the MASTER node, but not from the CORE node. I sshed into the CORE node and the lib was properly copied to /home/hadoop/lib and my HADOOP_CLASSPATH setting was there, but I was still getting class not found at runtime when the mapper tried to use thrift.
Solution ended up being to the the maven-shade-plugin and embed the thrift jar:
<plugin>
<!-- Use the maven shade plugin to embed the thrift classes in our jar.
Couldn't get the HADOOP_CLASSPATH on AWS EMR to load these classes even
with the jar copied to /home/hadoop/lib and the proper env var in
/home/hadoop/conf/hadoop-user-env.sh -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<includes>
<include>org.apache.thrift:libthrift</include>
</includes>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>

When your job is executed, the "controller" logfile contains the actually executed commandline. This could look something like:
2014-06-02T15:37:47.863Z INFO Fetching jar file.
2014-06-02T15:37:54.943Z INFO Working dir /mnt/var/lib/hadoop/steps/13
2014-06-02T15:37:54.944Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/13 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/13/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar <YOUR_JAR> <YOUR_ARGS>
The log is located on the master node in /mnt/var/lib/hadoop/steps/ - it´s easily accessible when you SSH into the master node (requires specifying a key pair when creating the cluster).
I´ve never really worked with what´s in HADOOP_CLASSPATH, but if you define a bootstrap action to just copy your libraries into /home/hadoop/lib, that should solve the issue.

Related

How to use AWS_MSK_IAM sasl mechanism with a kafka producer jar?

I have a fat jar called producer which produces messages.I want to use it to produce messages to a MSK serverless cluster. The jar takes the following arguments-
-topic --num-records --record-size --throughput --producer.config /configLocation/
As my MSK serverless cluster uses IAM based authentication, i have provided the following settings in my producer.config-
bootstrap.servers=boot-url
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
The way this jar usually works is by providing a username and password with the sasl.jaas.config property.
However, with MSK serverless we have to use the IAM role attached to our EC2 instance.
When my executing the current jar by using
java - jar producer.jar -topic --num-records --record-size --throughput --producer.config /configLocation/
I get the exception
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value software.amazon.msk.auth.iam.IAMClientCallbackHandler for configuration sasl.client.callback.handler.class: Class software.amazon.msk.auth.iam.IAMClientCallbackHandler could not be found.
I don't understand how to make the producer jar find the class present in the external jar aws-msk-iam-auth-1.1.1-all.jar.
Any help would be much appreciated, Thanks.
I found out that it isn't possible to override the classpath specified in MANIFEST.MF in case of a jar by using commandline options like --cp. What worked in my case was to have the jar's pom include the missing dependency.
When you use the -jar command-line option to run your program as an
executable JAR, then the Java CLASSPATH environment variable will be
ignored, and also the -cp and -classpath switches will be ignored and,
In this case, you can set your java classpath in the
META-INF/MANIFEST.MF file by using the Class-Path attribute.
https://javarevisited.blogspot.com/2011/01/how-classpath-work-in-java.html

Hadoop s3 configuration file missing

I'm using the Hadoop library to upload files in S3. Because of some metric configuration file is missing I'm getting this exception
MetricsConfig - Could not locate file hadoop-metrics2-s3a-file-system.properties org.apache.commons.configuration2.ex.ConfigurationException:
Could not locate: org.apache.commons.configuration2.io.FileLocator#77f46cee[fileName=hadoop-metrics2-s3a-file-system.properties,basePath=<null>,sourceURL=,encoding=<null>,fileSystem=<null>,locationStrategy=<null>]
My current configurations are
configuration.set("fs.s3a.access.key", "accessKey")
configuration.set("fs.s3a.secret.key", "secretKey")
Where to add this configuration file? What to add to that configuration file?
don't worry about it, it's just an irritating warning. It's only relevant when you have the s3a or abfs connectors running in a long-lived app where the metrics are being collected and fed to some management tooling.
Set the log level to warn in the log4j.properties file in your spark conf dir
log4j.logger.org.apache.hadoop.metrics2=WARN
I just placed an empty file in the classpath and it stopped complaining. Like:
touch /opt/spark/conf/hadoop-metrics2-s3a-file-system.properties

SpringBoot fails to deploy after adding .ebextensions for ngingx SSL -[An error occurred during execution of command [app-deploy]]

I have a SpringBoot app that deploys just fine to AWS Beanstalk, and the default nginx proxy works, allowing me to connect via port 80.
Following the instructions here: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/https-singleinstance.html, and verifying with another of my projects that works with this exact config, Beanstalk fails to deploy the app with error:
2020/05/29 01:27:56.418780 [ERROR] An error occurred during execution of command [app-deploy] - [CheckProcfileForJavaApplication]. Stop running the command. Error: there is no Procfile and no .jar file at root level of your source bundle
The contents of my war file are as such:
app.war
-.ebextensions
-nginx/conf.d/https.conf
-https-instance-single.config
-https-instance.config
-web-inf/
My config files pass as valid yaml files. (These files are identical to those in the AWS doc, and those that work in other project on mine.)
I am using a single instance, with port 443 set open.
These are the errors reported throughout the various log files:
----------------------------------------
/var/log/eb-engine.log
----------------------------------------
2020/05/29 01:37:53.054366 [ERROR] /usr/bin/id: healthd: no such user
...
2020/05/29 01:37:53.254965 [ERROR] Created symlink from /etc/systemd/system/multi-user.target.wants/healthd.service to /etc/systemd/system/healthd.service.
...
2020/05/29 01:37:53.732794 [ERROR] Created symlink from /etc/systemd/system/multi-user.target.wants/cfn-hup.service to /etc/systemd/system/cfn-hup.service.
----------------------------------------
/var/log/cfn-hup.log
----------------------------------------
ReadTimeout: HTTPSConnectionPool(host='sqs.us-east-1.amazonaws.com', port=443): Read timed out. (read timeout=23)
Taking count #Dean Wookey's answer for Java 11, I have successfully deployed Spring Boot application jar along with .ebextensions folder. I just added maven antrun plug to my maven build configurations and for output I am receiving .zip file, which contains .ebextensions folder and spring Boot .jar file at the same level. Just deploying this final zip file to AWS UI Console.
The following is the maven antrun plugin configuration
....
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<version>1.8</version>
<executions>
<execution>
<id>prepare</id>
<phase>package</phase>
<configuration>
<tasks>
<copy todir="${project.build.directory}/${project.build.finalName}/" overwrite="false">
<fileset dir="./" includes=".ebextensions/**"/>
<fileset dir="${project.build.directory}" includes="*.jar"/>
</copy>
<zip destfile="${project.build.directory}/${project.build.finalName}.zip" basedir="${project.build.directory}/${project.build.finalName}"/>
</tasks>
</configuration>
<goals>
<goal>run</goal>
</goals>
</execution>
</executions>
</plugin>
....
Issue with Java and Linux version
If you are using Java 8 and Linux 2.10.9 code will work and override ngingx configuration but if you choose Corretto 11 and Linux 2.2.3 get following error.
Error: there is no Procfile and no .jar file at root level of your
source bundle
Create new environment with Java 8 and deploy app again will resolve issue.
Instead of changing to java 8 as described in vaquar khan's answer, an alternative is to package your source jar inside a zip that also contains the .ebextensions folder.
In other words:
source.zip
-.ebextensions
-nginx/conf.d/https.conf
-https-instance-single.config
-https-instance.config
-web-inf/
-app.war
If you look at the latest documentation https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/platforms-linux-extend.html, you'll see that the nginx config now goes in the .platform folder instead, so your structure would be:
source.zip
-.ebextensions
-https-instance-single.config
-https-instance.config
-.platform
-nginx/conf.d/https.conf
-web-inf/
-app.war
After following vaquar's answer above, also change the 'buildspec.yml' file to have the correct java version. E.g:
runtime-versions:
java: corretto8 # previously this was openjdk8
Should work.
It is still possible to use .ebextensions within your war file.
Add following to your pom.xml in the <build><plugins> section:
<build>
<plugins>
...
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<executions>
<execution>
<id>add-resource-ebextensions</id>
<phase>generate-resources</phase>
<goals>
<goal>add-resource</goal>
</goals>
<configuration>
<resources>
<resource>
<directory>${basedir}/.ebextensions</directory>
<targetPath>.ebextensions</targetPath>
</resource>
</resources>
</configuration>
</execution>
</executions>
</plugin>
...
</plugins>
</build>
This will copy .ebextensions folder to WEB-INF/classes folder. There AWS picks it up while starting and applies scripts from there.

Error when launching Spring Cloud Task batch-job on Pivotal Cloud Foundry

I registered the batch-job task from https://repo.spring.io/libs-snapshot/io/spring/cloud/batch-job/1.0.0.RELEASE/ in Pivotal Cloud Foundry.
When launching the task I see the error
CF-UnprocessableEntity(10008): Task must have a droplet. Specify droplet or assign current droplet to app.
These are the commands I executed to register this task
app register --name batch-job --type task --uri maven://io.spring.cloud:batch-job:jar:1.0.0.RELEASE
task create myjob --definition batch-job
task list
task launch myjob
task execution list
Appreciate if someone can point what I am i missing.
It means your app is not deployed correctly. Look at cf push log for more details.
I had the similar error where it was not determining the buildpack.
I have added below in my pom.xml so PCF automatically detect the buildpack.
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<version>${spring.boot.version}</version>
<executions>
<execution>
<goals>
<goal>repackage</goal>
</goals>
</execution>
</executions>
</plugin>
This error is usually observed when the default API timeout (30s) is not enough to successfully deploy and launch the Task application. You can override the default behavior by setting a larger value via SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_TASK_API_TIMEOUT. Please review the configuration section in the reference guide for more details.
FYI: We recently changed the default timeout experience to 360s via spring-cloud/spring-cloud-deployer-cloudfoundry#192. This is included in the current 1.2.0.BUILD-SNAPSHOT build.

Elastic Beanstalk, Docker and Continuous integration

I have a beanstalk environment which use Docker.
Each time I push something, jenkins build and upload my new snapshot to S3. (I use S3 to store my version). Each version is a zip which contains my app and my Dockerfile.
Then I update my BS environment with the version I just uploaded.(BS create a new version with the version uploaded to S3, if the version exist it replace it, it usefull for snapshot).
Everything works fine the first time I deploy the version.
But when i do it a second time, it continue to works but it seems that my last version is not used. Docker not re build my freshly updated app.
Why this ? Did I missed something ? this is my Dockefile
Basically it seems the update-environment call refuses to update a the same version number - and thats why we always rely on ${maven.build.timestamp} and friends. Here's your retouched pom :]
Notice I'm using properties - Thats the suggested style for the latest version (oops, someone forgot to update the docs).
I've decided to try it with the latest 1.4.0-SNAPSHOT. Here's what you should add to your profile:
<profiles>
<profile>
<id>awseb</id>
<properties>
<maven.deploy.skip>true</maven.deploy.skip>
<beanstalker.region>eu-west-1</beanstalker.region>
<beanstalk.applicationName>wisdom-demo</beanstalk.applicationName>
<beanstalk.cnamePrefix>wisdom-demo</beanstalk.cnamePrefix>
<beanstalk.environmentName>${beanstalk.cnamePrefix}</beanstalk.environmentName>
<beanstalk.artifactFile>${project.basedir}/target/${project.build.finalName}.zip</beanstalk.artifactFile>
<beanstalk.environmentRef>${beanstalk.cnamePrefix}.elasticbeanstalk.com</beanstalk.environmentRef>
<maven.build.timestamp.format>yyyyMMddHHmmss</maven.build.timestamp.format>
<beanstalk.s3Key>apps/${project.artifactId}/${project.version}/${project.artifactId}-${project.version}-${maven.build.timestamp}.zip</beanstalk.s3Key>
<beanstalk.useLatestVersion>true</beanstalk.useLatestVersion>
<beanstalk.versionLabel>${project.artifactId}-${project.version}-${maven.build.timestamp}</beanstalk.versionLabel>
<beanstalk.applicationHealthCheckURL>/ping</beanstalk.applicationHealthCheckURL>
<beanstalk.instanceType>m1.small</beanstalk.instanceType>
<beanstalk.keyName>aldrin#leal.eng.br</beanstalk.keyName>
<beanstalk.iamInstanceProfile>aws-elasticbeanstalk-ec2-role</beanstalk.iamInstanceProfile>
<beanstalk.solutionStack>64bit Amazon Linux 2014.* running Docker 1.*</beanstalk.solutionStack>
<beanstalk.environmentType>SingleInstance</beanstalk.environmentType>
</properties>
<build>
<plugins>
<plugin>
<groupId>br.com.ingenieux</groupId>
<artifactId>beanstalk-maven-plugin</artifactId>
<version>1.4.0-SNAPSHOT</version>
<executions>
<execution>
<id>default-deploy</id>
<phase>deploy</phase>
<goals>
<goal>upload-source-bundle</goal>
<goal>create-application-version</goal>
<goal>put-environment</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
From the example above, just tweak your cnamePrefix and the latest three properties. Here's a rundown:
So if you want to deploy, simply:
$ mvn -Pawseb deploy
Or, if you want to boot it
from scratch the latest version (thus using useLatestVersion) once deployed, simply do:
$ mvn -Pawseb -Dbeanstalk.versionLabel= beanstalk:create-environment
By setting to blank the versionLabel, it effectively activates the useLatestVersion behaviour: When there isn't a version, use the latest one.
Oh, a deployment failed?
Easy peasy:
$ mvn -Pawseb beanstalk:rollback-version
Thank you for your explanation and the link to the blog post.
I follow these step by step instructions and successfully deployed my first Wisdom application in a Docker container on AWS Elastic Beanstalk.
I then upgrade the Java source code, compiled with mvn package, tested locally and deployed again the new ZIP file using AWS Console.
My AWS Elastic BeansTalk environment was correctly updated.
So, it looks like the deployment problem you are observing is lying in the maven AWS Elastic Beanstalk plugin that deploys the code.
Manual deploys work correctly. Since this maven plugin is a third-party, open-source project, I am not the right person to investigate this. I would suggest you to contact the project maintainer and / or open an issue in their Issue Tracking System
As a workaround, you can deploy manually (or script this procedure from your CI/CD environment) :
Copy your artefact to your AWS Elastic Beanstalk bucket
aws s3 --region <REGION_NAME> cp ./target/YOUR_ARTIFACTID-1.0-SNAPSHOT.zip s3://<YOUR_BUCKET_NAME>/20141128-210900-YOUR_ARTIFACTID-1.0-SNAPSHOT.zip
Create an application version with your zip file
aws elasticbeanstalk create-application-version --region <REGION_NAME> --application-name <YOUR_APPLICATION_NAME> --version-label 20141128-212100 --source-bundle S3Bucket=<YOUR_BUCKET_NAME>,S3Key=20141128-210900-YOUR_ARTIFACTID-1.0-SNAPSHOT.zip
Deploy that version
aws elasticbeanstalk update-environment --region <YOUR_REGION_NAME> --environment-name <YOUR_ENVIRONMENT_NAME> --version-label 20141128-212100
These three steps might be automated from maven or jenkins, I will let you this as an exercise :-)