Deploying Customized JAR in AWS failing Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS - amazon-web-services

I am trying to deploy a customized JAR in AWS (Map Reduce). I have to read files from S3. The paths to S3 are given as command line arguments. While the cluster is running, I see the following in the 'Steps' section of the Cluster:
Status:FAILED
Reason:Illegal Argument.
Log File:s3://aws-logs-502743756123-us-east-1/elasticmapreduce/j-3U1NGY5JNUBK2/steps/s-O3W3I4RU4NXS/stderr.gz
Details:Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://****/input, expected: hdfs://ip-172-31-45-130.ec2.internal:8020
JAR location: s3://****/ChainMapperDriver.jar
Main class: None
Arguments: ChainMapperDriver s3://****/input s3://****/output/
Action on failure: Terminate cluster
ChainMapperDriver is the name of the Main Class.
Do I have to do anything in the JAVA code that I have written to handle the case when the files are in S3? Your help is greatly appreciated.

Related

How to use AWS_MSK_IAM sasl mechanism with a kafka producer jar?

I have a fat jar called producer which produces messages.I want to use it to produce messages to a MSK serverless cluster. The jar takes the following arguments-
-topic --num-records --record-size --throughput --producer.config /configLocation/
As my MSK serverless cluster uses IAM based authentication, i have provided the following settings in my producer.config-
bootstrap.servers=boot-url
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
The way this jar usually works is by providing a username and password with the sasl.jaas.config property.
However, with MSK serverless we have to use the IAM role attached to our EC2 instance.
When my executing the current jar by using
java - jar producer.jar -topic --num-records --record-size --throughput --producer.config /configLocation/
I get the exception
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value software.amazon.msk.auth.iam.IAMClientCallbackHandler for configuration sasl.client.callback.handler.class: Class software.amazon.msk.auth.iam.IAMClientCallbackHandler could not be found.
I don't understand how to make the producer jar find the class present in the external jar aws-msk-iam-auth-1.1.1-all.jar.
Any help would be much appreciated, Thanks.
I found out that it isn't possible to override the classpath specified in MANIFEST.MF in case of a jar by using commandline options like --cp. What worked in my case was to have the jar's pom include the missing dependency.
When you use the -jar command-line option to run your program as an
executable JAR, then the Java CLASSPATH environment variable will be
ignored, and also the -cp and -classpath switches will be ignored and,
In this case, you can set your java classpath in the
META-INF/MANIFEST.MF file by using the Class-Path attribute.
https://javarevisited.blogspot.com/2011/01/how-classpath-work-in-java.html

Unable to spark-submit a pyspark file on s3 bucket

I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.
Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py
No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.

Add pyspark script as AWS step

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command
aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE
I have downloaded the spark-xml jar to the master node during bootstrap and its present under
/home/hadoop
location. Also in the python script I have included
conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")
But still its showing
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes.
Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.
Please refer this article for more info.
So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.
Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

How to pass --properties-file to spark-submit in Qubole?

I am using Spark in Qubole by having the clusters created in AWS. In Qubole Workbench, when I execute the below Command Line, it works fine and the command is successful
/usr/lib/spark/bin/spark-submit s3://bucket-name/SparkScripts/test.py
But, when I execute the same command along with --properties-file option
/usr/lib/spark/bin/spark-submit --properties-file s3://bucket-name/SparkScripts/properties.file s3://bucket-name/SparkScripts/test.py
it gives below error message
Qubole > Shell Command failed with exit code: 1
App > Error occurred when getting effective config required to initialize Qubole security provider
App > Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Properties file s3:/bucket-name/SparkScripts/properties.file does not exist
Can someone help me fix this? I need some application properties to be stored on a separate file on Amazon S3 and passed on to --properties-file to my spark program.
#saravanan - Qubole does not have the ability to specify --properties file from S3 path currently. It will be available in release 59.

Spring-xd hdfs sink- Error Creating bean

I am getting the following exception when deploying stream with hdfs as sink in spring-xd.
Error creating bean with name 'hadoopConf iguration': Invocation of init method failed; nested exception is java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf
I have my spring-xd app running on yarn successfully. Appreciate your help.
The problem is with the configuration in siteMapreduceAppClassPath in servers.yml. The classpath should include the path of hadoop-core jar, as the jar is not included the app it is giving NoClassDefFoundError.