S3-Dist-Cp Failing on EMR5 - amazon-web-services

I am facing issues with s3-dist-cp command in emr-5.0.0 version. In my application, I need to push some files from hdfs to S3. I am using s3-dist-cp command to achieve this. It was working fine in emr-4.2.0. But its not working in emr-5.0.0. If I run the command manually it works fine. But it fails in my application. I didn't make any change in my application to run it on emr-5.
Do I need to make any change if I need to use emr-5? Has there been any change in way we use s3-dist-cp command in emr-5?
I am using following command:
s3-dist-cp --src /user/hive/warehouse/abc.text --dest s3n://bucket/abc.text

s3-dist-cp is only available on the master node(s3-dist-cp.jar).
The following is the location of the application.
/usr/share/aws/emr/s3-dist-cp/
The s3-dist-cp.jar is not available in the slave nodes.
You can login into slave machine and verify it.
So the reason your application failure might be, In new emr you might be using some workflow management tool which deploy the application on slaves and start from there. As s3 s3-dist-cp is not available and it fails.
Work Around
First Option
bundle the jar and use following commands
hadoop jar s3-dist-cp.jar --src location --dest location
Second
Boot Strap the s3-dist-cp.jars on the cluster
You can even run it as java program

First thing, s3n:// is now deprecated, start using s3:// for S3 paths.
Secondly, if you're merely copying a file into S3 from a local file on your cluster, you can use aws s3 cp:
aws s3 cp /user/hive/warehouse/abc.text s3://bucket/abc.text

The syntax that you have used for s3-dist-cp is incorrect. Please try again with the command below.
s3-dist-cp --src hdfs:///user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
Let me know if this solves your proble.

Related

What is the correct way of installing a JDBC driver on EMR for Sqoop to use?

I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.
Here are the preliminary steps that I performed in order to run sqoop on EMR
Download the JDBC Driver
Move the JDBC driver to the /usr/lib/sqoop/lib directory
I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:
wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/
When I try to run these commands from an EMR bootstrap script however I get the error:
usr/lib/sqoop/lib/ No such file or directory
After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here
So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.
Here are some solutions which work but they feel like work-arounds
Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)
What is the correct way of installing this JDBC driver on EMR?
The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.
You can also use the jar command-runner.jar and the arguments to be
bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"

Unable to execute a step on a running EMR

I have an EMR cluster 5.28.1 running in AWS but I forgot to install from python libraries as part of the bootstrap action. Now that the cluster is running, I was simply attempting to add a step via the EMR console. Here are my settings
JAR: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class: None
Arguments: s3://xxxx/install_python_libraries.sh
Unfortunately, I get the following error.
Cannot run program "s3://xxxxx/install_python_libraries.sh" (in directory "."): error=2, No such file or directory
I am not sure what I am doing wrong. The shell script looks like this.
#!/bin/bash -xe
# Non-standard and non-Amazon Machine Image Python modules:
sudo pip-3.6 install boto3
sudo pip-3.6 install xmltodict
I also tried this by simply using 'command-runner.jar' but I get the same error. Can you please help me figure out the problem so I do this via the console? I would like to install the libraries on all nodes - master and core.
Thanks
The issue is the xxx.sh files EOL/carriage return type.
In other words, if it is Windows ("\r\n") then it will not work and return the ./ file not found error.
Convert it to unix type ("\n") using something like notepad++ and it will run fine.
(In notepad++ edit>EOL Conversion>Unix(LF) hit save and try again)

spark cluster on aws emr cant find spark-env.sh

I am playing with apache-spark on aws emr, and trying to use this to set the cluster to use python3,
I use the command as the last command in a bootstrap script
sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh
When I use it the cluster crashes during the bootstrap with the following error.
sed: can't read /etc/spark/conf/spark-env.sh: No such file or
directory
How should I set it to use python3 properly?
This is not a duplicate of, My issue is that the cluster is not finding the spark-env.sh file while bootstrapping, while the other question addresses the issue of the system not finding python3
In the end I did not use that script, but Used the EMR configuration file that is available on the creation stage, It gave me the proper configurations via spark_submit (in the aws gui) If you need it to be available for pyspark scripts in a more programatic way, you can use os.environ to set the pyspark python version in the python script

amazon emr spark submission from S3 not working

I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.
I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]
I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

How to set up and use EC2 CLI on Mac?

I am stuck at using Amazon EC2 CLI.
I have downloaded the Command Line Tools from
http://aws.amazon.com/developertools/351.
I placed the bin and lib folder into my Amazon project folder: /Users/Invictus/EC2
I downloaded the cert-xxxx.pem and pk-xxx.pem into the same folder.
Created a .bash_profile in the same folder.
I tried to execute ec2-describe-images -o amazon after I moved to cd /Users/Invictus/EC2.
The system does not recognise the command: command not found.
If I try to execute the same command inside the bin folder, the result is the same.
My .bash_profile:
export EC2_HOME=~/.EC2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=`ls $EC2_HOME/pk-*.pem`
export EC2_CERT=`ls $EC2_HOME/cert-*.pem`
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
Where did I make a mistake?
My aim is to connect to the launched instance and be able to execute commands there from my local machine.
I have Java installed.
The newer AWS Unified CLI Tools is much, much easier to set up. All you need is Python, which comes built-in to every Mac.
Here are a few things I can think of:
Your .bash_profile should be in /Users/Invictus/ , not /Users/Invictus/EC2. Move it to your home directory and log off and log back in (or restart your machine) and see if it picks up the right path.
Instead of ec2-describe-images, can you run it as "./ec2-describe-images" - does that work? If not, can you check the permissions on that script?