I am playing with apache-spark on aws emr, and trying to use this to set the cluster to use python3,
I use the command as the last command in a bootstrap script
sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh
When I use it the cluster crashes during the bootstrap with the following error.
sed: can't read /etc/spark/conf/spark-env.sh: No such file or
directory
How should I set it to use python3 properly?
This is not a duplicate of, My issue is that the cluster is not finding the spark-env.sh file while bootstrapping, while the other question addresses the issue of the system not finding python3
In the end I did not use that script, but Used the EMR configuration file that is available on the creation stage, It gave me the proper configurations via spark_submit (in the aws gui) If you need it to be available for pyspark scripts in a more programatic way, you can use os.environ to set the pyspark python version in the python script
Related
AWS EB (Elastic Beanstalk) CLI not running in git bash (Windows 10). I have successfully installed the AWS EB CLI from AWS documentation at https://github.com/aws/aws-elastic-beanstalk-cli-setup/blob/master/README.md . At the end I have set the environment variables as mentioned in the doc. So "eb" command is working from Windows Power shell. But when I am trying to access the "eb" command from GIT Bash / IntelliJ bash prompt, it is not working.
Working fine with windows power shell:
PS C:\> eb --version
EB CLI 3.19.2 (Python 3.7.3)
Environment variable set as below under "User Variable" -> "Path":
Environment variable set windows
While trying to access the "eb" from Git Bash the error is as below:
$ eb
bash: eb: command not found
$ echo $PATH
.....
......
/c/Users/xxxxxx/.ebcli-virtual-env/executables:
Restarted the system and commandline interfaces multiple time.
Can someone please let me know if there are some issue with environment variable set, or need to configure something additional in bash environment?
After so many trial and error with different solution available in internet along with AWS doc suggestion, finally I can use "eb" from Git bash of windows 10. The problem fixed after I put the below location in my environment variable path:
C:\Users\XXXX\AppData\Roaming\Python\Python37\Scripts
The issue for me was a username with a space. The path would then look like this: C:\Users\fname lastname.ebcli-virtual-env\executables. The problem came about with the .bat files created by the AWS script did not wrap the path in double quotes. Windows then interprets it as multiple parameters.
I had to go edit eb.bat and path_exporter.bat and wrap the directives like this: (in eb.bat) CALL "C:\Users\fname lastname.ebcli-virtual-env\Scripts\activate.bat"
#start CALL "C:\Users\fname lastname.ebcli-virtual-env\Scripts\eb.exe" %args%
The EB cli seems to work properly now.
I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.
Here are the preliminary steps that I performed in order to run sqoop on EMR
Download the JDBC Driver
Move the JDBC driver to the /usr/lib/sqoop/lib directory
I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:
wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/
When I try to run these commands from an EMR bootstrap script however I get the error:
usr/lib/sqoop/lib/ No such file or directory
After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here
So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.
Here are some solutions which work but they feel like work-arounds
Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)
What is the correct way of installing this JDBC driver on EMR?
The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.
You can also use the jar command-runner.jar and the arguments to be
bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"
I have an EMR cluster 5.28.1 running in AWS but I forgot to install from python libraries as part of the bootstrap action. Now that the cluster is running, I was simply attempting to add a step via the EMR console. Here are my settings
JAR: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class: None
Arguments: s3://xxxx/install_python_libraries.sh
Unfortunately, I get the following error.
Cannot run program "s3://xxxxx/install_python_libraries.sh" (in directory "."): error=2, No such file or directory
I am not sure what I am doing wrong. The shell script looks like this.
#!/bin/bash -xe
# Non-standard and non-Amazon Machine Image Python modules:
sudo pip-3.6 install boto3
sudo pip-3.6 install xmltodict
I also tried this by simply using 'command-runner.jar' but I get the same error. Can you please help me figure out the problem so I do this via the console? I would like to install the libraries on all nodes - master and core.
Thanks
The issue is the xxx.sh files EOL/carriage return type.
In other words, if it is Windows ("\r\n") then it will not work and return the ./ file not found error.
Convert it to unix type ("\n") using something like notepad++ and it will run fine.
(In notepad++ edit>EOL Conversion>Unix(LF) hit save and try again)
I am facing issues with s3-dist-cp command in emr-5.0.0 version. In my application, I need to push some files from hdfs to S3. I am using s3-dist-cp command to achieve this. It was working fine in emr-4.2.0. But its not working in emr-5.0.0. If I run the command manually it works fine. But it fails in my application. I didn't make any change in my application to run it on emr-5.
Do I need to make any change if I need to use emr-5? Has there been any change in way we use s3-dist-cp command in emr-5?
I am using following command:
s3-dist-cp --src /user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
s3-dist-cp is only available on the master node(s3-dist-cp.jar).
The following is the location of the application.
/usr/share/aws/emr/s3-dist-cp/
The s3-dist-cp.jar is not available in the slave nodes.
You can login into slave machine and verify it.
So the reason your application failure might be, In new emr you might be using some workflow management tool which deploy the application on slaves and start from there. As s3 s3-dist-cp is not available and it fails.
Work Around
First Option
bundle the jar and use following commands
hadoop jar s3-dist-cp.jar --src location --dest location
Second
Boot Strap the s3-dist-cp.jars on the cluster
You can even run it as java program
First thing, s3n:// is now deprecated, start using s3:// for S3 paths.
Secondly, if you're merely copying a file into S3 from a local file on your cluster, you can use aws s3 cp:
aws s3 cp /user/hive/warehouse/abc.text s3://bucket/abc.text
The syntax that you have used for s3-dist-cp is incorrect. Please try again with the command below.
s3-dist-cp --src hdfs:///user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
Let me know if this solves your proble.
I have a crontab that fires a PHP script that runs the AWS CLI command "aws ec2 create-snapshot".
When I run the script via the command line the php script completes successfully with the aws command returning a JSON string to PHP. But when I setup a crontab to run the php script the aws command doesn't return anything.
The crontab is running as the same user as when I run the PHP script on the command line myself, so I am a bit stumped?
I had the same problem with running a ruby script (ruby script.rb).
I replace ruby by its full path (/sources/ruby-2.0.0-p195/ruby) and it worked.
in you case, replace "aws" by its full path. to find it:
find / -name "aws"
The reason it's necessary to specify the full path to the aws command is because cron by default runs with a very limited environment. I ran into this problem as well, and debugged it by adding this to the cron script:
set | sort > /tmp/environment.txt
I then ran the script via cron and via command line (renaming the environment file between runs) and compared them. This led me to see that I needed to set both the PATH and the AWS_DEFAULT_REGION environment variables. After doing this the script worked just fine.