EMR stuck at bootstrap script - amazon-web-services

Am trying to run a bootstrap file at EMR to installed facebook prophet which seems to have an issue requiring to install dev-tools, the bootstrap.sh simply runs
bootstrap.sh
#!/bin/bash -xe
sudo yum install python3-devel python3-libs python3-tools
#sudo yum groupinstall "Development Tools"
aws s3 cp s3://bucket/requirements.txt .
sudo python3 -m pip install --upgrade pip
sudo python3 -m pip install --upgrade -r ./requirements.txt
,output logs shows the below
and errors logs shows
but then the cluster is stuck for 1 hour before failing

I had to add -y flag to yum inorder to pypass any user prompting

Related

Import boto3 module error in aws batch job

I was trying to run a batch job on an image in aws and is getting below error
ModuleNotFoundError: No module named 'boto3'
But boto3 is getting imported in dockerfile
Dockerfile
FROM ubuntu:20.04
ENV SPARK_VERSION 2.4.8
ENV HADOOP_VERSION 3.0.0
RUN apt update
RUN apt install openjdk-8-jdk -y
RUN apt install scala -y
RUN apt install wget tar -y
#RUN wget https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz
RUN wget http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
RUN wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN tar xfz hadoop-$HADOOP_VERSION.tar.gz
RUN mv hadoop-$HADOOP_VERSION /opt/hadoop
RUN tar xvf spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN mv spark-$SPARK_VERSION-bin-without-hadoop /opt/spark
RUN apt install software-properties-common -y
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update && \
apt install python3.7 -y
ENV SPARK_HOME /opt/spark
ENV HADOOP_HOME /opt/hadoop
ENV HADOOP_CONF_DIR $HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
ENV PYSPARK_PYTHON /usr/bin/python3.7
RUN export SPARK_HOME=/opt/spark
RUN export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
RUN export PYSPARK_PYTHON=/usr/bin/python3.7
RUN export SPARK_DIST_CLASSPATH=$(hadoop classpath)
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.7 1
RUN update-alternatives --set python /usr/bin/python3.7
RUN apt-get install python3-distutils -y
RUN apt-get install python3-apt -y
RUN apt install python3-pip -y
RUN pip3 install --upgrade pip
COPY ./pipeline_union/requirements.txt requirements.txt
#RUN python -m pip install -r requirements.txt
RUN pip3 install -r requirements.txt
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.10.6/aws-java-sdk-1.10.6.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar -P $SPARK_HOME/jars/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${HADOOP_HOME}/share/hadoop/tools/lib/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${SPARK_HOME}/jars/
# COPY datalake/spark-on-spot/src/jars $SPARK_HOME/jars
# COPY datalake/spark-on-spot/src/pipeline_union ./
# COPY datalake/spark-on-spot/src/pipeline_union/spark.conf spark.conf
COPY ./jars $SPARK_HOME/jars
COPY ./pipeline_union ./
COPY ./pipeline_union/spark.conf spark.conf
#RUN ls /usr/lib/jvm
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV PATH $PATH:$HOME/bin:$JAVA_HOME/bin
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN hadoop classpath
ENV SPARK_DIST_CLASSPATH=/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/tools/lib/*
ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf"]
#ENTRYPOINT ["spark-submit", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
#ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
requirements.txt
boto3==1.13.9
botocore
colorama==0.3.9
progressbar2==3.39.3
pyarrow==1.0.1
requests
psycopg2-binary
pytz
I ran another image successfully, with 2 differences
code line in dockerfile
RUN pip install -r requirements.txt
requirement.txt
requests
boto3
psycopg2-binary
pytz
pandas
pynt
Is there any knowns issues in:
Using pip3 in Dockerfile instead of pip
Specifying boto3 version

CodeBuild with custom docker image get error: Failed to get D-Bus connection: Operation not permitted

I created a docker image:
Dockerfile:
FROM amazonlinux:2 AS core
WORKDIR /root
RUN yum update -y
RUN curl -sL https://rpm.nodesource.com/setup_14.x | bash -
RUN yum install -y nodejs
RUN npm install -g less coffeescript
#shodw-utils for use the useradd command.
RUN yum install shadow-utils -y
RUN useradd ec2-user
RUN yum install -y awslogs
RUN yum -y install net-tools
RUN yum install which -y
RUN yum install sudo -y
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum install nginx -y
RUN yum clean metadata
RUN yum install -y python3-devel.aarch64
RUN yum install python37
RUN /usr/bin/pip3.7 install virtualenv
RUN virtualenv -p /usr/bin/python3.7 /var/app/.venv
RUN yum install nginx -y
RUN yum -y groupinstall Development tools
RUN yum -y install zlib-devel
RUN yum -y install bzip2-devel openssl-devel ncurses-devel
RUN yum -y install sqlite sqlite-devel xz-devel
RUN yum -y install readline-devel tk-devel gdbm-devel db4-devel
RUN yum -y install libpcap-devel xz-devel
RUN yum -y install libjpeg-devel
RUN yum -y install wget
When i run it with privileged on my local it succeed to run NGINX service with no failure.
CodeBuild (Privileged mode= True) pull this image from ECR but when trying to run NGINX service i got the following error:
Failed to get D-Bus connection: Operation not permitted
any ideas?

Error of "Command 'pip' not found" when trying to install requirements.txt

I'm trying to do: pip install -r requirements.txt on an AWS server. I recently pulled a git commit.
I get this error:
Command 'pip' not found, but can be installed with:
sudo apt install python-pip
So I tried entering:
sudo apt install python-pip install -r requirements.txt
and then
sudo apt install python-pip -r requirements.txt
But both attempts gave me this error:
E: Command line option 'r' [from -r] is not understood in combination with the other options.
What is the correct command to install this? Thank you.
You are mixing multiple commands.
apt ; It is Debian's package manager. It has nothing to do with python packages. You install pip through apt. There are also other ways of doing it.
pip : As understood it is python package manager. You can install dependencies for your project by listing them in requirements.txt.
The correct way would be :
sudo apt install python-pip
#install from a file requirements.txt:
sudo pip install -r requirements.txt
#install as a user :
pip install -U -r requirements.txt

AWS EMR stuck bootstrapping

For a course, we are doing a basic tweet analysis using AWS EMR. I followed the steps in this document:
http://docs.aws.amazon.com/en_us/gettingstarted/latest/emr/awsgsg-emr.pdf
The only modifications are that I uploaded a pre-done set of tweets and we are told to use our own config file for the NLTK. The instructor gave us the following for the custom NLTK config:
#!/bin/bash
sudo yum -y install git gcc python-dev python-devel
sudo ln -sf /usr/bin/python2.7 /usr/bin/python
sudo easy_install pip
sudo pip install -U numpy
sudo pip install numpy
sudo easy_install -U distribute
sudo pip install -U setuptools
sudo pip install pyyaml nltk
sudo pip install -e git://github.com/mdp-toolkit/mdp-toolkit#egg=MDP
sudo python -m nltk.downloader -d /usr/share/nltk_data all
I create my cluster and, when it executes, it gets to 'bootstrapping' and has been stuck there for 45 minutes. Using AMI Version 3.11.0, no Hive, Pig, or HUE.
Please let me know if more information is needed to try to diagnose this. What could cause this?

When bootstrapping an amazon elastic map reduce job, can my script use sudo?

I need to:
sudo apt-get install rubygems
sudo gem install <lots of gems>
does the bootstrap action have sudo access?
The answer is yes. You can test your bootstrap script like this:
elastic_mapreduce --create --alive --ssh
This will create a node and give you a ssh connection to it, from which you can test your bootstrap script.
UPDATE: For reference here is what I'm running:
#!/bin/bash
sudo apt-get -y -V install irb1.8 libreadline-ruby1.8 libruby libruby1.8 rdoc1.8 ruby ruby1.8 ruby1.8-dev
wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.11.zip
unzip rubygems-1.8.11.zip
cd rubygems-1.8.11
sudo ruby setup.rb
sudo gem1.8 install bson bson_ext json tzinfo i18n activesupport --no-rdoc --no-ri
UPDATE2: to install aws-sdk
#!/bin/bash
# ruby developer packages
sudo apt-get -y -V install ruby1.8-dev ruby1.8 ri1.8 rdoc1.8 irb1.8
sudo apt-get -y -V install libreadline-ruby1.8 libruby1.8 libopenssl-ruby
# nokogiri requirements
sudo apt-get -y -V install libxslt-dev libxml2-dev
wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.11.zip
unzip rubygems-1.8.11.zip
cd rubygems-1.8.11
sudo ruby setup.rb
sudo gem1.8 install aws-sdk --no-rdoc --no-ri
-y on apt-get makes it not prompt you
I wget rubygems because the version you get with apt-get is way out of date, and some gems won't build using an old version.
Yup, and here is a list of commands that I ran to setup my instance (for all the people who google for this question later):
sudo apt-get update
sudo apt-get -y install emacs
sudo apt-get -y install rubygems
sudo gem install fastercsv --source http://rubygems.org
sudo gem install crack --source http://rubygems.org
sudo gem install json_pure --source http://rubygems.org
exit