Import boto3 module error in aws batch job - amazon-web-services

I was trying to run a batch job on an image in aws and is getting below error
ModuleNotFoundError: No module named 'boto3'
But boto3 is getting imported in dockerfile
Dockerfile
FROM ubuntu:20.04
ENV SPARK_VERSION 2.4.8
ENV HADOOP_VERSION 3.0.0
RUN apt update
RUN apt install openjdk-8-jdk -y
RUN apt install scala -y
RUN apt install wget tar -y
#RUN wget https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz
RUN wget http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
RUN wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN tar xfz hadoop-$HADOOP_VERSION.tar.gz
RUN mv hadoop-$HADOOP_VERSION /opt/hadoop
RUN tar xvf spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN mv spark-$SPARK_VERSION-bin-without-hadoop /opt/spark
RUN apt install software-properties-common -y
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update && \
apt install python3.7 -y
ENV SPARK_HOME /opt/spark
ENV HADOOP_HOME /opt/hadoop
ENV HADOOP_CONF_DIR $HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
ENV PYSPARK_PYTHON /usr/bin/python3.7
RUN export SPARK_HOME=/opt/spark
RUN export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
RUN export PYSPARK_PYTHON=/usr/bin/python3.7
RUN export SPARK_DIST_CLASSPATH=$(hadoop classpath)
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.7 1
RUN update-alternatives --set python /usr/bin/python3.7
RUN apt-get install python3-distutils -y
RUN apt-get install python3-apt -y
RUN apt install python3-pip -y
RUN pip3 install --upgrade pip
COPY ./pipeline_union/requirements.txt requirements.txt
#RUN python -m pip install -r requirements.txt
RUN pip3 install -r requirements.txt
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.10.6/aws-java-sdk-1.10.6.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar -P $SPARK_HOME/jars/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${HADOOP_HOME}/share/hadoop/tools/lib/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${SPARK_HOME}/jars/
# COPY datalake/spark-on-spot/src/jars $SPARK_HOME/jars
# COPY datalake/spark-on-spot/src/pipeline_union ./
# COPY datalake/spark-on-spot/src/pipeline_union/spark.conf spark.conf
COPY ./jars $SPARK_HOME/jars
COPY ./pipeline_union ./
COPY ./pipeline_union/spark.conf spark.conf
#RUN ls /usr/lib/jvm
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV PATH $PATH:$HOME/bin:$JAVA_HOME/bin
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN hadoop classpath
ENV SPARK_DIST_CLASSPATH=/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/tools/lib/*
ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf"]
#ENTRYPOINT ["spark-submit", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
#ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
requirements.txt
boto3==1.13.9
botocore
colorama==0.3.9
progressbar2==3.39.3
pyarrow==1.0.1
requests
psycopg2-binary
pytz
I ran another image successfully, with 2 differences
code line in dockerfile
RUN pip install -r requirements.txt
requirement.txt
requests
boto3
psycopg2-binary
pytz
pandas
pynt
Is there any knowns issues in:
Using pip3 in Dockerfile instead of pip
Specifying boto3 version

Related

Unable to install Tesseract 5.0 version on AWS Lambda

I want to run Tesseract 4.0 or Tesseract 5.0 on my AWS Lambda function. So I have my docker file like so-
FROM public.ecr.aws/lambda/python:3.8
RUN mkdir app
# Copy function code
COPY / ${LAMBDA_TASK_ROOT}/app
# Install the function's dependencies using file requirements.txt
# from your project folder.
COPY requirements.txt .
RUN pip3 install -r requirements.txt --target ${LAMBDA_TASK_ROOT}
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum -y update
RUN yum -y install tesseract
RUN yum install -y poppler-utils
# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.com.emlAndMsgParser.mail_parser_test.getEmail_from_msg" ]
but when i do DockerBuild-"docker build -t qa-lambda ." on my terminal, it says Tesseract 3.0 version is getting installed. When i deploy this built Docker image to AWS Lambda,it also says Tesseract 3.0 is installed.
But I want Tesseract 4.0 or preferably Tesserct 5.0.
I tried changing the "RUN yum -y install tesseract" in my Dockerfile to "RUN yum -y install tesseract 5.0.0-alpha-320-g8dc3" and "RUN yum -y install tesseract -y" or "RUN yum -y install tesseract*".
But all of them are installing Tesseract 3.0.
Please can anyone tell me where I am going wrong?
I am a bit new to Tesseract, so any help is appreciated..thanks!
Having the same problem, I finally created a Dockerfile myself:
FROM public.ecr.aws/lambda/java:11 q
# Prepare dev tools
RUN yum -y update
RUN yum -y install wget libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
RUN yum group install -y "Development Tools"
# Build leptonica
WORKDIR /opt
RUN wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz
RUN ls -la
RUN tar -zxvf leptonica-1.82.0.tar.gz
WORKDIR ./leptonica-1.82.0
RUN ./configure
RUN make -j
RUN make install
RUN cd .. && rm leptonica-1.82.0.tar.gz
# Build tesseract
RUN wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz
RUN tar -zxvf 5.2.0.tar.gz
WORKDIR ./tesseract-5.2.0
RUN ./autogen.sh
RUN PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
RUN LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
RUN make install
RUN /sbin/ldconfig
RUN cd .. && rm 5.2.0.tar.gz
# Optional: install language packs
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/deu.traineddata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
RUN mv *.traineddata /usr/local/share/tessdata
WORKDIR /root
ENTRYPOINT [ "tesseract", "--version" ]
Hope this helps!

Pyarrow fs.HadoopFileSytem reports unable to load libhdfs.so

I'm trying to use the pyarrow Filesystem interface with HDFS. I receive a libhdfs.so not found error when calling the fs.HadoopFileSystem constructor even though libhdfs.so is apparently at the indicated location.
from pyarrow import fs
hfs = fs.HadoopFileSystem(host="10.10.0.167", port=9870)
OSError: Unable to load libhdfs: /hadoop-3.3.1/lib/native/libhdfs.so: cannot open shared object file: No such file or directory
I've tried different python and pyarrow versions and setting ARROW_LIBHDFS_DIR. For testing, I'm using the following dockerfile on linuxmint.
FROM openjdk:11
RUN apt-get update &&\
apt-get install wget -y
RUN wget -nv https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1-aarch64.tar.gz &&\
tar -xf hadoop-3.3.1-aarch64.tar.gz
ENV PATH=/miniconda/bin:${PATH}
RUN wget -nv https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh &&\
bash miniconda.sh -b -p /miniconda &&\
conda init
RUN conda install -c conda-forge python=3.9.6
RUN conda install -c conda-forge pyarrow=4.0.1
ENV JAVA_HOME=/usr/local/openjdk-11
ENV HADOOP_HOME=/hadoop-3.3.1
RUN printf 'from pyarrow import fs\nhfs = fs.HadoopFileSystem(host="10.10.0.167", port=9870)\n' > test_arrow.py
# 'python test_arrow.py' fails with ...
# OSError: Unable to load libhdfs: /hadoop-3.3.1/lib/native/libhdfs.so: cannot open shared object file: No such file or directory
RUN python test_arrow.py || true
CMD ["/bin/bash"]
I have created a docker file for the pyarrow fs hadoopfilesystem client.
HDFS needs to be installed to use the libhdfs.so file.
RUN mkdir -p /data/hadoop
RUN apt-get -q update
RUN apt-get install software-properties-common -y
RUN add-apt-repository "deb http://deb.debian.org/debian/ sid main"
RUN apt-get -q update
RUN apt-get install openjdk-8-jdk -y
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*
RUN wget "https://dlcdn.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz" -O hadoop-3.3.2.tar.gz
RUN tar xzf hadoop-3.3.2.tar.gz
ENV HADOOP_HOME=/app/hadoop-3.3.2
ENV HADOOP_INSTALL=$HADOOP_HOME
ENV HADOOP_MAPRED_HOME=$HADOOP_HOME
ENV HADOOP_COMMON_HOME=$HADOOP_HOME
ENV HADOOP_HDFS_HOME=$HADOOP_HOME
ENV YARN_HOME=$HADOOP_HOME
ENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
ENV PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
ENV HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ENV CLASSPATH="$HADOOP_HOME/bin/hadoop classpath --glob"
ENV ARROW_LIBHDFS_DIR=$HADOOP_HOME/lib/native
ADD pyarrow-app.py /app/
CMD [ "python3" "-u" "/app/pyarrow-app.py.py"]

CodeBuild with custom docker image get error: Failed to get D-Bus connection: Operation not permitted

I created a docker image:
Dockerfile:
FROM amazonlinux:2 AS core
WORKDIR /root
RUN yum update -y
RUN curl -sL https://rpm.nodesource.com/setup_14.x | bash -
RUN yum install -y nodejs
RUN npm install -g less coffeescript
#shodw-utils for use the useradd command.
RUN yum install shadow-utils -y
RUN useradd ec2-user
RUN yum install -y awslogs
RUN yum -y install net-tools
RUN yum install which -y
RUN yum install sudo -y
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum install nginx -y
RUN yum clean metadata
RUN yum install -y python3-devel.aarch64
RUN yum install python37
RUN /usr/bin/pip3.7 install virtualenv
RUN virtualenv -p /usr/bin/python3.7 /var/app/.venv
RUN yum install nginx -y
RUN yum -y groupinstall Development tools
RUN yum -y install zlib-devel
RUN yum -y install bzip2-devel openssl-devel ncurses-devel
RUN yum -y install sqlite sqlite-devel xz-devel
RUN yum -y install readline-devel tk-devel gdbm-devel db4-devel
RUN yum -y install libpcap-devel xz-devel
RUN yum -y install libjpeg-devel
RUN yum -y install wget
When i run it with privileged on my local it succeed to run NGINX service with no failure.
CodeBuild (Privileged mode= True) pull this image from ECR but when trying to run NGINX service i got the following error:
Failed to get D-Bus connection: Operation not permitted
any ideas?

Is there a way I can get my docker-compose build command to auto-generate my Django requirements.txt?

I'm using Django 2 and Python 3.7. I have the following directory structure.
web
- Dockerfile
- manage.py
+ maps
- requirements.txt
+ static
+ tests
+ venv
"requirements.txt" is just a file I generated by running "pip3 freeze > requirements.txt". I have the below Dockerfile for my Django container ...
FROM python:3.7-slim
RUN apt-get update && apt-get install
RUN apt-get install -y libmariadb-dev-compat libmariadb-dev
RUN apt-get update \
&& apt-get install -y --no-install-recommends gcc \
&& rm -rf /var/lib/apt/lists/*
RUN python -m pip install --upgrade pip
RUN mkdir -p /app/
WORKDIR /app/
pip3 freeze > requirements.txt
COPY requirements.txt requirements.txt
RUN python -m pip install -r requirements.txt
COPY . /app/
I was wondering if there is a way to build my container such that it auto-generates and copies the correct requirements.txt file. As you might guess, the line
pip3 freeze > requirements.txt
I have attempted to include above causes the whole thing to die when running "docker-compose build" with the error
ERROR: Dockerfile parse error line 15: unknown instruction: PIP3
This makes no sense as your environment on docker container will be empty and rewrite your requirements.txt
You are also missing run
RUN pip3 freeze > requirements.txt

Django: Dockerfile error with collectsatic

I am trying to deploy Django application with Docker and Jenkins.
I get the error
"msg": "Error building registry.docker.si/... - code: 1 message: The command '/bin/sh -c if [ ! -d ./static ]; then mkdir static; fi && ./manage.py collectstatic --no-input' returned a non-zero code: 1"
}
The Dockerfile is:
FROM python:3.6
RUN apt-get update && apt-get install -y python-dev && apt-get install -y libldap2-dev && apt-get install -y libsasl2-dev && apt-get install -y libssl-dev && apt-get install -y sasl2-bin
ENV PYTHONUNBUFFERED 1
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir --upgrade pip
RUN pip install --no-cache-dir --upgrade setuptools
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN chmod u+rwx manage.py
RUN if [ ! -d ./static ]; then mkdir static; fi && ./manage.py collectstatic --no-input
RUN chown -R 10000:10000 ./
EXPOSE 8080
CMD ["sh", "./run-django.sh"]
My problem is that, with same dockerfile other Django project deploy without any problem...