Pyarrow fs.HadoopFileSytem reports unable to load libhdfs.so - hdfs

I'm trying to use the pyarrow Filesystem interface with HDFS. I receive a libhdfs.so not found error when calling the fs.HadoopFileSystem constructor even though libhdfs.so is apparently at the indicated location.
from pyarrow import fs
hfs = fs.HadoopFileSystem(host="10.10.0.167", port=9870)
OSError: Unable to load libhdfs: /hadoop-3.3.1/lib/native/libhdfs.so: cannot open shared object file: No such file or directory
I've tried different python and pyarrow versions and setting ARROW_LIBHDFS_DIR. For testing, I'm using the following dockerfile on linuxmint.
FROM openjdk:11
RUN apt-get update &&\
apt-get install wget -y
RUN wget -nv https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1-aarch64.tar.gz &&\
tar -xf hadoop-3.3.1-aarch64.tar.gz
ENV PATH=/miniconda/bin:${PATH}
RUN wget -nv https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh &&\
bash miniconda.sh -b -p /miniconda &&\
conda init
RUN conda install -c conda-forge python=3.9.6
RUN conda install -c conda-forge pyarrow=4.0.1
ENV JAVA_HOME=/usr/local/openjdk-11
ENV HADOOP_HOME=/hadoop-3.3.1
RUN printf 'from pyarrow import fs\nhfs = fs.HadoopFileSystem(host="10.10.0.167", port=9870)\n' > test_arrow.py
# 'python test_arrow.py' fails with ...
# OSError: Unable to load libhdfs: /hadoop-3.3.1/lib/native/libhdfs.so: cannot open shared object file: No such file or directory
RUN python test_arrow.py || true
CMD ["/bin/bash"]

I have created a docker file for the pyarrow fs hadoopfilesystem client.
HDFS needs to be installed to use the libhdfs.so file.
RUN mkdir -p /data/hadoop
RUN apt-get -q update
RUN apt-get install software-properties-common -y
RUN add-apt-repository "deb http://deb.debian.org/debian/ sid main"
RUN apt-get -q update
RUN apt-get install openjdk-8-jdk -y
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*
RUN wget "https://dlcdn.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz" -O hadoop-3.3.2.tar.gz
RUN tar xzf hadoop-3.3.2.tar.gz
ENV HADOOP_HOME=/app/hadoop-3.3.2
ENV HADOOP_INSTALL=$HADOOP_HOME
ENV HADOOP_MAPRED_HOME=$HADOOP_HOME
ENV HADOOP_COMMON_HOME=$HADOOP_HOME
ENV HADOOP_HDFS_HOME=$HADOOP_HOME
ENV YARN_HOME=$HADOOP_HOME
ENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
ENV PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
ENV HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ENV CLASSPATH="$HADOOP_HOME/bin/hadoop classpath --glob"
ENV ARROW_LIBHDFS_DIR=$HADOOP_HOME/lib/native
ADD pyarrow-app.py /app/
CMD [ "python3" "-u" "/app/pyarrow-app.py.py"]

Related

Unable to install Tesseract 5.0 version on AWS Lambda

I want to run Tesseract 4.0 or Tesseract 5.0 on my AWS Lambda function. So I have my docker file like so-
FROM public.ecr.aws/lambda/python:3.8
RUN mkdir app
# Copy function code
COPY / ${LAMBDA_TASK_ROOT}/app
# Install the function's dependencies using file requirements.txt
# from your project folder.
COPY requirements.txt .
RUN pip3 install -r requirements.txt --target ${LAMBDA_TASK_ROOT}
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum -y update
RUN yum -y install tesseract
RUN yum install -y poppler-utils
# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.com.emlAndMsgParser.mail_parser_test.getEmail_from_msg" ]
but when i do DockerBuild-"docker build -t qa-lambda ." on my terminal, it says Tesseract 3.0 version is getting installed. When i deploy this built Docker image to AWS Lambda,it also says Tesseract 3.0 is installed.
But I want Tesseract 4.0 or preferably Tesserct 5.0.
I tried changing the "RUN yum -y install tesseract" in my Dockerfile to "RUN yum -y install tesseract 5.0.0-alpha-320-g8dc3" and "RUN yum -y install tesseract -y" or "RUN yum -y install tesseract*".
But all of them are installing Tesseract 3.0.
Please can anyone tell me where I am going wrong?
I am a bit new to Tesseract, so any help is appreciated..thanks!
Having the same problem, I finally created a Dockerfile myself:
FROM public.ecr.aws/lambda/java:11 q
# Prepare dev tools
RUN yum -y update
RUN yum -y install wget libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
RUN yum group install -y "Development Tools"
# Build leptonica
WORKDIR /opt
RUN wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz
RUN ls -la
RUN tar -zxvf leptonica-1.82.0.tar.gz
WORKDIR ./leptonica-1.82.0
RUN ./configure
RUN make -j
RUN make install
RUN cd .. && rm leptonica-1.82.0.tar.gz
# Build tesseract
RUN wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz
RUN tar -zxvf 5.2.0.tar.gz
WORKDIR ./tesseract-5.2.0
RUN ./autogen.sh
RUN PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
RUN LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
RUN make install
RUN /sbin/ldconfig
RUN cd .. && rm 5.2.0.tar.gz
# Optional: install language packs
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/deu.traineddata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
RUN mv *.traineddata /usr/local/share/tessdata
WORKDIR /root
ENTRYPOINT [ "tesseract", "--version" ]
Hope this helps!

Import boto3 module error in aws batch job

I was trying to run a batch job on an image in aws and is getting below error
ModuleNotFoundError: No module named 'boto3'
But boto3 is getting imported in dockerfile
Dockerfile
FROM ubuntu:20.04
ENV SPARK_VERSION 2.4.8
ENV HADOOP_VERSION 3.0.0
RUN apt update
RUN apt install openjdk-8-jdk -y
RUN apt install scala -y
RUN apt install wget tar -y
#RUN wget https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz
RUN wget http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
RUN wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN tar xfz hadoop-$HADOOP_VERSION.tar.gz
RUN mv hadoop-$HADOOP_VERSION /opt/hadoop
RUN tar xvf spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN mv spark-$SPARK_VERSION-bin-without-hadoop /opt/spark
RUN apt install software-properties-common -y
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update && \
apt install python3.7 -y
ENV SPARK_HOME /opt/spark
ENV HADOOP_HOME /opt/hadoop
ENV HADOOP_CONF_DIR $HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
ENV PYSPARK_PYTHON /usr/bin/python3.7
RUN export SPARK_HOME=/opt/spark
RUN export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
RUN export PYSPARK_PYTHON=/usr/bin/python3.7
RUN export SPARK_DIST_CLASSPATH=$(hadoop classpath)
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.7 1
RUN update-alternatives --set python /usr/bin/python3.7
RUN apt-get install python3-distutils -y
RUN apt-get install python3-apt -y
RUN apt install python3-pip -y
RUN pip3 install --upgrade pip
COPY ./pipeline_union/requirements.txt requirements.txt
#RUN python -m pip install -r requirements.txt
RUN pip3 install -r requirements.txt
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.10.6/aws-java-sdk-1.10.6.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar -P $SPARK_HOME/jars/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${HADOOP_HOME}/share/hadoop/tools/lib/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${SPARK_HOME}/jars/
# COPY datalake/spark-on-spot/src/jars $SPARK_HOME/jars
# COPY datalake/spark-on-spot/src/pipeline_union ./
# COPY datalake/spark-on-spot/src/pipeline_union/spark.conf spark.conf
COPY ./jars $SPARK_HOME/jars
COPY ./pipeline_union ./
COPY ./pipeline_union/spark.conf spark.conf
#RUN ls /usr/lib/jvm
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV PATH $PATH:$HOME/bin:$JAVA_HOME/bin
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN hadoop classpath
ENV SPARK_DIST_CLASSPATH=/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/tools/lib/*
ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf"]
#ENTRYPOINT ["spark-submit", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
#ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
requirements.txt
boto3==1.13.9
botocore
colorama==0.3.9
progressbar2==3.39.3
pyarrow==1.0.1
requests
psycopg2-binary
pytz
I ran another image successfully, with 2 differences
code line in dockerfile
RUN pip install -r requirements.txt
requirement.txt
requests
boto3
psycopg2-binary
pytz
pandas
pynt
Is there any knowns issues in:
Using pip3 in Dockerfile instead of pip
Specifying boto3 version

Run conda inside singularity

I would like to run a conda command with singularity.
The command is:
singularity exec ~/dockerimage.sif conda
It yields an error:
/.singularity.d/actions/exec: 9: exec: conda: Permission denied
Here is my dockerfile:
FROM ubuntu:20.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y apt-utils wget=1.20.3-1ubuntu1 python3.8=3.8.2-1ubuntu1.2 python3-pip=20.0.2-5ubuntu1 python3-yaml=5.3.1-1 git=1:2.25.1-1ubuntu3
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.8.3-Linux-x86_64.sh && chmod +x Miniconda3-py38_4.8.3-Linux-x86_64.sh && ./Miniconda3-py38_4.8.3-Linux-x86_64.sh -b && cp /root/miniconda3/bin/conda /usr/bin/conda
RUN wget https://data.qiime2.org/distro/core/qiime2-2020.8-py36-linux-conda.yml && conda env create -n qiime2-2020.8 --file qiime2-2020.8-py36-linux-conda.yml && conda install -y -n qiime2-2020.8 -c conda-forge -c bioconda -c qiime2 -c defaults q2cli q2template q2-types q2-feature-table q2-metadata vsearch snakemake
What should I add to the Dockerfile? How would it work?
You're installing with conda default settings, which puts it in the home of the current user. That user is root. Singularity runs as your current user, so unless you're running as root the conda files will not be available.
modify your conda install command to set the install prefix: -p /opt/conda (or some other arbitrary location)
make sure that any user will be able to access the files installed with conda: chmod -R o+rX /opt/conda
update PATH to include conda: export PATH="$PATH:/opt/conda/bin"
when running your image make sure your environment variables are not overriding those in the container: singularity exec --cleanenv ~/dockerimage.sif conda

macOS, Dockerfile mounting a folder cannot change locale

I'm trying to mount a folder with my docker file instead of copying it on build. We use git for development and I don't want to rebuild the image everytime I make a change for testing.
my docker file is now as such
#set base image
FROM centos:centos7.2.1511
MAINTAINER Alex <alex#app.com>
#install yum dependencies
RUN yum -y update \\
&& yum -y install yum-plugin-ovl \
&& yum -y install epel-release \
&& yum -y install net-tools \
&& yum -y install gcc \
&& yum -y install python-devel \
&& yum -y install git \
&& yum -y install python-pip \
&& yum -y install openldap-devel \
&& yum -y install gcc gcc-c++ kernel-devel \
&& yum -y install libxslt-devel libffi-devel openssl-devel \
&& yum -y install libevent-devel \
&& yum -y install openldap-devel \
&& yum -y install net-snmp-devel \
&& yum -y install mysql-devel \
&& yum -y install python-dateutil \
&& yum -y install python-pip \
&& pip install --upgrade pip
# Create the DIR
#RUN mkdir -p /var/www/itapp
# Set the working directory
#WORKDIR /var/www/itapp
# Copy the app directory contents into the container
#ADD . /var/www/itapp
# Install any needed packages specified in requirements.txt
#RUN pip install -r requirements.txt
# Make port available to the world outside this container
EXPOSE 8000
# Define environment variable
ENV NAME itapp
# Run server when the container launches
CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]
ive commented out the creation and copy of the itapp Django files as I want to mount them instead, (do I need to rebuild this first?)
then my command for mounting is
docker run -it -v /Users/alex/itapp:/var/www/itapp itapp bash
I now get an error:
bash: warning: setlocale: LC_CTYPE: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_COLLATE: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_MESSAGES: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_NUMERIC: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_TIME: cannot change locale (en_US.UTF-8): No such file or directory
and the dev instance does not run.
how would I also set the working directory to the the volume that I'm mounting at runtime?
Try this command. -w WORKDIR option in docker run sets working directory inside the container.
docker run -d -v /Users/alex/itapp:/var/www/itapp -w /var/www/itapp itapp
Also, you'll to map your container port to your host port to be able to access, for example, from a browser to your app.
To do this, use the following command.
docker run -d -p 8000:8000 -v /Users/alex/itapp:/var/www/itapp -w /var/www/itapp itapp
After this, your app should be running in localhost:8000

Why does CMD never work in my Dockerfiles?

I have a few Dockerfiles where CMD doesn't seem to run. Here is an example (all the way at the bottom).
##########################################################
# Set the base image to Ansible
FROM ubuntu:16.10
# Install Ansible, Python and Related Deps #
RUN apt-get -y update && \
apt-get install -y python-yaml python-jinja2 python-httplib2 python-keyczar python-paramiko python-setuptools python-pkg-resources git python-pip
RUN mkdir /etc/ansible/
RUN echo '[local]\nlocalhost\n' > /etc/ansible/hosts
RUN mkdir /opt/ansible/
RUN git clone http://github.com/ansible/ansible.git /opt/ansible/ansible
WORKDIR /opt/ansible/ansible
RUN git submodule update --init
ENV PATH /opt/ansible/ansible/bin:/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
ENV PYTHONPATH /opt/ansible/ansible/lib
ENV ANSIBLE_LIBRARY /opt/ansible/ansible/library
RUN apt-get update -y
RUN apt-get install python -y
RUN apt-get install python-dev -y
RUN apt-get install python-setuptools -y
RUN apt-get install python-pip
RUN mkdir /ansible/
WORKDIR /ansible
COPY ./ansible ./
WORKDIR /
RUN ansible-playbook -c local ansible/playbooks/installdjango.yml
ENV PROJECTNAME testwebsite
################## SETUP DIRECTORY STRUCTURE ######################
WORKDIR /home
CMD ["django-admin" "startproject" "$PROJECTNAME"]
EXPOSE 8000
If I build and run the container, I can manually run
Django-admin startproject $PROJECTNAME and it will create a new project as expected, but the CMD in my Dockerfile does not seem to be doing anything and this is happening with all my other Dockerfiles so there's something I must not be getting.
ENTRYPOINT and CMD defines the default command that docker runs when it starts your container, not when the image is built. When ENTRYPOINT isn't defined, you simply run the value of CMD. Otherwise, CMD becomes args to the ENTRYPOINT. When you run your image, you can override the value of the CMD by passing args after the container name.
So, in your example above, CMD may be defined as anything, but when you run your container with docker run -it <imagename> /bin/bash, you override any value of CMD and replace it with /bin/bash. To run the defined value of CMD, you would need to run the container with docker run <imagename>.