Unable to install Tesseract 5.0 version on AWS Lambda - amazon-web-services

I want to run Tesseract 4.0 or Tesseract 5.0 on my AWS Lambda function. So I have my docker file like so-
FROM public.ecr.aws/lambda/python:3.8
RUN mkdir app
# Copy function code
COPY / ${LAMBDA_TASK_ROOT}/app
# Install the function's dependencies using file requirements.txt
# from your project folder.
COPY requirements.txt .
RUN pip3 install -r requirements.txt --target ${LAMBDA_TASK_ROOT}
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum -y update
RUN yum -y install tesseract
RUN yum install -y poppler-utils
# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.com.emlAndMsgParser.mail_parser_test.getEmail_from_msg" ]
but when i do DockerBuild-"docker build -t qa-lambda ." on my terminal, it says Tesseract 3.0 version is getting installed. When i deploy this built Docker image to AWS Lambda,it also says Tesseract 3.0 is installed.
But I want Tesseract 4.0 or preferably Tesserct 5.0.
I tried changing the "RUN yum -y install tesseract" in my Dockerfile to "RUN yum -y install tesseract 5.0.0-alpha-320-g8dc3" and "RUN yum -y install tesseract -y" or "RUN yum -y install tesseract*".
But all of them are installing Tesseract 3.0.
Please can anyone tell me where I am going wrong?
I am a bit new to Tesseract, so any help is appreciated..thanks!

Having the same problem, I finally created a Dockerfile myself:
FROM public.ecr.aws/lambda/java:11 q
# Prepare dev tools
RUN yum -y update
RUN yum -y install wget libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
RUN yum group install -y "Development Tools"
# Build leptonica
WORKDIR /opt
RUN wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz
RUN ls -la
RUN tar -zxvf leptonica-1.82.0.tar.gz
WORKDIR ./leptonica-1.82.0
RUN ./configure
RUN make -j
RUN make install
RUN cd .. && rm leptonica-1.82.0.tar.gz
# Build tesseract
RUN wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz
RUN tar -zxvf 5.2.0.tar.gz
WORKDIR ./tesseract-5.2.0
RUN ./autogen.sh
RUN PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
RUN LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
RUN make install
RUN /sbin/ldconfig
RUN cd .. && rm 5.2.0.tar.gz
# Optional: install language packs
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/deu.traineddata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
RUN mv *.traineddata /usr/local/share/tessdata
WORKDIR /root
ENTRYPOINT [ "tesseract", "--version" ]
Hope this helps!

Related

Import boto3 module error in aws batch job

I was trying to run a batch job on an image in aws and is getting below error
ModuleNotFoundError: No module named 'boto3'
But boto3 is getting imported in dockerfile
Dockerfile
FROM ubuntu:20.04
ENV SPARK_VERSION 2.4.8
ENV HADOOP_VERSION 3.0.0
RUN apt update
RUN apt install openjdk-8-jdk -y
RUN apt install scala -y
RUN apt install wget tar -y
#RUN wget https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz
RUN wget http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
RUN wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN tar xfz hadoop-$HADOOP_VERSION.tar.gz
RUN mv hadoop-$HADOOP_VERSION /opt/hadoop
RUN tar xvf spark-$SPARK_VERSION-bin-without-hadoop.tgz
RUN mv spark-$SPARK_VERSION-bin-without-hadoop /opt/spark
RUN apt install software-properties-common -y
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update && \
apt install python3.7 -y
ENV SPARK_HOME /opt/spark
ENV HADOOP_HOME /opt/hadoop
ENV HADOOP_CONF_DIR $HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
ENV PYSPARK_PYTHON /usr/bin/python3.7
RUN export SPARK_HOME=/opt/spark
RUN export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:${HADOOP_HOME}/bin
RUN export PYSPARK_PYTHON=/usr/bin/python3.7
RUN export SPARK_DIST_CLASSPATH=$(hadoop classpath)
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.7 1
RUN update-alternatives --set python /usr/bin/python3.7
RUN apt-get install python3-distutils -y
RUN apt-get install python3-apt -y
RUN apt install python3-pip -y
RUN pip3 install --upgrade pip
COPY ./pipeline_union/requirements.txt requirements.txt
#RUN python -m pip install -r requirements.txt
RUN pip3 install -r requirements.txt
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.10.6/aws-java-sdk-1.10.6.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar -P $SPARK_HOME/jars/
RUN wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar -P $SPARK_HOME/jars/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${HADOOP_HOME}/share/hadoop/tools/lib/
#RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar -P ${SPARK_HOME}/jars/
# COPY datalake/spark-on-spot/src/jars $SPARK_HOME/jars
# COPY datalake/spark-on-spot/src/pipeline_union ./
# COPY datalake/spark-on-spot/src/pipeline_union/spark.conf spark.conf
COPY ./jars $SPARK_HOME/jars
COPY ./pipeline_union ./
COPY ./pipeline_union/spark.conf spark.conf
#RUN ls /usr/lib/jvm
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV PATH $PATH:$HOME/bin:$JAVA_HOME/bin
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN hadoop classpath
ENV SPARK_DIST_CLASSPATH=/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/tools/lib/*
ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf"]
#ENTRYPOINT ["spark-submit", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
#ENTRYPOINT ["spark-submit", "--properties-file", "spark.conf", "--packages", "org.apache.hadoop:hadoop-aws:2.8.5"]
requirements.txt
boto3==1.13.9
botocore
colorama==0.3.9
progressbar2==3.39.3
pyarrow==1.0.1
requests
psycopg2-binary
pytz
I ran another image successfully, with 2 differences
code line in dockerfile
RUN pip install -r requirements.txt
requirement.txt
requests
boto3
psycopg2-binary
pytz
pandas
pynt
Is there any knowns issues in:
Using pip3 in Dockerfile instead of pip
Specifying boto3 version

FreeTds inside Docker throwing unknown symbol exception

I'm trying to run a CGI website in a docker.
The software is written in c++ and uses to FreeTDS-dev package to connect to a mssql database.
so far its working as it should the only problem is:
If I try to compile or run it inside of the docker I receive
the following exception:
undefined symbol: dbprcollen
now I know where the exact line of code is and I also know that this particular function should be inside the freetds-dev package.
So I've included this package into the dockerfile
but it still won't work. Does Anyone know what I am missing?
Here is my dockerfile:
FROM php:apache
COPY ./html/ /var/www/html/
COPY ./work.cgi /var/www/html/work.cgi
RUN chmod +x /var/www/html/work.cgi
RUN a2enmod rewrite
RUN echo "<Directory /var/www/html/>\n\
AllowOverride all\n\
Options +ExecCGI\n\
AddHandler cgi-script .cgi\n\
</Directory>" >> /etc/apache2/apache2.conf
RUN ln -s /etc/apache2/mods-available/cgi.load /etc/apache2/mods-enabled/cgi.load
RUN apt update -y
RUN apt upgrade -y
RUN apt install build-essential -y
RUN apt install binutils -y
RUN apt install libcgicc-dev -y
RUN apt install freetds-bin -y
RUN apt install freetds-dev -y
RUN apt update --fix-missing -y
RUN apt upgrade -y

Deploying a Geodjango Application on AWS Elastic Beanstalk

I'm trying to deploy a geodjango application on AWS Elastic Beanstalk. The configuration is 64bit Amazon Linux 2017.09 v2.6.6 running Python 3.6. I am getting this error when trying to deploy:
Requires: libpoppler.so.5()(64bit) Error: Package: gdal-java-1.9.2-8.rhel6.x86_64 (pgdg93) Requires: libpoppler.so.5()(64bit)
How do I install the required package? I read through Setting up Django with GeoDjango Support in AWS Beanstalk or EC2 Instance but I am still getting problems. My ebextensions currently looks like:
commands:
01_yum_update:
command: sudo yum -y update
02_epel_repo:
command: sudo yum-config-manager -y --enable epel
03_install_gdal_packages:
command: sudo yum -y install gdal gdal-devel
packages:
yum:
git: []
postgresql95-devel: []
gettext: []
libjpeg-turbo-devel: []
libffi-devel: []
I'm going to answer my own question for the sake my future projects and anyone else trying to get started with geodjango. Updating this answer as of July 2020
Create an ebextensions file to install GDAL on the EC2 instance at deployment:
01_gdal.config
commands:
01_install_gdal:
test: "[ ! -d /usr/local/gdal ]"
command: "/tmp/gdal_install.sh"
files:
"/tmp/gdal_install.sh":
mode: "000755"
owner: root
group: root
content: |
#!/usr/bin/env bash
sudo yum-config-manager --enable epel
sudo yum -y install make automake gcc gcc-c++ libcurl-devel proj-devel geos-devel
# Geos
cd /
sudo mkdir -p /usr/local/geos
cd usr/local/geos/geos-3.7.2
sudo wget geos-3.7.2.tar.bz2 http://download.osgeo.org/geos/geos-3.7.2.tar.bz2
sudo tar -xvf geos-3.7.2.tar.bz2
cd geos-3.7.2
sudo ./configure
sudo make
sudo make install
sudo ldconfig
# Proj4
cd /
sudo mkdir -p /usr/local/proj
cd usr/local/proj
sudo wget -O proj-5.2.0.tar.gz http://download.osgeo.org/proj/proj-5.2.0.tar.gz
sudo wget -O proj-datumgrid-1.8.tar.gz http://download.osgeo.org/proj/proj-datumgrid-1.8.tar.gz
sudo tar xvf proj-5.2.0.tar.gz
sudo tar xvf proj-datumgrid-1.8.tar.gz
cd proj-5.2.0
sudo ./configure
sudo make
sudo make install
sudo ldconfig
# GDAL
cd /
sudo mkdir -p /usr/local/gdal
cd usr/local/gdal
sudo wget -O gdal-2.4.4.tar.gz http://download.osgeo.org/gdal/2.4.4/gdal-2.4.4.tar.gz
sudo tar xvf gdal-2.4.4.tar.gz
cd gdal-2.4.4
sudo ./configure
sudo make
sudo make install
sudo ldconfig
As shown, the script checks whether gdal already exists using the test function. It then downloads the Geos, Proj, and GDAL libraries and installs them in the usr/local directory. At the time of writing this, geodjango (Django 3.0) supports up to Geos 3.7, Proj 5.2 (which also requires projdatum. Current releases do not require it), and GDAL 2.4 Warning: this installation process can take a long time. Also I am not a Linux professional so some of those commands may be redundant, but it works.
Lastly I add the following two environment variables to my Elastic Beanstalk configuration:
LD_LIBRARY_PATH: /usr/local/lib:$LD_LIBRARY_PATH
PROJ_LIB: usr/local/proj
If you still have troubles I recommend checking the logs and ssh-ing in the EC2 instance to check that installation took place. Original credit to this post

Why does CMD never work in my Dockerfiles?

I have a few Dockerfiles where CMD doesn't seem to run. Here is an example (all the way at the bottom).
##########################################################
# Set the base image to Ansible
FROM ubuntu:16.10
# Install Ansible, Python and Related Deps #
RUN apt-get -y update && \
apt-get install -y python-yaml python-jinja2 python-httplib2 python-keyczar python-paramiko python-setuptools python-pkg-resources git python-pip
RUN mkdir /etc/ansible/
RUN echo '[local]\nlocalhost\n' > /etc/ansible/hosts
RUN mkdir /opt/ansible/
RUN git clone http://github.com/ansible/ansible.git /opt/ansible/ansible
WORKDIR /opt/ansible/ansible
RUN git submodule update --init
ENV PATH /opt/ansible/ansible/bin:/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
ENV PYTHONPATH /opt/ansible/ansible/lib
ENV ANSIBLE_LIBRARY /opt/ansible/ansible/library
RUN apt-get update -y
RUN apt-get install python -y
RUN apt-get install python-dev -y
RUN apt-get install python-setuptools -y
RUN apt-get install python-pip
RUN mkdir /ansible/
WORKDIR /ansible
COPY ./ansible ./
WORKDIR /
RUN ansible-playbook -c local ansible/playbooks/installdjango.yml
ENV PROJECTNAME testwebsite
################## SETUP DIRECTORY STRUCTURE ######################
WORKDIR /home
CMD ["django-admin" "startproject" "$PROJECTNAME"]
EXPOSE 8000
If I build and run the container, I can manually run
Django-admin startproject $PROJECTNAME and it will create a new project as expected, but the CMD in my Dockerfile does not seem to be doing anything and this is happening with all my other Dockerfiles so there's something I must not be getting.
ENTRYPOINT and CMD defines the default command that docker runs when it starts your container, not when the image is built. When ENTRYPOINT isn't defined, you simply run the value of CMD. Otherwise, CMD becomes args to the ENTRYPOINT. When you run your image, you can override the value of the CMD by passing args after the container name.
So, in your example above, CMD may be defined as anything, but when you run your container with docker run -it <imagename> /bin/bash, you override any value of CMD and replace it with /bin/bash. To run the defined value of CMD, you would need to run the container with docker run <imagename>.

Fastest way to install Tesseract on Elastic Beanstalk

I am currently using Tika to extract text from files uploaded to my Rails app running on AWS Elastic Beanstalk (64bit Amazon Linux 2016.03 v2.1.2 running Ruby 2.2). I'd like to index scanned images as well, so I need to install Tesseract.
I was able to get it to work by installing it from source like so, but it added 10 minutes to my deploys to a fresh instance. Is there a faster way to do this?
.ebextensions/02-tesseract.config
packages:
yum:
autoconf: []
automake: []
libtool: []
libpng-devel: []
libtiff-devel: []
zlib-devel: []
container_commands:
01-command:
command: mkdir -p install
cwd: /home/ec2-user
02-command:
command: cp .ebextensions/scripts/install_tesseract.sh /home/ec2-user/install/
03-command:
command: bash install/install_tesseract.sh
cwd: /home/ec2-user
.ebextensions/scripts/install_tesseract.sh
#!/usr/bin/env bash
cd_to_install () {
cd /home/ec2-user/install
}
cd_to () {
cd /home/ec2-user/install/$1
}
if ! [ -x "$(command -v tesseract)" ]; then
# Add `usr/local/bin` to PATH
echo 'pathmunge /usr/local/bin' > /etc/profile.d/usr_local.sh
chmod +x /etc/profile.d/usr_local.sh
# Install leptonica
cd_to_install
wget http://www.leptonica.org/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd_to leptonica-1.73
./configure
make
make install
rm -rf /home/ec2-user/install/leptonica-1.73.tar.gz
rm -rf /home/ec2-user/install/leptonica-1.73
# Install tesseract ~ the jewel of Odin's treasure room
cd_to_install
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd_to tesseract-3.04.01
./autogen.sh
./configure
make
make install
ldconfig
rm -rf /home/ec2-user/install/3.04.01.tar.gz
rm -rf /home/ec2-user/install/tesseract-3.04.01
# Install tessdata
cd_to_install
wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
tar -zxvf 3.04.00.tar.gz
cp /home/ec2-user/install/tessdata-3.04.00/eng.* /usr/local/share/tessdata/
rm -rf /home/ec2-user/install/3.04.00.tar.gz
rm -rf /home/ec2-user/install/tessdata-3.04.00
fi
Short answer
.ebextensions/02-tesseract.config
commands:
01-libwebp:
command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
02-tesseract:
command: "yum --enablerepo=epel -y install tesseract"
Long answer
I'm not familiar with non-Ubuntu package managers or ebextensions, so after some digging, I found that there are precompiled binaries that can be installed on Amazon Linux in the stable EPEL repo.
The first obstacle was figuring out how to use the EPEL repo. The easiest way is to use the enablerepo option on the yum command.
That gets us here:
yum --enablerepo=epel install tesseract
Next, I had to resolve this dependency error:
[root#ip-10-0-1-193 ec2-user]# yum install --enablerepo=epel tesseract
Loaded plugins: priorities, update-motd, upgrade-helper
951 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package tesseract.x86_64 0:3.04.00-3.el6 will be installed
--> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el6.x86_64
--> Running transaction check
---> Package leptonica.x86_64 0:1.72-2.el6 will be installed
--> Processing Dependency: libwebp.so.5()(64bit) for package: leptonica-1.72-2.el6.x86_64
--> Finished Dependency Resolution
Error: Package: leptonica-1.72-2.el6.x86_64 (epel)
Requires: libwebp.so.5()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
I found the solution here
Just adding the epel repo doesn't solve it, as the packages in the
amzn-main repository seem to overrule those in the epel repository. If
the libwebp package in the amzn-main repo are excluded it should work
The Tesseract install has some dependencies found in the amzn-main repo. This is why I first install libwebp with --disablerepo=amzn-main.
yum --enablerepo=epel --disablerepo=amzn-main install libwebp
yum --enablerepo=epel install tesseract
Finally, here's how you can install yum packages on Elastic Beanstalk with options:
.ebextensions/02-tesseract.config
commands:
01-libwebp:
command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
02-tesseract:
command: "yum --enablerepo=epel -y install tesseract"
Fortunately, this is also the easiest way to install Tesseract on Elastic Beanstalk!