Fastest way to install Tesseract on Elastic Beanstalk - amazon-web-services

I am currently using Tika to extract text from files uploaded to my Rails app running on AWS Elastic Beanstalk (64bit Amazon Linux 2016.03 v2.1.2 running Ruby 2.2). I'd like to index scanned images as well, so I need to install Tesseract.
I was able to get it to work by installing it from source like so, but it added 10 minutes to my deploys to a fresh instance. Is there a faster way to do this?
.ebextensions/02-tesseract.config
packages:
yum:
autoconf: []
automake: []
libtool: []
libpng-devel: []
libtiff-devel: []
zlib-devel: []
container_commands:
01-command:
command: mkdir -p install
cwd: /home/ec2-user
02-command:
command: cp .ebextensions/scripts/install_tesseract.sh /home/ec2-user/install/
03-command:
command: bash install/install_tesseract.sh
cwd: /home/ec2-user
.ebextensions/scripts/install_tesseract.sh
#!/usr/bin/env bash
cd_to_install () {
cd /home/ec2-user/install
}
cd_to () {
cd /home/ec2-user/install/$1
}
if ! [ -x "$(command -v tesseract)" ]; then
# Add `usr/local/bin` to PATH
echo 'pathmunge /usr/local/bin' > /etc/profile.d/usr_local.sh
chmod +x /etc/profile.d/usr_local.sh
# Install leptonica
cd_to_install
wget http://www.leptonica.org/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd_to leptonica-1.73
./configure
make
make install
rm -rf /home/ec2-user/install/leptonica-1.73.tar.gz
rm -rf /home/ec2-user/install/leptonica-1.73
# Install tesseract ~ the jewel of Odin's treasure room
cd_to_install
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd_to tesseract-3.04.01
./autogen.sh
./configure
make
make install
ldconfig
rm -rf /home/ec2-user/install/3.04.01.tar.gz
rm -rf /home/ec2-user/install/tesseract-3.04.01
# Install tessdata
cd_to_install
wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
tar -zxvf 3.04.00.tar.gz
cp /home/ec2-user/install/tessdata-3.04.00/eng.* /usr/local/share/tessdata/
rm -rf /home/ec2-user/install/3.04.00.tar.gz
rm -rf /home/ec2-user/install/tessdata-3.04.00
fi

Short answer
.ebextensions/02-tesseract.config
commands:
01-libwebp:
command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
02-tesseract:
command: "yum --enablerepo=epel -y install tesseract"
Long answer
I'm not familiar with non-Ubuntu package managers or ebextensions, so after some digging, I found that there are precompiled binaries that can be installed on Amazon Linux in the stable EPEL repo.
The first obstacle was figuring out how to use the EPEL repo. The easiest way is to use the enablerepo option on the yum command.
That gets us here:
yum --enablerepo=epel install tesseract
Next, I had to resolve this dependency error:
[root#ip-10-0-1-193 ec2-user]# yum install --enablerepo=epel tesseract
Loaded plugins: priorities, update-motd, upgrade-helper
951 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package tesseract.x86_64 0:3.04.00-3.el6 will be installed
--> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el6.x86_64
--> Running transaction check
---> Package leptonica.x86_64 0:1.72-2.el6 will be installed
--> Processing Dependency: libwebp.so.5()(64bit) for package: leptonica-1.72-2.el6.x86_64
--> Finished Dependency Resolution
Error: Package: leptonica-1.72-2.el6.x86_64 (epel)
Requires: libwebp.so.5()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
I found the solution here
Just adding the epel repo doesn't solve it, as the packages in the
amzn-main repository seem to overrule those in the epel repository. If
the libwebp package in the amzn-main repo are excluded it should work
The Tesseract install has some dependencies found in the amzn-main repo. This is why I first install libwebp with --disablerepo=amzn-main.
yum --enablerepo=epel --disablerepo=amzn-main install libwebp
yum --enablerepo=epel install tesseract
Finally, here's how you can install yum packages on Elastic Beanstalk with options:
.ebextensions/02-tesseract.config
commands:
01-libwebp:
command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
02-tesseract:
command: "yum --enablerepo=epel -y install tesseract"
Fortunately, this is also the easiest way to install Tesseract on Elastic Beanstalk!

Related

Unable to install Tesseract 5.0 version on AWS Lambda

I want to run Tesseract 4.0 or Tesseract 5.0 on my AWS Lambda function. So I have my docker file like so-
FROM public.ecr.aws/lambda/python:3.8
RUN mkdir app
# Copy function code
COPY / ${LAMBDA_TASK_ROOT}/app
# Install the function's dependencies using file requirements.txt
# from your project folder.
COPY requirements.txt .
RUN pip3 install -r requirements.txt --target ${LAMBDA_TASK_ROOT}
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum -y update
RUN yum -y install tesseract
RUN yum install -y poppler-utils
# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.com.emlAndMsgParser.mail_parser_test.getEmail_from_msg" ]
but when i do DockerBuild-"docker build -t qa-lambda ." on my terminal, it says Tesseract 3.0 version is getting installed. When i deploy this built Docker image to AWS Lambda,it also says Tesseract 3.0 is installed.
But I want Tesseract 4.0 or preferably Tesserct 5.0.
I tried changing the "RUN yum -y install tesseract" in my Dockerfile to "RUN yum -y install tesseract 5.0.0-alpha-320-g8dc3" and "RUN yum -y install tesseract -y" or "RUN yum -y install tesseract*".
But all of them are installing Tesseract 3.0.
Please can anyone tell me where I am going wrong?
I am a bit new to Tesseract, so any help is appreciated..thanks!
Having the same problem, I finally created a Dockerfile myself:
FROM public.ecr.aws/lambda/java:11 q
# Prepare dev tools
RUN yum -y update
RUN yum -y install wget libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
RUN yum group install -y "Development Tools"
# Build leptonica
WORKDIR /opt
RUN wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz
RUN ls -la
RUN tar -zxvf leptonica-1.82.0.tar.gz
WORKDIR ./leptonica-1.82.0
RUN ./configure
RUN make -j
RUN make install
RUN cd .. && rm leptonica-1.82.0.tar.gz
# Build tesseract
RUN wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz
RUN tar -zxvf 5.2.0.tar.gz
WORKDIR ./tesseract-5.2.0
RUN ./autogen.sh
RUN PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
RUN LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
RUN make install
RUN /sbin/ldconfig
RUN cd .. && rm 5.2.0.tar.gz
# Optional: install language packs
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/deu.traineddata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
RUN mv *.traineddata /usr/local/share/tessdata
WORKDIR /root
ENTRYPOINT [ "tesseract", "--version" ]
Hope this helps!

How to fail codebuild if testing coverage is below a threshold?

I am new to AWS codeBuild and code pipeline, what I am wondering is if there is a way I can fail a pipeline or codeBuild project if the coverage of my test cases is below a certain threshold, i.e 80%?
I have an angular project configured with Karma & Jasmine.
Upon research I have found that I can use this command aws codebuild stop-build --id ${CODEBUILD_BUILD_ID}, not sure if this will help me in any way?
Buildspec.yml
version: 0.2
phases:
install:
commands:
# Install the Angular CLI
- npm install -g #angular/cli
# Install puppeteer as a dev dependency
- npm i -D puppeteer
- npm i -D #angular-devkit/build-angular
# Print out missing libs
- echo "Missing Libs" || ldd ./node_modules/puppeteer/.local-chromium/linux-549031/chrome-linux/chrome | grep not
# Upgrade apt
- apt-get upgrade
# Update libs
- apt-get update
# Install apt-transport-https
- apt-get install -y apt-transport-https
# Use apt to install the Chrome dependencies
- apt-get install -y libxcursor1
- apt-get install -y libgtk-3-dev
- apt-get install -y libxss1
- apt-get install -y libasound2
- apt-get install -y libnspr4
- apt-get install -y libnss3
- apt-get install -y libx11-xcb1
# Print out missing libs
- echo "Missing Libs" || ldd ./node_modules/puppeteer/.local-chromium/linux-549031/chrome-linux/chrome | grep not
# Install project dependencies
- npm install
# - usermod -g 0 -o codebuild-user
build:
# run-as: codebuild-user
commands:
# Run headless Chrome tests
- ng test --browsers=ChromeHeadlessNoSandbox --watch=false --code-coverage
- printenv
I wish to condition the ng tests output here to fail the pipeline.
- aws codebuild stop-build --id ${CODEBUILD_BUILD_ID}
artifacts:
files: imagedefinitions.json

EMR stuck at bootstrap script

Am trying to run a bootstrap file at EMR to installed facebook prophet which seems to have an issue requiring to install dev-tools, the bootstrap.sh simply runs
bootstrap.sh
#!/bin/bash -xe
sudo yum install python3-devel python3-libs python3-tools
#sudo yum groupinstall "Development Tools"
aws s3 cp s3://bucket/requirements.txt .
sudo python3 -m pip install --upgrade pip
sudo python3 -m pip install --upgrade -r ./requirements.txt
,output logs shows the below
and errors logs shows
but then the cluster is stuck for 1 hour before failing
I had to add -y flag to yum inorder to pypass any user prompting

Deploying a Geodjango Application on AWS Elastic Beanstalk

I'm trying to deploy a geodjango application on AWS Elastic Beanstalk. The configuration is 64bit Amazon Linux 2017.09 v2.6.6 running Python 3.6. I am getting this error when trying to deploy:
Requires: libpoppler.so.5()(64bit) Error: Package: gdal-java-1.9.2-8.rhel6.x86_64 (pgdg93) Requires: libpoppler.so.5()(64bit)
How do I install the required package? I read through Setting up Django with GeoDjango Support in AWS Beanstalk or EC2 Instance but I am still getting problems. My ebextensions currently looks like:
commands:
01_yum_update:
command: sudo yum -y update
02_epel_repo:
command: sudo yum-config-manager -y --enable epel
03_install_gdal_packages:
command: sudo yum -y install gdal gdal-devel
packages:
yum:
git: []
postgresql95-devel: []
gettext: []
libjpeg-turbo-devel: []
libffi-devel: []
I'm going to answer my own question for the sake my future projects and anyone else trying to get started with geodjango. Updating this answer as of July 2020
Create an ebextensions file to install GDAL on the EC2 instance at deployment:
01_gdal.config
commands:
01_install_gdal:
test: "[ ! -d /usr/local/gdal ]"
command: "/tmp/gdal_install.sh"
files:
"/tmp/gdal_install.sh":
mode: "000755"
owner: root
group: root
content: |
#!/usr/bin/env bash
sudo yum-config-manager --enable epel
sudo yum -y install make automake gcc gcc-c++ libcurl-devel proj-devel geos-devel
# Geos
cd /
sudo mkdir -p /usr/local/geos
cd usr/local/geos/geos-3.7.2
sudo wget geos-3.7.2.tar.bz2 http://download.osgeo.org/geos/geos-3.7.2.tar.bz2
sudo tar -xvf geos-3.7.2.tar.bz2
cd geos-3.7.2
sudo ./configure
sudo make
sudo make install
sudo ldconfig
# Proj4
cd /
sudo mkdir -p /usr/local/proj
cd usr/local/proj
sudo wget -O proj-5.2.0.tar.gz http://download.osgeo.org/proj/proj-5.2.0.tar.gz
sudo wget -O proj-datumgrid-1.8.tar.gz http://download.osgeo.org/proj/proj-datumgrid-1.8.tar.gz
sudo tar xvf proj-5.2.0.tar.gz
sudo tar xvf proj-datumgrid-1.8.tar.gz
cd proj-5.2.0
sudo ./configure
sudo make
sudo make install
sudo ldconfig
# GDAL
cd /
sudo mkdir -p /usr/local/gdal
cd usr/local/gdal
sudo wget -O gdal-2.4.4.tar.gz http://download.osgeo.org/gdal/2.4.4/gdal-2.4.4.tar.gz
sudo tar xvf gdal-2.4.4.tar.gz
cd gdal-2.4.4
sudo ./configure
sudo make
sudo make install
sudo ldconfig
As shown, the script checks whether gdal already exists using the test function. It then downloads the Geos, Proj, and GDAL libraries and installs them in the usr/local directory. At the time of writing this, geodjango (Django 3.0) supports up to Geos 3.7, Proj 5.2 (which also requires projdatum. Current releases do not require it), and GDAL 2.4 Warning: this installation process can take a long time. Also I am not a Linux professional so some of those commands may be redundant, but it works.
Lastly I add the following two environment variables to my Elastic Beanstalk configuration:
LD_LIBRARY_PATH: /usr/local/lib:$LD_LIBRARY_PATH
PROJ_LIB: usr/local/proj
If you still have troubles I recommend checking the logs and ssh-ing in the EC2 instance to check that installation took place. Original credit to this post

AWS EMR stuck bootstrapping

For a course, we are doing a basic tweet analysis using AWS EMR. I followed the steps in this document:
http://docs.aws.amazon.com/en_us/gettingstarted/latest/emr/awsgsg-emr.pdf
The only modifications are that I uploaded a pre-done set of tweets and we are told to use our own config file for the NLTK. The instructor gave us the following for the custom NLTK config:
#!/bin/bash
sudo yum -y install git gcc python-dev python-devel
sudo ln -sf /usr/bin/python2.7 /usr/bin/python
sudo easy_install pip
sudo pip install -U numpy
sudo pip install numpy
sudo easy_install -U distribute
sudo pip install -U setuptools
sudo pip install pyyaml nltk
sudo pip install -e git://github.com/mdp-toolkit/mdp-toolkit#egg=MDP
sudo python -m nltk.downloader -d /usr/share/nltk_data all
I create my cluster and, when it executes, it gets to 'bootstrapping' and has been stuck there for 45 minutes. Using AMI Version 3.11.0, no Hive, Pig, or HUE.
Please let me know if more information is needed to try to diagnose this. What could cause this?