Tesseract OCR on AWS Lambda via virtualenv
Scroll to Adapatations for tesseract 4.
I have used this link to create the executable and the dependency libraries for tesseract. I have zipped everything and dropped in S3.
I am using lambda to download this zip, extract the dependencies in to /tmp folder. Now I am planning to use these dependencies in my lambda(python3 platform).
I am getting this error
Response:
{
"errorMessage": "tesseract is not installed or it's not in your path",
"errorType": "TesseractNotFoundError",
This is happening cause of not setting the environmental variable.
I have tried to do it but cannot by pass this error.
# Setting the modules path
sys.path.insert(0, '/tmp/')
import boto3
import cv2
import numpy as np
import subprocess
os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['TESSDATA_PREFIX'] = "/tmp/tessdata/"
import pytesseract
I have set the environmental variables like this in the lambda function. Still I am getting the same error. I have even tried setting the variables like shown in the image below. Still hard luck.
I am sure this lambda package works because I have created a new ec2 instance, downloaded the same zip file and extracted the libraries into /tmp/ folder. I wrote a basic test function for testing tesseract. This works.
import cv2
import pytesseract
import os
# os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['LD_LIBRARY_PATH'] = '/tmp/lib:/tmp'
config = ('-l eng --oem 1 --psm 3')
im = cv2.imread('pytesseract/test-european.jpg', cv2.IMREAD_COLOR)
text = pytesseract.image_to_string(im, config=config)
print(text)
Can somebody tell me what did I do wrong with lambda.
I don't want to zip everything because my zip file is greater than 50 MB. Also I want to try downloading the packages/modules/binaries from S3 to lambda and make it work.
Apparently lambda doesn't allow you to make changes to the PATH variable.
Try adding this to your script
pytesseract.pytesseract.tesseract_cmd = r'/var/task/tesseract'
Related
I was able to run a simple python code in Notebook instance to read and write csv files from/to S3 bucket. Now I want to create the SageMaker processing job to run the same code without any input/output data configuration. I have downloaded the same code and pushed the image to ECR repository. How to run this code in processing job and it should be able to install 's3fs' module?I just want to run python code in processing jobs without giving any input/output algorithms/configuration. Used boto3 to read/write from s3 bucket. With the current code it's stuck in "In Progress"
downloaded code in vs code
downloaded code in vs code
!pip install s3fs
import boto3
import pandas as pd
from io import StringIO
client = boto3.client('s3')
path = 's3://weatheranalysis/weatherset.csv'
df = pd.read_csv(path)
df.head()
filename = 'newdata.csv'
bucketName = 'weatheranalysis'
csv_buffer = StringIO()
df.to_csv(csv_buffer)
client = boto3.client('s3')
response = client.put_object(
ACL='private',
Body = csv_buffer.getvalue(),
Bucket =bucketName,
Key = filename
)
You can do so by defining a 'sagemaker-processing-container' and Set up the ScriptProcessor from the SageMaker Python SDK to run your existing python script in preprocessor.py.
A simple example can be found here
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
script_processor = ScriptProcessor(command=['python3'],
image_uri='image_uri',
role='role_arn',
instance_count=1,
instance_type='ml.m5.xlarge')
script_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
ProcessingOutput(source='/opt/ml/processing/output/validation'),
ProcessingOutput(source='/opt/ml/processing/output/test')])
I trying to move a python script into lambda, so I can automate the process. But I am having problems.
The script is from SecurityHub and I am trying enter the handler name, which I have detailed as follows: enablesecurityhub.assume_role, for the script - enablesecurityhub.py. But I am getting the following error message:
Unable to import module 'enablesecurityhub': No module named enablesecurityhub.
I have performed a zip, which includes all the imports {pip install -r requirements.txt -t ./} and I have uploaded to S3 - it is approx 10mb in size the file up.
cat requirements.txt
boto3
argparse
utils
OrderedDict
input
But I am not sure what I am doing wrong for the handler. Please can you advise?
Same code is working in my local machine, however getting the below error when I tried to test in the AWS Lambda:
Unable to import module 'lambda_function': Missing required dependencies ['numpy']
You need to download the packages from pypi.org and include them on the zip file that contains both, the code in a py file and the packages. Find a more detailed description here https://www.protos-technologie.de/en/2020/07/02/dependency-management-for-aws-lambda/.
I have master/worker EC2 instances that I'm using for Grinder tests. I need to try out a load test that directly gets files from an S3 bucket, but I'm not sure how that would look in Jython for the Grinder test script.
Any ideas or tips? I've looked into it a little and saw that Python has the boto package for working with AWS - would that work in Jython as well?
(Edit - adding code and import errors for clarification.)
Python approach:
Did "pip install boto3"
Test script:
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import boto3
# boto3 for Python
test1 = Test(1, "S3 request")
resource = boto3.resource('s3')
def accessS3():
obj = resource.Object(<bucket>,<key>)
test1.record(accessS3)
class TestRunner:
def __call__(self):
accessS3()
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named boto3
Java approach:
Added aws-java-sdk-1.11.221 jar from .m2\repository\com\amazonaws\aws-java-sdk\1.11.221\ to CLASSPATH
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import com.amazonaws.services.s3 as s3
# aws s3 for Java
test1 = Test(1, "S3 request")
s3Client = s3.AmazonS3ClientBuilder.defaultClient()
test1.record(s3Client)
class TestRunner:
def __call__(self):
result = s3Client.getObject(s3.model.getObjectRequest(<bucket>,<key>))
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named amazonaws
I'm also running things on a Windows computer, but I'm using Git Bash.
Given that you are using Jython, I'm not sure whether you want to execute the S3 request in java or python syntax.
However, I would suggest following along with the python guide at the link below.
http://docs.ceph.com/docs/jewel/radosgw/s3/python/
I am using the Python module xmlsec in my lambda function. The import looks like import dm.xmlsec.binding as xmlsec. The proper directory structure exists. at the root of the archive there is dm/xmlsec/binding/__init__.py and the rest of the module is there. However, when executing the function on lambda, I get the error "No module named dm.xmlsec.binding"
I have built many Python27 lambda functions in the same way as this one with no issues. I install all of the needed python modules to my build directory, with the lambda function at the root. I then zip the package recursively and update the existing function with the resulting archive using the AWS CLI. I've also tried manually uploading the archive in the console as well, with the same result.
I was honestly expecting some trouble with this module, but I did expect lambda to at least see it. What is going on?