How to run an OCR(Tesseract) on AWS Lambda using Java

How to run an OCR(Tesseract) on AWS Lambda using Java - amazon-web-services

I want to run an OCR(Tesseract) on AWS Lambda using Java.I wanted to have a basic-"Hello World" set up so I created the Handler function for AWS Lambda like so-
public class Hello implements RequestHandler<Object, String> {
#Override
public String handleRequest(Object input, Context context) {
context.getLogger().log("Inside handleRequest of Hello class PPPPp ");
context.getLogger().log("Input: " + input);
String output = "Hello, " + input + "!";
return output;
}
}
And this is working as I have tested it on the AWS Lambda console. And also Teseract is running on my local machine and tested it by passing an image and is printing the words present.
But now I am not getting how to combine these 2 (AWS Lambda and Teseract) into a single java program. I found this link which runs Teseract on Python - https://typless.com/tesseract-on-aws-lambda-ocr-as-a-service/ and tried to do the equivalent in Java, but was not able to do so as they were manipulating .py files and I am dealing with .java files.
I also came across this github link- https://github.com/tesseract-ocr/tesseract present in the AWS Lambda documentation,and this one-
https://github.com/tesseract-ocr/tesseract
but didnt understand how to use the code for my requirment.
I am stuck here and new to AWS Lambda. Any help very much appreciated. Thanks in advance.

Please see the example https://github.com/jlcorradi/tesseract-example/blob/e85b93f3481109c56bb0e7255e0116691463693f/src/main/java/com/playground/tesseract/TesseractPlaygroundLambdaHandle.java
Github: Not verified
#Log4j
public class TesseractPlaygroundLambdaHandle implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
#Override
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent input, Context context) {
Base64.Decoder decoder = Base64.getDecoder();
String inputBody = input.getBody();
Tesseract tesseract = new Tesseract();
//tesseract.setDatapath("/home/jorgecorradi/Downloads/");
try {
String decodedText = tesseract.doOCR(decodeToImage(inputBody));
return createResponse(decodedText);
} catch (TesseractException e) {
e.printStackTrace();
return createResponse(e.getMessage());
}
}
private APIGatewayProxyResponseEvent createResponse(String responseContent) {
APIGatewayProxyResponseEvent responseEvent = new APIGatewayProxyResponseEvent();
responseEvent.setBody(responseContent);
responseEvent.setStatusCode(200);
return responseEvent;
}
public static BufferedImage decodeToImage(String imageString) {
BufferedImage image = null;
byte[] imageByte;
try {
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(imageString);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
} catch (Exception e) {
log.error("Error decoding base64 to image", e);
}
return image;
}
}
You will need to have a project/lib management tool such as Maven for example, to allow you to specify dependencies (For Terrasact), and to create a standalone jar, that you can then upload towards your Lambda.
This is in its simplest form.
https://docs.aws.amazon.com/lambda/latest/dg/java-package.html for how to deploy a standalone zip/jar towards Lambda.

Related

How to enter data from Spring Boot Application into Amazon Kinesis?

I want to add data into kinesis using Sprint Boot Application and React. I am a complete beginner when it comes to Kinesis, AWS, etc. so a beginner friendly guide would be appriciated.

To add data records into an Amazon Kinesis data stream from a Spring BOOT app, you can use the AWS SDK for Java V2 and specifically the Amazon Kinesis Java API. You can use the software.amazon.awssdk.services.kinesis.KinesisClient.
Because you are a beginner, I recommend that you read the AWS SDK Java V2 Developer Guide to become familiar with how to work with this Java API. See Developer guide - AWS SDK for Java 2.x.
Here is a code example that shows you how to add data records using this Service Client. See Github that has the other required classes here.
package com.example.kinesis;
//snippet-start:[kinesis.java2.putrecord.import]
import software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider;
import software.amazon.awssdk.core.SdkBytes;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.kinesis.KinesisClient;
import software.amazon.awssdk.services.kinesis.model.PutRecordRequest;
import software.amazon.awssdk.services.kinesis.model.KinesisException;
import software.amazon.awssdk.services.kinesis.model.DescribeStreamRequest;
import software.amazon.awssdk.services.kinesis.model.DescribeStreamResponse;
//snippet-end:[kinesis.java2.putrecord.import]
/**
* Before running this Java V2 code example, set up your development environment, including your credentials.
*
* For more information, see the following documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class StockTradesWriter {
public static void main(String[] args) {
final String usage = "\n" +
"Usage:\n" +
" <streamName>\n\n" +
"Where:\n" +
" streamName - The Amazon Kinesis data stream to which records are written (for example, StockTradeStream)\n\n";
if (args.length != 1) {
System.out.println(usage);
System.exit(1);
}
String streamName = args[0];
Region region = Region.US_EAST_1;
KinesisClient kinesisClient = KinesisClient.builder()
.region(region)
.credentialsProvider(ProfileCredentialsProvider.create())
.build();
// Ensure that the Kinesis Stream is valid.
validateStream(kinesisClient, streamName);
setStockData( kinesisClient, streamName);
kinesisClient.close();
}
// snippet-start:[kinesis.java2.putrecord.main]
public static void setStockData( KinesisClient kinesisClient, String streamName) {
try {
// Repeatedly send stock trades with a 100 milliseconds wait in between
StockTradeGenerator stockTradeGenerator = new StockTradeGenerator();
// Put in 50 Records for this example
int index = 50;
for (int x=0; x<index; x++){
StockTrade trade = stockTradeGenerator.getRandomTrade();
sendStockTrade(trade, kinesisClient, streamName);
Thread.sleep(100);
}
} catch (KinesisException | InterruptedException e) {
System.err.println(e.getMessage());
System.exit(1);
}
System.out.println("Done");
}
private static void sendStockTrade(StockTrade trade, KinesisClient kinesisClient,
String streamName) {
byte[] bytes = trade.toJsonAsBytes();
// The bytes could be null if there is an issue with the JSON serialization by the Jackson JSON library.
if (bytes == null) {
System.out.println("Could not get JSON bytes for stock trade");
return;
}
System.out.println("Putting trade: " + trade);
PutRecordRequest request = PutRecordRequest.builder()
.partitionKey(trade.getTickerSymbol()) // We use the ticker symbol as the partition key, explained in the Supplemental Information section below.
.streamName(streamName)
.data(SdkBytes.fromByteArray(bytes))
.build();
try {
kinesisClient.putRecord(request);
} catch (KinesisException e) {
e.getMessage();
}
}
private static void validateStream(KinesisClient kinesisClient, String streamName) {
try {
DescribeStreamRequest describeStreamRequest = DescribeStreamRequest.builder()
.streamName(streamName)
.build();
DescribeStreamResponse describeStreamResponse = kinesisClient.describeStream(describeStreamRequest);
if(!describeStreamResponse.streamDescription().streamStatus().toString().equals("ACTIVE")) {
System.err.println("Stream " + streamName + " is not active. Please wait a few moments and try again.");
System.exit(1);
}
}catch (KinesisException e) {
System.err.println("Error found while describing the stream " + streamName);
System.err.println(e);
System.exit(1);
}
}
// snippet-end:[kinesis.java2.putrecord.main]
}

fetch batch translate job details using SDK

I was trying to find the LanguagePair and Operation associated with a given AWS translate job using the JAVA SDK.
Using the AWS web console, i created a couple of batch jobs to translate a few english sentences to french. In CloudWatch, i could see the metric dimensions as
LanguagePair: en-fr
Operation: TranslateText
Can i retrieve the same information (LanguagePair and Operation) for a given job,
using the TranslateAsyncClient.describeTextTranslationJob(...) method ?

You can use the TranslateClient to describes a translation job given the job number as input. This uses the TranslateClient; however you can use the Async version as well.
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.translate.TranslateClient;
import software.amazon.awssdk.services.translate.model.DescribeTextTranslationJobRequest;
import software.amazon.awssdk.services.translate.model.DescribeTextTranslationJobResponse;
import software.amazon.awssdk.services.translate.model.TranslateException;
// snippet-end:[translate.java2._describe_jobs.import]
/**
* To run this Java V2 code example, ensure that you have setup your development environment, including your credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class DescribeTextTranslationJob {
public static void main(String[] args) {
final String USAGE = "\n" +
"Usage:\n" +
" DescribeTextTranslationJob <id> \n\n" +
"Where:\n" +
" id - a translation job ID value. You can obtain this value from the BatchTranslation example.\n";
if (args.length != 1) {
System.out.println(USAGE);
System.exit(1);
}
String id = args[0];
Region region = Region.US_WEST_2;
TranslateClient translateClient = TranslateClient.builder()
.region(region)
.build();
describeTextTranslationJob(translateClient, id);
translateClient.close();
}
// snippet-start:[translate.java2._describe_jobs.main]
public static void describeTextTranslationJob(TranslateClient translateClient, String id) {
try {
DescribeTextTranslationJobRequest textTranslationJobRequest = DescribeTextTranslationJobRequest.builder()
.jobId(id)
.build();
DescribeTextTranslationJobResponse jobResponse = translateClient.describeTextTranslationJob(textTranslationJobRequest);
System.out.println("The job status is "+jobResponse.textTranslationJobProperties().jobStatus());
System.out.println("The source language is "+jobResponse.textTranslationJobProperties().sourceLanguageCode());
System.out.println("The target language is "+jobResponse.textTranslationJobProperties().targetLanguageCodes());
} catch (TranslateException e) {
System.err.println(e.getMessage());
System.exit(1);
}
// snippet-end:[translate.java2._describe_jobs.main]
}
}
To see all the data you can get back using this code, see this JavaDoc - https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/translate/model/TextTranslationJobProperties.html

Pentaho ETL Transformation Using Lamda

Is it possible to run Pentaho ETL Jobs/transformation using AWS Lamda functions?
I have Pentaho ETL jobs running on schedule on the Windows server, we are planning to migrate to AWS. I am considering the Lambda function. just to understand if it is possible to schedule the Pentaho ETL Jobs using AWS Lamdba

Here is the snippet of code that I was able to successfully run in AWS Lambda Function.
handleRequest Function is called from AWS Lambda Function
public Integer handleRequest(String input, Context context) {
parseInput(input);
return executeKtr(transName);
}
parseInput: This function is used to parse out a string parameter passed by Lambda Function to extract KTR name and its parameters with value. Format of the input is "ktrfilename param1=value1 param2=value2"
public static void parseInput(String input) {
String[] tokens = input.split(" ");
transName = tokens[0].replace(".ktr", "") + ".ktr";
for (int i=1; i<tokens.length; i++) {
params.add(tokens[i]);
}
}
Executing KTR: I am using git repo to store all my KTR files and based on the name passed as a parameter KTR is executed
public static Integer executeKtr(String ktrName) {
try {
System.out.println("Present Project Directory : " + System.getProperty("user.dir"));
String transName = ktrName.replace(".ktr", "") + ".ktr";
String gitURI = awsSSM.getParaValue("kattle-trans-git-url");
String repoLocalPath = clonePDIrepo.cloneRepo(gitURI);
String path = new File(repoLocalPath + "/" + transName).getAbsolutePath();
File ktrFile = new File(path);
System.out.println("KTR Path: " + path);
try {
/**
* IMPORTANT NOTE FOR LAMBDA FUNCTION MUST CREATE .KEETLE DIRECOTRY OTHERWISE
* CODE WILL FAIL IN LAMBDA FUNCTION WITH ERROR CANT CREATE
* .kettle/kettle.properties file.
*
* ALSO SET ENVIRNOMENT VARIABLE ON LAMBDA FUNCTION TO POINT
* KETTLE_HOME=/tmp/.kettle
*/
Files.createDirectories(Paths.get("/tmp/.kettle"));
} catch (IOException e) {
e.printStackTrace();
throw new RuntimeException("Error Creating /tmp/.kettle directory");
}
if (ktrFile.exists()) {
KettleEnvironment.init();
TransMeta metaData = new TransMeta(path);
Trans trans = new Trans(metaData);
// SETTING PARAMETERS
trans = parameterSetting(trans);
trans.execute( null );
trans.waitUntilFinished();
if (trans.getErrors() > 0) {
System.out.print("Error Executing transformation");
throw new RuntimeException("There are errors in running transformations");
} else {
System.out.print("Successfully Executed Transformation");
return 1;
}
} else {
System.out.print("KTR File:" + path + " not found in repo");
throw new RuntimeException("KTR File:" + path + " not found in repo");
}
} catch (KettleException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
}
parameterSetting: If KTR is accepting parameter and it is passed while calling AWS Lambda function, it is set using parameterSetting function.
public static Trans parameterSetting(Trans trans) {
String[] transParams = trans.listParameters();
for (String param : transParams) {
for (String p: params) {
String name = p.split("=")[0];
String val = p.split("=")[1];
if (name.trim().equals(param.trim())) {
try {
System.out.println("Setting Parameter:"+ name + "=" + val);
trans.setParameterValue(name, val);
} catch (UnknownParamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
trans.activateParameters();
return trans;
}
CloneGitRepo:
public class clonePDIrepo {
/**
* Clones the given repo to local folder
*
* #param pathWithPwd Gir repo URL with access token included in the url. e.g.
* https://token_name:token_value#github.com/ktr-git-repo.git
* #return returns Local Repository String Path
*/
public static String cloneRepo(String pathWithPwd) {
try {
/**
* CREATING TEMP DIR TO AVOID FOLDER EXISTS ERROR, THIS TEMP DIRECTORY LATER CAN
* BE USED TO GET ABSOLETE PATH FOR FILES IN DIRECTORY
*/
File pdiLocalPath = Files.createTempDirectory("repodir").toFile();
Git git = Git.cloneRepository().setURI(pathWithPwd).setDirectory(pdiLocalPath).call();
System.out.println("Git repository cloned successfully");
System.out.println("Local Repository Path:" + pdiLocalPath.getAbsolutePath());
// }
return pdiLocalPath.getAbsolutePath();
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
AWSSSMgetParaValue: Gets string value of the parameter passed.
public static String getParaValue(String paraName) {
try {
Region region = Region.US_EAST_1;
SsmClient ssmClient = SsmClient.builder()
.region(region)
.build();
GetParameterRequest parameterRequest = GetParameterRequest.builder()
.name(paraName)
.withDecryption(true)
.build();
GetParameterResponse parameterResponse = ssmClient.getParameter(parameterRequest);
System.out.println(paraName+ " value retreived from AWS SSM");
ssmClient.close();
return parameterResponse.parameter().value();
} catch (SsmException e) {
System.err.println(e.getMessage());
return null;
}
}
Assumptions:
Git repo is created with KTR files in the root of the repo
git repo url exists on the aws SSM with valid tokens to clone the repo
Input string contains name of the KTR file
Environment Variable is configured on Lambda Function for KETTLE_HOME=/tmp/.kettle
Lambda Function has necessary permissions for SSM and S3 VPC Network
Proper Security Group rules are setup to allow required network access for the KTR File
I am planning to upload complete code to git. I will update this post with the URL of the repository.

Updating User.UserAttributes on AWS Pinpoint

I cannot seem to update pre-existing UserAttributes on a user "73". I am not sure if this behaviour is to be expected.
Map<String, List<String>> userAttributes = new HashMap<>();
userAttributes.put("Inference", Arrays.asList("NEGATIVE"));
userAttributes.put("Gender", Arrays.asList("M"));
userAttributes.put("ChannelPreference", Arrays.asList("EMAIL"));
userAttributes.put("TwitterHandle", Arrays.asList("Nutter"));
userAttributes.put("Age", Arrays.asList("435"));
EndpointUser endpointUser = new EndpointUser().withUserId("73");
endpointUser.setUserAttributes(userAttributes);
EndpointRequest endpointRequest = new EndpointRequest().withUser(endpointUser);
UpdateEndpointResult updateEndpointResult = pinpoint.updateEndpoint(new UpdateEndpointRequest()
.withEndpointRequest(endpointRequest).withApplicationId("380c3902d4ds47bfb6f9c6749c6dc8bf").withEndpointId("a1fiy2gy+eghmsadj1vqew6+aa"));
System.out.println(updateEndpointResult.getMessageBody());

#David.Webster,
You can update user-attributes of Amazon Pinpoint endpoint using the below Java code snippet which I have tested to be working :
public static void main(String[] args) throws IOException {
// Try to update the endpoint.
try {
System.out.println("===============================================");
System.out.println("Getting Started with Amazon Pinpoint"
+"using the AWS SDK for Java...");
System.out.println("===============================================\n");
// Initializes the Amazon Pinpoint client.
AmazonPinpoint pinpointClient = AmazonPinpointClientBuilder.standard()
.withRegion(Regions.US_EAST_1).build();
// Creates a new user definition.
EndpointUser jackchan = new EndpointUser().withUserId("73");
// Assigns custom user attributes.
jackchan.addUserAttributesEntry("name", Arrays.asList("Jack", "Chan"));
jackchan.addUserAttributesEntry("Inference", Arrays.asList("NEGATIVE"));
jackchan.addUserAttributesEntry("ChannelPreference", Arrays.asList("EMAIL"));
jackchan.addUserAttributesEntry("TwitterHandle", Arrays.asList("Nutter"));
jackchan.addUserAttributesEntry("gender", Collections.singletonList("M"));
jackchan.addUserAttributesEntry("age", Collections.singletonList("435"));
// Adds the user definition to the EndpointRequest that is passed to the Amazon Pinpoint client.
EndpointRequest jackchanIphone = new EndpointRequest()
.withUser(jackchan);
// Updates the specified endpoint with Amazon Pinpoint.
UpdateEndpointResult result = pinpointClient.updateEndpoint(new UpdateEndpointRequest()
.withEndpointRequest(jackchanIphone)
.withApplicationId("4fd13a407f274f10b4ec06cbc71738bd")
.withEndpointId("095A8688-7D79-43CE-BDCE-7DF713332BC3"));
System.out.format("Update endpoint result: %s\n", result.getMessageBody().getMessage());
} catch (Exception ex) {
System.out.println("EndpointUpdate Failed");
System.err.println("Error message: " + ex.getMessage());
ex.printStackTrace();
}
}
}
Hope this helps

GATE Embedded runtime

I want to use "GATE" through web. Then I decide to create a SOAP web service in java with help of GATE Embedded.
But for the same document and saved Pipeline, I have a different run-time duration, when GATE Embedded runs as a java web service.
The same code has a constant run-time when it runs as a Java Application project.
In the web service, the run-time will be increasing after each execution until I get a Timeout error.
Does any one have this kind of experience?
This is my Code:
#WebService(serviceName = "GateWS")
public class GateWS {
#WebMethod(operationName = "gateengineapi")
public String gateengineapi(#WebParam(name = "PipelineNumber") String PipelineNumber, #WebParam(name = "Documents") String Docs) throws Exception {
try {
System.setProperty("gate.home", "C:\\GATE\\");
System.setProperty("shell.path", "C:\\cygwin2\\bin\\sh.exe");
Gate.init();
File GateHome = Gate.getGateHome();
File FrenchGapp = new File(GateHome, PipelineNumber);
CorpusController FrenchController;
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
FrenchController.setCorpus(corpus);
File docFile = new File(GateHome, Docs);
Document doc = Factory.newDocument(docFile.toURL(), "utf-8");
corpus.add(doc);
FrenchController.execute();
String docXMLString = null;
docXMLString = doc.toXml();
String outputFileName = doc.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
out = new OutputStreamWriter(bos, "utf-8");
out.write(docXMLString);
out.close();
gate.Factory.deleteResource(doc);
return outputFileName;
} catch (Exception ex) {
return "ERROR: -> " + ex.getMessage();
}
}
}
I really appreciate any help you can provide.

The problem is that you're loading a new instance of the pipeline for every request, but then not freeing it again at the end of the request. GATE maintains a list internally of every PR/LR/controller that is loaded, so anything you load with Factory.createResource or PersistenceManager.loadObjectFrom... must be freed using Factory.deleteResource once it is no longer needed, typically using a try-finally:
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
try {
// ...
} finally {
Factory.deleteResource(FrenchController);
}
But...
Rather than loading a new instance of the pipeline every time, I would strongly recommend you explore a more efficient approach to load a smaller number of instances of the pipeline but keep them in memory to serve multiple requests. There is a fully worked-through example of this technique in the training materials on the GATE wiki, in particular module number 8 (track 2 Thursday).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to run an OCR(Tesseract) on AWS Lambda using Java - amazon-web-services

Related

How to enter data from Spring Boot Application into Amazon Kinesis?

fetch batch translate job details using SDK

Pentaho ETL Transformation Using Lamda

Updating User.UserAttributes on AWS Pinpoint

GATE Embedded runtime

Categories

Resources