Process large volume of pdfs with OCR

Process large volume of pdfs with OCR - amazon-web-services

I need to process a large quantity of multipage pdfs (around 23,000 documents and an average of 30 pages) into text. Since the documents are typewritten and scanned I want to use OCR recognition to avoid characters recognition mistakes. The problem is the estimated running time on R (using the Tesseract package) is crazy. Is there an online service provider that can be used for this task?
N.B. I had a look both at Amazon Web Service and Google Cloud, but is extremely difficult for me to understand how to use them, especially how to automate the whole process

Related

Deploying NLP model to AWS for beginners

I have the task of optimizing search on the website. The search should be for pictures and for text by text query. I have already developed, trained, tested and selected a machine learning model that transforms images and text into a feature vector (Python, based on OpenAI CLIP). This feature vector will be transferred to Elastic Search. Elastic Search will be configured by another specialist.
The model will be used first to determine the feature vector on all existing images and texts, and then be used whenever new content is added or existing content is changed.
There is a lot of existing content (approximately several tens of millions of pictures and texts together). About 100-500 pieces of content are added and changed per day.
I haven't worked much with AWS, but in this case the model needs to be deployed to AWS somehow. Of course, I have the model and the entire project locally, I can write an API app and make a Docker container.
The question is, what is the best method to deploy this application on AWS? The best in terms of speed and ease of implementation (for me as an AWS beginner), as well as cost optimization, taking into account the number of requests for the application.
I've seen different possibilities, from simply deploying the application on EC2 (probably the easiest option) to using SageMaker. Also Kubernetes and ECS...

I'd recommend using SageMaker Hosting endpoint if you need to be able to run vectorization in near-real time any time of the day, or in a SageMaker Training job if you can run vectorization batched, for example once every few hour.
For both systems you can use pre-defined Framework containers and SDK to which you pass a Python code and optionally requirements.txt, or you can create your own image.

Fine tuning on either Google Cloud Vision, Microsoft Azure Computer Vision API or Amazon Text Extract

I need to transcribe a large number of Handwritten documents. I tried to use cloud services from either Google, Amazon, and Microsoft. Namely:
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
https://cloud.google.com/vision/docs/handwriting
https://aws.amazon.com/textract/
Unfortunately, none of them achieved good enough results. I suspect it is because my documents have a weird handwriting style, and as a result, the networks struggle a lot.
I searched whether I could fine-tune (with manually transcribed data), but I have not found anything online, so as a last resort, I ask here.
If it is possible to fine-tune one of these models, could you please point me some resources?

You are correct, with Azure Cognitive Services with Computer Vision you cannot upload your own data to train the API to recognise the handwriting in your documents I'm afraid. I can't comment on the other offerings from AWS and Google I'm afraid, but certainly not for Azure.

Amazon Textract vs Amazon Rekognition DetectText

How do I decide when to use Amazon Textract vs Amazon Rekognition's TextDetect method?
My usecase is click picture from mobile and convert image data into text and store into AWS RDS.
https://aws.amazon.com/blogs/aws/amazon-rekognition-image-detection-and-recognition-powered-by-deep-learning/
https://aws.amazon.com/textract/

With respect to end-to-end problem solving, Textract will perform better because it is more fully featured for OCR. If you're simply trying to pull a line or two of text from a picture shot in the wild, like street signs or billboards, (ie: not a document or form) I'd recommend Amazon Rekognition.
Amazon Textract is a newer AWS service that was created as a purpose-built solution to the problem of OCR (optical character recognition) in images of documents and PDFs. While Rekognition is a more generalizable computer vision service, Textract has many more OCR-oriented tuning parameters to optimize the process of accurately and effectively extracting text.
Out of the box, if all you are trying to do is detect text and the relevant metadata (coordinates, angle, confidence value), the Rekognition DetectText method will likely perform similarly to the equivalent analyze_document method in Textract, however Textract offers further semantic structuring that helps with text curation/formatting that abstracts other forms of post-processing that the developer would traditionally need to write themselves.
Lastly, when comparing the costs of the two Detect Text methods, Textract costs a bit more ($1.50/1k images) compared to Rekognition ($1.00/1k images).

If there is simply random text in the picture, then use Amazon Rekognition. It will find text in any location.
Amazon Textract is designed for converting paper documents into organized data. It will probably not work well with a random picture (although I haven't tried it so I can't be certain!).

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?

Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)

I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?

How to do large-scale, batch reverse geo-coding?

I have a very large list of lat/lon coordinate pairs (>50 million). I want to attach address information to each one. Most geo/revgeo services have strict call limits. Assuming computing power isn't the issue, how can I accomplish this? Also note that time/speed are not the primary concern.
One place to start might be the

You can get one of the dedicated AWS geocoders for unlimited volumme processing: https://aws.amazon.com/marketplace/search/results?x=0&y=0&searchTerms=geocoder

Intro
I have experience working with SmartyStreets's batch processing tool. They don't have call limits (paid version). But, they also don't have a Reverse Geocode API (yet!). Their batch processing is strictly for flexibility and ease-of-use in addition to normal calls. But, I am aware of a couple services that do Reverse Geocoding, and they mention batch processing on their website.
How they work
Batch processing services generally allow you to upload your data, even arbitrarily large files. You probably want to put your data in a CSV file (type of spreadsheet) as latitude and longitude pairs. Then, their servers will process the data and alert you when you can download. It's common practice to charge money for this download, but maybe TAMU's is free?
Suggestions on who to use
Texas A&M Geoservices
MapLarge
Both of these services have demos and developer portals to guide you along if there is something you want to research before using them.
(Full disclosure: I have worked for SmartyStreets.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Process large volume of pdfs with OCR - amazon-web-services

Related

Deploying NLP model to AWS for beginners

Fine tuning on either Google Cloud Vision, Microsoft Azure Computer Vision API or Amazon Text Extract

Amazon Textract vs Amazon Rekognition DetectText

How to make my datalab machine learning run faster

How to do large-scale, batch reverse geo-coding?

Categories

Resources