I want to extract information from PDFs using Amazon Textract (as in How to use the Amazon Textract with PDF files). All the answers and the AWS documentation requires the input to be Amazon S3 objects.
Can I use Textract without uploading the PDFs to Amazon S3, but just giving them in the REST call? (I have to store the PDFs locally).
I will answer this question with the Java API in mind. The short answer is Yes.
If you look at this TextractAsyncClient Javadoc for a given operation:
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/textract/TextractAsyncClient.html#analyzeDocument-software.amazon.awssdk.services.textract.model.AnalyzeDocumentRequest-
It states:
" Documents for asynchronous operations can also be in PDF format"
This means - you can reference a PDF document and create an AnalyzeDocumentRequest object like this (without pulling from an Amazon S3 bucket). :
public static void analyzeDoc(TextractClient textractClient, String sourceDoc) {
try {
InputStream sourceStream = new FileInputStream(new File(sourceDoc));
SdkBytes sourceBytes = SdkBytes.fromInputStream(sourceStream);
// Get the input Document object as bytes
Document myDoc = Document.builder()
.bytes(sourceBytes)
.build();
List<FeatureType> featureTypes = new ArrayList<FeatureType>();
featureTypes.add(FeatureType.FORMS);
featureTypes.add(FeatureType.TABLES);
AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder()
.featureTypes(featureTypes)
.document(myDoc)
.build();
// Use the TextractAsyncClient to perform an operation like analyzeDocument
...
}
Related
I'm trying to implement multi-part upload to Google Storage but to my surprise it does not seem to be straightforward (I could not find java example).
Only mention I found was in the XML API https://cloud.google.com/storage/docs/multipart-uploads
Also found some discussion around a compose API StorageExample.java#L446 mentioned here google-cloud-java issues 1440
Any recommendations how to do multipart upload?
I got the multi-part upload working with #Koblan suggestion. (for details check blog post)
This is how I create the S3 Client and point it to Google Storage
def createClient(accessKey: String, secretKey: String, region: String = "us"): AmazonS3 = {
val endpointConfig = new EndpointConfiguration("https://storage.googleapis.com", region)
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val credentialsProvider = new AWSStaticCredentialsProvider(credentials)
val clientConfig = new ClientConfiguration()
clientConfig.setUseGzip(true)
clientConfig.setMaxConnections(200)
clientConfig.setMaxErrorRetry(1)
val clientBuilder = AmazonS3ClientBuilder.standard()
clientBuilder.setEndpointConfiguration(endpointConfig)
clientBuilder.withCredentials(credentialsProvider)
clientBuilder.withClientConfiguration(clientConfig)
clientBuilder.build()
}
Because I'm doing the upload from the frontend (after I generate signled URLs for each part using the AmazonS3 client) I needed to enable CORS.
For testing, I enabled everything for now
$ gsutil cors get gs://bucket
$ echo '[{"origin": ["*"],"responseHeader": ["Content-Type", "ETag"],"method": ["GET", "HEAD", "PUT", "DELETE", "PATCH"],"maxAgeSeconds": 3600}]' > cors-config.json
$ gsutil cors set cors-config.json gs://bucket
See https://cloud.google.com/storage/docs/configuring-cors#gsutil_1
Currently Java Client library for multi part upload in Cloud Storage is not available. You can raise a feature request for the same in this link. As mentioned by John Hanley, the next best thing you can do is, do a parallel composite upload with gsutil (CLI), JSON and XML support/ resumable upload with Java libraries.
In parallel compose, the parallel writes can be done by using the JSON or XML API for Google Cloud Storage. Specifically, you would write a number of smaller objects in parallel and then (once all of those objects have been written) call the Compose request to compose them into one larger object.
If you're using the JSON API the compose documentation is at : https://cloud.google.com/storage/docs/json_api/v1/objects/compose
If you're using the XML API the compose documentation is at : https://cloud.google.com/storage/docs/reference-methods#putobject (see the compose query parameter).
Also there is an interesting document link provided by Kolban which you can try and work out. Also I would like to mention that you can have multi part uploads in Java, if you use the Google Drive API(v3). Here is the code example where we use the files.create method with uploadType=multipart.
I already can use the textract but with JPEG files. I would like to use it with PDF files.
I have the code bellow:
import boto3
# Document
documentName = "Path to document in JPEG"
# Read document content
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
# Amazon Textract client
textract = boto3.client('textract')
documentText = ""
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
documentText = documentText + item["Text"]
# print('\033[94m' + item["Text"] + '\033[0m')
# # print(item["Text"])
# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)
As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.
I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.
How can I do that?
EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.
As #syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:
Create new S3 bucket in console and write down bucket name,
then
import random
import boto3
bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'
s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
ClientRequestToken=random.randint(1,1e10))
jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)
It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.
According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.
Edit:
I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is
...
pages = [response]
while nextToken := response.get('NextToken'):
response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
pages.append(response)
As mentioned in the AWS Textract FAQ page https://aws.amazon.com/textract/faqs/. pdf files are supported and in Sdk as well https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html
Sample usage https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/12-pdf-text.py
Since you want to work with PDF files meaning that you'll utilize Amazon Textract Asynchronous API (StartDocumentAnalysis, StartDocumentTextDetection) then currently it's not possible to directly parse in PDF files.
This is because Amazon Textract Asynchronous APIs only support document location as S3 objects.
From AWS Textract doc:
Amazon Textract currently supports PNG, JPEG, and PDF formats. For synchronous APIs, you can submit images either as an S3 object or as a byte array. For asynchronous APIs, you can submit S3 objects.
Upload the pdf to S3 bucket. After that, you can use easily use available functions startDocumentAnalysis to fetch pdf directly from s3 and do textract.
It works (almost), I had to make ClientRequestToken a string instead of an integer.
I've written a code that first downloads the file from AWS and then starts uploading to Azure. I also need to monitor full progress of the migration. But this consumes alot of bandwidth and time and no monitoring of data as well. What should be the best way to make a reliable transfer from s3 to blobstorage along with monitoring of migration
//Downloading from AWS
BasicAWSCredentials awsCreds = new BasicAWSCredentials(t.getDaccID(),t.getDaccKey());
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(Regions.fromName("us-east-2"))
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build();
S3Object s3object = s3Client.getObject(new GetObjectRequest(t.getDbucket(), t.getDfileName()));
byte[] bytes = IOUtils.toByteArray(s3object.getObjectContent());
//Uploading to Azure
String Connstr = "DefaultEndpointsProtocol=https;AccountName="+t.getUaccID()+";AccountKey="+t.getUaccKey()+";EndpointSuffix=core.windows.net";
CloudStorageAccount cloudStorageAccount =CloudStorageAccount.parse(Connstr);
CloudBlobClient blobClient = cloudStorageAccount.createCloudBlobClient();
CloudBlobContainer container=blobClient.getContainerReference(t.getUbucket());
CloudBlockBlob blob = container.getBlockBlobReference(t.getDfileName());
blob.uploadFromByteArray(bytes ,0, bytes.length);
writer.append("File Uploaded to Azure Successful \n");
You don't really need to download the file from S3 and upload it back in Azure Blob Storage. Azure Blob Storage supports creating new blob by copying objects from a publicly available URL. This is an asynchronous operation and is done on the server-side by Azure Storage itself.
Here's what you would need to do (in lieu of the code):
Create a Signed URL of the object in AWS S3 or you can have the object publicly available.
Use Azure Storage Java SDK to create a blob by using Copy Blob functionality. In the copy operation, the source URL will be the signed URL.
Once the copy initiates, you would need to periodically fetch the properties of the blob. In the properties, you will see Copy Properties and there you will be told about the progress (both in terms of percentage as well as bytes copied). You can use that to monitor the progress of the copy.
I wrote a blog post long time back (when async copy blob was first introduced) which talks about copying objects from Amazon S3 to Azure Blob Storage. You can read that blog post here: https://gauravmantri.com/2012/06/14/how-to-copy-an-object-from-amazon-s3-to-windows-azure-blob-storage-using-copy-blob/.
I have a AWS Transcribe job that gives me a URI when completed. This URI should be where the transcription text is stored. I want to access that text with the Java SDK, but GetObject does not seem to support this option. How do I access the text from the Transcribe job?
// I am given this
String URI = job.getTranscript().getTranscriptFileUri();
// I want to do this
S3Object transcript = s3.getObject(URI);
You need to parse the bucket and the object key from the given URI, or you can use the provided class from AWS SDK, the AmazonS3URI. Then do as follows:
String URI = job.getTranscript().getTranscriptFileUri();
AmazonS3URI s3ObjectURI = new AmazonS3URI(URI);
S3Object transcript = s3.getObject(s3ObjectURI.getBucket(), s3ObjectURI.getKey());
I need a way to allow a 3rd party app to upload a txt file (350KB and slowly growing) to an s3 bucket in AWS. I'm hoping for a solution involving an endpoint they can PUT to with some authorization key or the like in the header. The bucket can't be public to all.
I've read this: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
and this: http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html
but can't quite seem to find the solution I'm seeking.
I'd suggests using a combination of the AWS API gateway, a lambda function and finally S3.
You clients will call the API Gateway endpoint.
The endpoint will execute an AWS lambda function that will then write out the file to S3.
Only the lambda function will need rights to the bucket, so the bucket will remain non-public and protected.
If you already have an EC2 instance running, you could replace the lambda piece with custom code running on your EC2 instance, but using lambda will allow you to have a 'serverless' solution that scales automatically and has no min. monthly cost.
I ended up using the AWS SDK. It's available for Java, .NET, PHP, and Ruby, so there's very high probability the 3rd party app is using one of those. See here: http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadObjSingleOpNET.html
In that case, it's just a matter of them using the SDK to upload the file. I wrote a sample version in .NET running on my local machine. First, install the AWSSDK Nuget package. Then, here is the code (taken from AWS sample):
C#:
var bucketName = "my-bucket";
var keyName = "what-you-want-the-name-of-S3-object-to-be";
var filePath = "C:\\Users\\scott\\Desktop\\test_upload.txt";
var client = new AmazonS3Client(Amazon.RegionEndpoint.USWest2);
try
{
PutObjectRequest putRequest2 = new PutObjectRequest
{
BucketName = bucketName,
Key = keyName,
FilePath = filePath,
ContentType = "text/plain"
};
putRequest2.Metadata.Add("x-amz-meta-title", "someTitle");
PutObjectResponse response2 = client.PutObject(putRequest2);
}
catch (AmazonS3Exception amazonS3Exception)
{
if (amazonS3Exception.ErrorCode != null &&
(amazonS3Exception.ErrorCode.Equals("InvalidAccessKeyId")
||
amazonS3Exception.ErrorCode.Equals("InvalidSecurity")))
{
Console.WriteLine("Check the provided AWS Credentials.");
Console.WriteLine(
"For service sign up go to http://aws.amazon.com/s3");
}
else
{
Console.WriteLine(
"Error occurred. Message:'{0}' when writing an object"
, amazonS3Exception.Message);
}
}
Web.config:
<add key="AWSAccessKey" value="your-access-key"/>
<add key="AWSSecretKey" value="your-secret-key"/>
You get the accesskey and secret key by creating a new user in your AWS account. When you do so, they'll generate those for you and provide them for download. You can then attach the AmazonS3FullAccess policy to that user and the document will be uploaded to S3.
NOTE: this was a POC. In the actual 3rd party app using this, they won't want to hardcode the credentials in the web config for security purposes. See here: http://docs.aws.amazon.com/AWSSdkDocsNET/latest/V2/DeveloperGuide/net-dg-config-creds.html