Access results of AWS Transcribe job with Java SDK - amazon-web-services

I have a AWS Transcribe job that gives me a URI when completed. This URI should be where the transcription text is stored. I want to access that text with the Java SDK, but GetObject does not seem to support this option. How do I access the text from the Transcribe job?
// I am given this
String URI = job.getTranscript().getTranscriptFileUri();
// I want to do this
S3Object transcript = s3.getObject(URI);

You need to parse the bucket and the object key from the given URI, or you can use the provided class from AWS SDK, the AmazonS3URI. Then do as follows:
String URI = job.getTranscript().getTranscriptFileUri();
AmazonS3URI s3ObjectURI = new AmazonS3URI(URI);
S3Object transcript = s3.getObject(s3ObjectURI.getBucket(), s3ObjectURI.getKey());

Related

multi-part upload Google Storage

I'm trying to implement multi-part upload to Google Storage but to my surprise it does not seem to be straightforward (I could not find java example).
Only mention I found was in the XML API https://cloud.google.com/storage/docs/multipart-uploads
Also found some discussion around a compose API StorageExample.java#L446 mentioned here google-cloud-java issues 1440
Any recommendations how to do multipart upload?
I got the multi-part upload working with #Koblan suggestion. (for details check blog post)
This is how I create the S3 Client and point it to Google Storage
def createClient(accessKey: String, secretKey: String, region: String = "us"): AmazonS3 = {
val endpointConfig = new EndpointConfiguration("https://storage.googleapis.com", region)
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val credentialsProvider = new AWSStaticCredentialsProvider(credentials)
val clientConfig = new ClientConfiguration()
clientConfig.setUseGzip(true)
clientConfig.setMaxConnections(200)
clientConfig.setMaxErrorRetry(1)
val clientBuilder = AmazonS3ClientBuilder.standard()
clientBuilder.setEndpointConfiguration(endpointConfig)
clientBuilder.withCredentials(credentialsProvider)
clientBuilder.withClientConfiguration(clientConfig)
clientBuilder.build()
}
Because I'm doing the upload from the frontend (after I generate signled URLs for each part using the AmazonS3 client) I needed to enable CORS.
For testing, I enabled everything for now
$ gsutil cors get gs://bucket
$ echo '[{"origin": ["*"],"responseHeader": ["Content-Type", "ETag"],"method": ["GET", "HEAD", "PUT", "DELETE", "PATCH"],"maxAgeSeconds": 3600}]' > cors-config.json
$ gsutil cors set cors-config.json gs://bucket
See https://cloud.google.com/storage/docs/configuring-cors#gsutil_1
Currently Java Client library for multi part upload in Cloud Storage is not available. You can raise a feature request for the same in this link. As mentioned by John Hanley, the next best thing you can do is, do a parallel composite upload with gsutil (CLI), JSON and XML support/ resumable upload with Java libraries.
In parallel compose, the parallel writes can be done by using the JSON or XML API for Google Cloud Storage. Specifically, you would write a number of smaller objects in parallel and then (once all of those objects have been written) call the Compose request to compose them into one larger object.
If you're using the JSON API the compose documentation is at : https://cloud.google.com/storage/docs/json_api/v1/objects/compose
If you're using the XML API the compose documentation is at : https://cloud.google.com/storage/docs/reference-methods#putobject (see the compose query parameter).
Also there is an interesting document link provided by Kolban which you can try and work out. Also I would like to mention that you can have multi part uploads in Java, if you use the Google Drive API(v3). Here is the code example where we use the files.create method with uploadType=multipart.

Amazon Textract without using Amazon S3

I want to extract information from PDFs using Amazon Textract (as in How to use the Amazon Textract with PDF files). All the answers and the AWS documentation requires the input to be Amazon S3 objects.
Can I use Textract without uploading the PDFs to Amazon S3, but just giving them in the REST call? (I have to store the PDFs locally).
I will answer this question with the Java API in mind. The short answer is Yes.
If you look at this TextractAsyncClient Javadoc for a given operation:
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/textract/TextractAsyncClient.html#analyzeDocument-software.amazon.awssdk.services.textract.model.AnalyzeDocumentRequest-
It states:
" Documents for asynchronous operations can also be in PDF format"
This means - you can reference a PDF document and create an AnalyzeDocumentRequest object like this (without pulling from an Amazon S3 bucket). :
public static void analyzeDoc(TextractClient textractClient, String sourceDoc) {
try {
InputStream sourceStream = new FileInputStream(new File(sourceDoc));
SdkBytes sourceBytes = SdkBytes.fromInputStream(sourceStream);
// Get the input Document object as bytes
Document myDoc = Document.builder()
.bytes(sourceBytes)
.build();
List<FeatureType> featureTypes = new ArrayList<FeatureType>();
featureTypes.add(FeatureType.FORMS);
featureTypes.add(FeatureType.TABLES);
AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder()
.featureTypes(featureTypes)
.document(myDoc)
.build();
// Use the TextractAsyncClient to perform an operation like analyzeDocument
...
}

When uploading a file into aws s3 with boto3 is it possible to get the s3 object url as a return value?

I am uploading an image-file into AWS S3 using boto3 library. I noticed that the S3 object url ending does not match with the given Key. Is it possible to get the S3 object url as a return value from boto3 upload_file function?
example:
import boto3
s3 = boto3.client('s3')
file_location = ...
bucket = ...
folder = ...
filename = ...
url = s3.upload_file(
Filename=file_location,
Bucket=bucket,
Key=f'{folder}/{filename}',
)
I read from docs that it might be possible with a callback function, but I could not get it working with boto3.
If not what is the simplest way to get the uploaded object url?
Using the AWS SDK, you can get a URL for an object in an Amazon S3 bucket. I am not sure there is a Python example for this use case however, you can get an idea how to perform this task by looking at the Java example.
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/GetObjectUrl.java
Okey, my problem was that the the filenames I was adding to create the file Key contained a hashtag symbol which has a specific meaning in url. AWS was automatically changing the hashtags into %23, which created a mismatch between Key and URL. Now I changed the naming convention of the file into containing no hashtag, so no more problem occurs.

How to use the Amazon Textract with PDF files

I already can use the textract but with JPEG files. I would like to use it with PDF files.
I have the code bellow:
import boto3
# Document
documentName = "Path to document in JPEG"
# Read document content
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
# Amazon Textract client
textract = boto3.client('textract')
documentText = ""
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
documentText = documentText + item["Text"]
# print('\033[94m' + item["Text"] + '\033[0m')
# # print(item["Text"])
# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)
As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.
I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.
How can I do that?
EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.
As #syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:
Create new S3 bucket in console and write down bucket name,
then
import random
import boto3
bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'
s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
ClientRequestToken=random.randint(1,1e10))
jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)
It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.
According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.
Edit:
I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is
...
pages = [response]
while nextToken := response.get('NextToken'):
response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
pages.append(response)
As mentioned in the AWS Textract FAQ page https://aws.amazon.com/textract/faqs/. pdf files are supported and in Sdk as well https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html
Sample usage https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/12-pdf-text.py
Since you want to work with PDF files meaning that you'll utilize Amazon Textract Asynchronous API (StartDocumentAnalysis, StartDocumentTextDetection) then currently it's not possible to directly parse in PDF files.
This is because Amazon Textract Asynchronous APIs only support document location as S3 objects.
From AWS Textract doc:
Amazon Textract currently supports PNG, JPEG, and PDF formats. For synchronous APIs, you can submit images either as an S3 object or as a byte array. For asynchronous APIs, you can submit S3 objects.
Upload the pdf to S3 bucket. After that, you can use easily use available functions startDocumentAnalysis to fetch pdf directly from s3 and do textract.
It works (almost), I had to make ClientRequestToken a string instead of an integer.

Allow 3rd party app to upload file to AWS s3

I need a way to allow a 3rd party app to upload a txt file (350KB and slowly growing) to an s3 bucket in AWS. I'm hoping for a solution involving an endpoint they can PUT to with some authorization key or the like in the header. The bucket can't be public to all.
I've read this: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
and this: http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html
but can't quite seem to find the solution I'm seeking.
I'd suggests using a combination of the AWS API gateway, a lambda function and finally S3.
You clients will call the API Gateway endpoint.
The endpoint will execute an AWS lambda function that will then write out the file to S3.
Only the lambda function will need rights to the bucket, so the bucket will remain non-public and protected.
If you already have an EC2 instance running, you could replace the lambda piece with custom code running on your EC2 instance, but using lambda will allow you to have a 'serverless' solution that scales automatically and has no min. monthly cost.
I ended up using the AWS SDK. It's available for Java, .NET, PHP, and Ruby, so there's very high probability the 3rd party app is using one of those. See here: http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadObjSingleOpNET.html
In that case, it's just a matter of them using the SDK to upload the file. I wrote a sample version in .NET running on my local machine. First, install the AWSSDK Nuget package. Then, here is the code (taken from AWS sample):
C#:
var bucketName = "my-bucket";
var keyName = "what-you-want-the-name-of-S3-object-to-be";
var filePath = "C:\\Users\\scott\\Desktop\\test_upload.txt";
var client = new AmazonS3Client(Amazon.RegionEndpoint.USWest2);
try
{
PutObjectRequest putRequest2 = new PutObjectRequest
{
BucketName = bucketName,
Key = keyName,
FilePath = filePath,
ContentType = "text/plain"
};
putRequest2.Metadata.Add("x-amz-meta-title", "someTitle");
PutObjectResponse response2 = client.PutObject(putRequest2);
}
catch (AmazonS3Exception amazonS3Exception)
{
if (amazonS3Exception.ErrorCode != null &&
(amazonS3Exception.ErrorCode.Equals("InvalidAccessKeyId")
||
amazonS3Exception.ErrorCode.Equals("InvalidSecurity")))
{
Console.WriteLine("Check the provided AWS Credentials.");
Console.WriteLine(
"For service sign up go to http://aws.amazon.com/s3");
}
else
{
Console.WriteLine(
"Error occurred. Message:'{0}' when writing an object"
, amazonS3Exception.Message);
}
}
Web.config:
<add key="AWSAccessKey" value="your-access-key"/>
<add key="AWSSecretKey" value="your-secret-key"/>
You get the accesskey and secret key by creating a new user in your AWS account. When you do so, they'll generate those for you and provide them for download. You can then attach the AmazonS3FullAccess policy to that user and the document will be uploaded to S3.
NOTE: this was a POC. In the actual 3rd party app using this, they won't want to hardcode the credentials in the web config for security purposes. See here: http://docs.aws.amazon.com/AWSSdkDocsNET/latest/V2/DeveloperGuide/net-dg-config-creds.html