Send S3 document to Textract using Go - amazon-web-services

I'm trying to use Go to send objects in a S3 bucket to Textract and collect the response.
I'm using the aws go sdk package and able to connect to my S3 bucket and list all the objects contained within. So far so good. I now need to be able to send one of those objects (a .pdf file) to Textract and collect the response(s).
The AWS Go SDK content for interacting with Textract seem to be quite extensive but I cannot find a good example for how to do this.
I would be very grateful for a sample or advice on how to do this.

To start a job, you invoke StartDocumentTextDetection, using a DocumentLocation to specify the file, and you specify a SNS topic where Textract will publish a notification when it has finished to process your job.
You have now two possibilities:
Subscribe to the SNS topic, and when you receive a message retrieve the result
Create a lambda function triggered by the SNS topic, which retrieves the result.
The second option is IMO better 'cause it use less computation time (doesn't run until the job hasn't finished).
To retrieve the job, you use GetDocumentTextDetection

If anyone else reaches this site searching for an answer:
I understood the documentation as if I could just call the StartDocumentAnalysis function through the textract SDK but in fact what was missing is the fact that you need to create a new Session first and do the calls based on the session:
https://docs.aws.amazon.com/sdk-for-go/api/service/textract/#New

Related

How do I see which user that invoked a Lambda function

Need some help with Lambda invocation and authentication. I have an AWS Lambda function that is invoked from AWS IoT MQTT feed based on a specific topic. The invocation happens when an authenticated IoT Thing publishes to MQTT on that topic. My question is how do I see who has invoked it? I need this information so I know under what user to store the published information to database. I'm guessing there should be some environment variables that carry this information but I haven't found it. Maybe I been looking in all the wrong places:/
Many thanks,
Marcus
You should be able to modify the Lambda trigger in your IoT configuration to include the client ID by using something like the following SQL statement:
select clientId() as clientId, *
How are you?
You could send the user on the topic message. Is it not easier? Not sure how to get it from env var.

Architectural advice for AWS firehose or similar when collecting a lot of events in real-time

I would like to ask you about getting some advice about handling many application events on AWS. My application sends a lot of different events about everything what a user did in real-time. For collecting those events, I’m using AWS firehose (kinesis) - I have few data streams where I push some different events. Some events, before storing on S3/Redshift contains data which I want to extract and store to other databases (DynamoDB) or to other S3 files — for that case I’m using lambda which is assigned to a specific stream.
My problem is that business adds more and more new events which they need to collect or do something with data and for every new event or „group” events I need create separate data stream + s3/rs/es + lambda for extracting data. Also, events on S3 are stored in one format and there is not possible to group that events e.g. by userId from an application or even name of the event in the stream filename. Ideal s3 with that events would look like events/{user_id}/{date}/{event-name}{timestamp}.json.
Maybe I’m wrong using firehose or I have wrong thinking about firehose in my case, maybe there are other, better services on AWS for my case which can give me more control. Maybe simple SQS + lambdas as a listener on S3 is better solution in this case?
Thanks for any advice.
EDIT 12th Nov 2020
This was supposed to be a comment for #Lina, but it was too long to put a comment, so I updated my question with the solution which I pick.
I resolved my issue as I "felt", so it may not be a good way to repeat, but: I've written a nodejs routing application which I connected on firehose and I wrote a few microservices where data is sent from firehose by my routing app. So now, I have a firehose tube and I'm taking 10 different event types. When some event came, my routing application decides what microservice should be run with what data based on the event type (the raw firehose event is still stored on s3 automatically). This gives me needed flexibility as I can extract specific data from the event, do with that data what I need, by running every other microservices from the whole system and still have a raw event in the s3 in case of needed revert history of events.
Some of the events are not passing to any service, it is just stored as a raw s3 file e.g. application logs - I can do many things with that files on S3 PUT/CREATE event.
I hope that it will help someone with a similar problem.

Moving data from sqs to s3

I have a queue in SQS that many notifications with data are pushed into it (more than 9M notifications a day)
I'd like to know if there is a way to create s3 objects from messages in sqs to s3 (path is being mentioned in attribute of the meesage)
I prefer to have an out-of-the-box solution without coding.
If such a solution doesn't exist, would you recommend to have a lambda function to do that instead of a process (code runs on ec2)
Thanks,
I found an official document describing something similar to what you want to achieve -- it's more recommended in case you have big/large messages. As of today, feature is only available in the Java SDK. Example:
/*
* Set the Amazon SQS extended client configuration with large payload
* support enabled.
*/
final ExtendedClientConfiguration extendedClientConfig =
new ExtendedClientConfiguration()
.withLargePayloadSupportEnabled(s3, S3_BUCKET_NAME);
final AmazonSQS sqsExtended =
new AmazonSQSExtendedClient(AmazonSQSClientBuilder
.defaultClient(), extendedClientConfig);
Example reference: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/working-java-example-using-s3-for-large-sqs-messages.html

Query AWS SNS Endpoints by User Data

Simple question, but I suspect it doesn't have a simple or easy answer. Still, worth asking.
We're creating an implementation for push notifications using AWS with our Web Server running on EC2, sending messages to a queue on SQS, which is dealt with using Lambda, which is sent finally to SNS to be delivered to the iOS/Android apps.
The question I have is this: is there a way to query SNS endpoints based on the custom user data that you can provide on creation? The only way I see to do this so far is to list all the endpoints in a given platform application, and then search through that list for the user data I'm looking for... however, a more direct approach would be far better.
Why I want to do this is simple: if I could attach a User Identifier to these Device Endpoints, and query based on that, I could avoid completely having to save the ARN to our DynamoDB database. It would save a lot of implementation time and complexity.
Let me know what you guys think, even if what you think is that this idea is impractical and stupid, or if searching through all of them is the best way to go about this!
Cheers!
There isn't the ability to have a "where" clause in ListTopics. I see two possibilities:
Create a new SNS topic per user that has some identifiable id in it. So, for example, the ARN would be something like "arn:aws:sns:us-east-1:123456789:know-prefix-user-id". The obvious downside is that you have the potential for a boat load of SNS topics.
Use a service designed for this type of usage like PubNub. Disclaimer - I don't work for PubNub or own stock but have successfully used it in multiple projects. You'll be able to target one or many users this way.
According the the [AWS documentation][1] if you try and create a new Platform Endpoint with the same User Data you should get a response with an exception including the ARN associated with the existing PlatformEndpoint.
It's definitely not ideal, but it would be a round about way of querying the User Data Endpoint attributes via exception.
//Query CustomUserData by exception
CreatePlatformEndpointRequest cpeReq = new CreatePlatformEndpointRequest().withPlatformApplicationArn(applicationArn).withToken("dummyToken").withCustomUserData("username");
CreatePlatformEndpointResult cpeRes = client.createPlatformEndpoint(cpeReq);
You should get an exception with the ARN if an endpoint with the same withCustomUserData exists.
Then you just use that ARN and away you go.

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.