Querying AWS SQS queue URL using Java SDK - amazon-web-services

Most operations of the AWS Java SDK for SQS require a queue url.
Given a queue name, the queue url can be queried using the GetQueueUrl operation.
Does the AmazonSQS client automatically cache the result of this operation, or is it up to the application to cache the queue url to avoid repeated queries?

If we look in the AWS Java SDK code on GitHub, we see that getQueueUrl() triggers the usual client preparation hooks (which doesn't appear to include caching), and then immediately jumps to executeGetQueueUrl() which makes the request, also without caching. Interestingly, there does appear to be a URI cachedEndpoint = null; that doesn't appear to be used anywhere (maybe I'm missing something?).
Taking a step back, this makes sense. Auto-caching the response on the SDK could be dangerous for applications using it, so the decision to cache or not cache is left to the application logic where it belongs. So, if you need to cache the responses, it's up to you to decide how long you want to cache it and where/how to store it.

Related

AWS Lambda best practices for Real Time Tracking

We currently run an AWS Lambda function that primarily simply redirects the user to a different URL. The function is invoked via API-Gateway.
For tracking purposes, we would like to create a widget on our dashboard that provides real-time insights into how many redirects are performed each second. The creation of the widget itself is not the problem.
My main question currently is which AWS Services is best suited for telling our other services that an invocation took place. We plan to register the invocation in our database.
Some additional things:
low latency (< 5 seconds) in order to be real-time data
nearly no increased time wait for the user. We aim to redirect the user as fast as possible
Many thanks in advance!
Best Regards
Martin
I understand that your goal is to simply persist the information that an invocation happened somewhere with minimal impact on the response time of the Lambda.
For that purpose I'd probably use an SQS standard queue and just send a message to the queue that the invocation happened.
You can then have an asynchronous process (Lambda, Docker, EC2) process the messages from the queue and update your Dashboard.
Depending on the scalability requirements looking into Kinesis Data Analytics might also be worth it.
It's a fully managed streaming data solution and the analytics part allows you to do sliding window analyses using SQL on data in the Stream.
In that case you'd write the info that something happened to the stream, which also has a low latency.

Amazon S3: how parallel PUTs to the same key are resolved in versioned buckets

Amazon S3 data consistency model contains following note:
Amazon S3 does not currently support object locking for concurrent updates. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
For my application I am considering resolving conflicts caused by concurrent writes by analyzing all object versions on the client side in order to assemble a single composite object, hopefully reconciling the two changes.
However, I am not able to find a definitive answer to how the part in bold plays out in versioned buckets.
Specifically, after reading how Object Versioning works, it appears that S3 will create new object version with a unique version Id on each PUT right away, after which I assume that data replication will kick in and S3 will have to determine which of the two concurrent writes to retain.
So the questions that I have on this part are:
Will S3 keep track of a separate version for each of the two concurrent writes for the object?
Will I see both versions when querying list of versions for the object using API, with one of them being arbitrarily marked as current?
I found this question while debugging our own issue caused by concurrent PUTs to the same key. Returning to fulfill my obligations per [1].
Rapid concurrent PUTs to the same S3 key do sometimes collide, even with versioning enabled. In such cases S3 will return a 503 for one of the requests (the one with the oldest timestamp, per the doc snippet you pasted above).
Here's a blob from S3's engineering team, passed onto us by our business support contact:
While S3 supports request rates of up to 3500 REST.PUT.OBJECT requests per second to a single partition, in some rare scenarios, rapid concurrent REST.PUT.OBJECT requests to the same key may result in a 503 response. In such cases, further partitioning also does not help because the requests for the same key will land on the same partition. In your case, we looked at the reason your request received a 503 response and determined that it was because there was a concurrent REST.PUT.OBJECT request for the same key that our system was processing. In such cases, retrying the failed request will most likely result in success.
Your definition of rare may vary. We were using 4 threads and seeing a 503 for 1.5 out of every 10 requests.
The snippet you quoted is the only reference I can find in the docs to this, and it doesn't explicitly mention 503s (which are usually rate limiting due to >3500 requests).
When the requests don't collide and return a 503, it works how you would expect (new version per request, most recent request timestamp is the most recent version).
Hopefully this post will help someone with the same issue in the future.
[1] https://xkcd.com/979/

AWS Lambda - Store state of a queue

I'm currently tasked with building a serverless architecture for communication between government agencies and citizens, and a main component is some form of queue that contains some form of object/pointer to each citizens request, sorted by priority. The government workers can then process an element when available. As Lambda is stateless, I need to save the queue outside in some manner.
For saving state I've gathered that you can use DynamoDB or S3 Buckets and use event triggers to invoke related Lambda methods. Some also suggest using Parameter Store to save some state variables. Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Finally, I've also read a bit about SQS, though I have no idea if it is at all applicable to this case.
What is the best-practice / suggested approach when working with Lambda in this way? I'm leaning towards S3 Buckets, due to event triggering, and not using DynamoDB as our DB.
Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Correct -- this is not viable at all. Note that what you are actually referring to when you say "the Lambda" is the process inside the container... and any time your Lambda function is handling more than one invocation concurrently, you are guaranteed that they will not be running in the same container -- so "global" variables are only useful for optimization, not state. Any two concurrent invocations of the same function have two entirely different global environments.
Forgetting all about Lambda for a moment -- I am not saying don't use Lambda; I'm saying that whether or not you use Lambda isn't relevant to the rest of what is written, below -- I would suggest that parallel/concurrent actions in general are perhaps one of the most important factors that many developers tend to overlook when trying to design something like you are describing.
How you will assign work from this work "queue" is extremely important to consider. You can't just "find the next item" and display it to a worker.
You must have a way to do all of these things:
finding the next item that appears to be available
verify that it is indeed available
assign it to a specific worker
mark it as unavailable for assignment
Not only that, but you have to be able to do all of these things atomically -- as a single logical action -- and without collisions.
A naïve implementation runs the risk of assigning the same work item to two or more people, with the first assignment being blindly and silently overwritten by subsequent assignments that happen at almost the same time.
DynamoDB allows conditional updates -- update a record if and only if a certain condition is true. This is a critical piece of functionality that your solution needs to accommodate -- for example, assign work item x to user y if and only if item x is currently unassigned. A conditional update will fail, and changes nothing, if the condition is not true at the instant the update happens and therein lies the power of the feature.
S3 does not support conditional updates, because unlike DynamoDB, S3 operates only on an eventual-consistency model in most cases. After an object in S3 is updated or deleted, there is no guarantee that the next request to S3 will return the most recent version or that S3 will not return an item that has recently been deleted. This is not a defect in S3 -- it's an optimization -- but it makes S3 unsuited to the "work queue" aspect.
Skip this consideration and you will have a system that appears to work, and works correctly much of the time... but at other times, it "mysteriously" behaves wrongly.
Of course, if your work items have accompanying documents (scanned images, PDF, etc.), it's quite correct to store them in S3... but S3 is the wrong tool for storing "state." SSM Parameter Store is the wrong tool, for the same reason -- there is no way for two actions to work cooperatively when they both need to modify the "state" at the same time.
"Event triggers" are useful, of course, but from your description, the most notable "event" is not from the data, or the creation of the work item, but rather it is when the worker says "I'm ready for my next work item." It is at that point -- triggered by the web site/application code -- when the steps above are executed to select an item and assign it to a worker. (In practice, this could be browser → API Gateway → Lambda). From your description, there may be no need for the creation of a new work item to trigger an "event," or if there is, it is not the most significant among the events.
You will need a proper database for this. DynamoDB is a candidate, as is RDS.
The queues provided by SQS are designed to decouple two parts of your application -- when two processes run at different speeds, SQS is used as a buffer, allowing X to safely store the work needing to be done and then continue with something else until Y is able to do the work. SQS queues are opaque -- you can't introspect what's in the queue, you just take the next message and are responsible for handling it. On its face, that seems to partially describe what you need, but it is not a clean match for this use case. Queues are limited in how long messages can be retained, and once a message is successfully processed, it is completely gone.
Note also that SQS is only a match to your use case with the FIFO queue feature enabled, which guarantees perfect in-order delivery and exactly-once delivery -- standard SQS queues, for performance optimization reasons, do not guarantee perfect in-order delivery and may under certain conditions deliver the same message more than once, to the same consumer or a different consumer. But the SQS FIFO queue feature does not coexist with event triggers, which require standard queues.
So SQS may have a role, but you need an authoritative database to store the work and the results of the business process.
If you need to store the message, then SQS is not the best tool here, because your Lambda function would then need to process the message and finally store it somewhere, making SQS nothing but a broker.
The S3 approach gives what you need out of the box, considering you can store the files (messages) in an S3 bucket and then have one Lambda consume its event. Your Lambda would then process this event and the file would remain safe and sound on S3.
If you eventually need multiple consumers for this message, then you can send the S3 event to SNS instead and finally you could subscribe N Lambda Functions to a given SNS topic.
You appear to be worrying too much about the infrastructure at this stage and not enough on the application design. The fact that it will be serverless does not change the basic functionality of the application — it will still present a UI to users, they will still choose options that must trigger some business logic and information will still be stored in a database.
The queue you describe is merely a datastore of messages that are in a particular state. The application will have some form of business logic for determining the next message to handle, which could be based on creation timestamp, priority, location, category, user (eg VIP users who get faster response), specialization of the staff member asking for the next message, etc. This is not a "queue" but rather a calculation to be performed against all 'unresolved' messages to determine the next message to assign.
If you wish to go serverless, then the back-end will certainly be using Lambda and a database (eg DynamoDB or Amazon RDS). The application should store everything in the database so that data is available for the application's business logic. There is no need to use SQS since there really isn't a "queue", and Parameter Store is merely a way of sharing parameters amongst application components — it is not meant for core data storage.
Determine the application functionality first, then determine the appropriate architecture to make it happen.

Log delay in Amazon S3

I have recently hosted in Amazon S3, and I need the log files to calculate the statistics for the "get", "put", "list" operations in the objects.
And I've observed that the log files are organized weirdly. I don't know when the log will appear(not immediatly, at least 20 minutes after the operation) and how many lines of logs will be contained in one log file.
After that, I need to download these log files and analyse them. But I can't figure out how often I will do this.
Can somebody help? Thanks.
What you describe (log files being made available with delays and being in unpredictable order) is exactly what is declared by AWS as behaviour to expect. This is by nature of distributed system, AWS S3 is using to provide S3 service, the same request may be served each time from different server - I have seen 5 different IP addresses being provided for publishing.
So the only solution is: accept the delay, see the delay you experience and add some extra time and learn living with this total delay (I would expect something like 30 to 60 minutes, but statistics could tell more).
If you need log records ordered, you have either sort them yourself, or search for some log processing solutions - I have seen some applications being offered exactly for this purpose.
In case, you really need to get your log file with very short delay, you have to make the logs yourself and this means, you have to write and run some frontend, which gives access to your files on S3 and at the same time keeps logging as needed.
I run such a solution, users get user name and password and url of my frontend. As they send the request, I evaluate, if they provide proper credentials and if they are allowed to see given resource, and if so, I create few minutes valid temporary url for that resource and redirect the request to that.
But such a fronted costs money (you have to run your frontend somewhere) and is less robust, then accessing directly the AWS S3.
Good luck, Lulu.
A lot has changed since the time that the question was originally posted. The delay is still there, but one of OP concerns was when to download the logs to analyze them.
One option right now would be to leverage Event Notifications: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/setup-event-notification-destination.html
This way, whenever an object is created in the access logs bucket, you can trigger a notification either to SNS, SQS or Lamba, and based on that download and analyze the log files.

Can a SQL trigger call a web service?

I'm building out a RESTful API for an iPhone app.
When a user "checks-in" [Inserts new row into a table] I want to then take data from that insert and call a web service, which would send push notifications based upon that insert.
The only way I can think of doing this is either doing it through a trigger, or having the actual insert method, upon successful insert, call the web service. That seems like a bad idea to me.
Was wondering if you had any thoughts on this or if there was a better approach that I haven't thought of.
Even if it technically could, it's really not a good idea! A trigger should be very lean, and it should definitely not involve a lengthy operation (which a webservice call definitely is)! Rethink your architecture - there should be a better way to do this!
My recommendation would be to separate the task of "noticing" that you need to call the webservice, in your trigger, from the actual execution of that web service call.
Something like:
in your trigger code, insert a "do call the webservice later" into a table (just the INSERT to keep it lean and fast - that's all)
have an asynchronous service (a SQL job, or preferably a Windows NT Service) that makes those calls separately from the actual trigger execution and stores any data retrieved from that web service into the appropriate tables in your database.
A trigger is a very finicky thing - it should always be very quick, very lean - do an INSERT or two at most - and by all means avoid cursors in triggers, or other lengthy operations (like a web service call)
Brent Ozar has a great webcast (presented at SQL PASS) on The Top 10 Developer Mistakes That Don't Scale and triggers are the first thing he puts his focus on! Highly recommended
It depends on the business needs. Usually I would stay away from using triggers for that, as this is a business logic, and should be handled by the BL.
But the answer is Yes to your question - you can do that, just make sure to call the web service asynchronously, so it does not delay the insert while the web service call finishes.
You may also consider using OneWay web service - i.e. fire and forget.
But, as others pointed out - you are always better off not using trigger.
If properly architectured, there should be only one piece of code, which can communicate with the database, i.e. some abstraction of the DAL in only a single service. Hook there to make whatever is needed after an insert.
I would go with a trigger, if there are many different applications which can write in the database with a direct access to the database, not trough a DAL service. Which again is a disaster waiting to happen.
Another situation, in which I may go with a trigger, if I have to deal with internally hosted third party application, i.e. if I have access to the database server itself, but not to the code which writes in the database.
What about a stored procedure? Instead of setting it up on a trigger, call a stored procedure, which will both insert the data, and possibly do something else.
As far as I know, triggers are pretty limited in their scope of what they can do. A stored procedure may have more scope (or maybe not).
In the worst case, you can always build your own "API" page; instead of directly inserting the data, request the API page, which can both insert the data and do the push notification.
Trigger->Queue->SP->XP_XMDShell->BAT->cURL->3rd party web service
I used a trigger to insert a record in a Queue table,
then a Stored procedure using a cursor to pull Queued entries off.
I had no WSDL or access to the 3rd party API developers and an urgent need to complete a prototype, so the Stored Procedure calls XP_CMDShell calling a .bat file with parameters.
The bat file calls cURL which manages the REST/JSON call and response.
It was free, quick and works reliably. Not architecturally pure but got the prototype off the ground.
A good practice is to have that web page make an entry into another table (i will call message_queue ) when the user hits the page.
Then have a windows service / *nix daemon on a server scan the message_queue table and perform the pushes via a web service to the mobile app. You can leverage the power of transaction processing in SQL to manage the queue processing.
The nice thing about this approach is you can start with everything on 1 stand alone server, and even separate the website, database, service/daemon onto different physical servers or server clusters as you scale up.