Log delay in Amazon S3 - amazon-web-services

I have recently hosted in Amazon S3, and I need the log files to calculate the statistics for the "get", "put", "list" operations in the objects.
And I've observed that the log files are organized weirdly. I don't know when the log will appear(not immediatly, at least 20 minutes after the operation) and how many lines of logs will be contained in one log file.
After that, I need to download these log files and analyse them. But I can't figure out how often I will do this.
Can somebody help? Thanks.

What you describe (log files being made available with delays and being in unpredictable order) is exactly what is declared by AWS as behaviour to expect. This is by nature of distributed system, AWS S3 is using to provide S3 service, the same request may be served each time from different server - I have seen 5 different IP addresses being provided for publishing.
So the only solution is: accept the delay, see the delay you experience and add some extra time and learn living with this total delay (I would expect something like 30 to 60 minutes, but statistics could tell more).
If you need log records ordered, you have either sort them yourself, or search for some log processing solutions - I have seen some applications being offered exactly for this purpose.
In case, you really need to get your log file with very short delay, you have to make the logs yourself and this means, you have to write and run some frontend, which gives access to your files on S3 and at the same time keeps logging as needed.
I run such a solution, users get user name and password and url of my frontend. As they send the request, I evaluate, if they provide proper credentials and if they are allowed to see given resource, and if so, I create few minutes valid temporary url for that resource and redirect the request to that.
But such a fronted costs money (you have to run your frontend somewhere) and is less robust, then accessing directly the AWS S3.
Good luck, Lulu.

A lot has changed since the time that the question was originally posted. The delay is still there, but one of OP concerns was when to download the logs to analyze them.
One option right now would be to leverage Event Notifications: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/setup-event-notification-destination.html
This way, whenever an object is created in the access logs bucket, you can trigger a notification either to SNS, SQS or Lamba, and based on that download and analyze the log files.

Related

How to add custom authentication to aws s3 download of large files

I'm trying to figure out how to implement these requirements for S3 downloads:
Signed URL (links should become invalid after some amount of time).
Download only 1 time - any other requests to the same URL should fail.
Need to restrict downloads to the user/browser who made the request to generate the signed URL - no other user should be able to download.
Be able to deal with large files (ideally, streaming, just like when someone downloads directly from a standard S3 access point).
Things that I've tried:
S3 Object Lambda + Access Point
Generate pre-signed URL to lambda access point, this works well.
Make use of S3 object metadata to store download state / restrict downloads to just 1 time. This works well.
No way to access user-agent or requestor's IP.
Large files are a problem. Timeout has been configured to 15 minutes (the max), but request still times out much earlier. This was done with NodeJS.
Lambda + Lambda URL
Pre-signed URL is generated and passed to lambda URL as encoded param - the lambda makes the request if auth/validation passes. This approach seems to work fine.
Can use same approach of leveraging S3 object metadata to limit downloads to just 1 time.
User-agent and requestor IP is available, this is great.
Large files are a problem. I've tried NodeJS and it behaves the same as the S3 Object Lambda (eventually times out, even earlier than the configured time), Also implemented the Java streaming handler but it dies with an "out of memory" error, even when I bump the memory up to 3GB (the file is only 1GB and I thought streaming would get around the memory problem anyway). I've tried several ways to stream (Java 11), but it really seems like the streaming handler is not really streaming, but buffering somewhere outside of the lambda.
I'm now unsure if AWS lambda will be able to handle all of these requirements, but I would really like to know if others might have ideas, or if I'm missing something.

Amazon S3: how parallel PUTs to the same key are resolved in versioned buckets

Amazon S3 data consistency model contains following note:
Amazon S3 does not currently support object locking for concurrent updates. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
For my application I am considering resolving conflicts caused by concurrent writes by analyzing all object versions on the client side in order to assemble a single composite object, hopefully reconciling the two changes.
However, I am not able to find a definitive answer to how the part in bold plays out in versioned buckets.
Specifically, after reading how Object Versioning works, it appears that S3 will create new object version with a unique version Id on each PUT right away, after which I assume that data replication will kick in and S3 will have to determine which of the two concurrent writes to retain.
So the questions that I have on this part are:
Will S3 keep track of a separate version for each of the two concurrent writes for the object?
Will I see both versions when querying list of versions for the object using API, with one of them being arbitrarily marked as current?
I found this question while debugging our own issue caused by concurrent PUTs to the same key. Returning to fulfill my obligations per [1].
Rapid concurrent PUTs to the same S3 key do sometimes collide, even with versioning enabled. In such cases S3 will return a 503 for one of the requests (the one with the oldest timestamp, per the doc snippet you pasted above).
Here's a blob from S3's engineering team, passed onto us by our business support contact:
While S3 supports request rates of up to 3500 REST.PUT.OBJECT requests per second to a single partition, in some rare scenarios, rapid concurrent REST.PUT.OBJECT requests to the same key may result in a 503 response. In such cases, further partitioning also does not help because the requests for the same key will land on the same partition. In your case, we looked at the reason your request received a 503 response and determined that it was because there was a concurrent REST.PUT.OBJECT request for the same key that our system was processing. In such cases, retrying the failed request will most likely result in success.
Your definition of rare may vary. We were using 4 threads and seeing a 503 for 1.5 out of every 10 requests.
The snippet you quoted is the only reference I can find in the docs to this, and it doesn't explicitly mention 503s (which are usually rate limiting due to >3500 requests).
When the requests don't collide and return a 503, it works how you would expect (new version per request, most recent request timestamp is the most recent version).
Hopefully this post will help someone with the same issue in the future.
[1] https://xkcd.com/979/

What happens if you’re in the middle of a process when AWSAssumeRole times out?

I’m currently working with a role that I need to assume to access certain buckets on S3.
I was wondering, if the duration given to an STSAssumeRoleSessiomCredentialsProvider is 1 hour and you’re doing something like downloading a file that takes 1.5 hours, does it finish the process or does it stop in the middle because the duration ended?
The validity of the credentials is verified when the request is initiated. Once initiated successfully, the response will be sent completely. In your download example case, if the credentials were valid when the download request was initiated, that is sufficient for the file to be downloaded completely.
The STS credentials expiry is a problem where repeated connections are made to AWS as part of a long running program and the program reads the credentials at the beginning and stores them. It is generally a good practice to decouple the sts-credential-acquisition process from the users of those credentials and the users should ensure the credentials are always read when the underlying source of credentials (typically a file) is modified.
These aspects are handled by AWS Java SDK's ProfileCredentialsProvider class automatically. Not sure if a similar module exists in other language bindings too.
Credentials are validated when presented on an API call. If you make your API call(s) before the credentials expire then you are fine.
If, however, you need to make multiple API calls, and one of them exceeds the expiration time, then that call will fail.
This is particularly relevant to S3 multi-part uploads, each part of which is a distinct API call, and which presents credentials each time. The solution to this generally is one of:
get credentials that are valid for long enough to complete the
operation
refresh credentials when you are close to expiration and
use the new credentials for subsequent part uploads

Why does aws s3 getObject executes slowly even with small files?

I am relatively new to amazon web services. There is problem that came up while I was coding my new web app. I am currently storing profile pictures in an s3 bucket.
I don’t want these profile pictures to be seen by the public, only authorized members. So I have a php file like this:
This php file executes getObject and sends out a header to show the picture but only if the user is allowed to see the picture. I query the database and also check session to make sure that the currently logged in user has access to the picture. All is working fine, but it takes around 500 milliseconds to the get request to execute, even on small files (40kb). On bigger files it gets even longer as well as if I embed the php file in an img tag multiple times with different query string values.
I need to mention that I’m testing this in a localhost environment with apache webserver.
Could be the the problem is that getObject is optimized to be run from an ec2 instance and that if I would test this on an ec2 the response time is much better?
My s3 is based in London, and I’m testing it in Hungary with a good internet connection so I’m not sure if this response time is what I should get here.
I read that other people had similar issues, but from my understanding the time it takes from s3 to transfer the files to an ec2 should be minimal as they are all in the cloud and the latency between these services and all the other aws services should be minimal (At least if they are in the same region).
Please don’t tell me in comments that I should just make my bucket public and embed the direct link to the file as it is not a viable option for obvious reasons. I also don’t want to generate pre-signed urls for various reasons.
I also tested this without querying the database and essentially the only logic in my code is to get the object and show it to the user. Even with this I get 400+ milliseconds response time.
I also tried using doesObjectExist() and I still need to wait around 300-400 milliseconds for that to give me a response.
Multiple get request to the same php file as image source
UPDATE
I tested it on my ec2 instance and I've got much better response time. I tested it with multiple files and all is fine. It seems like that if you use getObject on localhost, the time it takes to connect to s3 and fetch the data multiplies.
Thank you for the answers!

AWS CloudWatchLog limit

I am trying to find centralized solution to move my application logging from database (RDS).
I was thinking to use CloudWatchLog but noticed that there is a limit for PutLogEvents requests:
The maximum rate of a PutLogEvents request is 5 requests per second
per log stream.
Even if I will break my logs into many streams (based on EC2, log type - error,info,warning,debug) the limit of 5 req. per second is still very restrictive for an active application.
The other solution is to somehow accumulate logs and send PutLogEvents with log records batch, but it means then I am forced to use database to accumulate that records.
So the questions is:
May be I'm wrong and limit of 5 req. per second is not so restrictive?
Is there any other solution that I should consider, for example DynamoDB?
PutLogEvents is designed to put several events by definition (as per it name: PutLogEvent"S") :) Cloudwatch logs agent is doing this on its own and you don't have to worry about this.
However please note: I don't recommend you to generate to much logs (e.g don't run debug mode in prodution), as cloudwatch logs can become pretty expensive as your volume of log is growing.
My advice would be to use a Logstash solution on an AWS instance.
In alternative, you can run logstash on another existing instance or container.
https://www.elastic.co/products/logstash
It is designed for this scope and it does it wonderfully.
Cloudwatch, is not designed mainly for your needs.
I hope this helps somehow.
If you are calling this API directly from your application: the short answer is that you need to batch you log events (it's 5 for PutLogEvents).
If you are writing the logs to disk and after that you are pushing them there is already an agent that knows how to push the logs (http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/QuickStartEC2Instance.html)
Meta: I would suggest that you prototype this and ensure that it works for the log volume that you have. Also, keep in mind that, because of how the cloudwatch api works, only one application/user can push to a log stream at a time (see the token you have to pass in) - so that you probably need to use multiple stream, one per user / maybe per log type to ensure that your applicaitions are not competing for the log.
Meta Meta: think about how your application behaves if the logging subsystem fails and if you can live with the possibility of losing the logs (ie is it critical for you to always/always have the guarantee that you will get the logs?). this will probably drive what you do / what solution you ultimately pick.