faster search of file in s3 bucket in aws console - amazon-web-services

I am searching for a specific file in a S3 bucket that has a lot of files. In my application I get an error of 403 access denied, and with s3cmd I am getting an error of 403 (Forbidden) if I try to get a file from the bucket. My problem is that I am not sure if the permissions are the problem (because I can get other files) or the file isn't present on the bucket. I have started to search in the Amazon console interface, but I am scrolling for hours and I have not arrived at "4...." (I am still at "39...") and the file I am looking for is in a folder "C03215".
So, is there a faster way to verify that the file exists on the bucket? Or is there a way to do auto-scrolling and meanwhile doing something else (because if I do not scroll nothing new is loading)?
P.S.: I have no permission to list with s3cmd

Regarding accelerating the scrolling in the console
Like you I have many thousands of objects that takes an eternity to scroll through to in the console.
I recently discovered though how to jump straight to a specific path/folder in the console that is going to save my mouse finger and my sanity!
This will only work for folders though not the actual leaf objects themselves.
In the URL bar of your browser when viewing a bucket you will see something like:
console.aws.amazon.com/s3/home?region=eu-west-1#&bucket=your-bucket-name&prefix=
If you append your object's path after the prefix and hit enter you assume that it should jump to that object but it does nothing (in chrome at least).
However if you append your object's path after the prefix, hit enter and then hit refresh (f5) the console will reload at your specified location.
e.g.
console.aws.amazon.com/s3/home?region=eu-west-1#&bucket=your-bucket-name&prefix=development/2015-04/TestEvent/93edfcbg-5e27-42d3-a2f9-3d86a63d27f9/
There was much joy in our office when this was figured out!

The only "faster way" is to have the s3:ListBucket permission on the bucket, because, as you have noticed, S3's response to a GET request is intentionally ambiguous if you don't.
If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.
If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error.
If you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html
Also, there's not a way to accelerate scrolling in the console.

Related

AWS s3 bucket with CloudFront - Failed to load resource: the server responded with a status of 403 [duplicate]

I am experimenting with AWS S3 and CloudFront for a web application that I am developing.
In the app I'm letting users upload files to the S3 bucket (using the AWS SDK) and make it available via CloudFront CDN, but the issue is even when the files are uploaded and ready in the S3 bucket it takes about a minute or 2 to be available in the CloudFront CDN url, is this normal?
CloudFront attempts to fetch uncached content from the origin server in real time. There is no "replication delay" or similar issue because CloudFront is a pull-through CDN. Each CloudFront edge location knows only about your site's existence and configuration; it doesn't know about your content until it receives requests for it. When that happens, the CloudFront edge fetches the requested content from the origin server, and caches it as appropriate, for serving subsequent requests.
The issue that's occurring here is related to a concept sometimes called "negative caching" -- caching the fact that a request won't work -- which is typically done to avoid hammering the origin of whatever's being cached with requests that are likely to fail anyway.
By default, when your origin returns an HTTP 4xx or 5xx status code, CloudFront caches these error responses for five minutes and then submits the next request for the object to your origin to see whether the problem that caused the error has been resolved and the requested object is now available.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
If the browser, or anything else, tries to download the file from that particular CloudFront edge before the upload into S3 is complete, S3 will return an error, and CloudFront -- at that edge location -- will cache that error and remember, for the next 5 minutes, not to bother trying again.
Not to worry, though -- this timer is configurable, so if the browser is doing this under the hood and outside your control, you should still be able to fix it.
You can specify the error-caching duration—the Error Caching Minimum TTL—for each 4xx and 5xx status code that CloudFront caches. For a procedure, see Configuring Error Response Behavior.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
To configure this in the console:
When viewing the distribution configuration, click the Error Pages tab.
For each error where you want to customize the timing, begin by clicking Create Custom Error Response.
Choose the error code you want to modify from the drop-down list, such as 403 (Forbidden) or 404 (Not Found) -- your bucket configuration determines which code S3 returns for missing objects, so if you aren't sure, change 403 then repeat the process and change 404.
Set Error Caching Minimum TTL (seconds) to 0
Leave Customize Error Response set to No (If set to Yes, this option enables custom response content on errors, which is not what you want. Activating this option is outside the scope of this question.)
Click Create. This takes you back to the previous view, where you'll see Error Caching Minimum TTL for the code you just defined.
Repeat these steps for each HTTP response code you want to change from the default behavior (which is the 300 second hold time, discussed above).
When you've made all the changes you want, return to the main CloudFront console screen where the distributions are listed. Wait for the distribution state to change from In Progress to Deployed (formerly, this took quite some time but now requires typically about 5 minutes for the changes to be pushed out to all the edges) and test.
Are these new files being written to S3 for the first time, or are they updates to existing files? S3 provides read-after-write consistency for new objects, and given CloudFront's pull model you should not be having this issue with new files written to S3. If you are, then I would open a ticket with AWS.
If these are updates to existing files, then you have both S3 eventual consistency and CloudFront cache expiration to deal with. Both of which could cause this sort of behavior.
As observed in your comment, it seems that google chrome is messing up with your upload/preview strategy:
Chrome is requesting the URL that currently doesn't have the
content.
the request is cached by cloudfront with invalid response
you upload the file to S3
when preview the uploaded file the cloudfront answers with the cached response (step 2).
after the cloudfront cache expires, cloudfront hits origin and the problem can no longer be reproducible.

Amazon S3 Error Code 403 Forbidden from EMR cluster

I know that this question may have been asked multiple times but I tried those solutions and it didn't workout. Therefore, asking it in a new thread for a definite solution.
I have created a IAM user with S3 read only permission (Get and List on all S3 resources) but when I try to access S3 from EMR cluster using HDFS command it throws "Error Code 403 Forbidden" exception for certain folders. People in other post has answered it to be a permission issue; which I didn't find a right solution as it is "Forbidden" instead of "Access Denied". The behavior of this error has appeared only for certain folders (containing objects) inside a bucket and for certain empty folders. It was observed that if I use native API calls then it works normally as follows:
Exception "Forbidden" when using s3a calls:
hdfs dfs -ls s3a://<bucketname>/<folder>
No error when using s3 native calls s3n and s3:
hdfs dfs -ls s3://<bucketname>/<folder>
hdfs dfs -ls s3n://<bucketname>/<folder>
Similar behavior have also been observed for empty folders and I understand on S3 only objects are physical files whereas rest "buckets and folders" are just a place holder. However, if I create a new empty folder then s3a calls doesn't throw this exception.
P.S. - Root IAM access key surpass this exception.
I'd recommend you file a JIRA on issues.apache.org, HADOOP project, component fs/s3 with the exact hadoop version you are using. Add the stack trace as the first comment, as that's the only way we could begin to work out what is happening.
FWIW, we haven't tested restricted permissions other than simple read-only and R/W; mixing permissions down the path is inevitably going to break things, as the client code expects to be able to HEAD, GET & LIST anything in the bucket.
BTW, the Hadoop S3 clients all mock empty directories by creating 0 byte objects with a "/" suffix, e/g "folder/"; then use a HEAD on that to probe for an empty bucket. When data is added under an empty dir, the mock parent dir is DELETE-d.

Amazon S3 static site serves old contents

My S3 bucket hosts a static website. I do not have cloudfront set up.
I recently updated the files in my S3 bucket. While the files got updated, I confirmed manually in the bucket. It still serves an older version of the files. Is there some sort of caching or versioning that happens on Static websites hosted on S3?
I haven't been able to find any solution on SO so far. Note: Cloudfront is NOT enabled.
Is there some sort of caching or versioning that happens on Static websites hosted on S3?
Amazon S3 buckets provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES
what does this mean ?
If you create a new object in s3, you will be able to immediately access your object - however in case you do an update of an existing object, you will 'eventually' get the newest version of you object from s3, so s3 might still deliver you the previous version of the object.
I believe that starting some time ago, read-after-write consistency is also available for update in the US Standard region.
how much do you need to wait ? well it depends, Amazon does not provide much information about this.
what you can do ? no much. If you want to make sure you do not have any issue with your S3 bucket delivering the files, upload a new file in your bucket, you will be able to access it immediately
Solution is here:
But you need to use CloundFront. like #Frederic Henri said, you cannot do much in S3 bucket itself, but with CloudFront, you can invalidate it.
CloudFront will have cached that file on an edge location for 24 hours which is the default TTL (time to live), and will continue to return that file for 24 hours. Then after the 24 hours are over, and a request is made for that file, CloudFront will check the origin and see if the file has been updated in the S3 bucket. If is has been updated, CloudFront will then serve the new updated version of the object. If it has not been updated, then CloudFront will continue to serve the original version of the object.
However where you update the file in the origin and wish for it to be served immediately via your website, then what needs to be done is a CloudFront invalidation. An invalidation wipes the file(s) from the CloudFront cache, so when a request is made to CloudFront, it will see that there are no files on the cache, will then check the origin and will serve the new updated file in the origin. Running an invalidation is recommended each time files are updated in the origin.
To run an invalidation:
click on the following link for CloudFront console
-- https://console.aws.amazon.com/cloudfront/home?region=eu-west-1#
open the distribution in question
click on the 'Invalidations' tab to the right of all the tabs
click on 'Create Invalidation'
on the popup, it will ask for the path. You can enter /* to invalidate every object from the cache, or enter the exact path tot he file, such as /images/picture.jpg
finally click on 'Invalidate'
this typically will be completed within 2/3 minutes
then once the invalidation is complete, when you request the object again through CloudFront, CloudFront will check the origin and return the updated file.
It sounds like Akshay tried uploading with a new filename and it worked.
I just tried the same (I was having the same problem), and it resolved the file not being available for me.
Do a push of index.html
index.html not updated
mv index.html index-new.html
Do a push of new-index.htlml
After this, index-html was immediately available.
That's kind of shite - I can't share one link to my website if I want to be sure that the recipient will see the latest version? I need to keep changing the filename and re-sharing the new link.

Amazon CloudFront Latency

I am experimenting with AWS S3 and CloudFront for a web application that I am developing.
In the app I'm letting users upload files to the S3 bucket (using the AWS SDK) and make it available via CloudFront CDN, but the issue is even when the files are uploaded and ready in the S3 bucket it takes about a minute or 2 to be available in the CloudFront CDN url, is this normal?
CloudFront attempts to fetch uncached content from the origin server in real time. There is no "replication delay" or similar issue because CloudFront is a pull-through CDN. Each CloudFront edge location knows only about your site's existence and configuration; it doesn't know about your content until it receives requests for it. When that happens, the CloudFront edge fetches the requested content from the origin server, and caches it as appropriate, for serving subsequent requests.
The issue that's occurring here is related to a concept sometimes called "negative caching" -- caching the fact that a request won't work -- which is typically done to avoid hammering the origin of whatever's being cached with requests that are likely to fail anyway.
By default, when your origin returns an HTTP 4xx or 5xx status code, CloudFront caches these error responses for five minutes and then submits the next request for the object to your origin to see whether the problem that caused the error has been resolved and the requested object is now available.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
If the browser, or anything else, tries to download the file from that particular CloudFront edge before the upload into S3 is complete, S3 will return an error, and CloudFront -- at that edge location -- will cache that error and remember, for the next 5 minutes, not to bother trying again.
Not to worry, though -- this timer is configurable, so if the browser is doing this under the hood and outside your control, you should still be able to fix it.
You can specify the error-caching duration—the Error Caching Minimum TTL—for each 4xx and 5xx status code that CloudFront caches. For a procedure, see Configuring Error Response Behavior.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
To configure this in the console:
When viewing the distribution configuration, click the Error Pages tab.
For each error where you want to customize the timing, begin by clicking Create Custom Error Response.
Choose the error code you want to modify from the drop-down list, such as 403 (Forbidden) or 404 (Not Found) -- your bucket configuration determines which code S3 returns for missing objects, so if you aren't sure, change 403 then repeat the process and change 404.
Set Error Caching Minimum TTL (seconds) to 0
Leave Customize Error Response set to No (If set to Yes, this option enables custom response content on errors, which is not what you want. Activating this option is outside the scope of this question.)
Click Create. This takes you back to the previous view, where you'll see Error Caching Minimum TTL for the code you just defined.
Repeat these steps for each HTTP response code you want to change from the default behavior (which is the 300 second hold time, discussed above).
When you've made all the changes you want, return to the main CloudFront console screen where the distributions are listed. Wait for the distribution state to change from In Progress to Deployed (formerly, this took quite some time but now requires typically about 5 minutes for the changes to be pushed out to all the edges) and test.
Are these new files being written to S3 for the first time, or are they updates to existing files? S3 provides read-after-write consistency for new objects, and given CloudFront's pull model you should not be having this issue with new files written to S3. If you are, then I would open a ticket with AWS.
If these are updates to existing files, then you have both S3 eventual consistency and CloudFront cache expiration to deal with. Both of which could cause this sort of behavior.
As observed in your comment, it seems that google chrome is messing up with your upload/preview strategy:
Chrome is requesting the URL that currently doesn't have the
content.
the request is cached by cloudfront with invalid response
you upload the file to S3
when preview the uploaded file the cloudfront answers with the cached response (step 2).
after the cloudfront cache expires, cloudfront hits origin and the problem can no longer be reproducible.

Amazon S3 and Cloudfront with TTL=0 Testing procedure

I Would like to test and see that my TTL=0 did work.
What I have:
S3 bucket that is mounted to directory in my redhat. so when I edit a simple txt file from the shell, I can open it in the aws console bucket manager and view the file. Also I have created cloudfront distribution so i can open the txt file from the cloudfront link.
Test:
I edit the txt file with the telnet, then open it from aws console on S3 bucket section, i see the file has changed, but when i open the file on the cloudfront link, it didnt change. This means the TTL=0 did not work.
How can i verify TTL=0 works ? and it is set correctly ? after creating the distribution i cannot find where to edit the TTL again.
Thanks
Quoting AWS:
Note that our default behavior isn’t changing; if no cache control header is set, each edge location will continue to use an expiration period of 24 hours before checking the origin for changes to that file. You can also continue to use Amazon CloudFront's Invalidation feature to expire a file sooner than the TTL set on that file.
You're likely not setting the cache control correctly. One way to confirm that is to Enable S3 Bucket Logging - New files will appear whenever there are new HTTP GETs from your S3 Bucket, even if they come from CloudFront.
You could also test S3 Directly, with curl (or s3curl) so you can track its headers correctly.
My recommendation is that, whenever you upload new content, you force CloudFront to Invalidate. If you're using tools like s3fs, then inotify/icron might help you
(Disclaimer: I totally hate the whole idea of mapping filesystems off to S3. They're quite different tools and you're likely to get 'leaky abstractions')
It is most likely that you are not sending any TTL headers from S3. CloudFront will look for a TTL header in the source file and if it doesn't find anything, will default to 24 hours.
You could look to set a bucket policy or use a tool like S3 browser to automatically set the headers. http://s3browser.com/automatically-apply-http-headers.php
If you just want to test then I would follow the steps below.
Create a new text file in your bucket
Through the AWS console, locate the file and check and/or add the caching headers
Retrieve the file from CloudFront
Change the file in the bucket
Check the headers of the new file in AWS console (your S3 mapping utility may erase the previous file headers)
Retrieve the new changed file from CloudFront
Sending an invalidate call to CloudFront with each request may become chargeable if you have a large number of edits a month. Plus invalidations take several minutes (sometimes 20mins or more) to propagate, meaning you could never instantly change your content.