I am trying to download data from one of Amazon's public buckets.
Here is a description of the bucket in question
The bucket has web accessible folders for example.
I would want to download say all the listed files in that folder.
There will a long list of suitable tiles identified, and the goal would be to get all files in a folder in one go rather than downloading each individually from the http site.
From other StackOverflow questions I realize I need to use the REST endpoint and use a tool like the AWS CLI or Cyberduck, but I cannot get these to work as yet.
I think the issue may be authentication. I don't have an AWS account, and I was hoping to stick with guest / anonymous access.
Does anyone have a good solution / tool to traverse a public bucket and grab the contents as a guest? Could a different approach using curl or wget work for this type of task?
Thanks.
For the AWS CLI, you need to provide the --no-sign-request flag to skip signing. Example:
> aws s3 ls landsat-pds
Unable to locate credentials. You can configure credentials by running "aws configure".
> aws s3 ls landsat-pds --no-sign-request
PRE L8/
PRE landsat-pds_stats/
PRE runs/
PRE tarq/
PRE tarq_corrupt/
PRE test/
2015-01-28 10:13:53 23764 index.html
2015-04-14 10:43:22 25 robots.txt
2016-07-13 12:53:31 38 run_info.json
2016-07-13 12:53:30 23971821 scene_list.gz
To download that entire bucket into a directory, you would do something like this:
> mkdir landsat-pds
> aws s3 sync s3://landsat-pds landsat-pds --no-sign-request
I kept getting
SSL validation failed for https://s3bucket.eu-central-1.amazonaws.com/?list-type=2&prefix=&delimiter=%2F&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)
To go around I use --no-verify-ssl so then
aws s3 ls s3bucket --no-sign-request --no-verify-ssl
... does the trick
Related
Using AWS CLI for ACM's 'import-certificate' to re-import a renewed cert, chain and private key for a LetsEncrypt certificate that gets dropped off in an S3 bucket. It seems the usual file parameter syntax notation.
I am using [aws-cli/1.18.69 Python/3.8.10 Linux/5.14.0-1056-oem botocore/1.16.19]
Here is what is not working:
aws acm import-certificate --certificate fileb://s3://foo-bucket-001/bar.com/cert.pem --certificate-chain fileb://s3://foo-bucket-001/bar.com/chain.pem --private-key fileb://s3://foo-bucket-001/bar.com/privkey.pem --certificate-arn arn:aws:acm:us-east-1:000000000000:certificate/d3bbe6f3-c479-4bbe-ad16-cc97745501a5
Error Message:
Error parsing parameter '--certificate': Unable to load paramfile fileb://s3://foo-bucket-001/townsquareignite.com/cert.pem: [Errno 2] No such file or directory: 's3://foo-bucket-001/townsquareignite.com/cert.pem'
I've tried s3://, file://, and fileb://,using the ARN for the S3 objects.
Having no joy.
Using fileb://path/to-local/cert.pem does work, so obviously it's just the command binary file syntax to the files in S3 bucket arenot correct. But I cannot find any documentation now previous answer.
Any AWS CLI ACM via S3 guidance here?
I believe you cannot use this cli command with s3. If you type aws acm import-certificate help there is nothing about s3 at all. A private certificate usually is very sensitive information and I think AWS doesn't encourage uploading it to s3. Or maybe it was an original idea when they developed this CLI sub-command.
Unfortunately, you'll have to import it once from your local machine and then refer to its ARN in your automation/infrastructure code.
I'm trying to get at the Common Crawl news S3 bucket, but I keep getting a "fatal error: Unable to locate credentials" message. Any suggestions for how to get around this? As far as I was aware Common Crawl doesn't even require credentials?
From News Dataset Available – Common Crawl:
You can access the data even without a AWS account by adding the command-line option --no-sign-request.
I tested this by launching a new Amazon EC2 instance (without an IAM role) and issuing the command:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/
It gave me the error: Unable to locate credentials
I then ran it with the additional parameter:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request
It successfully listed the directories.
I have a bucket with an IP white listed policy, I would like to be able to s3 cp, or recursively wget everything in a "sub-folder" of that bucket. Is there a way to do this? The wget works fine on a single file.
What I've tried:
AWS cp with no profile set relying on the IP white list, this fails with a 403.
A recursive wget, this fails with a 403.
A wget with a wildcard, this is not actually a thing in HTTP.
IP white listing is very useful, it would be nice to get whole "folders" instead of just individual objects.
You could use the AWS Command-Line Interface (CLI) to copy the files, either:
aws s3 cp --recursive s3://bucket/path/* localdir
or
aws s3 sync s3://bucket/path/ localdir
This would require a set of AWS credentials because API calls are authenticated while your wget method appears to be unauthenticated. If the bucket policy is granting public access (List and Get) on the bucket, then the credentials do not actually need additional permissions.
Apparently in my case the issue was that we were white listing on GET but not LIST, if anyone else runs into this problem make sure you have both.
The AWS CLI in order to do a recursive GET first "quietly" lists the objects in the bucket. That's why were able to wget a single item but not GET multiples through S3.
The second thing I was running into was that you need valid credentials (so not an empty string etc) as #john-rotenstein points out in his comment to his answer.
I need to send someone a link to download a folder stored in an amazon S3 bucket. Is this possible?
You can do that using the AWS CLI
aws s3 sync s3://<bucket>/path/to/folder/ .
There are many options if you need to filter specific files etc ... check the doc page
You can also use Minio Client aka mc for this. It is open source and S3 compatible. mc policy command should do this for you.
Set bucket to "download" on Amazon S3 cloud storage.
$ mc policy download s3/your_bucket
This will add downloadable policy on all the objects inside bucket name your_bucket and an object with name yourobject
can be accessed with URL below.
https://your_bucket.s3.amazonaws.com/yourobject
Hope it helps.
Disclaimer: I work for Minio
I have been struggling for about a week to download arXiv articles as mentioned here: http://arxiv.org/help/bulk_data_s3#src.
I have tried lots of things: s3Browser, s3cmd. I am able to login to my buckets but I am unable to download data from arXiv bucket.
I tried:
s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
See:
$ s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
ERROR: S3 error: Unknown error
s3cmd get with x-amz-request-payer:requester
It gave me same error again:
$ s3cmd get --add-header="x-amz-request-payer:requester" s3://arxiv/pdf/arXiv_pdf_manifest.xml
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
ERROR: S3 error: Unknown error
Copying
I have tried copying files from that folder too.
$ aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
A client error (403) occurred when calling the HeadObject operation: Forbidden
Completed 1 part(s) with ... file(s) remaining
This probably means that I made a mistake. The problem is I don't know how and what to add that will convey my permission to pay for download.
I am unable to figure out what should I do for downloading data from S3. I have been reading a lot on AWS sites, but nowhere I can get pinpoint solution to my problem.
How can I bulk download the arXiv data?
Try downloading s3cmd version 1.6.0: http://sourceforge.net/projects/s3tools/files/s3cmd/
$ s3cmd --configure
Enter your credentials found in the account management tab of the Amazon AWS website interface.
$ s3cmd get --recursive --skip-existing s3://arxiv/src/ --requester-pays
Requester Pays is a feature on Amazon S3 buckets that requires the user of the bucket to pay Data Transfer costs associated with accessing data.
Normally, the owner of an S3 bucket pays Data Transfer costs, but this can be expensive for free / Open Source projects. Thus, the bucket owner can activated Requester Pays to reduce the portion of costs they will be charged.
Therefore, when accessing a Requester Pays bucket, you will need to authenticate yourself so that S3 knows whom to charge.
I recommend using the official AWS Command-Line Interface (CLI) to access AWS services. You can provide your credentials via:
aws configure
and then view the bucket via:
aws s3 ls s3://arxiv/pdf/
and download via:
aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
UPDATE: I just tried the above myself, and received Access Denied error messages (both on the bucket listing and the download command). When using s3cmd, it says ERROR: S3 error: Access Denied. It would appear that the permissions on the bucket no longer permit access. You should contact the owners of the bucket to request access.
At the bottom of this page arXiv explains that s3cmd gets denied because it does not support access to requester pays bucket as a non-owner and you have to apply a patch to the source code of s3cmd. However, the version of s3cmd they used is outdated and the patch does not apply to the latest version of s3cmd.
Basically you need to allow s3cmd to add "x-amz-request-payer" header to its HTTP request to buckets. Here is how to fix it:
Download the source code of s3cmd.
Open S3/S3.py with a text editor.
Add this two lines of code at the bottom of __init__ function:
if self.s3.config.extra_headers:
self.headers.update(self.s3.config.extra_headers)
Install s3cmd as instructed.
For me the problem was that my IAM user didn't have enough permissions.
Setting AmazonS3FullAccess was the solution for me.
Hope it'll save time to someone
Don't want to steal the thunder, but OttoV's comment actually gave the right command that works for me.
aws s3 ls --request-payer requester s3://arxiv/src/
My EC2 is in Region us-east-2, but the arXiv s3 buckets are in Region us-east-1, so I think that's why the --request-payer requester is needed.
From https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 :
You pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred in from the internet.
• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket (including to a different account in the same AWS region).
• Data transferred out to Amazon CloudFront (CloudFront).