I have been struggling for about a week to download arXiv articles as mentioned here: http://arxiv.org/help/bulk_data_s3#src.
I have tried lots of things: s3Browser, s3cmd. I am able to login to my buckets but I am unable to download data from arXiv bucket.
I tried:
s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
See:
$ s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
ERROR: S3 error: Unknown error
s3cmd get with x-amz-request-payer:requester
It gave me same error again:
$ s3cmd get --add-header="x-amz-request-payer:requester" s3://arxiv/pdf/arXiv_pdf_manifest.xml
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
ERROR: S3 error: Unknown error
Copying
I have tried copying files from that folder too.
$ aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
A client error (403) occurred when calling the HeadObject operation: Forbidden
Completed 1 part(s) with ... file(s) remaining
This probably means that I made a mistake. The problem is I don't know how and what to add that will convey my permission to pay for download.
I am unable to figure out what should I do for downloading data from S3. I have been reading a lot on AWS sites, but nowhere I can get pinpoint solution to my problem.
How can I bulk download the arXiv data?
Try downloading s3cmd version 1.6.0: http://sourceforge.net/projects/s3tools/files/s3cmd/
$ s3cmd --configure
Enter your credentials found in the account management tab of the Amazon AWS website interface.
$ s3cmd get --recursive --skip-existing s3://arxiv/src/ --requester-pays
Requester Pays is a feature on Amazon S3 buckets that requires the user of the bucket to pay Data Transfer costs associated with accessing data.
Normally, the owner of an S3 bucket pays Data Transfer costs, but this can be expensive for free / Open Source projects. Thus, the bucket owner can activated Requester Pays to reduce the portion of costs they will be charged.
Therefore, when accessing a Requester Pays bucket, you will need to authenticate yourself so that S3 knows whom to charge.
I recommend using the official AWS Command-Line Interface (CLI) to access AWS services. You can provide your credentials via:
aws configure
and then view the bucket via:
aws s3 ls s3://arxiv/pdf/
and download via:
aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
UPDATE: I just tried the above myself, and received Access Denied error messages (both on the bucket listing and the download command). When using s3cmd, it says ERROR: S3 error: Access Denied. It would appear that the permissions on the bucket no longer permit access. You should contact the owners of the bucket to request access.
At the bottom of this page arXiv explains that s3cmd gets denied because it does not support access to requester pays bucket as a non-owner and you have to apply a patch to the source code of s3cmd. However, the version of s3cmd they used is outdated and the patch does not apply to the latest version of s3cmd.
Basically you need to allow s3cmd to add "x-amz-request-payer" header to its HTTP request to buckets. Here is how to fix it:
Download the source code of s3cmd.
Open S3/S3.py with a text editor.
Add this two lines of code at the bottom of __init__ function:
if self.s3.config.extra_headers:
self.headers.update(self.s3.config.extra_headers)
Install s3cmd as instructed.
For me the problem was that my IAM user didn't have enough permissions.
Setting AmazonS3FullAccess was the solution for me.
Hope it'll save time to someone
Don't want to steal the thunder, but OttoV's comment actually gave the right command that works for me.
aws s3 ls --request-payer requester s3://arxiv/src/
My EC2 is in Region us-east-2, but the arXiv s3 buckets are in Region us-east-1, so I think that's why the --request-payer requester is needed.
From https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 :
You pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred in from the internet.
• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket (including to a different account in the same AWS region).
• Data transferred out to Amazon CloudFront (CloudFront).
Related
I am trying to register a respository on AWS S3 to store ElasticSearch snapshots.
I am following guide and ran the very first command listed in the doc.
But I am getting the error Access Denied while executing that command.
The role that is being used to perform operations on S3 is the AmazonEKSNodeRole.
I have assigned the appropriate permissions to the role to perform operations on the S3 bucket.
Also, here is another doc which suggests to use kibana for ElasticSearch version > 7.2 but I am doing the same via cURL requests.
Below is trust Policy of the role through which I am making the request to register repository in the S3 bucket.
Also, below are the screenshots of the permissions of the trusting and trusted accounts respectively -
I'm trying to get at the Common Crawl news S3 bucket, but I keep getting a "fatal error: Unable to locate credentials" message. Any suggestions for how to get around this? As far as I was aware Common Crawl doesn't even require credentials?
From News Dataset Available – Common Crawl:
You can access the data even without a AWS account by adding the command-line option --no-sign-request.
I tested this by launching a new Amazon EC2 instance (without an IAM role) and issuing the command:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/
It gave me the error: Unable to locate credentials
I then ran it with the additional parameter:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request
It successfully listed the directories.
How do I set permissions in such a way that anyone can upload files to my bucket?
Here is an example that has these 3 features:
I can upload any file and download my file from anywhere.
But I am not able to download files uploaded by others.
However, I can delete files uploaded by others.
I will like to know how this bucket (abc) was set up and who owns it.
1) I can upload:
[root#localhost ~]# aws s3 cp test.txt s3://abc/
upload: ./test.txt to s3://abc/test.txt
2) I can list contents:
[root#localhost ~]# aws s3 ls s3://abc | head
PRE doubleverify-iqm/
PRE folder400/
PRE ngcsc/
PRE out/
PRE pd/
PRE pit/
PRE soap1/
PRE some-subdir/
PRE swoo/
2018-06-15 12:06:27 2351 0Sw5xyknAcVaqShdROBSfCfa7sdA27WbFMm4QNdUHWqf2vymo5.json
3) I can download my file from anywhere:
[root#localhost ~]# aws s3 cp s3://abc/test.txt .
download: s3://abc/test.txt to ./test.txt
4) But not able to download other's file
[root#localhost ~]# aws s3 cp s3://abc/zQhAqmwIUfIeDnEEHpiaGhXuERgO3bR84jkjhbei1aLiV1758t.json .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
5) however, I can delete the file not uploaded by me:
[root#localhost ~]# aws s3 rm s3://abc/zQhAqmwIUfIeDnEEHpiaGhXuERgO3bR84jkjhbei1aLiV1758t.json
delete: s3://abc/zQhAqmwIUfIeDnEEHpiaGhXuERgO3bR84jkjhbei1aLiV1758t.json
I am not sure how to set-up such a bucket.
It is not advisable to setup a bucket in this manner.
The fact that anyone can upload to the bucket means that somebody could store, potentially, TBs of data and you would be liable for the cost. For example, somebody could host large video files, using your bucket for free storage and bandwidth.
Similarly, it is not good security practice to grant permissions for anyone to list the contents of your bucket. They might find sensitive data that was not intended to be released.
It would also be unwise to allow anyone to delete objects from your bucket, because somebody could delete everything!
There are two primary ways to grant access to objects:
Bucket Policy
A Bucket Policy can grant permissions on the whole bucket, or specific paths within a bucket. For example, granting GetObject to the whole bucket means that anyone can download any object.
See: Bucket Policy Examples - Amazon Simple Storage Service
Object-level permissions
Basic permissions can also be granted on a per-object basis. For example, when an object is copied to a bucket, the Access Control List (ACL) can specify who can access the object.
For example, this would grant ownership of the object to the bucket owner:
aws s3 cp foo.txt s3://my-bucket/foo.txt --acl bucket-owner-full-control
If the --acl is excluded, then the object 'belongs' to the identity that uploaded the file, which is why you were download your own file. This is not recommended, because it could lead to a situation where the bucket owner cannot access (and potentially cannot even delete!) the object.
Bottom line: Think about your security before implementing rules that grant other people, or anyone, permissions on your buckets.
I am trying to download data from one of Amazon's public buckets.
Here is a description of the bucket in question
The bucket has web accessible folders for example.
I would want to download say all the listed files in that folder.
There will a long list of suitable tiles identified, and the goal would be to get all files in a folder in one go rather than downloading each individually from the http site.
From other StackOverflow questions I realize I need to use the REST endpoint and use a tool like the AWS CLI or Cyberduck, but I cannot get these to work as yet.
I think the issue may be authentication. I don't have an AWS account, and I was hoping to stick with guest / anonymous access.
Does anyone have a good solution / tool to traverse a public bucket and grab the contents as a guest? Could a different approach using curl or wget work for this type of task?
Thanks.
For the AWS CLI, you need to provide the --no-sign-request flag to skip signing. Example:
> aws s3 ls landsat-pds
Unable to locate credentials. You can configure credentials by running "aws configure".
> aws s3 ls landsat-pds --no-sign-request
PRE L8/
PRE landsat-pds_stats/
PRE runs/
PRE tarq/
PRE tarq_corrupt/
PRE test/
2015-01-28 10:13:53 23764 index.html
2015-04-14 10:43:22 25 robots.txt
2016-07-13 12:53:31 38 run_info.json
2016-07-13 12:53:30 23971821 scene_list.gz
To download that entire bucket into a directory, you would do something like this:
> mkdir landsat-pds
> aws s3 sync s3://landsat-pds landsat-pds --no-sign-request
I kept getting
SSL validation failed for https://s3bucket.eu-central-1.amazonaws.com/?list-type=2&prefix=&delimiter=%2F&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)
To go around I use --no-verify-ssl so then
aws s3 ls s3bucket --no-sign-request --no-verify-ssl
... does the trick
While i am using "wget" command to download files from amazon S3 to amazon EC2 instance,
it gives following message and file not get downloaded.
How to solve this issue..?
Command :->
"wget https://s3.amazonaws.com/docsbucket/intro.doc"
Error Message :->
"Resolving s3.amazonaws.com... 207.171.163.225
Connecting to s3.amazonaws.com|207.171.163.225|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-03-20 13:06:00 ERROR 403: Forbidden."
You should launch your EC2 instance with the permission to read from your S3 buckets.
The easiest way to do it is using Roles. You simply create in IAM (Identity and access management) service of AWS a role that can read from S3. Then you launch your instance with this role. AWS will take care of getting the right credentials onto the instance and you can get your S3 objects, using S3 CLI tools.
You can use the same "trick" to access other resources and other actions on these resources.
You can read more about it in AWS documentations: http://docs.aws.amazon.com/IAM/latest/UserGuide/role-usecase-ec2app.html
Unless the file is public, you will need to authenticate with keys to download the file. This is probably easiest done with a tool like s3cmd.
This worked after I gave read Permission to everyone for the file
Go to Permission Tab - >Public Access->Click Everyone-> then give the Read Permission