AWS credentials required for Common Crawl S3 buckets - amazon-web-services

I'm trying to get at the Common Crawl news S3 bucket, but I keep getting a "fatal error: Unable to locate credentials" message. Any suggestions for how to get around this? As far as I was aware Common Crawl doesn't even require credentials?

From News Dataset Available – Common Crawl:
You can access the data even without a AWS account by adding the command-line option --no-sign-request.
I tested this by launching a new Amazon EC2 instance (without an IAM role) and issuing the command:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/
It gave me the error: Unable to locate credentials
I then ran it with the additional parameter:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request
It successfully listed the directories.

Related

Unable to connect to S3 while creating Elasticsearch snapshot repository

I am trying to register a respository on AWS S3 to store ElasticSearch snapshots.
I am following guide and ran the very first command listed in the doc.
But I am getting the error Access Denied while executing that command.
The role that is being used to perform operations on S3 is the AmazonEKSNodeRole.
I have assigned the appropriate permissions to the role to perform operations on the S3 bucket.
Also, here is another doc which suggests to use kibana for ElasticSearch version > 7.2 but I am doing the same via cURL requests.
Below is trust Policy of the role through which I am making the request to register repository in the S3 bucket.
Also, below are the screenshots of the permissions of the trusting and trusted accounts respectively -

S3 download works from console, but not from commandline

Can anyone explain this behaviour:
When I try to download a file from S3, I get the following error:
An error occurred (403) when calling the HeadObject operation: Forbidden.
Commandline used:
aws s3 cp s3://bucket/raw_logs/my_file.log .
However, when I use the S3 console website, I'm able to download the file without issues.
The access key used by the commandline is correct. I verified this, and other AWS operations via commandline work fine. The access key is tied to the same user account I use in the AWS console.
So I assume you're sure about the IAM policy of your user and the file exists in your bucket
If you have set a default region in your configuration but the bucket has not been created in this region (Yes s3 buckets are created in a region), it will not find it. Make sure to add the region flag to the CLI
aws s3 cp s3://bucket/raw_logs/my_file.log . --region <region of the bucket>
Other notes:
make sure to upgrade to latest version
can be cause if system clock is not synchronized, if you're not indicating any synchronize params, it might be ok but I dont know the internal and for some commands the CLI is looking at the system clock to compare to S3, if you're out of sync it might cause issues
I had a similar issue due to having two-factor authentication enabled on my account. Check out how to configure 2FA for the aws cli here: https://aws.amazon.com/premiumsupport/knowledge-center/authenticate-mfa-cli/

S3 cp AccessDenied from AWS cli with root keys

I have the AWS cli installed on an EC2 instance, and I configured it by running aws configure and giving it my AWSAccessKeyId and AWSSecretKey keys so if I run the command aws s3 ls it returns the name of my S3 bucket (call it "mybucket").
But, if I then try aws s3 cp localfolder/ s3://mybucket/ --recursive I get an error that looks like
A client error (AccessDenied) occurred when calling the CreateMultipartUpload operation: Anonymous users cannot initiate multipart uploads. Please authenticate.
I thought that by running aws configure and giving it my root key that I was effectively giving the aws cli everything it needs to authenticate? Is there something I am missing regarding copying to an S3 bucket as opposed to listing them?
Thought I would add in a very similar issue that I had where I could list buckets but could not write to a given bucket returning the error
An error occurred (AccessDenied) when calling the
CreateMultipartUpload operation: Access Denied
If the bucket uses server-side encryption you'll need to add the --sse flag to be able to write to this bucket.
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Root Access keys and Secret key have full control and full privileges to interact with the AWS. Please try running the aws configure again to recheck the setting and try again.
PS: it is highly not recommended to use root access keys - please give a thought is creating an IAM ( which take admin privileges- like root ) and use those.
If you have environment variables AWS_SECRET_ACCESS_KEY, AWS_ACCESS_KEY_ID and AWS_REGION set, AWS CLI gives higher precedence to them, and not to credentials you specify with aws configure.
So, in my case, bash command unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY solved the problem.

How to download data from Amazon's requester pay buckets?

I have been struggling for about a week to download arXiv articles as mentioned here: http://arxiv.org/help/bulk_data_s3#src.
I have tried lots of things: s3Browser, s3cmd. I am able to login to my buckets but I am unable to download data from arXiv bucket.
I tried:
s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
See:
$ s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
ERROR: S3 error: Unknown error
s3cmd get with x-amz-request-payer:requester
It gave me same error again:
$ s3cmd get --add-header="x-amz-request-payer:requester" s3://arxiv/pdf/arXiv_pdf_manifest.xml
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
ERROR: S3 error: Unknown error
Copying
I have tried copying files from that folder too.
$ aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
A client error (403) occurred when calling the HeadObject operation: Forbidden
Completed 1 part(s) with ... file(s) remaining
This probably means that I made a mistake. The problem is I don't know how and what to add that will convey my permission to pay for download.
I am unable to figure out what should I do for downloading data from S3. I have been reading a lot on AWS sites, but nowhere I can get pinpoint solution to my problem.
How can I bulk download the arXiv data?
Try downloading s3cmd version 1.6.0: http://sourceforge.net/projects/s3tools/files/s3cmd/
$ s3cmd --configure
Enter your credentials found in the account management tab of the Amazon AWS website interface.
$ s3cmd get --recursive --skip-existing s3://arxiv/src/ --requester-pays
Requester Pays is a feature on Amazon S3 buckets that requires the user of the bucket to pay Data Transfer costs associated with accessing data.
Normally, the owner of an S3 bucket pays Data Transfer costs, but this can be expensive for free / Open Source projects. Thus, the bucket owner can activated Requester Pays to reduce the portion of costs they will be charged.
Therefore, when accessing a Requester Pays bucket, you will need to authenticate yourself so that S3 knows whom to charge.
I recommend using the official AWS Command-Line Interface (CLI) to access AWS services. You can provide your credentials via:
aws configure
and then view the bucket via:
aws s3 ls s3://arxiv/pdf/
and download via:
aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
UPDATE: I just tried the above myself, and received Access Denied error messages (both on the bucket listing and the download command). When using s3cmd, it says ERROR: S3 error: Access Denied. It would appear that the permissions on the bucket no longer permit access. You should contact the owners of the bucket to request access.
At the bottom of this page arXiv explains that s3cmd gets denied because it does not support access to requester pays bucket as a non-owner and you have to apply a patch to the source code of s3cmd. However, the version of s3cmd they used is outdated and the patch does not apply to the latest version of s3cmd.
Basically you need to allow s3cmd to add "x-amz-request-payer" header to its HTTP request to buckets. Here is how to fix it:
Download the source code of s3cmd.
Open S3/S3.py with a text editor.
Add this two lines of code at the bottom of __init__ function:
if self.s3.config.extra_headers:
self.headers.update(self.s3.config.extra_headers)
Install s3cmd as instructed.
For me the problem was that my IAM user didn't have enough permissions.
Setting AmazonS3FullAccess was the solution for me.
Hope it'll save time to someone
Don't want to steal the thunder, but OttoV's comment actually gave the right command that works for me.
aws s3 ls --request-payer requester s3://arxiv/src/
My EC2 is in Region us-east-2, but the arXiv s3 buckets are in Region us-east-1, so I think that's why the --request-payer requester is needed.
From https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 :
You pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred in from the internet.
• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket (including to a different account in the same AWS region).
• Data transferred out to Amazon CloudFront (CloudFront).

Error During downloading files from S3 to EC2

While i am using "wget" command to download files from amazon S3 to amazon EC2 instance,
it gives following message and file not get downloaded.
How to solve this issue..?
Command :->
"wget https://s3.amazonaws.com/docsbucket/intro.doc"
Error Message :->
"Resolving s3.amazonaws.com... 207.171.163.225
Connecting to s3.amazonaws.com|207.171.163.225|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-03-20 13:06:00 ERROR 403: Forbidden."
You should launch your EC2 instance with the permission to read from your S3 buckets.
The easiest way to do it is using Roles. You simply create in IAM (Identity and access management) service of AWS a role that can read from S3. Then you launch your instance with this role. AWS will take care of getting the right credentials onto the instance and you can get your S3 objects, using S3 CLI tools.
You can use the same "trick" to access other resources and other actions on these resources.
You can read more about it in AWS documentations: http://docs.aws.amazon.com/IAM/latest/UserGuide/role-usecase-ec2app.html
Unless the file is public, you will need to authenticate with keys to download the file. This is probably easiest done with a tool like s3cmd.
This worked after I gave read Permission to everyone for the file
Go to Permission Tab - >Public Access->Click Everyone-> then give the Read Permission