I subscribed to the free synthetic dataset.
Now I have "Revision ID", "Revision ARN", "Data set ID" and 28 CSV files which I can not download in a pack. I must manually download them one after another or I can export them all to the AWS e3 (I do not want to do that).
Is there a way to download it all in a single archive or somehow automate the process via AWS S3 CLI?
I've tried
./venv/bin/awscliv2 s3 cp s3://arn:aws:dataexchange:us-east-1::data-sets/b0b14e86c092855166507c15e045b844/revisions/6011536d595840f7bd4412fca59e0f6b/assets/7cd4a5cbedb0c5c83e37c20f668b3708 ./
fatal error: Parameter validation failed:
Invalid bucket name "arn:aws:dataexchange:us-east-1::data-sets": Bucket
name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN
matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:]
[a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:
[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]
{1,63}$"
UPD: I've found a python snippet, which uses boto3 and creates a temporary bucket in the process.
UPD 2:: From https://docs.aws.amazon.com/data-exchange/latest/userguide/jobs.html#exporting-assets
There are two ways you can export assets from a published revision of a product:
To an Amazon S3 bucket that you have permissions to access.
By using a signed URL.
Therefore, I can't do that stuff without bucket through AWS CLI, but I can use EXPORT_ASSET_TO_SIGNED_URL
UPD 3: I've created a gist for downloading a dataset from AWS DataExchange via signed urls
I dont know why u using such complicated aws s3 cp. 2 line script can be something like
#syncing all files in folder to local directory ~/csvs/
aws s3 sync s3://bucketname/<folder>/ ~/csvs/
#you can zip or tar full foder whatever u want
zip ~/csvs csvs.zip
Related
I was able to follow this example1 and let my ec2 instance read from S3.
In order to write to the same bucket I thought changing line 572 from grant_read() to grant_read_write()
should work.
...
# Userdata executes script from S3
instance.user_data.add_execute_file_command(
file_path=local_path
)
# asset.grant_read(instance.role)
asset.grant_read_write(instance.role)
...
Yet the documented3 function cannot be accessed according to the error message.
>> 57: Pyright: Cannot access member "grant_read_write" for type "Asset"
What am I missing?
1 https://github.com/aws-samples/aws-cdk-examples/tree/master/python/ec2/instance
2 https://github.com/aws-samples/aws-cdk-examples/blob/master/python/ec2/instance/app.py#L57
3 https://docs.aws.amazon.com/cdk/latest/guide/permissions.html#permissions_grants
This is the documentation for Asset:
An asset represents a local file or directory, which is automatically
uploaded to S3 and then can be referenced within a CDK application.
The method grant_read_write isn't provided, as it is pointless. The documentation you've linked doesn't apply here.
an asset is just a Zip file that will be uploded to the bootstraped CDK s3 bucket, then referenced by Cloudformation when deploying.
if you have an script you want ot put into an s3 bucket, you dont want to use any form of asset cause that is a zip file. You would be better suited using a boto3 command to upload it once the bucket already exists, or making it part of a codePipeline to create the bucket with CDK then the next step in the pipeline uploads it.
the grant_read_write is for aws_cdk.aws_s3.Bucket constructs in this case.
I am trying to deploy a Lambda function to AWS from S3.
My organization currently does not provide the ability for me to upload files to the root of an S3 bucket, but only to a folder (ie: s3://application-code-bucket/Application1/).
Is there any way to deploy the Lambda function code through S3, from a directory other than the bucket root? I checked the documentation for Lambda's CreateFunction AWS command and could not find anything obvious.
You need to zip your lambda package and upload to S3 in any folder.
You can then provide an https S3 url of the file to upload to lambda
function.
The S3 bucket needs to be in the same region as that of the lambda
function.
Make sure you zip from the folder, i.e when the package is unzipped,
the files should be extracted in the same directory as the unzip
command, and should not create a new directory for the contents.
I have this old script of mine that I used to automate lambda deployments.
It needs to be refactored a bit, but still usable.
It gets as input the lambda name and the zip file path located locally on your PC.
It uploads it to S3 and publishes to the AWS Lambda.
You need to set AWS credentials with IAM roles that allows:
S3 upload permission
AWS Lambda update permission
You need to modify the bucket name and the path you want your zip to be uploaded to. (lines 36-37).
That's it.
I need to send someone a link to download a folder stored in an amazon S3 bucket. Is this possible?
You can do that using the AWS CLI
aws s3 sync s3://<bucket>/path/to/folder/ .
There are many options if you need to filter specific files etc ... check the doc page
You can also use Minio Client aka mc for this. It is open source and S3 compatible. mc policy command should do this for you.
Set bucket to "download" on Amazon S3 cloud storage.
$ mc policy download s3/your_bucket
This will add downloadable policy on all the objects inside bucket name your_bucket and an object with name yourobject
can be accessed with URL below.
https://your_bucket.s3.amazonaws.com/yourobject
Hope it helps.
Disclaimer: I work for Minio
I am trying to download data from one of Amazon's public buckets.
Here is a description of the bucket in question
The bucket has web accessible folders for example.
I would want to download say all the listed files in that folder.
There will a long list of suitable tiles identified, and the goal would be to get all files in a folder in one go rather than downloading each individually from the http site.
From other StackOverflow questions I realize I need to use the REST endpoint and use a tool like the AWS CLI or Cyberduck, but I cannot get these to work as yet.
I think the issue may be authentication. I don't have an AWS account, and I was hoping to stick with guest / anonymous access.
Does anyone have a good solution / tool to traverse a public bucket and grab the contents as a guest? Could a different approach using curl or wget work for this type of task?
Thanks.
For the AWS CLI, you need to provide the --no-sign-request flag to skip signing. Example:
> aws s3 ls landsat-pds
Unable to locate credentials. You can configure credentials by running "aws configure".
> aws s3 ls landsat-pds --no-sign-request
PRE L8/
PRE landsat-pds_stats/
PRE runs/
PRE tarq/
PRE tarq_corrupt/
PRE test/
2015-01-28 10:13:53 23764 index.html
2015-04-14 10:43:22 25 robots.txt
2016-07-13 12:53:31 38 run_info.json
2016-07-13 12:53:30 23971821 scene_list.gz
To download that entire bucket into a directory, you would do something like this:
> mkdir landsat-pds
> aws s3 sync s3://landsat-pds landsat-pds --no-sign-request
I kept getting
SSL validation failed for https://s3bucket.eu-central-1.amazonaws.com/?list-type=2&prefix=&delimiter=%2F&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)
To go around I use --no-verify-ssl so then
aws s3 ls s3bucket --no-sign-request --no-verify-ssl
... does the trick
I have been struggling for about a week to download arXiv articles as mentioned here: http://arxiv.org/help/bulk_data_s3#src.
I have tried lots of things: s3Browser, s3cmd. I am able to login to my buckets but I am unable to download data from arXiv bucket.
I tried:
s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
See:
$ s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar [1 of 1]
ERROR: S3 error: Unknown error
s3cmd get with x-amz-request-payer:requester
It gave me same error again:
$ s3cmd get --add-header="x-amz-request-payer:requester" s3://arxiv/pdf/arXiv_pdf_manifest.xml
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml [1 of 1]
ERROR: S3 error: Unknown error
Copying
I have tried copying files from that folder too.
$ aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
A client error (403) occurred when calling the HeadObject operation: Forbidden
Completed 1 part(s) with ... file(s) remaining
This probably means that I made a mistake. The problem is I don't know how and what to add that will convey my permission to pay for download.
I am unable to figure out what should I do for downloading data from S3. I have been reading a lot on AWS sites, but nowhere I can get pinpoint solution to my problem.
How can I bulk download the arXiv data?
Try downloading s3cmd version 1.6.0: http://sourceforge.net/projects/s3tools/files/s3cmd/
$ s3cmd --configure
Enter your credentials found in the account management tab of the Amazon AWS website interface.
$ s3cmd get --recursive --skip-existing s3://arxiv/src/ --requester-pays
Requester Pays is a feature on Amazon S3 buckets that requires the user of the bucket to pay Data Transfer costs associated with accessing data.
Normally, the owner of an S3 bucket pays Data Transfer costs, but this can be expensive for free / Open Source projects. Thus, the bucket owner can activated Requester Pays to reduce the portion of costs they will be charged.
Therefore, when accessing a Requester Pays bucket, you will need to authenticate yourself so that S3 knows whom to charge.
I recommend using the official AWS Command-Line Interface (CLI) to access AWS services. You can provide your credentials via:
aws configure
and then view the bucket via:
aws s3 ls s3://arxiv/pdf/
and download via:
aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .
UPDATE: I just tried the above myself, and received Access Denied error messages (both on the bucket listing and the download command). When using s3cmd, it says ERROR: S3 error: Access Denied. It would appear that the permissions on the bucket no longer permit access. You should contact the owners of the bucket to request access.
At the bottom of this page arXiv explains that s3cmd gets denied because it does not support access to requester pays bucket as a non-owner and you have to apply a patch to the source code of s3cmd. However, the version of s3cmd they used is outdated and the patch does not apply to the latest version of s3cmd.
Basically you need to allow s3cmd to add "x-amz-request-payer" header to its HTTP request to buckets. Here is how to fix it:
Download the source code of s3cmd.
Open S3/S3.py with a text editor.
Add this two lines of code at the bottom of __init__ function:
if self.s3.config.extra_headers:
self.headers.update(self.s3.config.extra_headers)
Install s3cmd as instructed.
For me the problem was that my IAM user didn't have enough permissions.
Setting AmazonS3FullAccess was the solution for me.
Hope it'll save time to someone
Don't want to steal the thunder, but OttoV's comment actually gave the right command that works for me.
aws s3 ls --request-payer requester s3://arxiv/src/
My EC2 is in Region us-east-2, but the arXiv s3 buckets are in Region us-east-1, so I think that's why the --request-payer requester is needed.
From https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 :
You pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred in from the internet.
• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket (including to a different account in the same AWS region).
• Data transferred out to Amazon CloudFront (CloudFront).