How to download data from AWS in python

How to download data from AWS in python - amazon-web-services

I am new to AWS and boto. The data I want to download is on AWS, and I have the access key and the secret key. My problem is I do not understand the approaches I found. For instance, this code:
import boto
import boto.s3.connection
def download_data_connect_s3(access_key, secret_key, region, bucket_name, key, local_path):
conn = boto.connect_s3(aws_access_key_id = access_key,\
aws_secret_access_key = secret_key,\
host='s3-{}.amazonaws.com'.format(region),\
calling_format = boto.s3.connection.OrdinaryCallingFormat()\
)
bucket = conn.get_bucket(bucket_name)
key = bucket.get_key(key)
key.get_contents_to_filename(local_path)
print('Downloaded File {} to {}'.format(key, local_path))
region = 'us-west-1'
access_key = # the key here
secret_key = # the secret key here
bucket_name = 'temp_name'
key = '<folder…/filename>' unique identifer
local_path = # local path
download_data_connect_s3(access_key, secret_key, region, bucket_name, key, local_path)
What I don't understand is the 'key' 'bucket_name' and 'local path'. What is 'key' in comparison to access key and secret key? I was not given a 'key'. Also, is the 'bucket_name' the name of the bucket on AWS (I was not provided with the bucket name); and local path the directory where I want to save the data?

You are right.
bucket_name = name of your S3 bucket
key = is object key. It's full path of the file in side the bucket. (ex: you have a file named a.txt in folder x, so key = x/a.txt. Refer to this link
local_path = where you want to save the data in local machine

It sounds like the data is stored in Amazon S3.
You can use the AWS Command-Line Interface (CLI) to access Amazon S3.
To view the list of buckets in that account:
aws s3 ls
To view the contents of a bucket:
aws s3 ls bucket-name
To copy a file from a bucket to the current directory:
aws s3 cp s3://bucket-name/filename.txt .
Or sync a whole folder:
aws s3 sync s3://bucket-name/folder/ local-folder/

Related

ClientError: Failed to download data. Please check your s3 objects and ensure that there is no object that is both a folder as well as a file

How are you?
I'm trying to execute a sagemaker job but i get this error:
ClientError: Failed to download data. Cannot download s3://pocaaml/sagemaker/xsell_sc1_test/model/model_lgb.tar.gz, a previously downloaded file/folder clashes with it. Please check your s3 objects and ensure that there is no object that is both a folder as well as a file.
I'm have that model_lgb.tar.gz on that s3 path as you can see here:
This is my code:
project_name = 'xsell_sc1_test'
s3_bucket = "pocaaml"
prefix = "sagemaker/"+project_name
account_id = "029294541817"
s3_bucket_base_uri = "{}{}".format("s3://", s3_bucket)
dev = "dev-{}".format(strftime("%y-%m-%d-%H-%M", gmtime()))
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))
# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()
boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)
s3_client = boto3.client("s3", region_name=region)
sagemaker_boto_client = boto_session.client("sagemaker") #este pinta?
sagemaker_session = sagemaker.session.Session(
boto_session=boto_session, sagemaker_client=sagemaker_boto_client
)
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1", role=role, instance_type='ml.m5.4xlarge', instance_count=1
)
PREPROCESSING_SCRIPT_LOCATION = 'funciones_altas.py'
preprocessing_input_code = sagemaker_session.upload_data(
PREPROCESSING_SCRIPT_LOCATION,
bucket=s3_bucket,
key_prefix="{}/{}".format(prefix, "code")
)
preprocessing_input_data = "{}/{}/{}".format(s3_bucket_base_uri, prefix, "data")
preprocessing_input_model = "{}/{}/{}".format(s3_bucket_base_uri, prefix, "model")
preprocessing_output = "{}/{}/{}/{}/{}".format(s3_bucket_base_uri, prefix, dev, "preprocessing" ,"output")
processing_job_name = params["project_name"].replace("_", "-")+"-preprocess-{}".format(strftime("%d-%H-%M-%S", gmtime()))
sklearn_processor.run(
code=preprocessing_input_code,
job_name = processing_job_name,
inputs=[ProcessingInput(input_name="data",
source=preprocessing_input_data,
destination="/opt/ml/processing/input/data"),
ProcessingInput(input_name="model",
source=preprocessing_input_model,
destination="/opt/ml/processing/input/model")],
outputs=[
ProcessingOutput(output_name="output",
destination=preprocessing_output,
source="/opt/ml/processing/output")],
wait=False,
)
preprocessing_job_description = sklearn_processor.jobs[-1].describe()
and on funciones_altas.py i'm using ohe_altas.tar.gz and not model_lgb.tar.gz making this error super weird.
can you help me?

Looks like you are using sagemaker generated execution role and the error is related to S3 permissions.
Here are a couple of things you can do:
make sure to check the policies on the role that they have access to your bucket.
check if the objects are encrypted in your bucket, if so then ensure to also include kms policy to the role you are linking to the job. https://aws.amazon.com/premiumsupport/knowledge-center/s3-403-forbidden-error/
You can always create your own role as well and pass the arn to the code to run the processing job.

Django Storage and Boto3 not retrieving Media from AWS S3

I am using a development server to test uploading and retrieving static files from AWS S3 using Django storages and Boto3. The file upload worked but I cannot retrieve the files.
This is what I get:
And when I check out the URL in another tab I get this
**This XML file does not appear to have any style information associated with it. The document tree is shown below.**
<Error>
<Code>IllegalLocationConstraintException</Code>
<Message>The me-south-1 location constraint is incompatible for the region specific endpoint this request was sent to.</Message>
<RequestId></RequestId>
<HostId></HostId>
</Error>
Also I configured the settings.py with my own credentials and IAM user
AWS_ACCESS_KEY_ID = <key>
AWS_SECRET_ACCESS_KEY = <secret-key>
AWS_STORAGE_BUCKET_NAME = <bucket-name>
AWS_DEFAULT_ACL = None
AWS_S3_FILE_OVERWRITE = False
AWS_S3_REGION_NAME = 'me-south-1'
AWS_S3_USE_SSL = True
AWS_S3_VERIFY = False
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage'

Please check in your AWS Identity & Access Management Console (IAM) whether your access keys have proper S3 permissions assigned to them.
Also, make sure you have installed AWS CLI and setup your credentials in your machine.
You can try running the below command and verify it.
$ aws s3 ls
2018-12-11 17:08:50 my-bucket
2018-12-14 14:55:44 my-bucket2
Reference : https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html

NoCredentialError when trying to access head_object

I have the following code which runs as expected:
import boto3
session = boto3.Session(profile_name='default')
s3 = session.resource('s3')
bucketName = 'myBucketName'
bucket = s3.Bucket(bucketName)
for object_summary in bucket.objects.filter(Prefix="MainFolder/"):
s3_cli = boto3.client('s3')
if(object_summary.key[-1]!='/'):
print('FileName: '+object_summary.key)
# print(s3_cli.head_object(Bucket=bucketName,Key=str(object_summary.key)))
else:
s3obj='FolderName: '+object_summary.key
print(s3obj)
And lists the files and folders present in MainFolder on my S3 bucket. However, when I uncomment Line#12, I get this error:
NoCredentialsError: Unable to locate credentials
Any idea what I am doing wrong?

Instead of:
s3_cli = boto3.client('s3')
you should be using your session which loads the specific profile:
s3_cli = session.client('s3')

How to change permission recursively to folder with AWS s3 or AWS s3api

I am trying to grant permissions to an existing account in s3.
The bucket is owned by the account, but the data was copied from another account's bucket.
When I try to grant permissions with the command:
aws s3api put-object-acl --bucket <bucket_name> --key <folder_name> --profile <original_account_profile> --grant-full-control emailaddress=<destination_account_email>
I receive the error:
An error occurred (NoSuchKey) when calling the PutObjectAcl operation: The specified key does not exist.
while if I do it on a single file the command is successful.
How can I make it work for a full folder?

This can be only be achieved with using pipes. Try -
aws s3 ls s3://bucket/path/ --recursive | awk '{cmd="aws s3api put-object-acl --acl bucket-owner-full-control --bucket bucket --key "$4; system(cmd)}'

The other answers are ok, but the FASTEST way to do this is to use the aws s3 cp command with the option --metadata-directive REPLACE, like this:
aws s3 cp --recursive --acl bucket-owner-full-control s3://bucket/folder s3://bucket/folder --metadata-directive REPLACE
This gives speeds of between 50Mib/s and 80Mib/s.
The answer from the comments from John R, which suggested to use a 'dummy' option, like --storage-class STANDARD. Whilst this works, only gave me copy speeds between 5Mib/s and 11mb/s.
The inspiration for trying this came from AWS's support article on the subject: https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-change-anonymous-ownership/
NOTE: If you encounter 'access denied` for some of your objects, this is likely because you are using AWS creds for the bucket owning account, whereas you need to use creds for the account where the files were copied from.

You will need to run the command individually for every object.
You might be able to short-cut the process by using:
aws s3 cp --acl bucket-owner-full-control --metadata Key=Value --profile <original_account_profile> s3://bucket/path s3://bucket/path
That is, you copy the files to themselves, but with the added ACL that grants permissions to the bucket owner.
If you have sub-directories, then add --recursive.

use python to set up the permissions recursively
#!/usr/bin/env python
import boto3
import sys
client = boto3.client('s3')
BUCKET='enter-bucket-name'
def process_s3_objects(prefix):
"""Get a list of all keys in an S3 bucket."""
kwargs = {'Bucket': BUCKET, 'Prefix': prefix}
failures = []
while_true = True
while while_true:
resp = client.list_objects_v2(**kwargs)
for obj in resp['Contents']:
try:
print(obj['Key'])
set_acl(obj['Key'])
kwargs['ContinuationToken'] = resp['NextContinuationToken']
except KeyError:
while_true = False
except Exception:
failures.append(obj["Key"])
continue
print "failures :", failures
def set_acl(key):
client.put_object_acl(
GrantFullControl="id=%s" % get_account_canonical_id,
Bucket=BUCKET,
Key=key
)
def get_account_canonical_id():
return client.list_buckets()["Owner"]["ID"]
process_s3_objects(sys.argv[1])

One thing you can do to get around the need for setting the ACL for every single object is disabling ACLs for the bucket. All objects in the bucket will then be owned by the bucket owner, and you can use policies for access control instead of ACLs.
You do this by setting the "object ownership" setting to "bucket owner enforced". As per the AWS documentation, this is in fact the recommended setting:
For the majority of modern use cases in S3, we recommend that you disable ACLs by choosing the bucket owner enforced setting and use your bucket policy to share data with users outside of your account as needed. This approach simplifies permissions management and auditing.
You can set this in the web console by going to the "Permissions" tab for the bucket, and clicking the "Edit" button in the "Object Ownership" section. You can then select the "ACLs disabled" radio button.
You can also use the AWS CLI. An example from the documentation:
aws s3api put-bucket-ownership-controls --bucket DOC-EXAMPLE-BUCKET --ownership-controls Rules=[{ObjectOwnership=BucketOwnerEnforced}]

This was my powershell only solution.
aws s3 ls s3://BUCKET/ --recursive | %{ "aws s3api put-object-acl --bucket BUCKET --key "+$_.ToString().substring(30)+" --acl bucket-owner-full-control" }

I had a similar issue with taking ownership of log objects in a quite large bucket.
Total number of objects - 3,290,956 Total size 1.4 TB.
The solutions I was able to find were far too sluggish for that amount of objects. I ended up writing some code that was able to do the job several times faster than
aws s3 cp
You will need to install requirements:
pip install pathos boto3 click
#!/usr/bin/env python3
import logging
import os
import sys
import boto3
import botocore
import click
from time import time
from botocore.config import Config
from pathos.pools import ThreadPool as Pool
logger = logging.getLogger(__name__)
streamformater = logging.Formatter("[*] %(levelname)s: %(asctime)s: %(message)s")
logstreamhandler = logging.StreamHandler()
logstreamhandler.setFormatter(streamformater)
def _set_log_level(ctx, param, value):
if value:
ctx.ensure_object(dict)
ctx.obj["log_level"] = value
logger.setLevel(value)
if value <= 20:
logger.info(f"Logger set to {logging.getLevelName(logger.getEffectiveLevel())}")
return value
#click.group(chain=False)
#click.version_option(version='0.1.0')
#click.pass_context
def cli(ctx):
"""
Take object ownership of S3 bucket objects.
"""
ctx.ensure_object(dict)
ctx.obj["aws_config"] = Config(
retries={
'max_attempts': 10,
'mode': 'standard'
}
)
#cli.command("own")
#click.argument("bucket", type=click.STRING)
#click.argument("prefix", type=click.STRING, default="/")
#click.option("--profile", type=click.STRING, default="default", envvar="AWS_DEFAULT_PROFILE", help="Configuration profile from ~/.aws/{credentials,config}")
#click.option("--region", type=click.STRING, default="us-east-1", envvar="AWS_DEFAULT_REGION", help="AWS region")
#click.option("--threads", "-t", type=click.INT, default=40, help="Threads to use")
#click.option("--loglevel", "log_level", hidden=True, flag_value=logging.INFO, callback=_set_log_level, expose_value=False, is_eager=True, default=True)
#click.option("--verbose", "-v", "log_level", flag_value=logging.DEBUG, callback=_set_log_level, expose_value=False, is_eager=True, help="Increase log_level")
#click.pass_context
def command_own(ctx, *args, **kwargs):
ctx.obj.update(kwargs)
profile_name = ctx.obj.get("profile")
region = ctx.obj.get("region")
bucket = ctx.obj.get("bucket")
prefix = ctx.obj.get("prefix").lstrip("/")
threads = ctx.obj.get("threads")
pool = Pool(nodes=threads)
logger.addHandler(logstreamhandler)
logger.info(f"Getting ownership of all objects in s3://{bucket}/{prefix}")
start = time()
try:
SESSION: boto3.Session = boto3.session.Session(profile_name=profile_name)
except botocore.exceptions.ProfileNotFound as e:
logger.warning(f"Profile {profile_name} was not found.")
logger.warning(f"Falling back to environment variables for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN")
AWS_ACCESS_KEY_ID = os.environ.get("AWS_ACCESS_KEY_ID", "")
AWS_SECRET_ACCESS_KEY = os.environ.get("AWS_SECRET_ACCESS_KEY", "")
AWS_SESSION_TOKEN = os.environ.get("AWS_SESSION_TOKEN", "")
if AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY:
if AWS_SESSION_TOKEN:
SESSION: boto3.Session = boto3.session.Session(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_session_token=AWS_SESSION_TOKEN)
else:
SESSION: boto3.Session = boto3.session.Session(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
else:
logger.error("Unable to find AWS credentials.")
sys.exit(1)
s3c = SESSION.client('s3', config=ctx.obj["aws_config"])
def bucket_keys(Bucket, Prefix='', StartAfter='', Delimiter='/'):
Prefix = Prefix[1:] if Prefix.startswith(Delimiter) else Prefix
if not StartAfter:
del StartAfter
if Prefix.endswith(Delimiter):
StartAfter = Prefix
del Delimiter
for page in s3c.get_paginator('list_objects_v2').paginate(Bucket=Bucket, Prefix=Prefix):
for content in page.get('Contents', ()):
yield content['Key']
def worker(key):
logger.info(f"Processing: {key}")
s3c.copy_object(Bucket=bucket, Key=key,
CopySource={'Bucket': bucket, 'Key': key},
ACL='bucket-owner-full-control',
StorageClass="STANDARD"
)
object_keys = bucket_keys(bucket, prefix)
pool.map(worker, object_keys)
end = time()
logger.info(f"Completed for {end - start:.2f} seconds.")
if __name__ == '__main__':
cli()
Usage:
get_object_ownership.py own -v my-big-aws-logs-bucket /prefix
The bucket mentioned above was processed for ~7 hours using 40 threads.
[*] INFO: 2021-08-05 19:53:55,542: Completed for 25320.45 seconds.
Some more speed comparison using AWS cli vs this tool on the same subset of data:
aws s3 cp --recursive --acl bucket-owner-full-control --metadata-directive
53.59s user 7.24s system 20% cpu 5:02.42 total
vs
[*] INFO: 2021-08-06 09:07:43,506: Completed for 49.09 seconds.

I used this Linux Bash shell oneliner to change ACLs recursively:
aws s3 ls s3://bucket --recursive | cut -c 32- | xargs -n 1 -d '\n' -- aws s3api put-object-acl --acl public-read --bucket bukcet --key
It works even if file names contain () characters.

The python code is more efficient this way, otherwise it takes a lot longer.
import boto3
import sys
client = boto3.client('s3')
BUCKET='mybucket'
def process_s3_objects(prefix):
"""Get a list of all keys in an S3 bucket."""
kwargs = {'Bucket': BUCKET, 'Prefix': prefix}
failures = []
while_true = True
while while_true:
resp = client.list_objects_v2(**kwargs)
for obj in resp['Contents']:
try:
set_acl(obj['Key'])
except KeyError:
while_true = False
except Exception:
failures.append(obj["Key"])
continue
kwargs['ContinuationToken'] = resp['NextContinuationToken']
print ("failures :"+ failures)
def set_acl(key):
print(key)
client.put_object_acl(
ACL='bucket-owner-full-control',
Bucket=BUCKET,
Key=key
)
def get_account_canonical_id():
return client.list_buckets()["Owner"]["ID"]
process_s3_objects(sys.argv[1])

Errno 11004 getaddrinfo failed error in connecting to Amazon S3 bucket

I am trying to use the boto (ver 2.43.0) library in Python to connect to S3, but I keep getting socket.gaierror: [Errno 11004] when I try to do this:
from boto.s3.connection import S3Connection
access_key = 'accesskey_here'
secret_key = 'secretkey_here'
conn = S3Connection(access_key, secret_key)
mybucket = conn.get_bucket('s3://diap.prod.us-east-1.mybucket/')
print("success!")
I can connect to and access folders in mybucket using AWS CLI by using a command like this in Windows:
> aws s3 ls s3://diap.prod.us-east-1.mybucket/
<list of folders in mybucket will be here>
or using software like CloudBerry or S3Browser.
Is there something that I am doing wrong here to access S3 bucket and folders properly?

get_bucket() expects a bucket name.
get_bucket(bucket_name, validate=True, headers=None)
Try:
mybucket = conn.get_bucket('mybucket')
If it doesn't work, show the full stack trace.
{Update]: There is a bug in boto library for bucket names with dot. Update your boto config
[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat
Or
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
conn = S3Connection(access_key, secret_key, calling_format=OrdinaryCallingFormat())

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to download data from AWS in python - amazon-web-services

You are right. bucket_name = name of your S3 bucket key = is object key. It's full path of the file in side the bucket. (ex: you have a file named a.txt in folder x, so key = x/a.txt. Refer to this link local_path = where you want to save the data in local machine

Related

ClientError: Failed to download data. Please check your s3 objects and ensure that there is no object that is both a folder as well as a file

Django Storage and Boto3 not retrieving Media from AWS S3

NoCredentialError when trying to access head_object

How to change permission recursively to folder with AWS s3 or AWS s3api

Errno 11004 getaddrinfo failed error in connecting to Amazon S3 bucket

Categories

Resources