Amazon S3 - Unable to create a datasource - amazon-web-services

I tried creating a datasource using boto for machine learning but ended up with an error.
Here's my code :
import boto
bucketname = 'mybucket'
filename = 'myfile.csv'
schema = 'myfile.csv.schema'
conn = boto.connect_s3()
datasource = 'my_datasource'
ml = boto.connect_machinelearning()
#create a data source
ds = ml.create_data_source_from_s3(
data_source_id = datasource,
data_spec ={
'DataLocationS3':'s3://'+bucketname+'/'+filename,
'DataSchemaLocationS3':'s3://'+bucketname+'/'+schema},
data_source_name=None,
compute_statistics = True)
print ml.get_data_source(datasource,verbose=None)
I get this error as a result of get_data_source call:
Could not access 's3://mybucket/myfile.csv'. Either there is no file at that location, or the file is empty, or you have not granted us read permission.
I have checked and I have FULL_CONTROL as my permissions. The bucket, file and schema all are present and are non-empty.
How do I solve this?

You may have FULL_CONTROL over that S3 resource but in order for this to work you have to grant the Machine Learning service the appropriate access to that S3 resource.
I know links to answers are frowned upon but in this case I think its best to link to the definitive documentation from the Machine Learning Service since the actual steps are complicated and could change in the future.

Related

Exception in Botocore while trying to read a files from AWS S3

I've been hitting an exception which previously had never rose up. I am trying to read a file stored in S3 with boto3. Something like this:
session = boto3.Session(
aws_access_key_id=my_aws_access_key_id,
aws_secret_access_key=my_aws_secret_access_key,
region_name="us-east-1",
)
s3 = session.resource("s3")
bucket = s3.Bucket("my_bucket_name")
mystring = bucket.Object(my_object_key).get()["Body"].read()
... some other code ...
Right where it should allocate the string in the mystring variable, I get the following:
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://<my_bucket_name>.s3.amazonaws.com/<my_object_key>"
I have tried to retrieve the file using the AWS CLI with the same
credentials that I'm giving boto3 and it does work fine.
I am setting specific region name and credentials so as to avoid aws config issues.
The endpoint URL given in the exception is the correct URL of the object.
I have tried with any other objects in S3 and it's not working.
I checked for changes in environment variables as explained in this related question, to no avail
I also checked all suggestions in this other related question without results
This error came out of the blue. Yesterday it was working fine and today it isn't, no changes in the source code.
To reproduce your situation, I did the following:
Created a bucket in us-east-1
Uploaded a file to us-east-1
Ran this code:
import boto3
s3_resource = boto3.resource("s3", region_name='us-east-1')
bucket = s3_resource.Bucket("my-bucket-name")
mystring = bucket.Object('my-object-name').get()["Body"].read()
print(mystring)
It successfully printed the contents of the object, suggesting that your code is not to blame. Perhaps there is another configuration in your system that is affecting it?
I then tried this code from Any method to get s3 endpoint url for a given region? · Issue #1166 · boto/boto3 · GitHub:
import botocore.loaders
import botocore.regions
loader = botocore.loaders.create_loader()
data = loader.load_data("endpoints")
resolver = botocore.regions.EndpointResolver(data)
endpoint_data = resolver.construct_endpoint("s3", "us-east-1")
print(endpoint_data)
It returned:
OrderedDict([('hostname', 's3.us-east-1.amazonaws.com'), ('signatureVersions', ['s3', 's3v4']), ('variants', [OrderedDict([('hostname', 's3-fips.dualstack.us-east-1.amazonaws.com'), ('tags', ['dualstack', 'fips'])]), OrderedDict([('hostname', 's3-fips.us-east-1.amazonaws.com'), ('tags', ['fips'])]), OrderedDict([('hostname', 's3.dualstack.us-east-1.amazonaws.com'), ('tags', ['dualstack'])])]), ('dnsSuffix', 'amazonaws.com'), ('partition', 'aws'), ('endpointName', 'us-east-1'), ('protocols', ['http', 'https'])])
Not sure if that helps, but it's something for you to compare against.
My boto3.__version__ is showing 1.24.38.

AWS Data Wrangler - wr.athena.read_sql_query doesn't work

I started using AWS Data Wrangler lib
( https://aws-data-wrangler.readthedocs.io/en/stable/what.html )
to execute queries on AWS Athena and use the results of them in my AWS Glue python shell job.
I see that exist wr.athena.read_sql_query to obtain what I need.
This is my code:
import sys
import os
import awswrangler as wr
os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'
databases = wr.catalog.databases()
print(databases)
query='select count(*) from staging_dim_channel'
print(query)
df_res = wr.athena.read_sql_query(sql=query, database="lsk2-target")
print(df_res)
print(f'DataScannedInBytes: {df_res.query_metadata["Statistics"]["DataScannedInBytes"]}')
print(f'TotalExecutionTimeInMillis: {df_res.query_metadata["Statistics"]["TotalExecutionTimeInMillis"]}')
print(f'QueryQueueTimeInMillis: {df_res.query_metadata["Statistics"]["QueryQueueTimeInMillis"]}')
print(f'QueryPlanningTimeInMillis: {df_res.query_metadata["Statistics"]["QueryPlanningTimeInMillis"]}')
print(f'ServiceProcessingTimeInMillis: {df_res.query_metadata["Statistics"]["ServiceProcessingTimeInMillis"]}')
I retrieve without problem the list of database (including the lsk2-target), but the read_sql_query go on error and I receive:
WaiterError: Waiter BucketExists failed: Max attempts exceeded
Please, can you help me to understand where I am wrong?
Thanks!
Fixed a similar issue and the resolution is to ensure that the IAM role used has necessary Athena permission to create tables. As this API defaults to run in ctas_approach=True.
Ref. documentation
Also, once that is resolved ensure that the IAM role also has access to delete files create in S3
Do you have the right IAM permissions to read execute a query? I bet it is an IAM issue.
Also I guess you have setup your credentials:
[default]
aws_access_key_id = your_access_key_id
aws_secret_access_key = your_secret_access_key

How to sign gcs blob from the dataflow worker

my beam dataflow job succeeds locally (with DirectRunner) and fails on the cloud (with DataflowRunner)
The issue localized in this code snippet:
class SomeDoFn(beam.DoFn):
...
def process(self, gcs_blob_path):
gcs_client = storage.Client()
bucket = gcs_client.get_bucket(BUCKET_NAME)
blob = Blob(gcs_blob_path, bucket)
# NEXT LINE IS CAUSING ISSUES! (when run remotely)
url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
and dataflow points to the error: "AttributeError: you need a private key to sign credentials.the credentials you are currently using just contains a token."
My dataflow job uses the service account (and appropriate service_account_email is provided in the PipelineOptions), however I don't see how I could pass the .json credentials file of that service account to the dataflow job. I suspect that locally my job runs successfully because I set the environment variable GOOGLE_APPLICATION_CREDENTIALS=<path to local file with service account credentials>, but how do I set it similarly for remote dataflow workers? Or maybe there is another solution, if anyone could help
You can see an example here on how to add custom options to your Beam pipeline. With this we can create a --key_file argument that will point to the credentials stored in GCS:
parser.add_argument('--key_file',
dest='key_file',
required=True,
help='Path to service account credentials JSON.')
This will allow you to add the --key_file gs://PATH/TO/CREDENTIALS.json flag when running the job.
Then, you can read it from within the job and pass it as a side input to the DoFn that needs to sign the blob. Starting from the example here we create a credentials PCollection to hold the JSON file:
credentials = (p
| 'Read Credentials from GCS' >> ReadFromText(known_args.key_file))
and we broadcast it to all workers processing the SignFileFn function:
(p
| 'Read File from GCS' >> beam.Create([known_args.input]) \
| 'Sign File' >> beam.ParDo(SignFileFn(), pvalue.AsList(credentials)))
Inside the ParDo, we build the JSON object to initialize the client (using the approach here) and sign the file:
class SignFileFn(beam.DoFn):
"""Signs GCS file with GCS-stored credentials"""
def process(self, gcs_blob_path, creds):
from google.cloud import storage
from google.oauth2 import service_account
credentials_json=json.loads('\n'.join(creds))
credentials = service_account.Credentials.from_service_account_info(credentials_json)
gcs_client = storage.Client(credentials=credentials)
bucket = gcs_client.get_bucket(gcs_blob_path.split('/')[2])
blob = bucket.blob('/'.join(gcs_blob_path.split('/')[3:]))
url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
logging.info(url)
yield url
See full code here
You will need to provide the service account JSON key similarly to what you are doing locally using the env variable GOOGLE_APPLICATION_CREDENTIALS.
To do so you can follow a few approaches mentioned in the answers to this question. Such as passing it using PipelineOptions
However, keep in mind that the safest way is to store the JSON key let's say in a GCP Bucket and get the file from there.
The easy but not safe workaround is getting the key, opening it, and in your code create a json object based on it to pass it later.

IAM role and Keys setup for S3 AWS accessing two different account buckets using boto3

I have two different accounts
1) Account one which is vendor account and they gave us AccessID and secret key for access.
2) Our Account where we have full access.
We need to copy files from Vendor S3 bucket to Our S3 bucket using boto3 Python 3.7 scripts.
What is the best function in boto3 to use to get best performance.
I tried using get_object and put_object. Problem with this scenario is I am actually reading the file body and writing it. How do we just copy from one account to another account with the faster copy mode?
Is there any setup I can do from my end to directly copy. We are okay to use Lambda as well as long as I get good performance. I cannot request any changes from vendor except that they give us access keys.
Thanks
Tom
One of the fastest ways to copy data between 2 buckets is to use S3DistCp, worth to use it only if you have a lot of files to copy, it will copy them in a distributed way with an EMR cluster.
Lambda function with boto3 will be an option, only if copy takes less then 5 minutes if longer you can consider using ECS tasks (basically Docker containers).
Regarding the part how to copy with boto3 you can check here.
Looks like that you can do something like:
import boto3
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
source_bucket_name = 'src_bucket_name'
destination_bucket_name = 'dst_bucket_name'
paginator = s3_client.get_paginator('list_objects')
response_iterator = paginator.paginate(
Bucket=source_bucket_name,
Prefix='your_prefix',
PaginationConfig={
'PageSize': 1000,
}
)
objs = response_iterator.build_full_result()['Contents']
keys_to_copy = [o['Key'] for o in objs] # or use a generator (o['Key'] for o in objs)
for key in keys_to_copy:
print(key)
copy_source = {
'Bucket': source_bucket_name,
'Key': key
}
s3_resource.meta.client.copy(copy_source, destination_bucket_name, key)
The proposed solution first get the name of the objects to copy, then it calls the copy command for each object.
To make it faster instead of using a for loop, you can use async.
If you run the code in a Lambda or ECS task remember to create a IAM role with access to both Source Bucket and Destination bucket.

botocore.exceptions.NoCredentialsError: Unable to locate credentials , Even after passing credentials manually

Hi I am a newbie in creating flask application, i have created a small GUI to upload files to the S3 Bucket
Here is the code snippet which is handling the same
s3 = boto3.client('s3', region_name="eu-west-1",
endpoint_url=S3_LOCATION, aws_access_key_id=S3_KEY, aws_secret_access_key=S3_SECRET)
myclient = boto3.resource('s3')
file = request.files['file[]']
filename=file.filename
data_files = request.files.getlist('file[]')
for data_file in data_files:
file_contents = data_file.read()
ts = time.gmtime()
k=time.strftime("%Y-%m-%dT%H:%M:%S", ts)
name=filename[0:-4]
newfilename=(name+k+'.txt')
myclient.Bucket(S3_BUCKET).put_object(Key=newfilename,Body=file_contents)
message='File Uploaded Successfully'
print('upload Successful')
the part is working fine when I am testing it from my local system, but upon uploading it to the EC2 Instance,the part
myclient.Bucket(S3_BUCKET).put_object(Key=newfilename,Body=file_contents)
is where it is throwing the error:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I have created a file config.py where I am storing the all the credentials and passing them at runtime.
Not sure what is Causing the Error at EC2 instance, please help me with it
You may be confusing boto3 service resource and client.
#!!!! this instantiate an object call s3 to boto3.s3.client with credential
s3 = boto3.client('s3', region_name="eu-west-1",
endpoint_url=S3_LOCATION, aws_access_key_id=S3_KEY, aws_secret_access_key=S3_SECRET)
#!!!! this instantiate an object call myclient to boto3.s3.resource that use
# credential inside .aws folder, since no designated credential given.
myclient = boto3.resource('s3')
It seems you try to pass explicit credential to boto3.resource without using .aws/credential and .aws/default access key. If so, this is not the right way to do it. To explicitly pass the credential to boto3.resource, it is recommended to to use boto3.Session (that also works for boto3.client too). This also allow you to connect to different AWS services by using the initialise session than passing API key for different services inside your program.
import boto3
session = boto3.session(
region_name = 'us-west-2',
aws_access_key_id=S3_KEY,
aws_secret_access_key=S3_SECRET)
# now instantiate the services
myclient = session.resource('s3')
# .... the rest of the code
Nevertheless, the better way is make use of .aws credential. Because it is a bad practice to hard code any access key/password inside the code. You can also use the profile name call if you need to access different API key in different region. e.g.
~/.aws/credential
[default]
aws_access_key_id = XYZABC12345
aws_secret_access_key = SECRET12345
[appsinfinity]
aws_access_key_id = XYZABC12346
aws_secret_access_key = SECRET12346
~/.aws/config
[default]
region = us-west-1
[profile appsinfinity]
region = us-west-2
And the code
import boto3
app_infinity_session = boto3.session(profile_name= 'appsinfinity')
....