AWS Data Wrangler - wr.athena.read_sql_query doesn't work - amazon-web-services

I started using AWS Data Wrangler lib
( https://aws-data-wrangler.readthedocs.io/en/stable/what.html )
to execute queries on AWS Athena and use the results of them in my AWS Glue python shell job.
I see that exist wr.athena.read_sql_query to obtain what I need.
This is my code:
import sys
import os
import awswrangler as wr
os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'
databases = wr.catalog.databases()
print(databases)
query='select count(*) from staging_dim_channel'
print(query)
df_res = wr.athena.read_sql_query(sql=query, database="lsk2-target")
print(df_res)
print(f'DataScannedInBytes: {df_res.query_metadata["Statistics"]["DataScannedInBytes"]}')
print(f'TotalExecutionTimeInMillis: {df_res.query_metadata["Statistics"]["TotalExecutionTimeInMillis"]}')
print(f'QueryQueueTimeInMillis: {df_res.query_metadata["Statistics"]["QueryQueueTimeInMillis"]}')
print(f'QueryPlanningTimeInMillis: {df_res.query_metadata["Statistics"]["QueryPlanningTimeInMillis"]}')
print(f'ServiceProcessingTimeInMillis: {df_res.query_metadata["Statistics"]["ServiceProcessingTimeInMillis"]}')
I retrieve without problem the list of database (including the lsk2-target), but the read_sql_query go on error and I receive:
WaiterError: Waiter BucketExists failed: Max attempts exceeded
Please, can you help me to understand where I am wrong?
Thanks!

Fixed a similar issue and the resolution is to ensure that the IAM role used has necessary Athena permission to create tables. As this API defaults to run in ctas_approach=True.
Ref. documentation
Also, once that is resolved ensure that the IAM role also has access to delete files create in S3

Do you have the right IAM permissions to read execute a query? I bet it is an IAM issue.
Also I guess you have setup your credentials:
[default]
aws_access_key_id = your_access_key_id
aws_secret_access_key = your_secret_access_key

Related

How to get the value of aws iam list-account-aliases as variable?

I want to write a python program to check which account I am in by using account alias (I have multiple Tennants et up on AWS). I think aws iam list-account-aliases return what exactly I am looking for but it is a command line results and I am not sure what is the best to capture as variable in a python program.
Also I was reading about the aws iam list-account-aliases and they have a output section mentioned AccountAliases -> (list). (https://docs.aws.amazon.com/cli/latest/reference/iam/list-account-aliases.html)
I wonder what this AccountAliases is? an option? a command? a variable? I was a little bit confused here.
Thank you!
Use Boto3 to get the account alias in python. Your link points to aws-cli.
Here is the link for equivalent Boto3 command for python:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html#IAM.Client.list_account_aliases
Sample code:
import boto3
client = boto3.client('iam')
response = client.list_account_aliases()
Response:
{
'AccountAliases': [
'myawesomeaccount',
],
'IsTruncated': True,
}
Account alias on AWS is a readable alias created against AWS user's account id. You can find more info here :
https://docs.aws.amazon.com/IAM/latest/UserGuide/console_account-alias.html
AWS has provided a very well documented library for Python called Boto3 which can be used to obtain connect to your AWS account as client, resource or session (more information on these in SO answer here: Difference in boto3 between resource, client, and session?)
For your use case you can connect to your AWS as client with iam label:
import boto3
client = boto3.client(
'iam',
region_name="region_name",
aws_access_key_id="your_aws_access_key_id",
aws_secret_access_key="your_aws_secret_access_key")
account_aliases = client. list_account_aliases()
The response is a JSON object which can be traversed to get the desired information on aliases.

gcloud sql import csv fails with permissions error when import file contains wildcard

According to the Google Cloud Platform SQL docs, I should be able to both export to and import from sharded files in a GCS bucket by putting a * in the filename.
If I import a single file, it works fine:
gcloud sql import csv sql-instance-name gs://gcsbucketname/data/led/led_finance_view_000000000000.csv --project=project-name --database=finance --table=Import_Test -q
Importing data into Cloud SQL instance...done.
Imported data from [gs://gcsbucketname/data/led/led_finance_view_000000000000.csv] into [https://www.googleapis.com/sql/v1beta4/projects/project-name/instances/sql-instance-name].
But if I import a sharded file, it throws a permissions error:
gcloud sql import csv sql-instance-name gs://gcsbucketname/data/led/led_finance_view_*.csv --project=project-name --database=finance --table=Import_Test -q
ERROR: (gcloud.sql.import.csv) HTTPError 403: The service account does not have the required permissions for the bucket.
I can confirm that these commands are running under the same user, using the same bucket & SQL instance.
I think the '403 service account permissions' error is probably a bug in GCP and therefore a red herring - but why won't it let me import the sharded file?
Currently, CloudSQL doesn't incorporate the possibility of importing several CSV files at once using wildcards. I have opened a public feature request for this. You can star to make it gain visibility.
Meanwhile, as a workaround, you can either have a script to run the import command once for each file, or join the CSV files before importing.
You have to get the user id (of creation) of your current database instance, in order to get that user id, type something like this in your terminal:
Command: gcloud sql instances describe short [name/db instance]
After that , go to the gcs project -> open your bucket -> add member permitions and you have to add the user id that you got in first command, add role for Storage Admin for that user id.
and that´s all.

Get the BigQuery Table creator and Google Storage Bucket Creator Details

I am trying to identify the users who created tables in BigQuery.
Is there any command line or API that would provide this information. I know that audit logs do provide this information, but I was looking for a command line which could do the job so that i could wrap this in a shell script and run them against all the tables at one time. Same for Google Storage Buckets as well. I did try
gsutil iam get gs://my-bkt and looked for "role": "roles/storage.admin" role, but I do not find the admin role with all buckets. Any help?
This is a use case for audit logs. BigQuery tables don't report metadata about the original resource creator, so scanning via tables.list or inspecting the ACLs don't really expose who created the resource, only who currently has access.
What's the use case? You could certainly export the audit logs back into BigQuery and query for table creation events going forward, but that's not exactly the same.
You can find it out using Audit Logs. You can access them both via Console/Log Explorer or using gcloud tool from the CLI.
The log filter that you're interested in is this one:
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
If you want to run it from the command line, you'd do something like this:
gcloud logging read \
'
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
'\
--limit 10
You can then post-process the output to find out who created the table. Look for principalEmail field.

Can I use Athena View as a source for a AWS Glue Job?

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as?
Thank you
Error Message Appearing
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.
Download the driver and store the jar to an S3 bucket.
Specify the S3 path to the driver as a dependent jar in your job definition.
Load the data into a dynamic frame using the code below (using an IAM user
with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
glueContext.read.format("jdbc")
.option("user", "[IAM user access key]")
.option("password", "[IAM user secret access key]")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", "my_database.my_athena_view")
.option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
.load()
)
athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")
The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).
The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation key.
I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.
Attribution: AWS support totally helped me get this working.

Amazon S3 - Unable to create a datasource

I tried creating a datasource using boto for machine learning but ended up with an error.
Here's my code :
import boto
bucketname = 'mybucket'
filename = 'myfile.csv'
schema = 'myfile.csv.schema'
conn = boto.connect_s3()
datasource = 'my_datasource'
ml = boto.connect_machinelearning()
#create a data source
ds = ml.create_data_source_from_s3(
data_source_id = datasource,
data_spec ={
'DataLocationS3':'s3://'+bucketname+'/'+filename,
'DataSchemaLocationS3':'s3://'+bucketname+'/'+schema},
data_source_name=None,
compute_statistics = True)
print ml.get_data_source(datasource,verbose=None)
I get this error as a result of get_data_source call:
Could not access 's3://mybucket/myfile.csv'. Either there is no file at that location, or the file is empty, or you have not granted us read permission.
I have checked and I have FULL_CONTROL as my permissions. The bucket, file and schema all are present and are non-empty.
How do I solve this?
You may have FULL_CONTROL over that S3 resource but in order for this to work you have to grant the Machine Learning service the appropriate access to that S3 resource.
I know links to answers are frowned upon but in this case I think its best to link to the definitive documentation from the Machine Learning Service since the actual steps are complicated and could change in the future.