Can I use Athena View as a source for a AWS Glue Job? - amazon-web-services

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as?
Thank you
Error Message Appearing

You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.
Download the driver and store the jar to an S3 bucket.
Specify the S3 path to the driver as a dependent jar in your job definition.
Load the data into a dynamic frame using the code below (using an IAM user
with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
glueContext.read.format("jdbc")
.option("user", "[IAM user access key]")
.option("password", "[IAM user secret access key]")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", "my_database.my_athena_view")
.option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
.load()
)
athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")
The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).
The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation key.
I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.
Attribution: AWS support totally helped me get this working.

Related

AWS Data Wrangler - wr.athena.read_sql_query doesn't work

I started using AWS Data Wrangler lib
( https://aws-data-wrangler.readthedocs.io/en/stable/what.html )
to execute queries on AWS Athena and use the results of them in my AWS Glue python shell job.
I see that exist wr.athena.read_sql_query to obtain what I need.
This is my code:
import sys
import os
import awswrangler as wr
os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'
databases = wr.catalog.databases()
print(databases)
query='select count(*) from staging_dim_channel'
print(query)
df_res = wr.athena.read_sql_query(sql=query, database="lsk2-target")
print(df_res)
print(f'DataScannedInBytes: {df_res.query_metadata["Statistics"]["DataScannedInBytes"]}')
print(f'TotalExecutionTimeInMillis: {df_res.query_metadata["Statistics"]["TotalExecutionTimeInMillis"]}')
print(f'QueryQueueTimeInMillis: {df_res.query_metadata["Statistics"]["QueryQueueTimeInMillis"]}')
print(f'QueryPlanningTimeInMillis: {df_res.query_metadata["Statistics"]["QueryPlanningTimeInMillis"]}')
print(f'ServiceProcessingTimeInMillis: {df_res.query_metadata["Statistics"]["ServiceProcessingTimeInMillis"]}')
I retrieve without problem the list of database (including the lsk2-target), but the read_sql_query go on error and I receive:
WaiterError: Waiter BucketExists failed: Max attempts exceeded
Please, can you help me to understand where I am wrong?
Thanks!
Fixed a similar issue and the resolution is to ensure that the IAM role used has necessary Athena permission to create tables. As this API defaults to run in ctas_approach=True.
Ref. documentation
Also, once that is resolved ensure that the IAM role also has access to delete files create in S3
Do you have the right IAM permissions to read execute a query? I bet it is an IAM issue.
Also I guess you have setup your credentials:
[default]
aws_access_key_id = your_access_key_id
aws_secret_access_key = your_secret_access_key

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
---------------------S3 image ---------------------
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1
They are actually directory markers as path + /. Source 2
To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3 , S3a and S3n here and here
Thanks to #stevel 's comment here
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

AWS Data Pipeline Dynamo to Redshift

I have an issue:
I need to migrate data from DynamoDB to Redshift. The problem is that I receive such exception:
ERROR: Unsupported Data Type: Current Version only supports Strings and Numbers Detail: ----------------------------------------------- error: Unsupported Data Type: Current Version only supports Strings and Numbers code: 9005 context: Table Name = user_session query: 446027 location: copy_dynamodb_scanner.cpp:199 process: query0_124_446027 [pid=25424] -----------------------------------------------
In my Dynamo item I have boolean field. How can I modify field from Boolean to INT(for example)?
I tried to use as a VARCHAR(5), but didn't help(so it one ticket in Github without response)
Will be appreciate for any suggestions.
As a solution, I migrated data from DynamoDB to S3 first and then to Redshift.
I used Exports to S3 build in feature in DynamoDB. It saves all data as *.json files into S3 realy fast(but not sorted).
After that I used ETL script, using Glue Job and custom script with pyspark to process and save data into Redshift.
Also can be done with Glue crawler to define schema, but still need to validate its result, as sometimes it was not correct.
Using crawlers to parse DynamoDB directly is overkill of your tables if you are not using ONDEMAND read/write. So the better way is to do that with data from S3.

Get the BigQuery Table creator and Google Storage Bucket Creator Details

I am trying to identify the users who created tables in BigQuery.
Is there any command line or API that would provide this information. I know that audit logs do provide this information, but I was looking for a command line which could do the job so that i could wrap this in a shell script and run them against all the tables at one time. Same for Google Storage Buckets as well. I did try
gsutil iam get gs://my-bkt and looked for "role": "roles/storage.admin" role, but I do not find the admin role with all buckets. Any help?
This is a use case for audit logs. BigQuery tables don't report metadata about the original resource creator, so scanning via tables.list or inspecting the ACLs don't really expose who created the resource, only who currently has access.
What's the use case? You could certainly export the audit logs back into BigQuery and query for table creation events going forward, but that's not exactly the same.
You can find it out using Audit Logs. You can access them both via Console/Log Explorer or using gcloud tool from the CLI.
The log filter that you're interested in is this one:
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
If you want to run it from the command line, you'd do something like this:
gcloud logging read \
'
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
'\
--limit 10
You can then post-process the output to find out who created the table. Look for principalEmail field.

Dataflow needs bigquery.datasets.get permission for the underlying table in authorized view

In a dataflow pipeline, I'm reading from a BigQuery Authorized View:
beam.io.Read(beam.io.BigQuerySource(query = "SELECT col1 FROM proj2.dataset2.auth_view1", use_standard_sql=True))
This is the error which I'm getting:
Error:
Message: Access Denied: Dataset proj1:dataset1: The user xxxxxx-compute#developer.gserviceaccount.com does not have bigquery.datasets.get permission for dataset proj1:dataset1.
proj1:dataset1 has the base table for the view auth_view1.
According to this issue in DataflowJavaSDK, dataflow seems to be directly executing some metadata query against the underlying table.
Is there a fix available for this issue in Apache Beam SDK?
Explicitly setting the Query location is also a solution in the Apache Beam Java SDK, using the withQueryLocation option of BigQueryIO.
It looks like setting the query location is not possible in the Python SDK yet.