I am trying to drop few tables from Athena and I cannot run multiple DROP queries at same time. Is there a way to do it?
Thanks!
You are correct. It is not possible to run multiple queries in the one request.
An alternative is to create the tables in a specific database. Dropping the database will then cause all the tables to be deleted.
For example:
CREATE DATABASE foo;
CREATE EXTERNAL TABLE bar1 ...;
CREATE EXTERNAL TABLE bar2 ...;
DROP DATABASE foo CASCADE;
The DROP DATABASE command will delete the bar1 and bar2 tables.
You can use aws-cli batch-delete-table to delete multiple table at once.
aws glue batch-delete-table \
--database-name <database-name> \
--tables-to-delete "<table1-name>" "<table2-name>" "<table3-name>" ...
You can use AWS Glue interface to do this now. The prerequisite being you must upgrade to AWS Glue Data Catalog.
If you Upgrade to the AWS Glue Data Catalog from Athena, the metadata for tables created in Athena is visible in Glue and you can use the AWS Glue UI to check multiple tables and delete them at once.
FAQ on Upgrading data catalog: https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
You could write a shell script to do this for you:
for table in products customers stores; do
aws athena start-query-execution --query-string "drop table $table" --result-configuration OutputLocation=s3://my-ouput-result-bucket
done
Use AWS Glue's Python shell and invoke this function:
def run_query(query, database, s3_output):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
Athena configuration:
s3_input = 's3://athena-how-to/data'
s3_ouput = 's3://athena-how-to/results/'
database = 'your_database'
table = 'tableToDelete'
query_1 = "drop table %s.%s;" % (database, table)
queries = [ query_1]
#queries = [ create_database, create_table, query_1, query_2 ]
for q in queries:
print("Executing query: %s" % (q))
res = run_query(q, database, s3_ouput)
#Vidy
I would second what #Prateek said. Please provide an example of your code. Also, please tag your post with the language/shell that you're using to interact with AWS.
Currently, you cannot run multiple queries in one request. However, you can make multiple requests simultaneously. Currently, you can run 20 requests simultaneously (2018-06-15). You could do this through an API call or the console. In addition you could use the CLI or the SDK (if available for your language of choice).
For example, in Python you could use the multiprocess or threading modules to manage concurrent requests. Just remember to consider thread/multiprocess safety when creating resources/clients.
Service Limits:
Athena Service Limits
AWS Service Limits for which you can request a rate increase
I could not get Carl's method to work by executing DROP TABLE statements even though they did work in the console.
So I just thought it was worth posting my approach that worked for me, which uses a combination of the AWS Pandas SDK and the CLI
import awswrangler as wr
import boto3
import os
session = boto3.Session(
aws_access_key_id='XXXXXX',
aws_secret_access_key='XXXXXX',
aws_session_token='XXXXXX'
)
database_name = 'athena_db'
athena_s3_output = 's3://athena_s3_bucket/athena_queries/'
df = wr.athena.read_sql_query(
sql= "SELECT DISTINCT table_name FROM information_schema.tables WHERE
table_schema = '" + database_name + "'",
database= database_name,
s3_output = athena_s3_output,
boto3_session = session
)
print(df)
# ensure that your aws profile is valid for CLI commands
# i.e. your credentials are set in C:\Users\xxxxxxxx\.aws\credentials
for table in df['table_name']:
cli_string = 'aws glue delete-table --database-name ' + database_name + ' --name ' + table
print(cli_string)
os.system(cli_string)
Related
I am currently trying to use athena view in AWS glue. Below is the code:
DataSource0 = (
glueContext.read.format("jdbc")
.option("AwsCredentialsProviderClass","com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", 'AwsDataCatalog."silver".vw_xyz')
.option("S3OutputLocation","s3://datadrop-test/xyz/")
.load()
)
DataSource0.printSchema()
Driver version=AthenaJDBC42_2.0.25.1002
The following output files(size<1kb) are generated
The files only contain columns. It seems like glue is unable to get records from view. The same view returns multiple rows in athena. Any help will be highly appreciated.
As provided in AWS athena documentation.
https://docs.aws.amazon.com/athena/latest/ug/create-database.html
We can specify DBPROPERTIES, S3Location and comment while creating Athena database as
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT 'database_comment']
[LOCATION 'S3_loc']
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...]]
For example:
CREATE DATABASE IF NOT EXISTS clickstreams
COMMENT 'Site Foo clickstream data aggregates'
LOCATION 's3://myS3location/clickstreams/'
WITH DBPROPERTIES ('creator'='Jane D.', 'Dept.'='Marketing analytics');
But once the properties are set. How can I fetch the properties back using Query.
Let say, I want to fetch creator name from the above example.
You can get these using the Glue Data Catalog GetDatabase API call.
Databases and tables in Athena are stored in the Glue Data Catalog. When you run DDL statements in Athena it translates these into Glue API calls. Not all operations you can do in Glue are available in Athena, because of historical reasons.
I was able to fetch AWS Athena Database properties in Json format using following code of Glue data catalog.
package com.amazonaws.samples;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.glue.AWSGlue;
import com.amazonaws.services.glue.AWSGlueClient;
import com.amazonaws.services.glue.model.GetDatabaseRequest;
import com.amazonaws.services.glue.model.GetDatabaseResult;
public class Glue {
public static void main(String[] args) {
BasicAWSCredentials awsCreds = new BasicAWSCredentials("*api*","*key*");
AWSGlue glue = AWSGlueClient.builder().withRegion("*bucket_region*")
.withCredentials(new AWSStaticCredentialsProvider(awsCreds)).build();
GetDatabaseRequest req = new GetDatabaseRequest();
req.setName("*database_name*");
GetDatabaseResult result = glue.getDatabase(req);
System.out.println(result);
}
}
Also, following permissions are required for user
AWSGlueServiceRole
AmazonS3FullAccess
I'm trying to export every item in a DynamoDB table to S3. I found this tutorial https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/ and followed the example. Basically,
table = glueContext.create_dynamic_frame.from_options(
"dynamodb",
connection_options={
"dynamodb.input.tableName": table_name,
"dynamodb.throughput.read.percent": read_percentage,
"dynamodb.splits": splits
}
)
glueContext.write_dynamic_frame.from_options(
frame=table,
connection_type="s3",
connection_options={
"path": output_path
},
format=output_format,
transformation_ctx="datasink"
)
I tested it in a tiny table in nonprod environment and it works fine. But my Dynamo table in production is over 400GB, 200 mil items. I suppose it'll take a while, but I have no idea how long to expect. Hours, or even days? Are there any way to show progress? For example, showing a count of how many items have been processed. I don't want to blindly start this job and wait.
One way would be to enable continuous logging for your AWS Glue Job to monitor its progress.
Another way would be to trigger a Lambda function whenever a file has been stored in S3, using Amazon S3 event notifications.
Did you try the custom waiter class within was docs?
For instance custom waiter for a Glue Job should look something like this:
class JobCompleteWaiter(CustomWaiter):
def __init__(self, client):
super().__init__(
"JobComplete",
"get_job_run",
"JobRun.JobRunState",
{"SUCCEEDED": WaitState.SUCCEEDED, "FAILED": WaitState.FAILED},
client,
max_tries=100,
)
def wait(self, JobName, RunId):
self._wait(JobName=JobName, RunId=RunId)
According to boto3 docs, you should expect a set of 6 different possible states from a JOB: STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
So I chost checkein whether was SUCCEEDED or FAILED.
My Requirement
I want to create a CloudWatch-Metric from Athena query results.
Example
I want to create a metric like user_count of each day.
In Athena, I will write an SQL query like this
select date,count(distinct user) as count from users_table group by 1
In the Athena editor I can see the result, but I want to see these results as a metric in Cloudwatch.
CloudWatch-Metric-Name ==> user_count
Dimensions ==> Date,count
If I have this cloudwatch metric and dimensions, I can easily create a Monitoring Dashboard and send send alerts
Can anyone suggest a way to do this?
You can use CloudWatch custom widgets, see "Run Amazon Athena queries" in Samples.
It's somewhat involved, but you can use a Lambda for this. In a nutshell:
Setup your query in Athena and make sure it works using the Athena console.
Create a Lambda that:
Runs your Athena query
Pulls the query results from S3
Parses the query results
Sends the query results to CloudWatch as a metric
Use EventBridge to run your Lambda on a recurring basis
Here's an example Lambda function in Python that does step #2. Note that the Lamda function will need IAM permissions to run queries in Athena, read the results from S3, and then put a metric into Cloudwatch.
import time
import boto3
query = 'select count(*) from mytable'
DATABASE = 'default'
bucket='BUCKET_NAME'
path='yourpath'
def lambda_handler(event, context):
#Run query in Athena
client = boto3.client('athena')
output = "s3://{}/{}".format(bucket,path)
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
#S3 file name uses the QueryExecutionId so
#grab it here so we can pull the S3 file.
qeid = response["QueryExecutionId"]
#occasionally the Athena hasn't written the file
#before the lambda tries to pull it out of S3, so pause a few seconds
#Note: You are charged for time the lambda is running.
#A more elegant but more complicated solution would try to get the
#file first then sleep.
time.sleep(3)
###### Get query result from S3.
s3 = boto3.client('s3');
objectkey = path + "/" + qeid + ".csv"
#load object as file
file_content = s3.get_object(
Bucket=bucket,
Key=objectkey)["Body"].read()
#split file on carriage returns
lines = file_content.decode().splitlines()
#get the second line in file
count = lines[1]
#remove double quotes
count = count.replace("\"", "")
#convert string to int since cloudwatch wants numeric for value
count = int(count)
#post query results as a CloudWatch metric
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
MetricData = [
{
'MetricName': 'MyMetric',
'Dimensions': [
{
'Name': 'DIM1',
'Value': 'dim1'
},
],
'Unit': 'None',
'Value': count
},
],
Namespace = 'MyMetricNS'
)
return response
return
I have developed different Athena Workgroups for different teams so that I can separate their queries and their query results. The users would like to query the tables available to them from their notebook instances (JupyterLab). I am having difficulty finding code which successfully covers the requirement of querying a table from the user's specific workgroup. I have only found code that will query the table from the primary workgroup.
The code I have currently used is added below.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>')
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
df
This code does not work as the users only have access to perform queries from their specific workgroups hence get errors when this code is run. It also does not cover the requirement of separating the user's queries in user specific workgroups.
Any suggestions on how I can add alter the code so that I can run the queries within a specific workgroup from the notebook instance?
Documentation of pyathena is not super extensive, but after looking into source code we can see that connect simply creates instance of Connection class.
def connect(*args, **kwargs):
from pyathena.connection import Connection
return Connection(*args, **kwargs)
Now, after looking into signature of Connection.__init__ on GitHub we can see parameter work_group=None which name in the same way as one of the parameters for start_query_execution from the official AWS Python API boto3. Here is what their documentation say about it:
WorkGroup (string) -- The name of the workgroup in which the query is being started.
After following through usages and imports in Connection we endup with BaseCursor class that under the hood makes a call to start_query_execution while unpacking a dictionary with parameters assembled by BaseCursor._build_start_query_execution_request method. That is excatly where we can see familar syntax for submitting queries to AWS Athena, in particular the following part:
if self._work_group or work_group:
request.update({
'WorkGroup': work_group if work_group else self._work_group
})
So this should do a trick for your case:
import pandas as pd
from pyathena import connect
conn = connect(
s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>',
work_group='<USER SPECIFIC WORKGROUP>'
)
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
I implemented this it worked for me.
!pip install pyathena
Ref link
from pyathena import connect
from pyathena.pandas.util import as_pandas
import boto3
query = """
Select * from "s3-prod-db"."CustomerTransaction" ct where date(partitiondate) >= date('2022-09-30') limit 10
"""
query
cursor = connect(s3_staging_dir='s3://s3-temp-analytics-prod2/',
region_name=boto3.session.Session().region_name, work_group='data-scientist').cursor()
df = cursor.execute(query)
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.output_location)
df = as_pandas(cursor)
print(df)
If we dont pass work_group parameter will use "primary" as the default work_group.
If we pass s3_staging_dir='s3://s3-temp-analytics-prod2/' s3 bucket which does not exist, it will create this bucket.
But if the user role that you are running the script does not have to create bucket privilege it will throw an exception.