I have developed different Athena Workgroups for different teams so that I can separate their queries and their query results. The users would like to query the tables available to them from their notebook instances (JupyterLab). I am having difficulty finding code which successfully covers the requirement of querying a table from the user's specific workgroup. I have only found code that will query the table from the primary workgroup.
The code I have currently used is added below.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>')
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
df
This code does not work as the users only have access to perform queries from their specific workgroups hence get errors when this code is run. It also does not cover the requirement of separating the user's queries in user specific workgroups.
Any suggestions on how I can add alter the code so that I can run the queries within a specific workgroup from the notebook instance?
Documentation of pyathena is not super extensive, but after looking into source code we can see that connect simply creates instance of Connection class.
def connect(*args, **kwargs):
from pyathena.connection import Connection
return Connection(*args, **kwargs)
Now, after looking into signature of Connection.__init__ on GitHub we can see parameter work_group=None which name in the same way as one of the parameters for start_query_execution from the official AWS Python API boto3. Here is what their documentation say about it:
WorkGroup (string) -- The name of the workgroup in which the query is being started.
After following through usages and imports in Connection we endup with BaseCursor class that under the hood makes a call to start_query_execution while unpacking a dictionary with parameters assembled by BaseCursor._build_start_query_execution_request method. That is excatly where we can see familar syntax for submitting queries to AWS Athena, in particular the following part:
if self._work_group or work_group:
request.update({
'WorkGroup': work_group if work_group else self._work_group
})
So this should do a trick for your case:
import pandas as pd
from pyathena import connect
conn = connect(
s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>',
work_group='<USER SPECIFIC WORKGROUP>'
)
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
I implemented this it worked for me.
!pip install pyathena
Ref link
from pyathena import connect
from pyathena.pandas.util import as_pandas
import boto3
query = """
Select * from "s3-prod-db"."CustomerTransaction" ct where date(partitiondate) >= date('2022-09-30') limit 10
"""
query
cursor = connect(s3_staging_dir='s3://s3-temp-analytics-prod2/',
region_name=boto3.session.Session().region_name, work_group='data-scientist').cursor()
df = cursor.execute(query)
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.output_location)
df = as_pandas(cursor)
print(df)
If we dont pass work_group parameter will use "primary" as the default work_group.
If we pass s3_staging_dir='s3://s3-temp-analytics-prod2/' s3 bucket which does not exist, it will create this bucket.
But if the user role that you are running the script does not have to create bucket privilege it will throw an exception.
Related
we've set up AWS SecretsManager as a secrets backend to Airflow (AWS MWAA) as described in their documentation. Unfortunately, nowhere is explained where the secrets are to be found and how they are to be used then. When I supply conn_id to a task in a DAG, we can see two errors in the task logs, ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined. What's even more surprising is that when retrieving variables stored the same way with Variable.get('my_variable_id'), it works just fine.
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Btw, neither the connection nor the variable show up in the Airflow UI.
Having stored a secret named airflow/connections/redshift_conn (and on the side one airflow/variables/my_variable_id), I expect the connection to be found and used when constructing RedshiftSQLOperator(task_id='mytask', redshift_conn_id='redshift_conn', sql='SELECT 1'). But this results in the above error.
I am able to retrieve the redshift connection manually in a DAG with a separate task, but I think that is not how SecretsManager is supposed to be used in this case.
The example DAG is below:
from airflow import DAG, settings, secrets
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
from airflow.models.baseoperator import chain
from airflow.models import Connection, Variable
from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
from datetime import timedelta
sm_secret_id_name = f'airflow/connections/redshift_conn'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(1),
'retries': 1,
}
def read_from_aws_sm_fn(**kwargs): # from AWS example code
### set up Secrets Manager
hook = AwsBaseHook(client_type='secretsmanager')
client = hook.get_client_type('secretsmanager')
response = client.get_secret_value(SecretId=sm_secret_id_name)
myConnSecretString = response["SecretString"]
print(myConnSecretString[:15])
return myConnSecretString
def get_variable(**kwargs):
my_var_value = Variable.get('my_test_variable')
print('variable:')
print(my_var_value)
return my_var_value
with DAG(
dag_id=f'redshift_test_dag',
default_args=default_args,
dagrun_timeout=timedelta(minutes=10),
start_date=days_ago(1),
schedule_interval=None,
tags=['example']
) as dag:
read_from_aws_sm_task = PythonOperator(
task_id="read_from_aws_sm",
python_callable=read_from_aws_sm_fn,
provide_context=True
) # works fine
query_redshift = RedshiftSQLOperator(
task_id='query_redshift',
redshift_conn_id='redshift_conn',
sql='SELECT 1;'
) # results in above errors :-(
try_to_get_variable_value = PythonOperator(
task_id='get_variable',
python_callable=get_variable,
provide_context=True
) # works fine!
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Using secret manager as a backend, you don't need to change the way you use the connections or variables. They work the same way, when looking up a connection/variable, airflow follow a search path.
Btw, neither the connection nor the variable show up in the Airflow UI.
The connection/variable will not up in the UI.
ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined
The 1st error is related to the secret and the 2nd error is due to the connection not existing in the airflow UI.
There is 2 formats to store connections in secret manager (depending on the aws provider version installed) the IPv6 URL error could be that its not parsing the connection correctly. Here is a link to the provider docs.
First step is defining the prefixes for connections and variables, if they are not defined, your secret backend will not check for the secret:
secrets.backend_kwargs : {"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}
Then for the secrets/connections, you should store them in those prefixes, respecting the required fields for the connection.
For example, for the connection my_postgress_conn:
{
"conn_type": "postgresql",
"login": "user",
"password": "pass",
"host": "host",
"extra": '{"key": "val"}',
}
You should store it in the path airflow/connections/my_postgress_conn, with the json dict as string.
And for the variables, you just need to store them in airflow/variables/<var_name>.
I have a serverless Aurora DB on AWS RDS (with data api enabled) that I would like to query using the database resource and secret ARNs. The code to do this is shown below.
rds_data = boto3.client('rds-data', region_name='us-west-1')
response = rds_data.execute_statement(
resourceArn='DATABASE_ARN',
secretArn='DATABASE_SECRET',
database='db',
sql=query)
rds_data.close()
The response contains the returned records, but the column names are not shown and the format is interesting (see below).
'records': [[{'stringValue': 'name'}, {'stringValue': '5'}, {'stringValue': '3'}, {'stringValue': '1'}, {'stringValue': '2'}, {'stringValue': '4'}]]
I want to query the aurora database and then create a pandas dataframe. Is there a better way to do this? I looked at just using psycopg2 and creating a database connection, but I believe you need a user name and password. I would like to not go that route and instead authenticate using IAM.
The answer was actually pretty simple. There is a parameter called "formatRecordsAs" in the "execute_statement" function. If you set this to json, you can get back better records.
response = rds_data.execute_statement(
resourceArn='DATABASE_ARN',
secretArn='DATABASE_SECRET',
database='db',
formatRecordsAs='JSON',
sql=query)
This still gives you back a string, so then you just need to convert that string representation to a list of dictionaries.
list_of_dicts = ast.literal_eval(response['formattedRecords'])
That can then be changed to a pandas dataframe
df = pd.DataFrame(list_of_dicts)
I want to create a script, and perhaps run it in a cron job every 24 hours, which will list all access keys older than 60 days.
I also want to shove the keys older than 60 days into an array so I can iterate over it and perform other options.
I'm looking at Managing access keys for IAM users - AWS Identity and Access Management and it has a aws iam get-access-key-last-used command but that's not what I want. But it's the closet thing I can find.
What I want to get the key where current date - creation date > 60 days.
I'm imagining my script would look something like this:
# some of this is pseudocode just to
# communicate what I'm envisioning.
# I don't actually know what to put
# here yet; need assistance.
myCommand = "aws cli get key where age > 60"
staleKeys=( $( $myCommand) )
for key in "${staleKeys[#]}"
do
# log "${key}"
# run another aws cli command with ${key} as a value
done
Is this possible from the AWS CLI?
I recommend Getting credential reports for your AWS account - AWS Identity and Access Management. This is an automated process that can generate a CSV file listing lots of information about credentials, including:
The date and time when the user's access key was created or last changed
The date and time when the user's access key was most recently used to sign an AWS API request
The report can be obtained by calling generate-credential-report, waiting a bit, then calling get-credential-report. The response needs to be base64 decoded. The result looks like this:
user,arn,user_creation_time,password_enabled,password_last_used,password_last_changed,password_next_rotation,mfa_active,access_key_1_active,access_key_1_last_rotated,access_key_1_last_used_date,access_key_1_last_used_region,access_key_1_last_used_service,access_key_2_active,access_key_2_last_rotated,access_key_2_last_used_date,access_key_2_last_used_region,access_key_2_last_used_service,cert_1_active,cert_1_last_rotated,cert_2_active,cert_2_last_rotated
user1,arn:aws:iam::111111111111:user/user1,2019-04-08T05:57:22+00:00,true,2020-05-20T10:55:03+00:00,2019-04-18T00:43:43+00:00,N/A,false,true,2019-04-08T05:57:24+00:00,2019-12-05T21:23:00+00:00,us-west-2,iot,true,2019-11-18T09:38:54+00:00,N/A,N/A,N/A,false,N/A,false,N/A
If you decide to generate the information yourself, please note that list_access_keys() only returns information about a single user. Therefore, you would need to iterate through all users, and call list_access_keys() for each user to obtain the CreationDate of the keys.
For an example of usage, see: How to scan your AWS account for old access keys using python - DEV Community
I use the following Python boto3 script, not AWS CLI.
Hope this help those who wanna use boto3:
import boto3
from datetime import datetime, timezone
def utc_to_local(utc_dt):
return utc_dt.replace(tzinfo=timezone.utc).astimezone(tz=None)
def diff_dates(date1, date2):
return abs(date2 - date1).days
resource = boto3.resource('iam')
client = boto3.client("iam")
KEY = 'LastUsedDate'
for user in resource.users.all():
Metadata = client.list_access_keys(UserName=user.user_name)
if Metadata['AccessKeyMetadata']:
for key in user.access_keys.all():
AccessId = key.access_key_id
Status = key.status
CreatedDate = key.create_date
numOfDays = diff_dates(utc_to_local(datetime.utcnow()), utc_to_local(CreatedDate))
LastUsed = client.get_access_key_last_used(AccessKeyId=AccessId)
if (Status == "Active"):
if KEY in LastUsed['AccessKeyLastUsed']:
print("User:", user.user_name, "Key:", AccessId, "Last Used:", LastUsed['AccessKeyLastUsed'][KEY], "Age of Key:", numOfDays, "Days")
else:
print("User:", user.user_name , "Key:", AccessId, "Key is Active but NEVER USED")
else:
print("User:", user.user_name , "Key:", AccessId, "Keys is InActive")
else:
print("User:", user.user_name , "No KEYS for this USER")
As provided in AWS athena documentation.
https://docs.aws.amazon.com/athena/latest/ug/create-database.html
We can specify DBPROPERTIES, S3Location and comment while creating Athena database as
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT 'database_comment']
[LOCATION 'S3_loc']
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...]]
For example:
CREATE DATABASE IF NOT EXISTS clickstreams
COMMENT 'Site Foo clickstream data aggregates'
LOCATION 's3://myS3location/clickstreams/'
WITH DBPROPERTIES ('creator'='Jane D.', 'Dept.'='Marketing analytics');
But once the properties are set. How can I fetch the properties back using Query.
Let say, I want to fetch creator name from the above example.
You can get these using the Glue Data Catalog GetDatabase API call.
Databases and tables in Athena are stored in the Glue Data Catalog. When you run DDL statements in Athena it translates these into Glue API calls. Not all operations you can do in Glue are available in Athena, because of historical reasons.
I was able to fetch AWS Athena Database properties in Json format using following code of Glue data catalog.
package com.amazonaws.samples;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.glue.AWSGlue;
import com.amazonaws.services.glue.AWSGlueClient;
import com.amazonaws.services.glue.model.GetDatabaseRequest;
import com.amazonaws.services.glue.model.GetDatabaseResult;
public class Glue {
public static void main(String[] args) {
BasicAWSCredentials awsCreds = new BasicAWSCredentials("*api*","*key*");
AWSGlue glue = AWSGlueClient.builder().withRegion("*bucket_region*")
.withCredentials(new AWSStaticCredentialsProvider(awsCreds)).build();
GetDatabaseRequest req = new GetDatabaseRequest();
req.setName("*database_name*");
GetDatabaseResult result = glue.getDatabase(req);
System.out.println(result);
}
}
Also, following permissions are required for user
AWSGlueServiceRole
AmazonS3FullAccess
I am trying to drop few tables from Athena and I cannot run multiple DROP queries at same time. Is there a way to do it?
Thanks!
You are correct. It is not possible to run multiple queries in the one request.
An alternative is to create the tables in a specific database. Dropping the database will then cause all the tables to be deleted.
For example:
CREATE DATABASE foo;
CREATE EXTERNAL TABLE bar1 ...;
CREATE EXTERNAL TABLE bar2 ...;
DROP DATABASE foo CASCADE;
The DROP DATABASE command will delete the bar1 and bar2 tables.
You can use aws-cli batch-delete-table to delete multiple table at once.
aws glue batch-delete-table \
--database-name <database-name> \
--tables-to-delete "<table1-name>" "<table2-name>" "<table3-name>" ...
You can use AWS Glue interface to do this now. The prerequisite being you must upgrade to AWS Glue Data Catalog.
If you Upgrade to the AWS Glue Data Catalog from Athena, the metadata for tables created in Athena is visible in Glue and you can use the AWS Glue UI to check multiple tables and delete them at once.
FAQ on Upgrading data catalog: https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
You could write a shell script to do this for you:
for table in products customers stores; do
aws athena start-query-execution --query-string "drop table $table" --result-configuration OutputLocation=s3://my-ouput-result-bucket
done
Use AWS Glue's Python shell and invoke this function:
def run_query(query, database, s3_output):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
Athena configuration:
s3_input = 's3://athena-how-to/data'
s3_ouput = 's3://athena-how-to/results/'
database = 'your_database'
table = 'tableToDelete'
query_1 = "drop table %s.%s;" % (database, table)
queries = [ query_1]
#queries = [ create_database, create_table, query_1, query_2 ]
for q in queries:
print("Executing query: %s" % (q))
res = run_query(q, database, s3_ouput)
#Vidy
I would second what #Prateek said. Please provide an example of your code. Also, please tag your post with the language/shell that you're using to interact with AWS.
Currently, you cannot run multiple queries in one request. However, you can make multiple requests simultaneously. Currently, you can run 20 requests simultaneously (2018-06-15). You could do this through an API call or the console. In addition you could use the CLI or the SDK (if available for your language of choice).
For example, in Python you could use the multiprocess or threading modules to manage concurrent requests. Just remember to consider thread/multiprocess safety when creating resources/clients.
Service Limits:
Athena Service Limits
AWS Service Limits for which you can request a rate increase
I could not get Carl's method to work by executing DROP TABLE statements even though they did work in the console.
So I just thought it was worth posting my approach that worked for me, which uses a combination of the AWS Pandas SDK and the CLI
import awswrangler as wr
import boto3
import os
session = boto3.Session(
aws_access_key_id='XXXXXX',
aws_secret_access_key='XXXXXX',
aws_session_token='XXXXXX'
)
database_name = 'athena_db'
athena_s3_output = 's3://athena_s3_bucket/athena_queries/'
df = wr.athena.read_sql_query(
sql= "SELECT DISTINCT table_name FROM information_schema.tables WHERE
table_schema = '" + database_name + "'",
database= database_name,
s3_output = athena_s3_output,
boto3_session = session
)
print(df)
# ensure that your aws profile is valid for CLI commands
# i.e. your credentials are set in C:\Users\xxxxxxxx\.aws\credentials
for table in df['table_name']:
cli_string = 'aws glue delete-table --database-name ' + database_name + ' --name ' + table
print(cli_string)
os.system(cli_string)