RDS-data query to pandas dataframe - amazon-web-services

I have a serverless Aurora DB on AWS RDS (with data api enabled) that I would like to query using the database resource and secret ARNs. The code to do this is shown below.
rds_data = boto3.client('rds-data', region_name='us-west-1')
response = rds_data.execute_statement(
resourceArn='DATABASE_ARN',
secretArn='DATABASE_SECRET',
database='db',
sql=query)
rds_data.close()
The response contains the returned records, but the column names are not shown and the format is interesting (see below).
'records': [[{'stringValue': 'name'}, {'stringValue': '5'}, {'stringValue': '3'}, {'stringValue': '1'}, {'stringValue': '2'}, {'stringValue': '4'}]]
I want to query the aurora database and then create a pandas dataframe. Is there a better way to do this? I looked at just using psycopg2 and creating a database connection, but I believe you need a user name and password. I would like to not go that route and instead authenticate using IAM.

The answer was actually pretty simple. There is a parameter called "formatRecordsAs" in the "execute_statement" function. If you set this to json, you can get back better records.
response = rds_data.execute_statement(
resourceArn='DATABASE_ARN',
secretArn='DATABASE_SECRET',
database='db',
formatRecordsAs='JSON',
sql=query)
This still gives you back a string, so then you just need to convert that string representation to a list of dictionaries.
list_of_dicts = ast.literal_eval(response['formattedRecords'])
That can then be changed to a pandas dataframe
df = pd.DataFrame(list_of_dicts)

Related

Query a table/database in Athena from a Notebook instance

I have developed different Athena Workgroups for different teams so that I can separate their queries and their query results. The users would like to query the tables available to them from their notebook instances (JupyterLab). I am having difficulty finding code which successfully covers the requirement of querying a table from the user's specific workgroup. I have only found code that will query the table from the primary workgroup.
The code I have currently used is added below.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>')
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
df
This code does not work as the users only have access to perform queries from their specific workgroups hence get errors when this code is run. It also does not cover the requirement of separating the user's queries in user specific workgroups.
Any suggestions on how I can add alter the code so that I can run the queries within a specific workgroup from the notebook instance?
Documentation of pyathena is not super extensive, but after looking into source code we can see that connect simply creates instance of Connection class.
def connect(*args, **kwargs):
from pyathena.connection import Connection
return Connection(*args, **kwargs)
Now, after looking into signature of Connection.__init__ on GitHub we can see parameter work_group=None which name in the same way as one of the parameters for start_query_execution from the official AWS Python API boto3. Here is what their documentation say about it:
WorkGroup (string) -- The name of the workgroup in which the query is being started.
After following through usages and imports in Connection we endup with BaseCursor class that under the hood makes a call to start_query_execution while unpacking a dictionary with parameters assembled by BaseCursor._build_start_query_execution_request method. That is excatly where we can see familar syntax for submitting queries to AWS Athena, in particular the following part:
if self._work_group or work_group:
request.update({
'WorkGroup': work_group if work_group else self._work_group
})
So this should do a trick for your case:
import pandas as pd
from pyathena import connect
conn = connect(
s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>',
work_group='<USER SPECIFIC WORKGROUP>'
)
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
I implemented this it worked for me.
!pip install pyathena
Ref link
from pyathena import connect
from pyathena.pandas.util import as_pandas
import boto3
query = """
Select * from "s3-prod-db"."CustomerTransaction" ct where date(partitiondate) >= date('2022-09-30') limit 10
"""
query
cursor = connect(s3_staging_dir='s3://s3-temp-analytics-prod2/',
region_name=boto3.session.Session().region_name, work_group='data-scientist').cursor()
df = cursor.execute(query)
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.output_location)
df = as_pandas(cursor)
print(df)
If we dont pass work_group parameter will use "primary" as the default work_group.
If we pass s3_staging_dir='s3://s3-temp-analytics-prod2/' s3 bucket which does not exist, it will create this bucket.
But if the user role that you are running the script does not have to create bucket privilege it will throw an exception.

How use Filters with boto3 vpc endpoint services?

I need to get vpc endpoint service ID from python script, but I don't understand how use boto3, filters from vpc-id or a subnet
How do I use Filters?
This part of boto3
> (dict) --
A filter name and value pair that is used to return a more specific list of results from a describe operation. Filters can be used to match a set of resources by specific criteria, such as tags, attributes, or IDs. The filters supported by a describe operation are documented with the describe operation. For example:
DescribeAvailabilityZones
DescribeImages
DescribeInstances
DescribeKeyPairs
DescribeSecurityGroups
DescribeSnapshots
DescribeSubnets
DescribeTags
DescribeVolumes
DescribeVpcs
Name (string) --
The name of the filter. Filter names are case-sensitive.
Values (list) --
The filter values. Filter values are case-sensitive.
(string) --
The easiest method would be to call it with no filters, and observe what comes back:
import boto3
ec2_client = boto3.client('ec2', region_name='ap-southeast-2')
response = ec2_client.describe_vpc_endpoint_services()
for service in response['ServiceDetails']:
print(service['ServiceId'])
You can then either filter the results within your Python code, or use the Filters capability of the Describe command.
Feel free to print(response) to see the data that comes back.
It depends on what you want to filter the results with. In my case, I use below to filter it for a specific vpc-endpoint-id.
import boto3
vpc_client = boto3.client('ec2')
vpcEndpointId = "vpce-###"
vpcEndpointDetails = vpc_client.describe_vpc_endpoints(
VpcEndpointIds=[vpcEndpointId],
Filters=[
{
'Name': 'vpc-endpoint-id',
'Values': [vpcEndpointId]
},
])

How to delete / drop multiple tables in AWS athena?

I am trying to drop few tables from Athena and I cannot run multiple DROP queries at same time. Is there a way to do it?
Thanks!
You are correct. It is not possible to run multiple queries in the one request.
An alternative is to create the tables in a specific database. Dropping the database will then cause all the tables to be deleted.
For example:
CREATE DATABASE foo;
CREATE EXTERNAL TABLE bar1 ...;
CREATE EXTERNAL TABLE bar2 ...;
DROP DATABASE foo CASCADE;
The DROP DATABASE command will delete the bar1 and bar2 tables.
You can use aws-cli batch-delete-table to delete multiple table at once.
aws glue batch-delete-table \
--database-name <database-name> \
--tables-to-delete "<table1-name>" "<table2-name>" "<table3-name>" ...
You can use AWS Glue interface to do this now. The prerequisite being you must upgrade to AWS Glue Data Catalog.
If you Upgrade to the AWS Glue Data Catalog from Athena, the metadata for tables created in Athena is visible in Glue and you can use the AWS Glue UI to check multiple tables and delete them at once.
FAQ on Upgrading data catalog: https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
You could write a shell script to do this for you:
for table in products customers stores; do
aws athena start-query-execution --query-string "drop table $table" --result-configuration OutputLocation=s3://my-ouput-result-bucket
done
Use AWS Glue's Python shell and invoke this function:
def run_query(query, database, s3_output):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
Athena configuration:
s3_input = 's3://athena-how-to/data'
s3_ouput = 's3://athena-how-to/results/'
database = 'your_database'
table = 'tableToDelete'
query_1 = "drop table %s.%s;" % (database, table)
queries = [ query_1]
#queries = [ create_database, create_table, query_1, query_2 ]
for q in queries:
print("Executing query: %s" % (q))
res = run_query(q, database, s3_ouput)
#Vidy
I would second what #Prateek said. Please provide an example of your code. Also, please tag your post with the language/shell that you're using to interact with AWS.
Currently, you cannot run multiple queries in one request. However, you can make multiple requests simultaneously. Currently, you can run 20 requests simultaneously (2018-06-15). You could do this through an API call or the console. In addition you could use the CLI or the SDK (if available for your language of choice).
For example, in Python you could use the multiprocess or threading modules to manage concurrent requests. Just remember to consider thread/multiprocess safety when creating resources/clients.
Service Limits:
Athena Service Limits
AWS Service Limits for which you can request a rate increase
I could not get Carl's method to work by executing DROP TABLE statements even though they did work in the console.
So I just thought it was worth posting my approach that worked for me, which uses a combination of the AWS Pandas SDK and the CLI
import awswrangler as wr
import boto3
import os
session = boto3.Session(
aws_access_key_id='XXXXXX',
aws_secret_access_key='XXXXXX',
aws_session_token='XXXXXX'
)
database_name = 'athena_db'
athena_s3_output = 's3://athena_s3_bucket/athena_queries/'
df = wr.athena.read_sql_query(
sql= "SELECT DISTINCT table_name FROM information_schema.tables WHERE
table_schema = '" + database_name + "'",
database= database_name,
s3_output = athena_s3_output,
boto3_session = session
)
print(df)
# ensure that your aws profile is valid for CLI commands
# i.e. your credentials are set in C:\Users\xxxxxxxx\.aws\credentials
for table in df['table_name']:
cli_string = 'aws glue delete-table --database-name ' + database_name + ' --name ' + table
print(cli_string)
os.system(cli_string)

Delimiter not found error - AWS Redshift Load from s3 using Kinesis Firehose

I am using Kinesis firehose to transfer data to Redshift via S3.
I have a very simple csv file that looks like this. The firehose puts it to s3 but Redshift errors out with Delimiter not found error.
I have looked at literally all posts related to this error but I made sure that delimiter is included.
File
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:23:56.986397,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:02.061263,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:07.143044,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:12.217930,848.78
OR
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:48:59.993260","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:07.034945","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:12.306484","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:18.020833","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:24.203464","852.12"
Redshift Table
CREATE TABLE stockvalue
( symbol VARCHAR(4),
streamdate VARCHAR(20),
writedate VARCHAR(26),
stockprice VARCHAR(6)
);
Error
Error
Just in case, here's what my kinesis stream looks like
Firehose
Can someone point out what may be wrong with the file.
I added a comma between the fields.
All columns in destination table are varchar so there should be no reason for datatype error.
Also, the column lengths match exactly between the file and redshift table.
I have tried embedding columns in double quotes and without.
Can you post the full COPY command? It's cut off in the screenshot.
My guess is that you are missing DELIMITER ',' in your COPY command. Try adding that to the COPY command.
I was stuck on this for hours, and thanks to Shahid's answer it helped me solve it.
Text Case for Column Names is Important
Redshift will always treat your table's columns as lower-case, so when mapping JSON keys to columns, make sure the JSON keys are lower-case, e.g.
Your JSON file will look like:
{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}
And the COPY statement will look like
COPY latency(id,name) FROM 's3://<bucket-name>/<manifest>' CREDENTIALS 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' MANIFEST json 'auto';
Settings within Firehose must have the column names specified (again, in lower-case). Also, add the following to Firehose COPY options:
json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull
How to call put_records from Python:
Below is a snippet showing how to use the put_records functions with kinesis in python:
'objects' passed into the 'put_to_stream' function is an array of dictionaries:
def put_to_stream(objects):
records = []
for metric in metrics:
record = {
'Data': json.dumps(metric),
'PartitionKey': 'swat_report'
};
records.append(record)
print(records)
put_response = kinesis_client.put_records(StreamName=kinesis_stream_name, Records=records)
flush
``
1- You need to add FORMAT AS JSON 's3://yourbucketname/aJsonPathFile.txt'. AWS has not mentioned this already. Please note that this only works when your data is in JSON form like
{'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'}
2- You also needs to verify the column order in kinesis firehouse and in csv file.and try adding
TRUNCATECOLUMNS blanksasnull emptyasnull
3- An example
COPY testrbl3 ( eventId,serverTime,pageName,action,ip,userAgent,location,plateform,language,campaign,content,source,medium,productID,colorCode,scrolltoppercentage) FROM 's3://bucketname/' CREDENTIALS 'aws_iam_role=arn:aws:iam:::role/' MANIFEST json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull;

DynamoDB: getting table description null

I need to have a query on DynamoDB.
Currently I made so far this code:
AWSCredentials creds = new DefaultAWSCredentialsProviderChain().getCredentials();
AmazonDynamoDBClient client = new AmazonDynamoDBClient(creds);
client.withRegion(Regions.US_WEST_2);
DynamoDB dynamoDB = new DynamoDB(new AmazonDynamoDBClient(creds));
Table table = dynamoDB.getTable("dev");
QuerySpec spec = new QuerySpec().withKeyConditionExpression("tableKey = :none.json");
ItemCollection<QueryOutcome> items = table.query(spec);
System.out.println(table);
The returned value of table is: {dev: null}, which means the that teh description is null.
It's important to say that while i'm using AWS CLI with this command: aws dynamodb list-tables i'm getting a result of all the tables so if i'm also making the same operation over my code dynamoDB.listTables() is retrieving empty list.
Is there something that I'm doing wrong?
Do I need to define some more credentials before using DDB API ?
I was getting the same problem and landed here looking for a solution. As mentioned in javadoc of getDesciption
Returns the table description; or null if the table description has
not yet been described via {#link #describe()}. No network call.
Initially description is set to null. After the first call to describe(), which makes a network call, description gets set and getDescription can be used after that.