(AWS) Athena: Query Results seem too short - amazon-web-services

My Athena queries appear to be too short in their results. Trying to figure out Why?
Setup:
Glue Catalogs (118.6 Gig in size).
Data: Stored in S3 in both CSV and JSON format.
Athena Query: When I query data for a whole table, I only get 40K results per Query, there should be 121Million Records for that query on average for one month's data.
Does Athena Cap query result data? Is this a service limit (the documentation does not suggest this to be the case).

So, getting 1000 results at a time obviously doesn't scale. Thankfully, there's a simple workaround. (Or maybe this is how it was supposed to be done all along.)
When you run an Athena query, you should get a QueryExecutionId. This Id corresponds to the output file you'll find in S3.
Here's a snippet I wrote:
s3 = boto3.resource("s3")
athena = boto3.client("athena")
response: Dict = athena.start_query_execution(QueryString=query, WorkGroup="<your_work_group>")
execution_id: str = response["QueryExecutionId"]
print(execution_id)
# Wait until the query is finished
while True:
try:
athena.get_query_results(QueryExecutionId=execution_id)
break
except botocore.exceptions.ClientError as e:
time.sleep(5)
local_filename: str = "temp/athena_query_result_temp.csv"
s3.Bucket("athena-query-output").download_file(execution_id + ".csv", local_filename)
return pd.read_csv(local_filename)
Make sure the corresponding WorkGroup has "Query result location" set, e.g. "s3://athena-query-output/"
Also see this thread with similar answers: How to Create Dataframe from AWS Athena using Boto3 get_query_results method

It seems that there is a limit of 1000.
You should use NextToken to iterate over the results.
Quote of the GetQueryResults Documentation
MaxResults The maximum number of results (rows) to return in this
request.
Type: Integer
Valid Range: Minimum value of 0. Maximum value of 1000.
Required: No

Another option is Paginate and count approach :
Don't know whether better way to do it like select count(*) from table like...
Here is the complete example code ready to use. Used python boto3 athena api
I used paginator and converted result as list of dict and also returning count along with the result.
below are 2 methods
First one will paginate
second one will convert paginated result to list of dict and calculate count.
Note : converting in to list of dict is not necessary in this case. If you don't want that.. in the code you can modify to have only count
def get_athena_results_paginator(params, athena_client):
"""
:param params:
:param athena_client:
:return:
"""
query_id = athena_client.start_query_execution(
QueryString=params['query'],
QueryExecutionContext={
'Database': params['database']
}
# ,
# ResultConfiguration={
# 'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
# }
, WorkGroup=params['workgroup']
)['QueryExecutionId']
query_status = None
while query_status == 'QUEUED' or query_status == 'RUNNING' or query_status is None:
query_status = athena_client.get_query_execution(QueryExecutionId=query_id)['QueryExecution']['Status']['State']
if query_status == 'FAILED' or query_status == 'CANCELLED':
raise Exception('Athena query with the string "{}" failed or was cancelled'.format(params.get('query')))
time.sleep(10)
results_paginator = athena_client.get_paginator('get_query_results')
results_iter = results_paginator.paginate(
QueryExecutionId=query_id,
PaginationConfig={
'PageSize': 1000
}
)
count, results = result_to_list_of_dict(results_iter)
return results, count
def result_to_list_of_dict(results_iter):
"""
:param results_iter:
:return:
"""
results = []
column_names = None
count = 0
for results_page in results_iter:
print(len(list(results_iter)))
for row in results_page['ResultSet']['Rows']:
count = count + 1
column_values = [col.get('VarCharValue', None) for col in row['Data']]
if not column_names:
column_names = column_values
else:
results.append(dict(zip(column_names, column_values)))
return count, results

Related

Not able to get all the columns while using group by in Pandas df

controller.py
def consolidated_universities_data_by_country(countries,universities):
cursor = connection.cursor()
query = None
if countries == str(1):
query = f"""
#sql_query#
"""
result_data=cursor.execute(query)
result=dict_fetchall_rows(result_data)
consolidated_df_USA=pd.DataFrame(result).fillna('NULL').replace( {True : 1, False : 0}).groupby('CourseId')['ApplicationDeadline'].apply(', '.join).reset_index()
return consolidated_df_USA
With the mentioned code i am able to get desired output i.e., i wanted to merge n rows deadline in one row for given courseid, but i am not able to get rest of the columns.
consolidated_df_USA=pd.DataFrame(result).fillna('NULL').replace( {True : 1, False : 0}).groupby('CourseId')['ApplicationDeadline','CourseName'].agg(', '.join).reset_index()
return consolidated_df_USA
With this i am able to get some columns but some of the columns are getting depricated. Also getting below warning.
FutureWarning: Dropping invalid columns in SeriesGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the aggregating function.
How to get all the columns which is given by sql query?

Scan large 10gb of Amazon DynamoDB data

The following code works for me but it takes 19 minutes for 1 API request to return a result. An optimized result would be appreciated. I would not like to go for segments because then I will have to do thread management.
dynamodb = boto3.resource('dynamodb', region_name='us-west-2', endpoint_url="http://localhost:8000")
table = dynamodb.Table('Movies')
fe = Key('year').between(1950, 1959)
pe = "#yr, title, info.rating"
# Expression Attribute Names for Projection Expression only.
ean = { "#yr": "year", }
esk = None
response = table.scan(
FilterExpression=fe,
ProjectionExpression=pe,
ExpressionAttributeNames=ean
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))
// As long as LastEvaluatedKey is in response it means there are still items from the query related to the data
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=pe,
FilterExpression=fe,
ExpressionAttributeNames= ean,
ExclusiveStartKey=response['LastEvaluatedKey']
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))
Because it is searching across all partitions, the scan operation can be very slow. You wont be able to "tune" this query like you might if you were working with a relational database.
In order to best help you, I will need to know more about your access pattern (get movies by year?) and what your table currently looks like (what are your partition keys/sort keys, other attributes, etc).
Unfortunately, scan is slow by nature. There is no way to optimize at the code level except for redesigning the table to optimize for this access pattern.

Django Postgres migration: Fastest way to backfill a column in a table with 100 Million rows

I have a table in Postgres Thing that has 100 Million rows.
I have a column that was populated over time that stores some keys. The keys were prefixed before storing. Let's call it prefixed_keys.
My task is to use the values of this column to populate another column with the same values but with the prefixes trimmed off. Let's call it simple_keys.
I tried the following migration:
from django.db import migrations
import time
def backfill_simple_keys(apps, schema_editor):
Thing = apps.get_model('thing', 'Thing')
batch_size = 100000
number_of_batches_completed = 0
while Thing.objects.filter(simple_key__isnull=True).exists():
things = Thing.objects.filter(simple_key__isnull=True)[:batch_size]
for tng in things:
prefixed_key = tng.prefixed_key
if prefixed_key.startswith("prefix_A"):
simple_key = prefixed_key[len("prefix_A"):]
elif prefixed_key.startswith("prefix_BBB"):
simple_key = prefixed_key[len("prefix_BBB"):]
tng.simple_key = simple_key
Thing.objects.bulk_update(
things,
['simple_key'],
batch_size=batch_size
)
number_of_batches_completed += 1
print("Number of batches updated: ", number_of_batches_completed)
sleep_seconds = 3
time.sleep(sleep_seconds)
class Migration(migrations.Migration):
dependencies = [
('thing', '0030_add_index_to_simple_key'),
]
operations = [
migrations.RunPython(
backfill_simple_keys,
),
]
Each batch took about ~7 minutes to complete. Which would means it would take days to complete!
It also increased the latency of the DB which is bing used in production.
Since you're going to go through every record in that table anyway it makes sense to traverse it in one go using a server-side cursor.
Calling
Thing.objects.filter(simple_key__isnull=True)[:batch_size]
is going to be expensive especially as the index starts to grow.
Also the call above retrieves ALL fields from that table even if you are only going to use only 2-3 fields.
update_query = """UPDATE table SET simple_key = data.key
FROM (VALUES %s) AS data (id, key) WHERE table.id = data.id"""
conn = psycopg2.connect(DSN, cursor_factory=RealDictCursor)
cursor = conn.cursor(name="key_server_side_crs") # having a name makes it a SSC
update_cursor = conn.cursor() # regular cursor
cursor.itersize = 5000 # how many records to retrieve at a time
cursor.execute("SELECT id, prefixed_key, simple_key FROM table")
count = 0
batch = []
for row in cursor:
if not row["simple_key"]:
simple_key = calculate_simple_key(row["prefixed_key"])
batch.append[(row["id"], simple_key)]
if len(batch) >= 1000 # how many records to update at once
execute_values(update_cursor, update_query, batch, page_size=1000)
batch = []
time.sleep(0.1) # allow the DB to "breathe"
count += 1
if count % 100000 == 0: # print progress every 100K rows
print("processed %d rows", count)
The above is NOT tested so it's advisable to create a copy of a few million rows of the table and test it against it first.
You can also test various batch size settings (both for retrieve and update).

DynamoDB BatchWriteItem: Provided list of item keys contains duplicates

I am trying to use DynamoDB operation BatchWriteItem, wherein I want to insert multiple records into one table.
This table has one partition key and one sort key.
I am using AWS lambda and Go language.
I get the elements to be inserted into a slice.
I am following this procedure.
Create PutRequest structure and add AttributeValues for the first record from the list.
I am creating WriteRequest from this PutRequest
I am adding this WriteRequest to an array of WriteRequests
I am creating BatchWriteItemInput which consists of RequestItems, which is basically a map of Tablename and the array of WriteRequests.
After that I am calling BatchWriteItem, which results into an error:
Provided list of item keys contains duplicates.
Any pointers, why this could be happening?
You've provided two or more items with identical primary keys (which in your case means identical partition and sort keys).
Per the BatchWriteItem docs, you cannot perform multiple operations on the same item in the same BatchWriteItem request.
Consideration: This answers works for Python
As #Benoit has remarked, the boto3 documentation states:
If you want to bypass no duplication limitation of single batch write request as botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the BatchWriteItem operation: Provided list of item keys contains duplicates.
you could specify overwrite_by_pkeys=['partition_key', 'sort_key'] on the batch writer to "de-duplicate request items in buffer if match new request item on specified primary keys" according to the documentation and the source code. That is, if the combination primary-sort already exists in the buffer it will drop that request and replace it with the new one.
Example
Suppose there is pandas dataframe that you want to write to a DynamoDB table, the following function could be helpful,
import json
import datetime as dt
import boto3
import pandas as pd
from typing import Optional
def write_dynamoDB(df:'pandas.core.frame.DataFrame', tbl:str, partition_key:Optional[str]=None, sort_key:Optional[str]=None):
'''
Function to write a pandas DataFrame to a DynamoDB Table through
batchWrite operation. In case there are any float values it handles
them by converting the data to a json format.
Arguments:
* df: pandas DataFrame to write to DynamoDB table.
* tbl: DynamoDB table name.
* partition_key (Optional): DynamoDB table partition key.
* sort_key (Optional): DynamoDB table sort key.
'''
# Initialize AWS Resource
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(tbl)
# Check if overwrite keys were provided
overwrite_keys = [partition_key, sort_key] if partition_key else None
# Check if they are floats (convert to decimals instead)
if any([True for v in df.dtypes.values if v=='float64']):
from decimal import Decimal
# Save decimals with JSON
df_json = json.loads(
json.dumps(df.to_dict(orient='records'),
default=date_converter,
allow_nan=True),
parse_float=Decimal
)
# Batch write
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
for element in df_json:
batch.put_item(
Item=element
)
else: # If there are no floats on data
# Batch writing
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
columns = df.columns
for row in df.itertuples():
batch.put_item(
Item={
col:row[idx+1] for idx,col in enumerate(columns)
}
)
def date_converter(obj):
if isinstance(obj, dt.datetime):
return obj.__str__()
elif isinstance(obj, dt.date):
return obj.isoformat()
by calling write_dynamoDB(dataframe, 'my_table', 'the_partition_key', 'the_sort_key').
Use batch_writer instead of batch_write_item:
import boto3
dynamodb = boto3.resource("dynamodb", region_name='eu-west-1')
my_table = dynamodb.Table('mirrorfm_yt_tracks')
with my_table.batch_writer(overwrite_by_pkeys=["user_id", "game_id"]) as batch:
for item in items:
batch.put_item(
Item={
'user_id': item['user_id'],
'game_id': item['game_id'],
'score': item['score']
}
)
If you don't have a sort key, overwrite_by_pkeys can be None
This is essentially the same answer as #MiguelTrejo (thanks! +1) but simplified

How to Compose Query in BigQuery with Destination Table?

I am trying to Query on a BQ table and load that queried data into destination table with use of legacy_sql
Code:
bigquery_client = bigquery.Client.from_service_account_json(config.ZF_FILE)
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = True
# Allow for query results larger than the maximum response size
job_config.allow_large_results = True
# When large results are allowed, a destination table must be set.
dest_dataset_ref = bigquery_client.dataset('datasetId')
dest_table_ref = dest_dataset_ref.table('datasetId:mydestTable')
job_config.destination = dest_table_ref
query =""" SELECT abc FROM [{0}] LIMIT 10 """.format(mySourcetable_name)
# run the Query here now
query_job = bigquery_client.query(query, job_config=job_config)
Error:
google.api_core.exceptions.BadRequest: 400 POST : Invalid dataset ID "datasetId:mydestTable". Dataset IDs must be alphanumeric (plus underscores, dashes, and colons) and must be at most 1024 characters long.
The job_config.destination gives :
print job_config.destination
TableReference(u'projectName', 'projectName:dataset', 'projectName:dataset.mydest_table')
The datasetId is correct from my side still the error?
May I know how to get the proper destination table ?
This may be helpfull to someone in future
It worked by just naming only the Names instead of full Id of dataset and table as below
dest_dataset_ref = bigquery_client.dataset('dataset_name')
dest_table_ref = dest_dataset_ref.table('mydestTable_name')