Scan large 10gb of Amazon DynamoDB data

Scan large 10gb of Amazon DynamoDB data - amazon-web-services

The following code works for me but it takes 19 minutes for 1 API request to return a result. An optimized result would be appreciated. I would not like to go for segments because then I will have to do thread management.
dynamodb = boto3.resource('dynamodb', region_name='us-west-2', endpoint_url="http://localhost:8000")
table = dynamodb.Table('Movies')
fe = Key('year').between(1950, 1959)
pe = "#yr, title, info.rating"
# Expression Attribute Names for Projection Expression only.
ean = { "#yr": "year", }
esk = None
response = table.scan(
FilterExpression=fe,
ProjectionExpression=pe,
ExpressionAttributeNames=ean
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))
// As long as LastEvaluatedKey is in response it means there are still items from the query related to the data
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=pe,
FilterExpression=fe,
ExpressionAttributeNames= ean,
ExclusiveStartKey=response['LastEvaluatedKey']
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))

Because it is searching across all partitions, the scan operation can be very slow. You wont be able to "tune" this query like you might if you were working with a relational database.
In order to best help you, I will need to know more about your access pattern (get movies by year?) and what your table currently looks like (what are your partition keys/sort keys, other attributes, etc).

Unfortunately, scan is slow by nature. There is no way to optimize at the code level except for redesigning the table to optimize for this access pattern.

Related

Find number of objects inside an Item of DynomoDB table using Lamda function (Python/Node)

I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?

DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also

you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.

Unable to retrieve all the records from DynamoDB

I am new to AWS workspace, as of now we are using DynamoDB to feed our logs on daily bases for each job execution,
And then each day we generating a summary report from all the data which was posted to dynamoDB on the previous day.
I am facing an issue while fetching the data from dynamoDB while generating the summary report. For fetching the data, I am using Java Client inside my scala class. The issue is that I am not able to retrieve all the data from dynamoDB for any filter condition. But while checking at DynamoDB UI, I can see a lot more no of records.
..using below code ..
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.standard.build
//Function that returns filter expression and ExpressionAttribute
val (filterExpression, expressionAttributeValues) = getDynamoDBQuery(inputArgs)
val scanRequest: ScanRequest = new ScanRequest()
.withTableName("table_name")
.withFilterExpression(filterExpression)
.withExpressionAttributeValues(expressionAttributeValues)
client.scan(scanRequest)
After a lot of analysis, it looks like that DynamoDB is taking a while for fetching all the data for any filter condition (when we scan the dataset). And Java client is not waiting while all the records are retrieved from the DynamoDB. Is there any workaround for this. Please help.
Thanks

DynamoDB returns results in a paginated manner. For a given ScanRequest, the ScanResult contains getLastEvaluatedKey that should be passed through setExclusiveStartKey of the next ScanRequest to get the next page. You should loop through this until the getLastEvaluatedKey in a ScanResult is null.
BTW, I agree with the previous answer that DynamoDB may not be an ideal choice to store this kind of data from a cost perspective, but you are a better judge of the choice made!

Dynamodb is not meant for the purpose which you are using for. Storage is not only costlier, but querying the data will also be costlier.
DynamoDb is meant for transaction key value store.
You can store it in Firehose, S3 and query with Athena. That is cheaper, scalable and good for analytical use.
Log --> Firehose --> S3 --> Athena
With regards to your question, DynamoDB will not return all the records when you request for it. It will return a set of records and will give the lastevaluatedkey.
More documentation on DynamoDB Scan.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html
Hope it helps.

Thanks #Vikdor for your help .. I did the same way you suggested and it worked perfectly fine. Below is the code ..
var output = new StringBuilder
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.standard.build
val (filterExpression, expressionAttributeValues) = getDynamoDBQuery(inputArgs)
var scanRequest: ScanRequest = new ScanRequest()
.withTableName("watchman-jobs")
.withFilterExpression(filterExpression)
.withExpressionAttributeValues(expressionAttributeValues)
var flag: Boolean = false
var scanResult = client.scan(scanRequest)
var items : util.List[util.Map[String,AttributeValue]] = scanResult.getItems
var lastEvaluatedKey: util.Map[String, AttributeValue] = null
do {
scanRequest = scanRequest.withExclusiveStartKey(lastEvaluatedKey)
scanResult = client.scan(scanRequest)
if(flag) items.addAll(scanResult.getItems)
lastEvaluatedKey = scanResult.getLastEvaluatedKey
flag = true
} while ( {
lastEvaluatedKey != null
})
return items

Add a partition on glue table via API on AWS?

I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?

You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition

You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';

This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/

AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.

How to Compose Query in BigQuery with Destination Table?

I am trying to Query on a BQ table and load that queried data into destination table with use of legacy_sql
Code:
bigquery_client = bigquery.Client.from_service_account_json(config.ZF_FILE)
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = True
# Allow for query results larger than the maximum response size
job_config.allow_large_results = True
# When large results are allowed, a destination table must be set.
dest_dataset_ref = bigquery_client.dataset('datasetId')
dest_table_ref = dest_dataset_ref.table('datasetId:mydestTable')
job_config.destination = dest_table_ref
query =""" SELECT abc FROM [{0}] LIMIT 10 """.format(mySourcetable_name)
# run the Query here now
query_job = bigquery_client.query(query, job_config=job_config)
Error:
google.api_core.exceptions.BadRequest: 400 POST : Invalid dataset ID "datasetId:mydestTable". Dataset IDs must be alphanumeric (plus underscores, dashes, and colons) and must be at most 1024 characters long.
The job_config.destination gives :
print job_config.destination
TableReference(u'projectName', 'projectName:dataset', 'projectName:dataset.mydest_table')
The datasetId is correct from my side still the error?
May I know how to get the proper destination table ?

This may be helpfull to someone in future
It worked by just naming only the Names instead of full Id of dataset and table as below
dest_dataset_ref = bigquery_client.dataset('dataset_name')
dest_table_ref = dest_dataset_ref.table('mydestTable_name')

(AWS) Athena: Query Results seem too short

My Athena queries appear to be too short in their results. Trying to figure out Why?
Setup:
Glue Catalogs (118.6 Gig in size).
Data: Stored in S3 in both CSV and JSON format.
Athena Query: When I query data for a whole table, I only get 40K results per Query, there should be 121Million Records for that query on average for one month's data.
Does Athena Cap query result data? Is this a service limit (the documentation does not suggest this to be the case).

So, getting 1000 results at a time obviously doesn't scale. Thankfully, there's a simple workaround. (Or maybe this is how it was supposed to be done all along.)
When you run an Athena query, you should get a QueryExecutionId. This Id corresponds to the output file you'll find in S3.
Here's a snippet I wrote:
s3 = boto3.resource("s3")
athena = boto3.client("athena")
response: Dict = athena.start_query_execution(QueryString=query, WorkGroup="<your_work_group>")
execution_id: str = response["QueryExecutionId"]
print(execution_id)
# Wait until the query is finished
while True:
try:
athena.get_query_results(QueryExecutionId=execution_id)
break
except botocore.exceptions.ClientError as e:
time.sleep(5)
local_filename: str = "temp/athena_query_result_temp.csv"
s3.Bucket("athena-query-output").download_file(execution_id + ".csv", local_filename)
return pd.read_csv(local_filename)
Make sure the corresponding WorkGroup has "Query result location" set, e.g. "s3://athena-query-output/"
Also see this thread with similar answers: How to Create Dataframe from AWS Athena using Boto3 get_query_results method

It seems that there is a limit of 1000.
You should use NextToken to iterate over the results.
Quote of the GetQueryResults Documentation
MaxResults The maximum number of results (rows) to return in this
request.
Type: Integer
Valid Range: Minimum value of 0. Maximum value of 1000.
Required: No

Another option is Paginate and count approach :
Don't know whether better way to do it like select count(*) from table like...
Here is the complete example code ready to use. Used python boto3 athena api
I used paginator and converted result as list of dict and also returning count along with the result.
below are 2 methods
First one will paginate
second one will convert paginated result to list of dict and calculate count.
Note : converting in to list of dict is not necessary in this case. If you don't want that.. in the code you can modify to have only count
def get_athena_results_paginator(params, athena_client):
"""
:param params:
:param athena_client:
:return:
"""
query_id = athena_client.start_query_execution(
QueryString=params['query'],
QueryExecutionContext={
'Database': params['database']
}
# ,
# ResultConfiguration={
# 'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
# }
, WorkGroup=params['workgroup']
)['QueryExecutionId']
query_status = None
while query_status == 'QUEUED' or query_status == 'RUNNING' or query_status is None:
query_status = athena_client.get_query_execution(QueryExecutionId=query_id)['QueryExecution']['Status']['State']
if query_status == 'FAILED' or query_status == 'CANCELLED':
raise Exception('Athena query with the string "{}" failed or was cancelled'.format(params.get('query')))
time.sleep(10)
results_paginator = athena_client.get_paginator('get_query_results')
results_iter = results_paginator.paginate(
QueryExecutionId=query_id,
PaginationConfig={
'PageSize': 1000
}
)
count, results = result_to_list_of_dict(results_iter)
return results, count
def result_to_list_of_dict(results_iter):
"""
:param results_iter:
:return:
"""
results = []
column_names = None
count = 0
for results_page in results_iter:
print(len(list(results_iter)))
for row in results_page['ResultSet']['Rows']:
count = count + 1
column_values = [col.get('VarCharValue', None) for col in row['Data']]
if not column_names:
column_names = column_values
else:
results.append(dict(zip(column_names, column_values)))
return count, results

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scan large 10gb of Amazon DynamoDB data - amazon-web-services

Unfortunately, scan is slow by nature. There is no way to optimize at the code level except for redesigning the table to optimize for this access pattern.

Related

Find number of objects inside an Item of DynomoDB table using Lamda function (Python/Node)

Unable to retrieve all the records from DynamoDB

Add a partition on glue table via API on AWS?

How to Compose Query in BigQuery with Destination Table?

(AWS) Athena: Query Results seem too short

Categories

Resources