I am new to AWS workspace, as of now we are using DynamoDB to feed our logs on daily bases for each job execution,
And then each day we generating a summary report from all the data which was posted to dynamoDB on the previous day.
I am facing an issue while fetching the data from dynamoDB while generating the summary report. For fetching the data, I am using Java Client inside my scala class. The issue is that I am not able to retrieve all the data from dynamoDB for any filter condition. But while checking at DynamoDB UI, I can see a lot more no of records.
..using below code ..
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.standard.build
//Function that returns filter expression and ExpressionAttribute
val (filterExpression, expressionAttributeValues) = getDynamoDBQuery(inputArgs)
val scanRequest: ScanRequest = new ScanRequest()
.withTableName("table_name")
.withFilterExpression(filterExpression)
.withExpressionAttributeValues(expressionAttributeValues)
client.scan(scanRequest)
After a lot of analysis, it looks like that DynamoDB is taking a while for fetching all the data for any filter condition (when we scan the dataset). And Java client is not waiting while all the records are retrieved from the DynamoDB. Is there any workaround for this. Please help.
Thanks
DynamoDB returns results in a paginated manner. For a given ScanRequest, the ScanResult contains getLastEvaluatedKey that should be passed through setExclusiveStartKey of the next ScanRequest to get the next page. You should loop through this until the getLastEvaluatedKey in a ScanResult is null.
BTW, I agree with the previous answer that DynamoDB may not be an ideal choice to store this kind of data from a cost perspective, but you are a better judge of the choice made!
Dynamodb is not meant for the purpose which you are using for. Storage is not only costlier, but querying the data will also be costlier.
DynamoDb is meant for transaction key value store.
You can store it in Firehose, S3 and query with Athena. That is cheaper, scalable and good for analytical use.
Log --> Firehose --> S3 --> Athena
With regards to your question, DynamoDB will not return all the records when you request for it. It will return a set of records and will give the lastevaluatedkey.
More documentation on DynamoDB Scan.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html
Hope it helps.
Thanks #Vikdor for your help .. I did the same way you suggested and it worked perfectly fine. Below is the code ..
var output = new StringBuilder
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.standard.build
val (filterExpression, expressionAttributeValues) = getDynamoDBQuery(inputArgs)
var scanRequest: ScanRequest = new ScanRequest()
.withTableName("watchman-jobs")
.withFilterExpression(filterExpression)
.withExpressionAttributeValues(expressionAttributeValues)
var flag: Boolean = false
var scanResult = client.scan(scanRequest)
var items : util.List[util.Map[String,AttributeValue]] = scanResult.getItems
var lastEvaluatedKey: util.Map[String, AttributeValue] = null
do {
scanRequest = scanRequest.withExclusiveStartKey(lastEvaluatedKey)
scanResult = client.scan(scanRequest)
if(flag) items.addAll(scanResult.getItems)
lastEvaluatedKey = scanResult.getLastEvaluatedKey
flag = true
} while ( {
lastEvaluatedKey != null
})
return items
Related
I use a lambda to detect if there is any isActive record in my table and put_item to update the id if there is.
For example, I have a placeholder record with ID 999999999, if my table query detected there's an active record (isActive = True), it will put_item with the real session_id and other data.
Table record:
My lambda has the following section (from my cloudwatch the if...else statement is working as intended to verify the logic). Please ignore indentation hiccups when i copy and paste, the code runs with no issue.
##keep "isActive = True" when there's already an active status started from other source, just updating the session_id to from 999999999 to real session_id
else:
count_1 = query["Items"][0]["count_1"] <<< from earlier part of code to retrieve from current count_1 value from the table.
print(count_1) << get the right '13' value from the current table id = '999999999'
table.put_item(
Item={
'session_id': session_id,
'isActive': True,
'count_1': count_1,
'count_2': count_2
},
ConditionExpression='session_id = :session_id AND isActive = :isActive',
ExpressionAttributeValues={
':session_id': 999999999,
':isActive': True
}
)
However my table is not getting new item nor the primary key session_id is updated. Table still stays as the image above.
I understand from the documentation that
You cannot use UpdateItem to update any primary key attributes.
Instead, you will need to delete the item, and then use PutItem to
create a new item with new attributes.
but even if put_item is not able to update primary key, at least I am expecting a new item being created from my code when there isn't any error code thrown?
Does anybody know what is happening? thanks
I resolved it with different specification for ConditionExpression. Did multiple troubleshooting ways and pinpoint the issue comes from ConditionExpression:
What i did instead -
add imports of boto3.dynamodb.conditions import Key & Attr
and use ConditionExpression with ConditionExpression=Attr("session_id").ne(999999999)
and delete old id item
table.delete_item(
Key={
'session_id': 999999999
}
)
Other conditions available here https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/dynamodb.html#ref-dynamodb-conditions
If anyone has any other better and easier way would like to learn
The following code works for me but it takes 19 minutes for 1 API request to return a result. An optimized result would be appreciated. I would not like to go for segments because then I will have to do thread management.
dynamodb = boto3.resource('dynamodb', region_name='us-west-2', endpoint_url="http://localhost:8000")
table = dynamodb.Table('Movies')
fe = Key('year').between(1950, 1959)
pe = "#yr, title, info.rating"
# Expression Attribute Names for Projection Expression only.
ean = { "#yr": "year", }
esk = None
response = table.scan(
FilterExpression=fe,
ProjectionExpression=pe,
ExpressionAttributeNames=ean
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))
// As long as LastEvaluatedKey is in response it means there are still items from the query related to the data
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=pe,
FilterExpression=fe,
ExpressionAttributeNames= ean,
ExclusiveStartKey=response['LastEvaluatedKey']
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))
Because it is searching across all partitions, the scan operation can be very slow. You wont be able to "tune" this query like you might if you were working with a relational database.
In order to best help you, I will need to know more about your access pattern (get movies by year?) and what your table currently looks like (what are your partition keys/sort keys, other attributes, etc).
Unfortunately, scan is slow by nature. There is no way to optimize at the code level except for redesigning the table to optimize for this access pattern.
I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?
DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also
you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.
I am trying to query to my dynamodb using paginator.paginate.
Here is my code:
for page_1 in paginator.paginate(TableName=chroma_organization_data_table,
FilterExpression='#s = Approved',
ProjectionExpression="#s, organizationId",
ExpressionAttributeNames={'#s': 'status'}
):
print page_1
However, I get nothing returned back. I know there are several entries that are in the 'approved' states.
This is how my dynamodb returns data if there are no conditions on it (no FilterExpression)(example)
[{u'organizationId': {u'S': u'323454354525'}, u'status': {u'S': u'Approved'}}]
So clearly there is an entry where status is approved just when I use paginator, it doesn't work.
What can I do about this?
You cannot embed string literals in filter/condition expressions. You need to set an ExpressionAttributeValues map equal to { ":approved": "Approved" } and then update your filter expression to be #s = :approved.