When we scan a DynamoDB table, we can/should use LastEvaluatedKey to track the progress so that we can resume in case of failures. The documentation says that
LastEvaluateKey is The primary key of the item where the operation stopped, inclusive of the previous result set. Use this value to start a new operation, excluding this value in the new request.
My question is if I start a scan, pause, insert a few rows and resume the scan from the previous LastEvaluatedKey, will I get those new rows after resuming the scan?
My guess is I might miss some of all of the new rows because the new keys will be hashed and the values could be smaller than LastEvaluatedKey.
Is my guess right? Any explanation or documentation links are appreciated.
It is going sequentially through your data, and it does not know about all items that were added in the process:
Scan operations proceed sequentially; however, for faster performance
on a large table or secondary index, applications can request a
parallel Scan operation by providing the Segment and TotalSegments
parameters.
Not only it can miss some of the items that were added after you've started scanning it can also miss some of the items that were added before the scan started if you are using eventually consistent read:
Scan uses eventually consistent reads when accessing the data in a
table; therefore, the result set might not include the changes to data
in the table immediately before the operation began.
If you need to keep track of items that were added after you've started a scan you can use DynamoDB streams for that.
Related
We have a setup where various worker nodes perform computations and update their relative states in a DynamoDB table. The table acts as a kind of history of activity of the worker nodes. A watchdog node needs to periodically scan through the table, and build an object representing the current state of the worker nodes and their jobs. As such, it's important for our application to be able to scan the table and retrieve data in chronological order (i.e. sorted by timestamp). The table will eventually be too large to scan into local memory for later ordering, so we cannot sort it after scanning.
Reading from the AWS documentation about the primary key:
DynamoDB uses the partition key value as input to an internal hash
function. The output from the hash function determines the partition
(physical storage internal to DynamoDB) in which the item will be
stored. All items with the same partition key are stored together, in
sorted order by sort key value.
Documentation on the scan function doesn't seem to mention anything about the order of the returned results. But can that last part in the quote above (the part I emphasized in bold) be interpreted to mean that the results of scans are ordered by the sort key? If I set all partition keys to be the same value, say "0", then use my timestamp as the sort key, can I be guaranteed that the scan operation will return data in chronological order?
Some note:
All code is written in Python, and thus I'm using the boto3 module to perform scan operations.
Our system architect is steadfast against the idea of updating any entries in the table to reflect their current state, or deleting items when the job is complete. We can only ever add to the table, and thus we need to scan through the whole thing each time to determine the worker states.
I am using strong read consistency for scan operations.
Technically SCAN never guarantees order (although as an observation the lack of order guarantee seems to mean that the partition is randomly ordered, but the sort remains, well, sorted.)
What you've proposed will work however, but instead of scanning, you'll be doing a query on partition-key == 0, which will then return all the items with the partition key of 0, (up to limit and optional sorted forward/backwards) sorted by the sort key.
That said, this is really not the way that dynamo wants you to use it. For example, it guarantees your partition will run hot (because you've explicitly put everything on the same partition), and this operation will cost you the capacity of reading every item on the table.
I would recommend investigating patterns such as using a dynamodb stream processed by a lambda to build and maintain a materialised view of this "current state", rather than "polling" the table with this expensive scan and resulting poor key design.
You’re better off using yyyy-mm-dd as the partition key, rather than all 0. There’s a limit of 10 GB of data per partition, which also means you can’t have more than 10 GB of data per partition key value.
If you want to be able to retrieve data sorted by date, take the ISO 8601 time stamp format (roughly yyyy-mm-ddThh-mm-ss.sss), split it somewhere reasonable for your data, and use the first part as the partition key and the second part as the sort key. (Another advantage of this approach is that you can use eventually consistent reads for most of the queries since it’s pretty safe to assume that after a day (or an hour o something) that the data is completely replicated.)
If you can manage it, it would be even better to use Worker ID or Job ID as a partition key, and then you could use the full time stamp as the sort key.
As #thomasmichaelwallace mentioned, it would be best to use DynamoDB streams with Lambda to create a materialized view.
Now, that being said, if you’re dealing with jobs being run on workers, then you should also consider whether you can achieve your goal by using a workflow service rather than a database. Workflows will maintain a job history and/or current state for you. AWS offers Step Functions and Simple Workflow.
I have next table structure:
ID string `dynamodbav:"id,omitempty"`
Type string `dynamodbav:"type,omitempty"`
Value string `dynamodbav:"value,omitempty"`
Token string `dynamodbav:"token,omitempty"`
Status int `dynamodbav:"status,omitempty"`
ActionID string `dynamodbav:"action_id,omitempty"`
CreatedAt time.Time `dynamodbav:"created_at,omitempty"`
UpdatedAt time.Time `dynamodbav:"updated_at,omitempty"`
ValidationToken string `dynamodbav:"validation_token,omitempty"`
and I have 2 Global Secondary Indexes for Value(ValueIndex) filed and Token(TokenIndex) field. Later somewhere in the internal logic I perform the Update of this entity and immediate read of this entity by one of this indexes(ValueIndex or TokenIndex) and I see the expected problem that data is not ready(I mean not yet updated). I can't use ConsistentRead for this cases, because this is Global Secondary Index and it doesn't support this options. As a result I can't run my load tests over this logic, because data is not ready when tests go in 10-20-30 threads. So my question - is it possible to solve this problem somewhere? or should I reorganize my table and split it to 2-3 different tables and move filed like Value, Token to HASH key or SORT key?
GSIs are updated asynchronously from the table they are indexing. The updates to a GSI typically occur in well under a second. So, if you're after immediate read of a GSI after insert / update / delete, then there is the potential to get stale data. This is how GSIs work - nothing you can do about that. However, you need to be really mindful of three things:
Make sure you keep your GSI lean - that is, only project the absolute minimum attributes that you need. Less data to write will make it quicker.
Ensure that your GSIs have the correct provisioned throughput. If it doesn't, it may not be able to keep up with activity in the table and therefore you'll get long delays in the GSI being kept in sync.
If an update causes the keys in the GSI to be updated, you'll need 2 units of throughput provisioned per update. In essence, DynamoDB will delete the item then insert a new item with the keys updated. So, even though your table has 100 provisioned writes, if every single write causes an update to your GSI key, you'll need to provision 200 write units.
Once you've tuned your DynamoDB setup and you still absolutely cannot handle the brief delay in GSIs, you'll probably need to use different technology. For example, even if you decided to split your table into multiple tables, it'll have the same (if not worse) impact. You'll update one table, then try to read the data from another table and you haven't yet inserted the values into a different table.
I suspect that once you tune DynamoDB for your situation, you'll get pretty damn close you what you want.
Similarly to my other previously opened threads, I'm trying to achieve a fast and efficient folder search. I've tried waiting for fvnSearchCompleted event which took forever to arrive and for fvnTableModified with TABLE_ROW_ADDED which has never arrived due to IMsgStore's decision tree (more than 10K mails)
Is it possible to query the IMAPITable associated with the search_folder during the search operation(SetSearchCriteria) and until the fvnSearchCompleted event.
In case it's plausible, a simple IMAPTable->QueryRow call with an infinitive loop is enough for that? the table's order wont change during the search operation, and the cursor will correctly move to the next record?
Edit I've found out that SetSearchCriteria, moves table's cursor position, each time it inserts new records to the search table, is there a way to overcome this behavior for on-the-fly table query?
Let's assume I have billions of rows in my HBase table. The rows in this table change slowly, meaning there will be new rowkeys and some rowkeys get deleted.
I receive lots of events per row. However, there will be very few rows that will not have any events associated with them.
At the end of the day I would like to report on the rows that have not received any events.
My naive solution would be to introduce a cf:c that holds a flag, set the flag to 1 every-time I see an event for it. Then do a full-scan of the table looking for rowkeys that are missing the column-value. That seems like a waste, because I would be looking through 10 billion rows to discover a handful of rowkeys (we are talking about 100s or low 1000s).
Is there a clever way to design the hbase schema such that the rowkeys that are missing events could be found quickly (without going through every row)?
If I understood correctly, you have a rowkey xxxxyyyyzzzz1 ... xxxxyyyyzzzzn.
You have events for some rows and no events for other rows.
c is your flag to know whether events are there or not and you have huge data.
Rule of thumb in HBase: RowFilters are always faster and more efficient than column value filters (for searching that flag, a full table scan is required).
Your approach to scan the entire table for missing events (column value filter) will lead to a full table scan and is not efficient.
Conclusion: You have to use a row key filter to scan such a big table.
So I'd suggest you write the flag in the row key. For example :
0 -- is for no events
1 -- is there are events
xxxxyyyyzzzz1_0 // row with no events
xxxxyyyyzzzz1_1 // row with events
Now you can use a fuzzy row filter to capture missing event rows and take a report.
Option 2 of your another question which was answered by me
Is there a clever HBase Schema to Aid with Discovering Missing Value?
From, my experience with hbase, there is no such thing.
I want to know whether I have to use a dynamodb "Scan" operation for getting a list of all hash key values in a dynamodb table or is there an another "less-expensive" approach to do that. I have tried with a "Query" operation, but it was unsuccessful in my case, since I have to define the table hash key to use this operation. I just want to get a list of all hash key values in the table.
Yes, you need to use the scan method to access every item in the table. You can reduce the size of the data returned to you by setting the attributes_to_get attribute to only what you need(*) -- e.g. just the hash key value. Also, note that scan operations are eventually consistent, so if this database is actively growing, your result set may not include the most recent items added to the table.
(*) This will reduce the amount of bandwidth consumed and make the result less resource-intensive to process on the application side, but it will not reduce the amount of throughput that you are charged. Scan operation charges based on size of the entire item, not just attributes that get returned.
Unfortunately to get a list of hash key values you have to perform a Scan operation. What is your use case? Typically, the application should keep track of hash key values since there needs to be an evenly distributed workload. As a result, a Scan operation for this purpose should not happen frequently.
Edit: note that if you filter out the result using attributes_to_get or projection expression, it will help make the results cleaner but it will not reduce the amount of throughput that you are charged. Scan operation charges based on size of the entire item, not just attributes that get returned.