Dynamo DB Stream Activity - amazon-web-services

Has anyone experienced an issue when using Dynamo DB streams with items not populating the streams after a period of inactivity, in this case roughly 24 hours?
The problem that I am experiencing is (when prototyping a new solution):
I create a new stream in Dynamo DB (not a Kinesis Stream)
I populate items into a table via the API
I pull shards from the database and the new items are being processed correctly and appear in the shard (using New and Old image for reference)
If I then stop populating new items for a period of 24 hours, and then add new items they do not appear in the new or existing shards
The old items can still be seen when iterating the existing shards so it is slightly less than 24 hours since they were added, only the new items do not appear at all
I have repeated the issue on a few occasions and have tried to find an explanation in the documentation to no avail. Not sure if this is happening because I have a free account or if I'm missing a setting somewhere.

It is definitely not due to the free account.
DynamoDB writes data into shards(based on the partition key). Each shard is open for writes for 4 hours and open for reads for 24 hours.
That's the reason you can read but not write. This information somehow a bit hard to find in the docs. I found this reference and this in the offical code samples.

Related

Amazon DynamoDB read latency while writing

I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?
Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.

Cannot get data (100k+ rows) for a dashboard

Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.
Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/

handling dynamo db read and write units

I am using dynamo db as back end database in my project, I am storing items in the table with each of size 80 Kb or more(contains nested JSON), and my partition key is a unique valued column(unique for each item). Now i want to perform pagination on this table i.e., my UI will provide(start-Integer, limit-Integer and type-2 string constants) and my API should retrieve the items from dynamo db based on the provided input query parameters from UI. I am using SCAN method from boto3 (python SDK) but this scan is reading all the items from my table prior to considering my filters and causing provision throughput error, but I cannot afford to either increase my table's throughput or opt table auto-scaling. Is there any way how my problem can be solved? Please give your suggestions
Do you have a limit set on your scan call? If not, DynamoDB will return you 1MB of data by default. You could try using limit and some kind of sleep or delay in your code, so that you process your table at a slower rate and stay within your provisioned read capacity. If you do this, you'll have to use the LastEvaluatedKey returned to you to page through your table.
Keep in mind that just to read a single one of your 80kb items, you'll be using 10 read capacity units, and perhaps more if your items are larger.

DynamoDB Triggers (Streams + Lambda) : details on TRIM_HORIZON?

I want to process recent updates on a DynamoDB table and save them in another one. Let's say I get updates from an IoT device irregularly put in Table1, and I need to use the N last updates to compute an update in Table2 for the same device in sync with the original updates (kind of a sliding window).
DynamoDB Triggers (Streams + Lambda) seem quite appropriate for my needs, but I did not find a clear definition of TRIM_HORIZON. In some docs I understand that it is the oldest data in Table1 (can get huge), but in other docs it would seems that it is 24h. Or maybe the oldest in the stream, which is 24h?
So anyone knows the truth about TRIM_HORIZON? Would it even be possible to configure it?
The alternative I see is not to use TRIM_HORIZON, but rather tu use LATEST and perform a query on Table1. But it sort of defeats the purpose of streams.
Here are the relevant aspects for you, from DynamoDB's documentation (1 and 2):
All data in DynamoDB Streams is subject to a 24 hour lifetime. You can
retrieve and analyze the last 24 hours of activity for any given table
TRIM_HORIZON - Start reading at the last (untrimmed) stream record,
which is the oldest record in the shard. In DynamoDB Streams, there is
a 24 hour limit on data retention. Stream records whose age exceeds
this limit are subject to removal (trimming) from the stream.
So, if you have a Lambda that is continuously processing stream updates, I'd suggest going with LATEST.
Also, since you "need to use the N last updates to compute an update in Table2", you will have to query Table1 for every update, so that you can 'merge' the current update with the previous ones for that device. I don't think you can't get around that using TRIM_HORIZON too.

Amazon DynamoDB and Provisioned Throughput

I am new to DynamoDB and I'm having trouble getting my head around the Provisioned Throughput.
From what I've read it seems you can use this to set the limit of reads and writes at one time. Have I got that wrong?
Basically what I want to do is store emails that are sent through my software. I currently store them in a MySQL database but the amount of data is very large which is why I am looking at DynamoDB. This data I do not need to access very often but when it's needed, I need to be able to access it.
Last month 142,925 emails were sent and each "row" (or email) in the MySQL table I store them in is around 2.5KB.
Sometimes 1 email is sent, other times there might be 3,000 at one time. There's no way of knowing when or how many will be sent at any given time.
Do you have any suggestions on what my Throughputs should be?
And if I did go over, am I correct in understanding that Amazon throttles it and adds them over time? Or does it just throw and error and that's the end of it?
Thanks so much for your help.
I'm using DynamoDB with the Java SDK. When you have an access burst, amazon first tries to keep up, even allowing a bit above the provisioned throughput, after that it start throttling and also throws exceptions. In our code we use this error to break the requests into smaller batches and sometimes force a sleep to cool it down a bit.
When dealing with your situation it really depends on the type of crunching you need to do "from time to time". How much time do you need to get all the data from the table? do you really need to get all of it? And ~100k a month doesn't sound too much for MySQL in my mind.. it all depends on the querying power you need.
Also note that in DynamoDB writes are more expensive than reads so maybe that alone signals that it is not the best fit for your write-intensive problem.
DynamoDb is very expensive, I would suggest not to store emails in dynamo db as each read and write cost good amount, Basically 1 read unit means 4KB data read per sec and 1 write unit means 1KB data write per sec, As you mentioned your each email is 2.5KB, hence while searching data(if you dont have proper key for searching the email) table will be completely scanned that will cost a very good amount as you will need several write units for reading.