We are working on deploying our product (currently on prem) on AWS and are looking at DynamoDB as a alternative to Cassandra mainly to avoid the devop costs associated with a large number of Cassandra clusters.
The DynamoDB doc says that the per account limit on the number of tables is 256 per region but can be increased by calling AWS support. How much is the max limit for this per account?
Our product is separated into distinct logical units where each such unit will have several tables (say 100). Each customer can have several of such units. Each logical unit can be backed up (i.e. a snapshot taken) and that snapshot can be restored at any time in the future (to overwrite the current content of all tables). The backup/restore performance - time taken to take a snapshot/import old data for all the tables - need to be good - it cannot be several minutes/hrs.
We were thinking of using distinct set of tables for each such logical unit - so that backup/restore is quick using EMR on S3. But if we follow this approach, we will run out of the 256 table number limit even with one customer. Looks like there are 2 options
Create a new account for each such logical unit for each customer. Is this possible? We will have a main corporate account I suppose (I am still learning about this), but can it have a set of sub-accounts for our customers using IAM each of which is considered as an independent AWS account?
Use each table in a true multi-tenant manner - where the primary key contains the customer id + logical unit id. But in this scenario,when using EMR to backup an entire table, we will need to selectively back up specific set of rows/items which may be in millions and this will go on while other write/read operations are going on on a different set of items. Is this feasible in terms of large scale?
Any other thoughts on how to approach this?
Thanks for any info.
I would suggest changing the approach - rather then thinking how to get more tables via creating more accounts.
I would think of how to use less tables.
Having said that - you could contact support and increase the amount of tables for you account.
I think that you will run into a money problem, due to the current pricing model of provisioning throughput per table.
Many people split tables based on time frame.
e.x: this weeks table, last weeks table, then move it to last months table and so on..
This helps when analyzing the data with EMR/Redshift - so you wont have to pull the whole table every time.
Related
I am developing an application using DynamoDB. This application is not yet open to the public so only certain employees can access the application.
Generally, the application is very fast and there are no performance issues. Sometimes, however, the application is extremely slow.
At first I suspected that the problem comes from React JS application or from the API but that problem is from DynamoDB.
How can I affirm this?
I tested by stopping Node JS (so the API was offline)
I tested directly in the AWS console in "Explore table items" screens and in "PartiQL editor" screens
And DynamoDB was very very slow and I get this error:
The level of configured provisioned throughput for one or more global secondary indexes of the table was exceeded.
Consider increasing your provisioning level for the under-provisioned global secondary indexes with the UpdateTable API
I cannot understand because no application is running.
So why DynamoDB because slow ?
---> Maybe there is a bug in the API. Engineer are works on that.
But why does the DynamoDB keep running slow when API was offline?
How can I "restart" and/or "stop" DynamoDB service?
Best regards
Update: 2022-09-05 17h42 (Japan Time)
I created two videos to illustrate what I say (Sorry for the delay because to create the videos I had to wait for the database bugs):
Normal Case: DynamoDB is very very fast
https://youtu.be/ayeccV0zk0E
Issue Case: DynamoDB is very very slow
https://youtu.be/1u201N2HV8o
---> On my example, I have only 52 Users so this is bug not normal.
Regards
The error message is giving you a potential cause for your perceived slowness.
I suspect that what you perceive as slowness is because the throughput of the Global Secondary Index your app is reading from is exhausted, and the app (or the AWS SDK) is performing exponential backoff to retry the API call.
The one dimension you scale DynamoDB with aside from the Key schema is Throughput. You decide how many requests per second (it's a bit more complicated than that) DynamoDB can handle, and AWS ensures that load can be served. If you go beyond that, AWS throttles API calls, and you receive the errors.
GSIs have their own throughput that you can manage. I suggest you take a look at the provided metrics to identify where your throughput bottleneck is and adjust the throughput accordingly. If you don't want to deal with throughput at all, switch the table to On-Demand Capacity (Pay per request) and AWS handles that for you at a small premium.
The error message mentions provisioned throughput of a GSI, so it is quite likely that this is your problem:
The DynamoDB GSI documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations explains that
When you create a global secondary index on a provisioned mode table, you must specify read and write capacity units for the expected workload on that index. The provisioned throughput settings of a global secondary index are separate from those of its base table. A Query operation on a global secondary index consumes read capacity units from the index, not the base table. When you put, update or delete items in a table, the global secondary indexes on that table are also updated. These index updates consume write capacity units from the index, not from the base table.
For example, if you accidentally set a GSI's read provisioning to 1, then you can only do on average one read per second from this GSI. If you do a scan that needs to return 10 items, it may take around 10 seconds to complete. Even if no other application is using the table.
Please read the aforementioned link for the full story on how to provision secondary indexes in DynamoDB.
If this is not your problem, please update your question with details on the provisioned throughput settings of your base table and its GSI.
We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.
What are the best practices to keep the data QA/UAT to best represent all scenarios PROD environment?
The intention is to have the lower environment as close as PROD environment so that we can identify and test all scenarios in lower environment before deploying changes to Production.
One idea is to sync past X months data into UAT AND strip off / randomize / de-identify the personal identification information for privacy protection and data security.
Looking for suggestions, links to article or videos.
let's say you have 1 table called prod-data. You can create another table with name uat-table. And use dynamodb-stream of first table and a lambda function to insert data in uat-table.
In the lambda function
a. you can remove the PII information
b. set ttl while inserting in uat-table
c. set lower concurrently of lambda function to limit number of wcu consumed.
d. set higher Batch size so that wcu can be less.
for more information read this documentation.
OR
you can use production table only giving access to only non pii data. read here.
PS this solution has lower cost but has multiple limitations.
We are using DynamoDB global tables and planning to use DAX on the top of DynamoDB to enable caching. But I don't see any mention of how DAX invalidation will take place in multi-region setup.
For example, let's say there are 2 clusters, one in us-west-2 and one in us-east-2. If we update something in us-east-2 using the DAX client it's cache will be updated but while replicating the data to us-west-2, will global table update cache in us-west-2 as well? I don't see any mention of this in the DynamoDB documentation.
The DAX cache will not be updated. Global tables will replicate the data in other regions. However, it wouldn't update the cache. Even, the query cache and item cache are independent.
DAX does not refresh result sets in the query cache with the most
current data from DynamoDB. Each result set in the query cache is
current as of the time that the Query or Scan operation was performed.
Thus, Charlie's Query results do not reflect his PutItem operation.
This will be the case until DAX evicts the result set from the query
cache.
Write through policy:-
The DAX item cache implements a write-through policy (see How DAX
Processes Writes). When you write an item, DAX ensures that the cached
item is synchronized with the item as it exists in DynamoDB. This is
helpful for applications that need to re-read an item immediately
after writing it. However, if other applications write directly to a
DynamoDB table, the item in the DAX item cache will no longer be in
sync with DynamoDB.
DAX Consistency
In the above statement, you can consider the other application word as global table replication. The DAX wouldn't aware about the replication done for the global table.
At this time, the DAX cache in region two will have no knowledge of the GT replicated write. Your best alternative at the moment is to keep a lower TTL on DAX in both regions, so it fetches the newest version more often.
This has been the consistent problem with AWS service teams. They seems to design things in isolation without worrying about the different related context. I have seen this kind of inconsistency in design at several places. In fact even with in DAX and DynamoDB the 2 TTL concepts doesn’t consider its functions even though they are related. Don’t know when AWS service teams will design things with full context like Microsoft does for their solutions.
We have a DynamoDB Database that is storing machine sensor information in the "structure" of :
HashKey: MachineNumber (Number)
SortKey: EntryDate (String)
Columns: SensorType (String), SensorValue (Number)
The sensors generate information almost every 3 seconds and we're looking to measure a (near) real-time KPI to count how many machines in a region were down in the past hour for more than 10 minutes. A region can have close to 10000 machines so iterating through DynamoDB is taking almost 10+ minutes for a response. What is the best way to do this?
Describing the answer as discussed in comments on the question.
Performing a table scan on a very large table is expensive and should be avoided. DynamoDB Streams provides the ability to process records using your own custom code after they are inserted. This allows for aggregations or other computations to be performed asynchronously in near real time. The result can then be written or updated in a separate DynamoDB table.
You can run the code that processes the DynamoDB Stream messages on your own server (example: EC2), but it is likely easier to just utilize Lambda. Lambda lets you write Java or NodeJS code that will be run on AWS infrastructure that is fully managed so all you need to worry about is the code.