Synchronize data between two Environments

Synchronize data between two Environments - amazon-web-services

What are the best practices to keep the data QA/UAT to best represent all scenarios PROD environment?
The intention is to have the lower environment as close as PROD environment so that we can identify and test all scenarios in lower environment before deploying changes to Production.
One idea is to sync past X months data into UAT AND strip off / randomize / de-identify the personal identification information for privacy protection and data security.
Looking for suggestions, links to article or videos.

let's say you have 1 table called prod-data. You can create another table with name uat-table. And use dynamodb-stream of first table and a lambda function to insert data in uat-table.
In the lambda function
a. you can remove the PII information
b. set ttl while inserting in uat-table
c. set lower concurrently of lambda function to limit number of wcu consumed.
d. set higher Batch size so that wcu can be less.
for more information read this documentation.
OR
you can use production table only giving access to only non pii data. read here.
PS this solution has lower cost but has multiple limitations.

Related

Why sometimes the DynamoDB is extremely slow?

I am developing an application using DynamoDB. This application is not yet open to the public so only certain employees can access the application.
Generally, the application is very fast and there are no performance issues. Sometimes, however, the application is extremely slow.
At first I suspected that the problem comes from React JS application or from the API but that problem is from DynamoDB.
How can I affirm this?
I tested by stopping Node JS (so the API was offline)
I tested directly in the AWS console in "Explore table items" screens and in "PartiQL editor" screens
And DynamoDB was very very slow and I get this error:
The level of configured provisioned throughput for one or more global secondary indexes of the table was exceeded.
Consider increasing your provisioning level for the under-provisioned global secondary indexes with the UpdateTable API
I cannot understand because no application is running.
So why DynamoDB because slow ?
---> Maybe there is a bug in the API. Engineer are works on that.
But why does the DynamoDB keep running slow when API was offline?
How can I "restart" and/or "stop" DynamoDB service?
Best regards
Update: 2022-09-05 17h42 (Japan Time)
I created two videos to illustrate what I say (Sorry for the delay because to create the videos I had to wait for the database bugs):
Normal Case: DynamoDB is very very fast
https://youtu.be/ayeccV0zk0E
Issue Case: DynamoDB is very very slow
https://youtu.be/1u201N2HV8o
---> On my example, I have only 52 Users so this is bug not normal.
Regards

The error message is giving you a potential cause for your perceived slowness.
I suspect that what you perceive as slowness is because the throughput of the Global Secondary Index your app is reading from is exhausted, and the app (or the AWS SDK) is performing exponential backoff to retry the API call.
The one dimension you scale DynamoDB with aside from the Key schema is Throughput. You decide how many requests per second (it's a bit more complicated than that) DynamoDB can handle, and AWS ensures that load can be served. If you go beyond that, AWS throttles API calls, and you receive the errors.
GSIs have their own throughput that you can manage. I suggest you take a look at the provided metrics to identify where your throughput bottleneck is and adjust the throughput accordingly. If you don't want to deal with throughput at all, switch the table to On-Demand Capacity (Pay per request) and AWS handles that for you at a small premium.

The error message mentions provisioned throughput of a GSI, so it is quite likely that this is your problem:
The DynamoDB GSI documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations explains that
When you create a global secondary index on a provisioned mode table, you must specify read and write capacity units for the expected workload on that index. The provisioned throughput settings of a global secondary index are separate from those of its base table. A Query operation on a global secondary index consumes read capacity units from the index, not the base table. When you put, update or delete items in a table, the global secondary indexes on that table are also updated. These index updates consume write capacity units from the index, not from the base table.
For example, if you accidentally set a GSI's read provisioning to 1, then you can only do on average one read per second from this GSI. If you do a scan that needs to return 10 items, it may take around 10 seconds to complete. Even if no other application is using the table.
Please read the aforementioned link for the full story on how to provision secondary indexes in DynamoDB.
If this is not your problem, please update your question with details on the provisioned throughput settings of your base table and its GSI.

Amazon DynamoDB read latency while writing

I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?

Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.

Impact of On-Demand mode on Audit table data for Amazon DynamoDB

I am working on Amazon DynamoDB audit table.
The read/write mode was set to "Provisioning". Now, the mode is changed to "On-Demand". I have an "Audit Table" (which captures the audit information like date and time of operation, user details, etc) associated with DynamoDB.
My questions on this are:
1) How is it impacting the data that gets created in the "Audit Table"?
2) Will the data be deleted automatically on timely bases?
3) If not, what is the maximum limit of data that a table (audit table in this case) can persist?
Please let me know if you need any more information from my side.
Waiting for your answers on my questions.
Thanks and regards,
Mahesh Bongale

Provisioning just means that the table is initializing with whatever read/write capacity you set, or OnDemand capacity if you set it to that mode (similar to an auto-scaling mode where it will always deliver the throughput needed by your application). More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html
No, absolutely not, unless you specifically add code that will delete old data OR set a specific TTL on your data. More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html
There is no specific limit on the number of rows in a given table. It can be as much as you want. There are a few limits though on a few things, some can be lifted if you ask AWS, some can not: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?

For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.

I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html

DynamoDB - limit on number of tables per account

We are working on deploying our product (currently on prem) on AWS and are looking at DynamoDB as a alternative to Cassandra mainly to avoid the devop costs associated with a large number of Cassandra clusters.
The DynamoDB doc says that the per account limit on the number of tables is 256 per region but can be increased by calling AWS support. How much is the max limit for this per account?
Our product is separated into distinct logical units where each such unit will have several tables (say 100). Each customer can have several of such units. Each logical unit can be backed up (i.e. a snapshot taken) and that snapshot can be restored at any time in the future (to overwrite the current content of all tables). The backup/restore performance - time taken to take a snapshot/import old data for all the tables - need to be good - it cannot be several minutes/hrs.
We were thinking of using distinct set of tables for each such logical unit - so that backup/restore is quick using EMR on S3. But if we follow this approach, we will run out of the 256 table number limit even with one customer. Looks like there are 2 options
Create a new account for each such logical unit for each customer. Is this possible? We will have a main corporate account I suppose (I am still learning about this), but can it have a set of sub-accounts for our customers using IAM each of which is considered as an independent AWS account?
Use each table in a true multi-tenant manner - where the primary key contains the customer id + logical unit id. But in this scenario,when using EMR to backup an entire table, we will need to selectively back up specific set of rows/items which may be in millions and this will go on while other write/read operations are going on on a different set of items. Is this feasible in terms of large scale?
Any other thoughts on how to approach this?
Thanks for any info.

I would suggest changing the approach - rather then thinking how to get more tables via creating more accounts.
I would think of how to use less tables.
Having said that - you could contact support and increase the amount of tables for you account.
I think that you will run into a money problem, due to the current pricing model of provisioning throughput per table.
Many people split tables based on time frame.
e.x: this weeks table, last weeks table, then move it to last months table and so on..
This helps when analyzing the data with EMR/Redshift - so you wont have to pull the whole table every time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js