I realized that you can add a scope for the whole database or just a number of tables.
The question would be, if you're going to synchronize a full database, which would be preferrable, one scopename for the whole database, or would it be better to scope and synchronize per schema?
that would depend on some factors: number of tables, frequency of update, sync direction, etc...
have a look at this: SYNC FRAMEWORK SCOPE AND SQL AZURE DATA SYNC DATASET CONSIDERATIONS
Related
I preface all of this to say I’m still actively learning DynamoDB, and I think an answer to my question will help me understand a few things.
I have an analytics microservice that I’m pushing custom (internal) analytics events into a DynamoDB table. Columns in our Dynamo rows/items include data like:
User ID
IP Address
Event Action
Timestamp
Split Test ID
Split Test Value
One of the main questions we want to pull from this db is:
"How many users saw split test x with values y?"
I’m struggling to understand how I should index my database to account for this kind of requests? I set up a “Keys Only” index targeting Split Test ID, and the query to gather these are fairly efficient, but it only pulls UserID and Split Test ID. Ideally I want an efficient query that returns multiple other associated values as well…
How do I achieve this? Do I need to be doing something much differently? Additionally, if any of my understanding of Dynamo, based on my explanations, sounds completely lacking in some regard, please point me in the right direction!
You're thinking of DynamoDB as a schema-less database, which it obviously is. However, that does not mean that a schema is not important. Schemas in NoSQL databases are usually more important than they are in SQL databases and they are usually less straightforward.
The most important thing to determine how you will store your data is how you will access it. You will have to take into account all the ways that you will want to access your data and ensure it is possible by creating the necessary data columns and necessary indexes. In this case, if you want to know how many times two values are combined in a certain way, you could easily add a column that has these combined values (e.g., splitId#splitValue ) and use that in your indexes.
If you want to know more about advanced patterns and such, I advise you to watch this pretty famous re:invent talk by Rick Houlihan or to read the DynamoDB book.
As a last note, I want to add that switching to a SQL server usually is not the solution. Picking NoSQL over SQL is usually based on non-functional requirements. There is a reason NoSQL databases are used in applications that require very low-latency retrieval of data in huge datasets, but as with everything, trade-offs are the name of the game.
Looking for an advice on AWS architecture. Did some research on my own, but I'm far from an expert and I would really love to hear other opinions. This seems to be a pretty common problem for miscroservice architecture, but AWS looks like a different universe to me with its own rules (and tools), there should be best practices that I'm not aware of yet.
What we have:
SOA: Lambda per entity (usually node.js + DynamoDB)
Some Lambda functions use RDS (MySQL) as a DB (this data was supposed to be used by Quicksight)
GraphQL (AppSync)
First problem occurred when we understood that we have to display in Quicksight the data that is stored in DynamoDB. This was solved by Data Pipeline job that transfers the data from DynamoDB to S3 and then is fetched by Quicksight using Athena. In this case it's acceptable that the data for analysis is not updated in real time.
But now we need to create a table in the main application and combine the data that is stored in different data sources - DynamoDB and MySQL. For example, we have an entity payment with attributes like amount and currency, this data is stored in MySQL. And then there is a contract entity which is stored in DynamoDB. Payment can have a link to a contract (one to many relation). We need to create a table with a list of contracts, so the user can filter contracts by payments attributes like seeing the contracts that have payments in EUR or with total amount > 500 USD. This table must contain real time data and have common data grid features: filtering, sorting, pagination.
Options that I see at the moment:
use SQS to transfer payment attributes from payment service to DynamodDB and store it as a String Set in DynamoDB (e.g. column currencies: ['EUR', 'USD']).
use streams (DynamoDB streams, Kinesis?) to transfer data from DynamoDB to S3, and then query the data with Athena. Not sure it will work for us, I got really bad performance issues with Athena, queries stuck in queue for a couple of minutes, did I do something wrong?
remodel the architecture, merge entities into one DB. Probably this one will take far too long to be allowed by project managers.
Data duplication (and consistency issues as a result) was always a pain for me, but it seems to be unavoidable here.
Any thoughts or links to the articles that might help are highly appreciated.
P.S. The architecture was designed by a previous development team.
I'm not sure how to achieve consistent read across multiple SELECT queries.
I need to run several SELECT queries and to make sure that between them, no UPDATE, DELETE or CREATE has altered the overall consistency. The best case for me would be something non blocking of course.
I'm using MySQL 5.6 with InnoDB and default REPEATABLE READ isolation level.
The problem is when I'm using RDS DataService beginTransaction with several executeStatement (with the provided transactionId). I'm NOT getting the full result at the end when calling commitTransaction.
The commitTransaction only provides me with a { transactionStatus: 'Transaction Committed' }..
I don't understand, isn't the commit transaction fonction supposed to give me the whole (of my many SELECT) dataset result?
Instead, even with a transactionId, each executeStatement is returning me individual result... This behaviour is obviously NOT consistent..
With SELECTs in one transaction with REPEATABLE READ you should see same data and don't see any changes made by other transactions. Yes, data can be modified by other transactions, but while in a transaction you operate on a view and can't see the changes. So it is consistent.
To make sure that no data is actually changed between selects the only way is to lock tables / rows, i.e. with SELECT FOR UPDATE - but it should not be the case.
Transactions should be short / fast and locking tables / preventing updates while some long-running chain of selects runs is obviously not an option.
Issued queries against the database run at the time they are issued. The result of queries will stay uncommitted until commit. Query may be blocked if it targets resource another transaction has acquired lock for. Query may fail if another transaction modified resource resulting in conflict.
Transaction isolation affects how effects of this and other transactions happening at the same moment should be handled. Wikipedia
With isolation level REPEATABLE READ (which btw Aurora Replicas for Aurora MySQL always use for operations on InnoDB tables) you operate on read view of database and see only data committed before BEGIN of transaction.
This means that SELECTs in one transaction will see the same data, even if changes were made by other transactions.
By comparison, with transaction isolation level READ COMMITTED subsequent selects in one transaction may see different data - that was committed in between them by other transactions.
We are using DynamoDB global tables and planning to use DAX on the top of DynamoDB to enable caching. But I don't see any mention of how DAX invalidation will take place in multi-region setup.
For example, let's say there are 2 clusters, one in us-west-2 and one in us-east-2. If we update something in us-east-2 using the DAX client it's cache will be updated but while replicating the data to us-west-2, will global table update cache in us-west-2 as well? I don't see any mention of this in the DynamoDB documentation.
The DAX cache will not be updated. Global tables will replicate the data in other regions. However, it wouldn't update the cache. Even, the query cache and item cache are independent.
DAX does not refresh result sets in the query cache with the most
current data from DynamoDB. Each result set in the query cache is
current as of the time that the Query or Scan operation was performed.
Thus, Charlie's Query results do not reflect his PutItem operation.
This will be the case until DAX evicts the result set from the query
cache.
Write through policy:-
The DAX item cache implements a write-through policy (see How DAX
Processes Writes). When you write an item, DAX ensures that the cached
item is synchronized with the item as it exists in DynamoDB. This is
helpful for applications that need to re-read an item immediately
after writing it. However, if other applications write directly to a
DynamoDB table, the item in the DAX item cache will no longer be in
sync with DynamoDB.
DAX Consistency
In the above statement, you can consider the other application word as global table replication. The DAX wouldn't aware about the replication done for the global table.
At this time, the DAX cache in region two will have no knowledge of the GT replicated write. Your best alternative at the moment is to keep a lower TTL on DAX in both regions, so it fetches the newest version more often.
This has been the consistent problem with AWS service teams. They seems to design things in isolation without worrying about the different related context. I have seen this kind of inconsistency in design at several places. In fact even with in DAX and DynamoDB the 2 TTL concepts doesn’t consider its functions even though they are related. Don’t know when AWS service teams will design things with full context like Microsoft does for their solutions.
I have a terabyte size SQL Server DB table which has only two columns:
Id,
HTML Content
There are few applications that call this Table to retrieve the HTML content by providing the Id of the row.
The DB is residing On-premises, and the maintenance cost and size of it is getting higher and higher. I am thinking to move this DB into AWS Dynamo DB. Reason I have choose Dynamo DB is the cost and the performance I have read about it.
Are the any concerns I should know about before choosing Dynamo DB?
Are the any other services in AWS that I could possibly use over
Dynamo DB?
I understand that SQL Server is a Relational DB, while DynamoDB is no sql. And it seems a No Sql DB could be a potential solution for this scenario. I have no kind of joins nor transactions against that Table. All I am doing with the table is to Insert, and Select.
Are the any concerns I should know about before choosing Dynamo DB?
As with any NoSql bigdata DB, Dynamo is "eventually consistent", so, if your application writes and then immediately reads the same record - you should expect failures (inconsistencies).
I'm not familiar with "Prem" and assuming you mean that you're working with your private servers I feel obligated to provide the following warning: working in the cloud is very different from working with your own servers: requests fail more often, latency pattern is different and you should architect your software to handle these sort of issues. If you're planning on moving to the cloud I'd start with migrating your application and leave the DB to be last.
If you really need real time updates of your data, You should reconsider moving on Dynamo. Also dynamo is useful when you do need a dynamic number of columns for each row. So except the cost, i don't see any benefits here.
If you don't need realtime updates, you can look into AWS Redshift or Google BigQuery, and these will be cheaper solutions compare to Dynamo.
Like you have mentioned, you just have two columns, take a look into "redis" also. A plain key value structure will help in performance. But since Redis stores everything in the Physical memory, costing will be high and you'll still need permanent storage/ DB like SQL, MySQL. So in terms of performance, yes you ll be able to see huge difference. but you'll be more thn the current cost.
How about AWS Aurora? At least AWS claims of 1/10th of cost compare to other SQL/MySQL instances. It have backward compatibility also.