How to achieve consistent read across multiple SELECT using AWS RDS DataService (Aurora Serverless) - amazon-web-services

I'm not sure how to achieve consistent read across multiple SELECT queries.
I need to run several SELECT queries and to make sure that between them, no UPDATE, DELETE or CREATE has altered the overall consistency. The best case for me would be something non blocking of course.
I'm using MySQL 5.6 with InnoDB and default REPEATABLE READ isolation level.
The problem is when I'm using RDS DataService beginTransaction with several executeStatement (with the provided transactionId). I'm NOT getting the full result at the end when calling commitTransaction.
The commitTransaction only provides me with a { transactionStatus: 'Transaction Committed' }..
I don't understand, isn't the commit transaction fonction supposed to give me the whole (of my many SELECT) dataset result?
Instead, even with a transactionId, each executeStatement is returning me individual result... This behaviour is obviously NOT consistent..

With SELECTs in one transaction with REPEATABLE READ you should see same data and don't see any changes made by other transactions. Yes, data can be modified by other transactions, but while in a transaction you operate on a view and can't see the changes. So it is consistent.
To make sure that no data is actually changed between selects the only way is to lock tables / rows, i.e. with SELECT FOR UPDATE - but it should not be the case.
Transactions should be short / fast and locking tables / preventing updates while some long-running chain of selects runs is obviously not an option.
Issued queries against the database run at the time they are issued. The result of queries will stay uncommitted until commit. Query may be blocked if it targets resource another transaction has acquired lock for. Query may fail if another transaction modified resource resulting in conflict.
Transaction isolation affects how effects of this and other transactions happening at the same moment should be handled. Wikipedia
With isolation level REPEATABLE READ (which btw Aurora Replicas for Aurora MySQL always use for operations on InnoDB tables) you operate on read view of database and see only data committed before BEGIN of transaction.
This means that SELECTs in one transaction will see the same data, even if changes were made by other transactions.
By comparison, with transaction isolation level READ COMMITTED subsequent selects in one transaction may see different data - that was committed in between them by other transactions.

Related

AWS Redshift Auto MV feature causing performance issues

We are getting performance issue whenever Auto MV feature creates/refreshes Automatic MV on the table which has huge volume of data(> 400GB) This impacts badly on existing workload.
I am still investigating if we can disable Auto MV for the specific table or we have to disable it for entire cluster ?
We have tried every other options but issue is there.
This table contains only 3 months of data.
will there any imapct of we disable AutoMV ? is there any solution ?
Since this comments discussion has grown let me add an answer to the question with more detail.
When a query runs it consumes CPU cycles AND main memory AND disk bandwidth AND network bandwidth. Every query needs all these resources in varying amounts. Redshift WLM has the ability to contain (somewhat successfully) the amount of CPU and main memory a query uses but doesn’t have a good ability to put fences up for disk and network traffic. This means that a “greedy” query can easily impact other queries by tying up network and disk subsystems. Given the size of the table you mention I expect this is what you are seeing.
A materialized view is a way to pre-compute the results of queries and store the result for later use at reduced cost. These view need to be refreshed when the source data changes or the result will be out of date. AutoMV is based on Redshift detecting when an expensive query is being run multiple times without changing source data and use this query to create an MV without user intervention. It then detects when a query could use the MV and rewrites the query on the fly all without the user knowing what is happening. It then will refresh the MV automatically when the database usage is “low”. All this sounds good in theory.
The concern is that if you have a really expensive, brown-out the cluster, query that runs often then Redshift will move it to an AutoMV. Now this painful query will run less often but at unexpected times. This should be an improvement in overall cluster workload but you have lost control of when this query will impact your database.
The root issue is that the underlying query is too expensive. Possibly due to a poorly written query (in Redshift terms) or possibly due to data model inefficiencies. Either way the “brown-out” events (likely) won’t stop until the query is addressed. [There are other reasons that a query can be too expensive but query and data model design or the most likely.]
Turning off AutoMV will just make Redshift rerun the query every time and not use the MV. The impact of this will likely be increasing the load on the cluster but you will gain control of when these queries run. Now it is possible that Redshift is “generalizing” the query you issue so that the MV can be used in more cases and it is this “generalize” query that is the issue. I doubt it but if you can show this submit a bug to Amazon. So I expect things will get worse with AutoMV off.
Generally Redshift doesn’t like queries that “generate” lots of new data during execution. These “fat in the middle” queries when run against large tables like you describe can easily lead to brown-out events. Inequality JOINs or poorly written JOINs are likely causes so look for these. Often DISTINCT or GROUP BY are used to pare the results down by removing the many duplicates in the query. Look for queries in your catalog tables that have large spill (svl_query_summary) or high network traffic (stl_dist).
If you can identify the query (or class of queries) that is causing your issues then this forum can likely provide help is addressing the performance issues you are experiencing.

DynamoDB Transaction Write

DynamoDB's transaction write states:
Multiple transactions updating the same items simultaneously can cause conflicts that cancel the transactions. We recommend following DynamoDB best practices for data modeling to minimize such conflicts.
If there are multiple simultaneous TransactionWriteItems on the same item simultaneously, will all the transaction write request fail with TransactionCanceledException? Or at least one request will succeed?
This purely depends on how the conflicts happen. So, if all the writes in a transaction get successful, it will be successful. If some other write/trasaction midifies one of the items, it will fail.

Locking Behavior In Spanner vs MySQL

I'm exploring moving an application built on top of MySQL into Spanner and am not sure if I can replicate certain functionality from our MySQL db.
basically a simplified version of our mysql schema would look like this
users idnamebalanceuser_transactionsiduser_idexternal_idamountuser_locksuser_iddate
when the application receives a transaction for a user the app starts a mysql transaction, updates the user_lock for that user, checks if the user has sufficient balance for the transaction, creates a new transaction, and then updates the balance. It is possible the application receive transactions for a user at the same time and so the lock forces them to be sequential.
Is it possible to replicate this in Spanner? How would I do so? Basically If the application receives two transactions at the same time I want to ensure that they are given an order and that the changed data from the first transaction is propagated to the second transaction.
Cloud Spanner would do this by default since it provides serializability which means that all transactions appear to have occurred in serial order. You can read more about the transaction semantics here:
https://cloud.google.com/spanner/docs/transactions#rw_transaction_semantics

Reflecting changes on big tables in hdfs

I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks
One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.

How to monitor database updates from application?

I work with SQL Server database with ODBC, C++. I want to detect modifications in some tables of the database: another application inserts or updates rows and I have to detect all these modifications. It does not have to be the immediate trigger, it is acceptable to use polling to periodically check database tables for modifications.
Below is the way I think this can be done, and need your opinions whether this is the standard/right way of doing this, or any better approaches exist.
What I've thought of is this: I add triggers in SQL Server, which, on any modification, will insert the identifiers of modified/added rows into special table, which I will check periodically from my application. Suppose there are 3 tables: Customers, Products, Services. i will make three additional tables: Change_Customers, Change_Products, Change_Services, and will insert the identifiers of modified rows of the respective tables. Then I will read these Change_* tables from my application periodically and delete processed records.
Now if you agree that above solution is right, I have another question: Is it better to have separate Change_* tables for each of my tables I wish to monitor, or is it better to have one fat Changes table which will contain the changes from all tables.
Query Notifications is the technology designed to do exactly what you're describing. You can leverage Query Notifications from managed clients via the well known SqlDependency class, but there are native Ole DB and ODBC ways too. See Working with Query Notifications, the paragraphs about SSPROP_QP_NOTIFICATION_MSGTEXT (OleDB) and SQL_SOPT_SS_QUERYNOTIFICATION_MSGTEXT (ODBC). See The Mysterious Notification for an explanation how Query Notifications work.
This is the only polling-free solution that work with any kind of updates. Triggers and polling for changes has severe scalability and performance issues. Change Data Capture and Change Tracking are really covering a different topic (synchronizing datasets for occasionally connected devices, eg. Sync Framework).
Change Data Capture(CDC)--http://msdn.microsoft.com/en-us/library/cc645937.aspx
First you will need to enable CDC in database
::
USE db_name
GO
EXEC sys.sp_cdc_enable_db
GO
Enable CDC on table then
:: sys.sp_cdc_enable_table
Then you can query changes
If your version of Sql Server is 2005 - you may use Notification Services
If your Sql Server is 2008+ - there is most preferrable way to use triggers and log changes to log tables and periodically poll these tables from application to see the changes