How snowflake internally performs updates? - sql-update

As far as I know, underlying files (columnar format) is immutable. My question is, if files are immutable, how the updates are being performed. Do Snowflake maintains different versions of the same row, and returns the latest version based on key? or it inserts the data into new files behind the scene and deletes old files? How performance gets affected in these scenarios (querying current data), if time travel is set to 90 days as Snowflake need to maintain different version of the same row. But as Snowflake doesn't respect keys, how even different versions are detected. Any insights (document/video) on the detailed internals is appreciated.

It's a complex question, but a basic ideas are as follows (quite a bit simplified):
records are stored in immutable micro-partitions on S3
a table is a list of micro-partitions
when a record is modified
its old micro-partition is marked as inactive (from that moment),
a new micro-partition is created, containing the modified record, but also other records from that micro-partition.
the new micro-partition is added to the table's list (marked as active from that moment)
inactive micro-partitions are not deleted for some time, allowing time-travel
So Snowflake doesn't need a record key, as each record is stored in only one file active at a given time.
The impact of performing updates on querying is marginal, the only visible impact might be that the files need to be fetched from S3 and cached on the warehouses.
For more info, I'd suggest going to Snowflake forums and asking there.

Related

Which one is more performant in redshift - Truncate followed with Insert Into or Drop and Create Table As?

I have been working on AWS Redshift and kind of curious about which of the data loading (full reload) method is more performant.
Approach 1 (Using Truncate):
Truncate the existing table
Load the data using Insert Into Select statement
Approach 2 (Using Drop and Create):
Drop the existing table
Load the data using Create Table As Select statement
We have been using both in our ETL, but I am interested in understanding what's happening behind the scene on AWS side.
In my opinion - Drop and Create Table As statement should be more performant as it reduces the overhead of scanning/handling associated data blocks for table needed in Insert Into statement.
Moreover, truncate in AWS Redshift does not reseed identity columns - Redshift Truncate table and reset Identity?
Please share your thoughts.
Redshift operates on 1MB blocks as the base unit of storage and coherency. When changes are made to a table it is these blocks that are "published" for all to see when the changes are committed. A table is just a list (data structure) of block ids that compose it and since there can be many versions of a table in flight at any time (if it is being changed while others are viewing it).
For the sake of the is question let's assume that the table in question is large (contains a lot of data) which I expect is true. These two statements end up doing a common action - unlinking and freeing all the blocks in the table. The blocks is where all the data exists so you'd think that the speed of these two are the same and on idle systems they are close. Both automatically commit the results so the command doesn't complete until the work is done. In this idle system comparison I've seen DROP run faster but then you need to CREATE the table again so there is time needed to recreate the data structure of the table but this can be in a transaction block so do we need to include the COMMIT? The bottom line is that in the idle system these two approaches are quite close in runtime and when I last measured them out for a client the DROP approach was a bit faster. I would advise you to read on before making your decision.
However, in the real world Redshift clusters are rarely idle and in loaded cases these two statements can be quite different. DROP requires exclusive control over the table since it does not run inside of a transaction block. All other uses of the table must be closed (committed or rolled-back) before DROP can execute. So if you are performing this DROP/recreate procedure on a table others are using the DROP statement will be blocked until all these uses complete. This can take an in-determinant amount of time to happen. For ETL processing on "hidden" or "unpublished" tables the DROP/recreate method can work but you need to be really careful about what other sessions are accessing the table in question.
Truncate does run inside of a transaction but performs a commit upon completion. This means that it won't be blocked by others working with the table. It's just that one version of the table is full (for those who were looking at it before truncate ran) and one version is completely empty. The data structure of the table has versions for each session that has it open and each sees the blocks (or lack of blocks) that corresponds to their version. I suspect that it is managing these data structures and propagating these changes through the commit queue that slows TRUNCATE down slightly - bookkeeping. The upside for this bookkeeping is that TRUNCATE will not be blocked by other sessions reading the table.
The deciding factors on choosing between these approaches is often not performance, it is which one has the locking and coherency features that will work in your solution.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

DynamoDB Concurrency Issue

I'm building a system in which many DynamoDB (NoSQL) tables all contain data and data in one table accesses data in another table.
Multiple processes are accessing the same item in a table at the same time. I want to ensure that all of the processes have updated data and aren't trying to access that item at the exact same time because they are all updating the item with different data.
I would love some suggestions on this as I am stuck right now and don't know what to do. Thanks in advance!
Optimistic locking is a strategy to ensure that the client-side item that you are updating (or deleting) is the same as the item in Amazon DynamoDB. If you use this strategy, your database writes are protected from being overwritten by the writes of others, and vice versa.
With optimistic locking, each item has an attribute that acts as a version number. If you retrieve an item from a table, the application records the version number of that item. You can update the item, but only if the version number on the server side has not changed. If there is a version mismatch, it means that someone else has modified the item before you did. The update attempt fails, because you have a stale version of the item. If this happens, you simply try again by retrieving the item and then trying to update it. Optimistic locking prevents you from accidentally overwriting changes that were made by others. It also prevents others from accidentally overwriting your changes.
To support optimistic locking, the AWS SDK for Java provides the #DynamoDBVersionAttribute annotation. In the mapping class for your table, you designate one property to store the version number, and mark it using this annotation. When you save an object, the corresponding item in the DynamoDB table will have an attribute that stores the version number. The DynamoDBMapper assigns a version number when you first save the object, and it automatically increments the version number each time you update the item. Your update or delete requests succeed only if the client-side object version matches the corresponding version number of the item in the DynamoDB table.
ConditionalCheckFailedException is thrown if:
You use optimistic locking with #DynamoDBVersionAttribute and the version value on the server is different from the value on the client side.
You specify your own conditional constraints while saving data by using DynamoDBMapper with DynamoDBSaveExpression and these constraints failed.
Note
DynamoDB global tables use a “last writer wins” reconciliation between concurrent updates. If you use global tables, last writer policy wins. So in this case, the locking strategy does not work as expected.

Sitecore Lucene Index queue lagging behind in Prod server

In our Sitecore (6.6) implementation we use Lucene indexing. In our PROD server, index bilding process is very slow. At the moment it has 5000+ entries to waiting in the index queue.
Queries I used (in master database),
select * from Properties (check the index last run time)
select * from History where created > 'last index updated time'
As a result of this delay, data gets created do not reflect their changes in the website. Also this queue keeps increasing. When the site takes offline, index building catch up after a while.
Its a heavy read intensive website.
We encountered CPU going high issues, but now they have been sorted. We thought index building was lagging because of the CPU high issue. But now the CPU is running around 30-40%. Still the lucene indexing queue increase rate is high.
How can I solve this issue? Please help.
You need to set up a database maintenance task, so that you regularly flush your History table. If you have sites that are index heavy, this table can grow excessively large. I think the default job cleans this table out with everything that is older than 30 days - you could set this much lower. Like 1 day, or a couple of days.
This article on SDN covers most of the standard maintenance tasks: http://sdn.sitecore.net/Articles/Administration/Database%20Maintenance.aspx
More general information about searching, indexing and performance here: http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf#search=%22clean%22
I think you need to take a step back and ask the question as to why there is such a large number of entries being added to the history table to begin with, before looking at what configuration changes to Sitecore can be made.
You should trace through your code in your development environment based on each of the use cases for your implementation, to find all calls to the Sitecore API where an item is:
Added into the Sitecore Tree
Edited - the changing of any fields item including security, presentation, workflow, publishing restrictions, etc.
Duplicated
Deleted from the Sitecore Tree
Moved to a new location.
Has a new version is added
Has a version removed
As you are going through, make sure that all edit actions to an item are performed with in a single Sitecore.Data.Items.Item.Editing.BeginEdit() and Sitecore.Data.Items.Item.Editing.EndEdit() call whenever possible, so that the changes are performed as a single edit action instead of multiple. Every time Sitecore.Data.Items.Item.Editing.EndEdit() is called, a new record will be inserted into the history table so unnecessary edits will only cause the history table size to increase.
If you are duplicating an item using the Sitecore.Data.Items.Item.CopyTo() method, remember that all versions of the item will be duplicated as well as the item's descendants. This means that the history table will have a record in it for every version of the item that was copied. If you only require the latest version and therefore removing older versions from the new item after it was created, again you should be aware that removing a version from an item will result in a record inserted into the history table for each version deleted.
If you have minimized all of the above actions to the bare minimum that is required to make the system functional, you should find that the Lucene Indexing will keep up-to-date pretty well without having to change Sitecore's default index configuration.

How do you handle "Sync Framework does not automatically handle the deletion of rows that no longer satisfy a filter condition"

http://msdn.microsoft.com/en-us/library/dd918848.aspx
"It is important to understand that a scope is the combination of tables and filters. For example, you could define a filtered scope named sales-WA that contains only the sales data for the state of Washington from the customer_sales table. If you define another filter on the same table, such as sales-OR, this is a different scope. If you define filters, be aware that Sync Framework does not automatically handle the deletion of rows that no longer satisfy a filter condition. For example, if a user or application updates a value in a column that is used for filtering, a row moves from one scope to another. The row is sent to the new scope that the row now belongs to, but the row is not deleted from the old scope. Your application must handle this situation."
I am just wondering someone can shed some light on how to handle "Sync Framework does not automatically handle the deletion of rows that no longer satisfy a filter condition"?
Many thanks.
The sync providers will (as part of the provisioning step) automatically create tombstone tables and triggers to track row deletions. When rows are not deleted, but updated in such a way, as to fall out of the scope, then the automatically generated schema won't log these as deletions. It will log them as updates. So to extend the Microsoft example, assume your application is syncing only Washington data to Washington sales reps. Some sales that were originally entered as Washington sales are corrected and moved to Oregon. The sync framework won't know that it should remove these now-Oregon records from the Washington reps' local databases.
You have a couple of options to solve this:
Modify the provisioning tools to generate triggers that would handle the situation, instead of the default triggers that don't. Look into extending SqlSyncScopeProvisioning to accomplish this. If done correctly, this is probably the most scale-able/extensible solution.
Modify your application to detect the attempt to move a row out of a scope and have the application delete the row and re-insert it instead of just updating it (probably in a stored procedure). If you already use stored procedures to handle updates, this might be a good option.
Add a background service or process that goes through and looks for records that don't match the scope and delete them. This may end up being the easiest solution - especially if your application is already deployed.