What is the difference between Insert else update & Update else Insert in lookup transformation. Can anyone, please explain with example
As per my understanding, there will be no functional difference. It's more about the performance - whichever option you choose, the latter is attempted after first operation fails. So it's best to use the option you'd suspect to succeed on first attempt more often.
Say we've expect 80% of updates and 20% of inserts and 10.000 rows:
With Insert else Update we will end up having 18.000 operations (10k inserts with 8k failed followed by 8k updates)
With Update else Insert there will be 12k DB operations (10k updates with 2k failed resulting in 2k inserts).
Update else insert effects rows marked for update. When checked, they will be inserted in the cache if they don't exist in the cache.
Insert else update effects rows marked for insert. When checked, they will be updated in the cache if they exist in the cache.
Related
I'm working with on an exceptionally large table which due to some data issue, I have to re-insert data on a couple of historical dates. After the insertion, I wanted to perform a manually triggered VACUUM FULL operation. However, unfortunately, the VACUUM FULL operation on that table takes more than several days to complete. Since, in Redshift, only one VACUUM operation can happen at a time, that also means that other smaller tables will not be able to perform their daily VACUUM operation until that large table is done with its VACUUM operation.
My question is, is there a way to pause a VACUUM operation on that large table to give some room for VACUUM-ing the smaller tables? Will terminating a VACUUM operation resets the operation or will re-running the VACUUM command able to resume the operation from the last successful state?
Sorry, I'm trying to learn more about how the VACUUM process works in Redshift but I am not able to find too much info on it. Would be really appreciated to have some explanation/docs in your answer as well.
Note: I did tried to perform a deep copy as mentioned in the official docs. However, the table is too large to copy on one go so it's not an option.
Thanks!
As far as I know there is no way to pause a (manual) vacuum. However, vacuum runs in "passes" where some vacuum work is done and partial results are committed. This allows for ongoing work to progress while the vacuum is running. If you terminate the vacuum midway the previously committed blocks will save the partial work. Some work will be lost for the current pass and the restarted vacuum will need to scan the entire table to figure out where to start. Last I knew this will work but you will lose some progress with each terminate.
If you manage the update / consistency issues a deep copy can be a faster way to go. You don't have to do this in a single pass - you can do it in parts. You need the space to store the second version of the table but you won't need the space to sort the whole table in one go. For example if you have a table, let's say with 10 years of data, you can insert the first year into the new table (sorted of course). Then the second and so on. There may be some partial empty blocks at the boundaries but these are easy to fix up with a delete only vacuum (or just wait for auto-vacuum to do it).
If you can do it the deep copy method will be faster as it doesn't need to keep consistency or play nice with other workloads.
I have a relatively large data set in a table with about 60 columns, of which about 20 have gone stale. I've found a few posts on dropping multiple columns and the performance of DROP COLUMN, but nothing on whether or not dropping a bunch of columns would result in a noticeable performance increase.
Any insight as to whether or not something like this could a perceptible impact?
Dropping one or more columns can be done in a single statement and is very fast. All it needs is a short ACCESS EXCLUSIVE lock on the table, so long running queries would block it.
The table is not rewritten during this operation, and it will not shrink. Subsequent rewrites (with VACUUM (FULL) or similar) will get rid of the column data.
There are invoices that I am caching but with 2 cache entry. First cache entry holds if caching of the invoices are existing or not. Why I am doing it? Because there is a business logic (get_cache_timeout method) that tells me when to update 2nd cache entry which is holding the actual invoice details.
So, first one is a flag for me to understand if 2nd cache entry is there or not. If not, I call the backend system and update 1st and 2nd cache entry.
The reason behind of having 2nd cache key with 60 days is that, for the worst case if 1st entry doesn't exist and then call to the backend system fails, I want to still return 2nd cache entry as a response instead of showing error.
cache.set(f'{invoices}_cache_exists', True, get_cache_timeout())
cache.set(f'{invoices}_cache', some_cache, 60*60*24*60)
Sorry for confusing explanation but I hope you get the idea behind of this solution.
So, in the end my question is that for this problem how can I get rid of 1st cache entry and only having 2nd cache entry with 2 timeouts? 1st one is giving me telling when to update, and 2nd one is to remove the cache.
What about this?
#cache.set(f'{invoices}_cache_exists', True, get_cache_timeout())
cache.set(f'{invoices}_cache', some_cache, get_cache_timeout())
you can make your cache expires in get_cache_timeout() time.
In the end, if the cache entry expires it needs to be updated so knowing when to update is solved.
On the order hand, when to remove, well, it will be removed every get_cache_timeout() seconds/minutes.
It just not make sense to have a cache entry with a TTL of M min that has to be updated every m min where M > n
Our content management server hosts the Lucene sitecore_analytics_index.
By default, the sitecore_analytics_index uses a TimedIndexRefreshStrategy with an interval of 1 minute. This means that every minute, Sitecore adds new analytics data to the index, and then optimizes the index.
We've found that the optimization part takes ~20 minutes for our index. In practice, this means that the index is constantly being optimized, resulting in non-stop high disk I/O.
I see two possible ways to improve the situation:
Don't run the optimize step after index updates, and implement an agent to optimize the index just once per day (as per this post). Is there a big downside to only optimizing the index, say, once per day? AFAIK it's not necessary to optimize the index after every update.
Keep running the optimize step after every index update, but increase the interval from 1 minute to something much higher. What ill-effects might we see from this?
Option 2 is easier as it is just a config change, but I suspect that updating the index less frequently might be bad (hopefully I'm wrong?). Looking in the Sitecore search log, I see that the analytics index is almost constantly being searched, but I don't know what by, so I'm not sure what might happen if I reduce the index update frequency.
Does anyone have any suggestions? Thanks.
EDIT: alternatively, what would be the impact of disabling the Sitecore analytics index entirely (and how could I do that)?
I have a stuck 'vacuum reindex' operation and am wondering what may be the cause for it taking such a long time.
I recently changed the schema of one of my Redshift tables, by creating a new table with the revised schema and deep copying the data using 'select into' (see Performing a Deep Copy). My basic understanding was that after deep copying the table, the data should be sorted according to the table's sort-keys. The table has an interleaved 4-column sort-key. Just to make sure, after deep copying I ran the 'interleaved skew' query (see Deciding When to Reindex), and the results were 1.0 for all columns, meaning no skew.
I then ran 'vacuum reindex' on the table, which should be really quick since the data is already sorted. However the vacuum is still running after 30 hours. During the vacuum I examined svv_vacuum_progress periodically to check the vacuum operation status. The 'sort' phase finished after ~6 hours but now the 'merge' phase is stuck in 'increment 23' for >12 hours.
What could be the cause for the long vacuum operation, given that the data is supposed to be already sorted by the deep copy operation? Am I to expect these times for future vacuum operations too? The table contains ~3.5 billion rows and its total size is ~200 GB.