Problem:
I imported a small package of about 15 items from one of our DBs to another one and somehow in the process the children of one of the items got overwritten by this operation.
I may have incorrectly selected "overwrite" instead of "merge" but I'm not sure about that.
The worst thing is, I also published to the web DB after import, so the items are not in the web DB either.
Things I checked:
Checked the Recycle Bin, not there
Also checked the Archive, not there either
Even wrote a piece of code to find the items by ID in the DB, FAILED
My question:
Are the items overwritten by Launch Wizard gone forever? Or there could still be a trace of them remaining in the DB?
There is no "rollback or uninstall a package" feature out of the box in Sitecore. This seems to be the only available info regarding the matter.
I've heard of some shared source modules which could be useful, but never tried them personally.
I think, your best choice is to restore items from a database backup or revert content, if you have a serialized copy on the file system.
Related
After reading this page:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
"Operational Differences and Considerations" -> "Direct writes to Amazon S3 eliminated" section.
I wonder - does this mean that writing to S3 from Hive in EMR 4.x versions will be faster than 5.x versions?
If so, isn't it kind of regression? why would AWS want to eliminate this optimization?
Writing to a Hive table which is located in S3 is a very common scenario.
Can someone clear up that issue?
This optimization originally was developed by Qubole and pushed to Apache Hive.
See here.
This feature is rather dangerous because it is a bypass of Hive fault-tolerance mechanism and also forces developers to use normally unnecessary intermediate tables which, in its turn leads to performance degradation and increases cost.
Very common use-case is when we need to merge increment data into partitioned target table, described here The query is an insert overwrite table from itself, without intermediate table (in a single query) it is rather efficient. The query can be much more complex, with many tables joined. This is what happens with Direct Writes enabled in this use-case:
Partition folder is being deleted before query finished, this causing FileNotFound exception in Mapper reading the same table which is being written fails because partition folder deleted before mapper executed.
If the target table is initially empty, first run succeeds because Hive knows there is no any partition and does not read folder. Second run fails because see (1) folder deleted before mapper finishes.
Known workaround has performance impact. Loading data incrementally is quite often use-case. Direct writing to S3 feature forces developers to use temporary table in this case to eliminate FileNotFoundException and table corruption. As a result we are doing this task much slower and much more costly than if this feature is disabled and we are writing target table from itself.
After the first failure, successful restart is impossible, table is not selectable and not writable because Hive partition exists in metadata but folder does not exists and this causing FileNotFoundException in other queries from this table, which are not overwriting it.
The same described with less details on Amazon page which you are refering: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
Another possible issue is described on Qubole page, some existing fix with using preffixes is mentioned, though this does not work with use-case above because writing new files into folder which is being read will anyway create a problem.
Also mappers, reducers may fail and restart, whole session may fail and restart, writing files directly even with postponed deleting the old ones seems not so good idea because it increases the chance of unrecoverable failure or data corruption.
To disable direct writes, set this configuration property:
set hive.allow.move.on.s3=true; --this disables direct write
You can use this feature for small tasks and when not reading the same table which is being written, though for small tasks it will not give you much. This optimizetion is most efficient when you are rewriting many partitions in a very big table and move task at the end is extremely slow, then you may want to enable it at risk of data corruption.
I have created a UDF through the CREATE FUNCTION command, and now when I try to drop it the server crashes. According to the docs, this is a known issue:
To upgrade the shared library associated with a UDF, issue a DROP FUNCTION statement, upgrade the shared library, and then issue a CREATE FUNCTION statement. If you upgrade the shared library first and then use DROP FUNCTION, the server may crash.
It does, indeed, crash, and afterwards any attempt to remove the function crashes, even if I completely remove the DLL from the plugin directory. During development I'm continually replacing the library that defines the UDF functions. I've already re-installed MySQL from scratch once today and would rather not do it again. Aside from being more careful, is there anything I can do to e.g. clean up the mysql.* tables manually so as to remove the function?
Edit: after some tinkering, the database seems to have settled into a pattern of crashing until I have removed the offending DLL, and after that issuing Error Code: 1305: FUNCTION [schema].[functionName] does not exist. If I attempt to drop the function as root, I get the same message but without the schema prefix.
SELECT * from mysql.func shows the function. If I remove the record by hand, I get the same 1305 error.
Much of the data in the system tables in the mysql schema is cached in memory on first touch. After that, modifying the tables by hand may not have the expected effect unless the server is restarted.
For the grant tables, a mechanism for flushing any cached data is provided -- FLUSH PRIVILIGES -- but for other tables, like func and the time zone tables, the only certain way to ensure that manual changes to the tables are all taken into consideration is to restart the server process.
I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).
There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.
can some one suggest me best idea to overcome this situation. Iam using kettle 4.1.0 community version, here when i want to preview the data in spoon for the transformation table output, then when i click on preview data option, the data is being generated in database directly even if we are not performing Run transformation option.. how can i overcome this problem..
regards
kiran kumar.g
Thats just how it works. Perhaps the name "preview" is badly named.
Couple of ways around it. Preview the step before the table output, and disable the hop so no data goes to table output. If the table output step is collecting several inputs, then put a "dummy" step and make that do the collecting, and then preview that.
or change your db to use a local db (via properties or jndi or even a different connection on the step) then you wont care if the db is generated.
I'm looking for a good efficient method for scanning a directory structure for changed files in Windows XP+. Something like how git does it is exactly what I'm looking for, when running a git status it displays all modified files, all new (untracked) files and deleted files very quickly which is exactly what I would like to do.
I have a basic model up and running which performs an initial scan and stores all filenames, size, dates and attributes.
On a subsequent scan it checks if the size, attributes or date have changed and marks as a changed file.
My issue now comes in detecting moved and deleted files. Is there a tried and tested method for this sort of thing? I'm struggling to come up with a good method.
I should mention that it will eventually use ReadDirectoryChangesW to monitor files and alert the user when something changes so a full scan is really a last resort after the initial scan.
Thanks,
J
EDIT: I think I may have described the problem badly. The issue I'm facing is not so much detecting the changes - I have ReadDirectoryChangesW() using IOCP on multiple threads to detected when a change happens, the issue is more what to do with the information. For example, a moved file is reported as a delete followed by a create and a rename comes in 2 parts, old name, followed by new name. So what I'm asking is how to differentiate between the delete as part of a move and an actual delete. I'm guessing buffering the changes and processing batches would be an option but feels messy.
In native code FileSystemWatcher is replaced by ReadDirectoryChangesW. Using this properly is not simple, there is a good baseline to build off here.
I have used this code in a previous job and it worked pretty well. The Win32 API itself (and FileSystemWatcher) are prone to problems that are described in the docs and also discussed in various places online, but impact of those will depending on your use cases.
EDIT: the exact change is indicated in the FILE_NOTIFY_INFORMATION structure that you get back - adds, removals, rename data including old and new name.
I voted Liviu M. up. However, another option if you don't want to use the .NET framework for some reason, would be to use the basic Win32 API call FindFirstChangeNotification.
You can use USN journaling if you are up to it, that is pretty low level (NTFS level) stuff.
Here you can find detailed information and source code included. It is written in C# but most of it is PInvoking C/C++ functions.