I'm writing a library in order to help everyone using Amazon cloud search.
With cloud search when you update a document you need to specify the id of the document (of course) and the version of the document you want to upgrade to.
If the specify version number is smaller than the current version the update don't append.
So how to make sure I update my record every time I do an update?
The Ruby project aws_cloud_search is using a timestamp in order to keep the version number always higher but:
As the maximum version number is 4294967295 for AWS cloud search. So
it will not work any more after the 07 Feb 2106
If you run two updates within the same second then the last update
(the more important one) will be ignore
Aside from the timestamp approach, which appears to be the standard answer from everybody, including the docs, the only approach that I've found that works is to keep track of the version number elsewhere, and increment it as changes happen.
Of course, this approach only works if the object that you're trying to represent in the cloud search document can be accessed from somewhere else where presumably you have some sort of atomicity.
Related
I'm storing application logs in Elasticsearch. I want to delete logs older than N months. The app uses index name my-log-index to write the logs. What will be the most efficient way? Some of the ways I found but not sure what will be the best way:
Use Delete by query API. Run this periodically.
Use alias instead of index name like my-log-alias and rollover to new index using the Rollover API after every N months. Also delete old indices periodically.
First approach uses expensive delete. Also, it may only soft delete. Second one looks more efficient. Which one is better or is there a better way?
Elasticsearch version: 6.2.3 (I know it is EOL but can't upgrade right now)
Rollover with ILM is the way to go
Lifecycle-management
With auto deletion
First time manage document using dls:document-insert-and-manage
Update the same document using xdmp:document-insert
Document get lost from the dls latest version collection cts:search(/scopedIntervention/id , dls:documents-query())
First time manage document
<scopedIntervention>
<id>someId12345</id>
<scopedInterventionName>
First Name
</scopedInterventionName>
<forTestOnly>
true
</forTestOnly>
<inactive>
true
</inactive>
</scopedIntervention>)```
**Document inserted with versioning**
Verify document is present in latest documents collection
cts:search(/scopedIntervention/id , dls:documents-query())
Document present in managed latest collection
Update the same document
<scopedIntervention>
<id>someId12345</id>
<scopedInterventionName>
Updated Name
</scopedInterventionName>
<forTestOnly>
true
</forTestOnly>
<inactive>
true
</inactive>
</scopedIntervention>)```
**Update document to same URI using xdmp:document-insert**
Again verify document is present or NOT in latest documents collection
cts:search(/scopedIntervention/id , dls:documents-query())
Document NOT present in managed latest collection (lost from collection)
After applying DLS package using following upgrade step, the same document shows in the list
```xquery version "1.0-ml";
import module namespace dls = "http://marklogic.com/xdmp/dls"
at "/MarkLogic/dls.xqy";
dls:set-upgrade-status(fn:false()),
dls:start-upgrade(),
fn:doc("http://marklogic.com/dls/upgrade-task-status.xml"),
dls:latest-validation-results(),
dls:set-upgrade-status(fn:true())```
Update the same document using xdmp:document-insert
You are most likely removing the DLS Latest collection at this step. Further, version history is not preserved when you do this.
Instead of using xdmp:document-insert you should use dls:document-checkout-update-checkin .
Please read to the end -- if you did NOT do a DLS Upgrade on an upgraded ML version - STOP NOW and follow the upgrade instructions. Not doing so will leave DLS in a unstable state and anything else you do will make things much harder to repair.
+1 Rob. #IAM, reguardless of if it 'worked' or appeared to 'work' in V7, dls was not designed to handle the case you describe. DLS architecture depends on encapsulating all changes to documents within the checkin/checkout semantics. Bypassing that, you might as well bypass DLS entirely because it wont work. The fact that it was 'working' in V7 is a misnomer, it may have not behaved incorrectly in ways that your application cared about, or your code may have coincidentally done sufficiently similar work as the internals. You might get lucky and find a way to do so again, but I encourage you to consider how to work within the define behaviour of the library, or to refactor out those parts of your code that are not 'DLS Friendly' to operate between checkout/checkin windows -- not all updates have to be the checkout-update-checkin -- you can checkout -- do whatever -- then checkin.
As a migration workaround you MIGHT be able to make use of the upgrade functions added to dls on an ongoing basis.
See https://docs.marklogic.com/dls:start-upgrade
In V9 (I believe), significant non-backwards compatible changes were made to DLS internals that require running this code. one time
The assumption was as in-place update from prior DLS to current. However the code may also happen to work on an ongoing basis, depending on the details of exactly what your application code is doing that the DLS code doesn't know about.
The 'new' DLS code adds an internal collection to optimize the common case of searching for 'latest' documents -- if that is dropped then those documents will not show up on DLS searches (for 'latest').
You mention your code is 'migration scripts' --> If these are migrating from V7 to V10 then you could run your code before the V10 update, then run the V10 update then run the dls-upgrade. After that the documents should be in good shape -- as long as you don't do anything else that is not defined behaviour for managed documents.
Are there any tried/true methods of managing your own sequential integer field w/o using SQL Server's built in Identity Specification? I'm thinking this has to have been done many times over and my google skills are just failing me tonight.
My first thought is to use a separate table to manage the IDs and use a trigger on the target table to manage setting the ID. Concurrency issues are obviously important, but insert performance is not critical in this case.
And here are some gotchas I know I need to look out for:
Need to make sure the same ID isn't doled out more than once when
multiple processes run simultaneously.
Need to make sure any solution to 1) doesn't cause deadlocks
Need to make sure the trigger works properly when multiple records are
inserted in a single statement; not only for one record at a time.
Need to make sure the trigger only sets the ID when it is not already
specified.
The reason for the last bullet point (and the whole reason I want to do this without an Identity Specification field in the first place) is because I want to seed multiple environments at different starting points and I want to be able to copy data between each of them so that the ID for a given record remains the same between environments (and I have to use integers; I cannot use GUIDs).
(Also yes, I could set identity insert on/off to copy data and still use a regular Identity Specification field but then it reseeds it after every insert. I could then use DBCC CHECKIDENT to reseed it back to where it was, but I feel the risk with this solution is too great. It only takes one time for someone to make a mistake and then when we realize it, it would be a real pain to repair the data... probably enough pain that it would have made more sense just to do what I'm doing now in the first place).
SQL Server 2012 introduced the concept of a SEQUENCE database object - something like an "identity" column, but separate from a table.
You can create and use sequence from your code, you can use the values in various place, and more.
See these links for more information:
Sequence numbers (MS Docs)
CREATE SEQUENCE statement (MS Docs)
SQL Server SEQUENCE basics (Red Gate - Joe Celko)
Is there an indicator or flag in gspread that indicates whether or not a change has been made to a sheet or worksheet? This appears to have been present as an attribute called updated before version 2.0, or maybe that served a different purpose?
You're looking for Detect Changes guide in Drive API.
For Google Drive apps that need to keep track of changes to files, the
Changes collection provides an efficient way to detect changes to all
files, including those that have been shared with a user. The
collection works by providing the current state of each file, if and
only if the file has changed since a given point in time.
There's a github code demo for testing purposes.
I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).
There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.