Efficiently deleting the older documents in Elasticsearch

Efficiently deleting the older documents in Elasticsearch - amazon-web-services

I'm storing application logs in Elasticsearch. I want to delete logs older than N months. The app uses index name my-log-index to write the logs. What will be the most efficient way? Some of the ways I found but not sure what will be the best way:
Use Delete by query API. Run this periodically.
Use alias instead of index name like my-log-alias and rollover to new index using the Rollover API after every N months. Also delete old indices periodically.
First approach uses expensive delete. Also, it may only soft delete. Second one looks more efficient. Which one is better or is there a better way?
Elasticsearch version: 6.2.3 (I know it is EOL but can't upgrade right now)

Rollover with ILM is the way to go
Lifecycle-management
With auto deletion

Related

k8s get resources from cluster take too much time

I need to get all resources based on label, I used the following code which works, However, it takes too much time ( ~20sec) to get the response, even which I restrict it to only one namespace (vrf), any idea what im doing wrong here?
resource.NewBuilder(flags).
Unstructured().
ResourceTypes(res...).
NamespaceParam("vrf").AllNamespaces(false).
LabelSelectorParam("a=b").SelectAllParam(selector == "").
Flatten().
Latest().Do().Object()
https://pkg.go.dev/k8s.io/cli-runtime#v0.26.1/pkg/resource#Builder
As I already using label and ns, not sure what should else I do in this case.
Ive checked the cluster connection and it seems that everything is ok, running regular kubectl are getting very fast response, just this query took much time.

The search may be heavy due to the sheer size of the resources the query has to search into. Have you looked into this possibility and further reduce the size using one more label or filter on top of current.
Also check the performance of you Kubernetes api server when the operation is being performed and optimize it.

Automate sequential integer IDs without using Identity Specification?

Are there any tried/true methods of managing your own sequential integer field w/o using SQL Server's built in Identity Specification? I'm thinking this has to have been done many times over and my google skills are just failing me tonight.
My first thought is to use a separate table to manage the IDs and use a trigger on the target table to manage setting the ID. Concurrency issues are obviously important, but insert performance is not critical in this case.
And here are some gotchas I know I need to look out for:
Need to make sure the same ID isn't doled out more than once when
multiple processes run simultaneously.
Need to make sure any solution to 1) doesn't cause deadlocks
Need to make sure the trigger works properly when multiple records are
inserted in a single statement; not only for one record at a time.
Need to make sure the trigger only sets the ID when it is not already
specified.
The reason for the last bullet point (and the whole reason I want to do this without an Identity Specification field in the first place) is because I want to seed multiple environments at different starting points and I want to be able to copy data between each of them so that the ID for a given record remains the same between environments (and I have to use integers; I cannot use GUIDs).
(Also yes, I could set identity insert on/off to copy data and still use a regular Identity Specification field but then it reseeds it after every insert. I could then use DBCC CHECKIDENT to reseed it back to where it was, but I feel the risk with this solution is too great. It only takes one time for someone to make a mistake and then when we realize it, it would be a real pain to repair the data... probably enough pain that it would have made more sense just to do what I'm doing now in the first place).

SQL Server 2012 introduced the concept of a SEQUENCE database object - something like an "identity" column, but separate from a table.
You can create and use sequence from your code, you can use the values in various place, and more.
See these links for more information:
Sequence numbers (MS Docs)
CREATE SEQUENCE statement (MS Docs)
SQL Server SEQUENCE basics (Red Gate - Joe Celko)

Creating a Version Node in Titan

I'm new to graph databases and to Titan. I'm embedding Titan in a Clojure app. When the app starts up, it creates a BerkeleyDB backed Titan store.
I want to know/do 3 things:
Is this database new? If so, create a version node with version 0. Run the migration procedures to bring the "schema" to the newest version.
If not, does it have a version node? If not, throw an exception.
If the database was preexisting and has a version node, run migration procedures to bring the "schema" up to date.
How do I do this in Titan? Is there a best practice for this?
EDIT:
OK, on further review, I think using a hard-coded vertexid makes the most sense. There's a TitanTransaction.containsVertex(long vertexid). Are there any drawbacks to this approach? I guess I don't know how vertexids are allocated and what their reserved ranges are, so this smells dangerous. I'm new to graph DBs, but I think in Neo4j creating a reference node from the root node is recommended. But Titan discourages root node usage because it becomes a supernode. IDK...

1- I don't know if there is a way to see if the database is new through Titan. You could check to see if the directory where BerkeleyDB will be stored exists before you start Titan.
2/3- Probably your best bet would be a hardcoded vertex with an indexed property "version". Do a look up within the (nearly empty) index on "version" at the start and base your logic on those results.
An aside, you might be interested in Titanium[0]. We're gearing up for a big release in the next week or so that should make it much more useful[1].
[0] http://titanium.clojurewerkz.org/
[1] http://blog.clojurewerkz.org/blog/2013/04/17/whats-going-on-with-titanium/

Updating a field in all records in elasticsearch

I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).

There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.

How to manage AWS cloud search version ID?

I'm writing a library in order to help everyone using Amazon cloud search.
With cloud search when you update a document you need to specify the id of the document (of course) and the version of the document you want to upgrade to.
If the specify version number is smaller than the current version the update don't append.
So how to make sure I update my record every time I do an update?
The Ruby project aws_cloud_search is using a timestamp in order to keep the version number always higher but:
As the maximum version number is 4294967295 for AWS cloud search. So
it will not work any more after the 07 Feb 2106
If you run two updates within the same second then the last update
(the more important one) will be ignore

Aside from the timestamp approach, which appears to be the standard answer from everybody, including the docs, the only approach that I've found that works is to keep track of the version number elsewhere, and increment it as changes happen.
Of course, this approach only works if the object that you're trying to represent in the cloud search document can be accessed from somewhere else where presumably you have some sort of atomicity.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js