Updating a field in all records in elasticsearch - sql-update

I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).

There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.

Related

Automate sequential integer IDs without using Identity Specification?

Are there any tried/true methods of managing your own sequential integer field w/o using SQL Server's built in Identity Specification? I'm thinking this has to have been done many times over and my google skills are just failing me tonight.
My first thought is to use a separate table to manage the IDs and use a trigger on the target table to manage setting the ID. Concurrency issues are obviously important, but insert performance is not critical in this case.
And here are some gotchas I know I need to look out for:
Need to make sure the same ID isn't doled out more than once when
multiple processes run simultaneously.
Need to make sure any solution to 1) doesn't cause deadlocks
Need to make sure the trigger works properly when multiple records are
inserted in a single statement; not only for one record at a time.
Need to make sure the trigger only sets the ID when it is not already
specified.
The reason for the last bullet point (and the whole reason I want to do this without an Identity Specification field in the first place) is because I want to seed multiple environments at different starting points and I want to be able to copy data between each of them so that the ID for a given record remains the same between environments (and I have to use integers; I cannot use GUIDs).
(Also yes, I could set identity insert on/off to copy data and still use a regular Identity Specification field but then it reseeds it after every insert. I could then use DBCC CHECKIDENT to reseed it back to where it was, but I feel the risk with this solution is too great. It only takes one time for someone to make a mistake and then when we realize it, it would be a real pain to repair the data... probably enough pain that it would have made more sense just to do what I'm doing now in the first place).
SQL Server 2012 introduced the concept of a SEQUENCE database object - something like an "identity" column, but separate from a table.
You can create and use sequence from your code, you can use the values in various place, and more.
See these links for more information:
Sequence numbers (MS Docs)
CREATE SEQUENCE statement (MS Docs)
SQL Server SEQUENCE basics (Red Gate - Joe Celko)

Should I store a list in memory or in a database and should I build a class to connect to DB?

I am writing a C++ program, I have a class that provides services for the rest of the clases in the program.
I am writing now the clases and the UML.
1) the class that I refer to has a task list that is changing over time and conditions are being checked on this list, I am thinking to keep it in a table in a databasse that every line in the table would represent a task, this way in case that the program crashes or stops working I can restore the last situation, the other option is to keep the task list in memory and keep a copy in the database.
the task list should be searched every second
Which approach is more recommended?
2) In order to write and to read to the database I can call the database directly from the class or build a database communication class, if I write a data communication class I need to give specific options and to build a mini server for this,
e.g. write a line to the database, read a line to the database, update only the first column etc..
what is the recommended approach for this?
Thanks.
First, if the database is obvious and easy, and there are no performance problems, just do that. You're talking about running a query once/second, and maybe marking a task done or adding a new one every so often; even sqlite on a slow SMB share should be able to handle that just fine.
If you do need to optimize it, then there are two approaches: Either still with the database and cache it in-memory, or use memory as your primary storage and come up with a persistence mechanism that uses the database. But until you need to optimize it, don't.
Next, how should you do it? Your question makes it sound like you're thinking in terms of a whole three-tier system, with a "mini-server" sitting between the database server and your task list. There's really no need for that. What you want is a bespoke ORM, but that makes it sound more complicated than it is. All you're doing is writing a class that wraps a database connection and provides a handful of methods—get_due, mark_done, add, get_next_id—each of which maps SQL parameters to Task members. For example (with no error handling):
void mark_done(Task task) {
db.execute("UPDATE Task SET done=true WHERE id=%s", task.id);
}
Three more methods like that, plus a constructor to connect to the database (including creating the Task table if it didn't already exist), and your class is done.
The reason you don't want to write the database stuff directly into Task is that you don't really have anywhere to store shared information like the database connection object; either you need globals (or class attributes, which are effectively globals), or you need copies in every single Task instance (or, really, weak references—which you're going to fake with either a reference or a raw pointer, either way leading to shutdown problems somewhere down the line).
Finally, your whole reason for doing this is error recovery, and databases do a great job of journaling so nothing ever gets inconsistent, but you do have to make sure to structure your app to take advantage of that. For example, you may want to mark all the now-due tasks "in process", then process them, then mark them all "done"; that way, at recovery time, you know exactly which tasks may or may not have been done, and can act appropriately. The more steps you can commit to the database, the less data loss you have to deal with—but of course the more code you have to write, and the slower it gets. So, do as much as necessary, but no more.
Saving information in Database just to recover crashed information may be bit of an overkill.
You ideally want to serialize the list and save it - as binary, xml or csv based values. This can be done based on a timer or certain events in your applications.
Databases may also be used if you can come up with a structure that looks exactly similar to tables - so that you can do one-to-one mapping between the objects and probably write SQL queries easily. But keep that on a separate layer for abstraction.

Storing, tracking and updating an SQLite database version in C++ application

I have an application written in C++ which uses an SQLite database to store information. I need a way of assigning a version number to the database. By this I mean, I need to be able to assign a version number to the state of the database, and if a new 'state' (version) is available, then I need to update the current database state to match the state of the updated version.
I am wondering whether it would be good practice to store the information required for this to happen in a table. I would need the version number, and then some way of storing the tables and their columns related to each version number. This would allow me to make comparisons etc.
I realise that this question Set a version to a SQLite database file is related, however it doesn't quite answer my question, as I am unsure if my approach is correct, and if so, how I go about achieving this.
All help is much appreciated.
Use PRAGMA user_version to read and store an integer value in the database file.
When the version in your code and database file are the same, do nothing. When they are different, upgrade/downgrade accordingly and update the version number.

Managing %SYS.PTools.SQLStats data

I need to profile an application using Caché database and I'm trying to use CacheMonitor for that.
I have enabled query statistics (I suppose CacheMonitor executes DO SetSQLStats^%apiSQL(3) internally) and two days after, my server has gone out of disk space. I'm afraid there is too much data in %SYS.PTools.SQLQuery and %SYS.PTools.SQLStats and I would like to free some space.
Is there any administration tool to manage these data? How can I delete data from sql statistics?
NOTE: My knowledge about Caché is almost none.
It sounds like this is a pretty general problem of how to delete a global and then reclaim the disk space.
To delete the data, you should be able to use a SQL delete statement to clear out %SYS.PTools.SQLStats (which should be larger), and/or %SYS.PTools.SQLQuery.
Since this is Cache, you also might kill the global from the command line. I haven't used these classes, but looking at the class definition in ^oddDEF it appears to store the data in ^%SYS.PTools.SQLQueryD, ^%SYS.PTools.SQLQueryI, and ^%SYS.PTools.SQLQueryS (which is the standard default storage, so this would be likely anyway).
If you only want to delete some of it you will need to craft your own SQL for it.
Once they are deleted, you need to actually shrink the database (like most databases, it can grow dynamically but doesn't automatically give up any space). See this reference for an example of one way to do that. The basic idea is on page 3 - you can make a new database then copy all the data into it, then delete the old once you are sure you don't need it. Don't forget to do a backup first.
To make this easier in the future, you can use the global mapping feature to save the %SYS.PTools globals into their own, new, database. Then when you want to shrink that database you can just replace it with a new one without copying all the data around (as is suggested in the class documentation for %SYS.PTools.SQLStats).

Scan for changed files

I'm looking for a good efficient method for scanning a directory structure for changed files in Windows XP+. Something like how git does it is exactly what I'm looking for, when running a git status it displays all modified files, all new (untracked) files and deleted files very quickly which is exactly what I would like to do.
I have a basic model up and running which performs an initial scan and stores all filenames, size, dates and attributes.
On a subsequent scan it checks if the size, attributes or date have changed and marks as a changed file.
My issue now comes in detecting moved and deleted files. Is there a tried and tested method for this sort of thing? I'm struggling to come up with a good method.
I should mention that it will eventually use ReadDirectoryChangesW to monitor files and alert the user when something changes so a full scan is really a last resort after the initial scan.
Thanks,
J
EDIT: I think I may have described the problem badly. The issue I'm facing is not so much detecting the changes - I have ReadDirectoryChangesW() using IOCP on multiple threads to detected when a change happens, the issue is more what to do with the information. For example, a moved file is reported as a delete followed by a create and a rename comes in 2 parts, old name, followed by new name. So what I'm asking is how to differentiate between the delete as part of a move and an actual delete. I'm guessing buffering the changes and processing batches would be an option but feels messy.
In native code FileSystemWatcher is replaced by ReadDirectoryChangesW. Using this properly is not simple, there is a good baseline to build off here.
I have used this code in a previous job and it worked pretty well. The Win32 API itself (and FileSystemWatcher) are prone to problems that are described in the docs and also discussed in various places online, but impact of those will depending on your use cases.
EDIT: the exact change is indicated in the FILE_NOTIFY_INFORMATION structure that you get back - adds, removals, rename data including old and new name.
I voted Liviu M. up. However, another option if you don't want to use the .NET framework for some reason, would be to use the basic Win32 API call FindFirstChangeNotification.
You can use USN journaling if you are up to it, that is pretty low level (NTFS level) stuff.
Here you can find detailed information and source code included. It is written in C# but most of it is PInvoking C/C++ functions.