How do I migrate my database schema in a backwards-incompatible way with DataJoint? - database-migration

I work with a neuroscience lab that uses DataJoint and a shared lab database to manage our data and data pipelines during data preprocessing and analysis. We have tables for different data types, both raw and processed, and as we use new hardware and analysis methods over the course of several years, we need to revise these schema to support these new ways of storing and working with data. Sometimes, we need to modify table definitions in a backwards-incompatible way, e.g., change the primary key, change relationships between tables, remove tables, or rename tables. However, we want to ensure that users who have started their data processing and analysis on a particular database schema and code can continue to do so. How do we do this in DataJoint?
Here is our current thinking:
One common strategy for backwards-incompatible schema migration is the Expand and Contract pattern (see also this ref).
Adapting that strategy, we would:
Modify the existing schema name so that it is versioned, e.g., version 1 uses “common_ephys_v1” and create a new schema for version 2 “common_ephys_v2”.
Create a branch in the code repo (“v1_v2”) where changes to the database are made in both the v1 and v2 schema. This will add some overhead in both code and transactions. Test the code thoroughly.
Select a time period to pause all changes in the database. Create and run a script using custom SQL to copy data from the v1 schema into the v2 schema and apply the backwards-incompatible changes. Merge the “v1_v2” branch from step 2 into "main".
Let users use this version of the codebase for some time and make sure data is being written correctly.
Create another branch in the code repo (“v2”) and, in that branch, update the DataJoint classes to be compatible with only the new v2 schema.
Tell users who do not need to use the old v1 database schema to adapt their custom code to use the new v2 code and schema.
While there are users of both v1 and v2 code and schema, apply critical bug fixes to both the v1 and v2 classes in their respective branches (“main” and “v2”).
Once all users of v1 have finished with their analysis on the v1 schema, delete the v1 schema and merge the “v2” branch from step 5 into "main".
This seems like a workable, albeit tedious, solution. Is there a better or standard way to do this in DataJoint?

Related

DynamoDB version control using sort keys

Anyone that has implemented versioning using sort keys as stated in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html?
Trying to implement this using typescript for building a database with versions of the items. Is there any way of doing this using updateItem or is it a get + put operation needed?
Any sample to get me started or help is much appreciated!
The concept of versioning using sort key involves the creation of a completely new item that uses same Partition Key and different Sort Key.
DynamoDB offers some operations that allow to update values within an object in an atomic way, this use case is perfect for when you have something like a counter or a quantity and you want to decrease/increase it without having to read its value first. - Docs here.
In the case you're trying to achieve, as mentioned, you are essentially creating a new object. DynamoDB, by itself, doesn't have any concept of versioning and what this pattern does is to cleverly leverage the relation between Partition Key and Sort Key and the fact that a PK can have multiple SK associated with it, to correlate multiple rows of the same table.
To answer your question, if your only source of truth (or data store) is DynamoDB, then yes, your client will have to first query the table to know which was the last version of the item being updated and then insert the new version.
In case you are recording this information elsewhere and are using DynamoDB only to store these versions, then no, one put operation will be enough but again, this assumes you can retrieve this info somewhere else.
In terms of samples, the official documentation of the AWS SDK is always a good start, in your case I assume you'll want to use the Javascript one which you can find here.
At a very high level, you'll have to do the following:
Create an AWS.DynamoDB() client.
Execute a query using the dynamodb.query() method and specifying the PK of the item you want to update.
Go through the items (rows) returned from the previous query and find the one with the bigger version number as SK.
Put a new item using the dynamodb.putItem() method passing an item with the incremented version number as SK and same PK.
You can do the technique described by Amazon with a read and then a write, or more accurately, a read and then two writes (since they want to update both v0 and a new v4!). Often, you need the extra read because you want to build the new version v4 based on data you read from v3 (or equivalently, v0) - but in case you don't need that, the read is not necessary, and two writes are enough:
You first do an UpdateItem to v0 which increments the "Latest" attribute, sets whatever attributes you want to set in the new version, and uses the ReturnValues parameter to ask the update operation to return the new "Latest" attribute.
Then you write with PutItem the new row for v4 (where 4 is the "Latest" you just read).
This approach is safe in the sense that if two clients try to create two new versions at the same time, each one will pick a different "Latest", and both will appear on the version histories. However, it is not safe in the sense that if the client dies between step 1 and 2, you'll have a "hole" in the version history. However, I don't think there's any implementation of this technique that doesn't suffer from this problem.
After saying this, I want to reiterate what I said in the first paragraph: In most realistic use cases, the new version would be based on the old version, so your code anyway needs to read the old version first, then decide how to change it - and then write it (twice). You can't avoid the read in these cases. By the way, in this case the first write (to v0) would be a conditional update to verify that you only write the new version if the old version is still the same one ("Latest" is the same one you read during the read) - otherwise you'd be basing your modification on a non-current version. This is an example of optimistic locking.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

What is the idiomatic way to perform a migration on a dynamo table

Suppose I have a dynamo table named Person, which has 2 fields, name (string), age (int). Let's assume it has a TB worth of data and experiences a small amount of read throughput, but a ton of write throughput. Now I want to add a new field called Phone (string). What is the best way to go about moving the data from one table to another?
Note: Dynamo doesn't let you rename tables, and fields cannot be null.
Here are the options I think I have:
Dump the table to .csv, run a script (overnight probably since it's a TB worth of data) to add a default phone number to this file. (Not ideal, will also lose all new data submitted into old table, unless I bring the service offline to perform the migration (which is not an option in this case)).
Use the SCAN api call. (SCAN will read all values, then will consume significant write throughput on the new table to insert all old data into it).
How can I do perform a dynamo migration on a large table w/o
significant data loss?
you don't need to do anything. This is NoSQL, not SQL. (i.e. there is no idiomatic way to do this as you normally don't need migrations for NoSQL)
Just start writing entries with the additional key.
Records you get back that are written before will not have this key. What you normally do is have a default value you use when missing.
If you want to backfill, just go through and read the value + put the value with the additional field. You can do this in one run via a scan or again do it lazily when accessing the data.

Doctrine 2 How to set an entity table name at run time (Zend 2)

I'm building a product with Zend 2 and Doctrine 2 and it requires that I have a separate table for each user to contain data unique to them. I've made an entity that defines what that table looks like but how do I change the name of the table to persist the data to, or in fact retrieve the data from, at run time?
Alternatively am I going to be better off giving each user their own database, and just changing which DB I am connecting to?
I'd question the design-choice at first. What happens if you create a new user after runtime. The table has to be created first? Furthermore, what kind of data are you storing, to me this sounds like a pretty common multi-client capabilities. Like:
tbl_clients
- id
- name
tbl_clientdata
- client_id
- data_1_value
- data_2_value
- data_n_value
If you really want to silo users data, you'd have to go the separate databases route. But that only works if each "user" is really independent of each other. Think very hard about that.
If you're building some kind of software-as-a-service, and user A and user B are just two different customers of yours, with no relationship to each other, then an N+1 database might be appropriate (one db for each of your N users, plus one "meta" database which just holds user accounts (and maybe billing-related stuff).
I've implemented something like this in ZF2/Doctrine2, and it's not terribly bad. You just create a factory for EntityManager that looks up the database information for whatever user is active, and configures the EM to connect to it. The only place it gets a bit tricky is when you find yourself writing some kind of shared job queue, where long-running workers need to switch database connections with some regularity -- but that's doable too.

Django and Oracle nested table support

Can Django support Oracle nested tables or varrays or collections in some manner? Asking just for completeness as our project is reworking the data model, attempting to move away from EAV organization, but I don't like creating a bucket load of dependent supporting tables for each main entity.
e.g.
(not the proper Oracle syntax, but gets the idea across)
Events
eventid
report_id
result_tuple (result_type_id, result_value)
anomaly_tuple(anomaly_type_id, anomaly_value)
contributing_factors_tuple(cf_type_id, cf_value)
etc,
where the can be multiple rows of the tuples for one eventid
each of these tuples can, of course exist as separate tables, but this seems to be more concise. If it 's something Django can't do, or I can't modify the model classes to do easily, then perhaps just having django create the extra tables is the way to go.
--edit--
I note that django-hstore is doing something very similar to what I want to do, but using postgresql's hstore capability. Maybe I can branch off of that for an Oracle nested table implementation. I dunno...I'm pretty new to python and django, so my reach may exceed my grasp in this case.
Querying a nested table gives you a cursor to traverse the tuples, one member of which is yet another cursor, so you can get the rows from the nested table.