Liquibase Database Migrations, Managing over time - database-migration

We've been using Liquibase for our database migrations, which runs off the Migrations.xml file. This references all of our migrations so it looks something to the effect of:
<include file="ABC.xml" relativeToChangelogFile="true" />
<include file="DEF.xml" relativeToChangelogFile="true" />
<include file="HIJ.xml" relativeToChangelogFile="true" />
...
However, over time, this has grown immensely long. Looking to refactor this has anyone solved this issue? What options are available to essentially condense all these going forward?
Thanks

There is no single answer, what works best depends on your environment, process, and why you want to trim them down.
Generally, I recommend just letting your changelog grow because existing steps are the ones you have tested, deployed, and know are correct. Changing them introduces risk which is always good to avoid. Keeping your changelog unchanged also allows you to continue to bootstrap new dev/QA databases the same way older instances were.
While there is some performance penalty to having a large changelog, Liquibase tries to keep it to a minimum and is primarily just parsing the changelog file and then reading from the DATABASECHANGELOG table which should scale well. Since liquibase update is usually ran infrequently, even if it takes a couple seconds to run it's often not a big deal.
All that being said, there are times it makes sense clear out your changelog. The best way to do that depends on what the problem you are trying to resolve is.
The easiest approach is usually to just remove the include references to your oldest changelog files. If you know that all your databases have the changesSets in ABC.xml and DEF.xml already applied you can just remove the references to them from your master changelog and everthing will be fine. Liquibase doesn't care if there are "unknown" changeSets in the DATABASECHANGELOG table. If you want to continue boostrapping dev and QA environments with ABC.xml and DEF.xml, you can create a second legacy" master changelog.xml which include them and either run both changelogs when needed or make sure the legacy changelog contains both the old and new changelogs.
If you do not want to completely remove changeLog references, you can manually modify the existing changeSets. Often times there are just a few changes that can make a big difference in update performance. For example, if an index was created and then dropped you can skip the cost of the create by removing the drop and create changeSets. Again, liquibase doesn't care for databases that already ran the changeSets and will not see them for new ones. If you have some databases that may still have the index and you want it removed, you can use an indexExists precondition on the drop changeSet after you delete the create changeset.
Precondition checks can be expensive as well, especially depending on the database vendor. Sometimes removing now-unneeded precondition checks can improve performance as well.

So far this has not been an issue for us. Yes, you do end up with a lot of includes but so far I simply did not care. I would strongly suggest not to delete any old changeset or making changes to them. Try to keep everything reproduceable as-is.
Maybe it helps to group changesets by the version of your application. With that I mean that in your migrations.xml file include the files for each of version of your application which include your "real" changesets. That way you could also easily figure out which changeset belongs to which version of your application.

Related

Maintaining South migrations on Django forks

I'm working on a pretty complex Django project (50+ models) with some complicated logic (lots of different workflows, views, signals, APIs, background tasks etc.). Let's call this project-base. Currently using Django 1.6 + South migrations and quite a few other 3rd party apps.
Now, one of the requirements is to create a fork of this project that will add some fields/models here and there and some extra logic on top of that. Let's call this project-fork. Most of the extra work will be on top of the existing models, but there will also be a few new ones.
As project-base continues to be developed, we want these features to also get into project-fork (much like a rebase/merge in git-land). The extra changes in project-fork will not be merged back into project-base.
What could be the best possible way to accomplish this? Here are some of my ideas:
Use South merges in project-fork to keep it up-to-date with latest changes from project-base, as explained here. Use signals and any other means necessarry to keep the new logic from project-fork as loosely coupled as possible to avoid any potential conflicts.
Do not modify ANY of the original project-base models and instead create new models in different apps that reference the old models (i.e. using OneToOneField). Extra logic could end up in the old and/or new apps.
your idea here please :)
I would go with option 1 as it seems less complicated as a whole, but might expose a greater risk. Here's how I would see it happening:
Migrations on project-base:
0001_project_base_one
0002_project_base_two
0003_project_base_three
Migrations on project-fork:
0001_project_base_one
0002_project_fork_one
After merge, the migrations would look like this:
0001_project_base_one
0002_project_base_two
0002_project_fork_one
0003_project_base_three
0004_project_fork_merge_noop (added to merge in changes from both projects)
Are there any pitfalls using this approach? Is there a better way?
Thank you for your time.
Official South workflow:
The official South recommendation is to try with the --merge flag: http://south.readthedocs.org/en/latest/tutorial/part5.html#team-workflow
This obviously won't work in all cases, from my experience it works in most though.
Pitfalls:
Multiple changes to the same model can still break
Duplicate changes can break things
The "better" way is usually to avoid simultaneous changes to the same models, the easiest way to do this is by reducing the error-window as much as possible.
My personal workflows in these cases:
With small forks where the model changes are obvious from the beginning:
Discus which model changes will need to be done for the fork
Apply those changes to both/all branches as fast as possible to avoid conflicts
Work on the fork ....
Merge the fork which gives no new migrations
With large forks where the changes are not always obvious and/or will change again:
Do the normal fork and development stuff while trying to stay up to date with the latest master/develop branch as much as possible
Before merging back, throw away all schemamigrations in the fork
Merge all changes from the master/develop
Recreate all needed schema changes
Merge to develop/master

Database versions deployment. Entity Framework Migrations vs SSDT DacPacs

I have a data-centered application with SQL Server. The environments in which it´ll be deployed are not under our control and there´s no DBA in there (they are all small businesses) so we need the process of distribution of each application/database update to be as automatic as possible.
Besides of the normal changes between versions of an application (kind of unpredictable sometimes), we already know that we´ll need to distribute some new seed data with each version. Sometimes this seed data will be related to other data in our system. For instance: maybe we´ll need to insert 2 new rows of some master data during the v2-v3 update process, and some other 5 rows during the v5-v6 update process.
EF
We have checked Entity Framework Db Migrations (available for existing databases with no Code-First since 4.3.1 release), which represents the traditional sequential scripts in a more automatic and controlled way (like Fluent Migrations).
SSDT
On the other hand, with a different philosophy, we have checked SSDT and its dacpacs, snapshots and pre- and post-deployment scripts.
The questions are:
Which of these technologies / philosophies is more appropriate for the case described?
Any other technology / philosophy that could be used?
Any other advice?
Thanks in advance.
That's an interesting question. Here at Red Gate we're hoping to tackle this issue later this year, as we have many customers asking about how we might provide a simple deployment package. We do have SQL Packager, which essentially wraps a SQL script into an exe.
I would say that dacpacs are designed to cover the use case you describe. However, as far as I understand they work be generating a deployment script dynamically when applied to the target. The drawback is that you won't have the warm fuzzy feeling that you might get when deploying a pre-tested SQL script.
I've not tried updating data with dacpacs before, so I'd be interested to know how well this works. As far as I recall, it truncates the target tables and repopulates them.
I have no experience with EF migrations so I'd be curious to read any answers on this topic.
We´ll probably adopt an hybrid solution. We´d like not to renounce to the idea deployment packagers, but in the other hand, due to our applications´s nature (small businesses as final users, no DBA, no obligation to upgrade so multiple "alive" database versions coexisting), we can´t either renounce to the full control of the migration process, including schema and data. In our case, pre and post-deployment scripts may not be enough (or at least not comfortable enough ) for a full migration like EF Migrations are. Changes like addind/removing seed data, changing a "one to many" to a "many to many" relationship or even radical database schema changes (and, consequently , data migrations to this schema from any previous released schema) may be part of our diary work when our first version is released.
So we´ll probably use EF migations, with its "Up" and "Down" system for each version release. In principle, each "Up" will invoke a dacpac with the last database snapshot (and each Down, its previous), each one with its own deployment parameters for this specific migration. EF migrations will handle the versioning line, an maybe also some complex parts of data migration.
We feel more secure in this hybrid way. We missed automatization and schema changes detection in Entity Framework Migrations as much as we missed versioning line control in Dacpacs way.

When should I make a repository for my project?

I've been looking into using Mercurial for version management and wanted to try it out. I'm planning on doing a project on my own to help me learn C++ coming from Java and thought it would be a good idea to try version management with it. It would probably be a game or some sort of simple application, not sure with a GUI or not.
The problem is I don't know much about version management. When in a project is a good time to make the repository? Should I just do it right away? How often should I make commits?
I'll be using Netbeans for my development since it seems to be pretty good, also has Mercurial controls built in, but what should I set in the .hgignore for my project? I assume the repository would be the whole folder Netbeans creates but what should I make it ignore?
Bonus questions: I'll be using Bitbucket for hosting/backup, should I make the project public? Why/why not? Also would that make it open source? Should I bother with attaching a license? Because I have no idea how the licenses work and such. Then again, I sincerely doubt I'll be making anything anyone will want to copy too badly.
On repos:
You should use a VCS repository as soon as possible. Not only does it allow for off-site backups, but it allows you to track changes. To find out when a bug was introduced.
Having a VCS, particularly a distributed VCS like Mercurial, will change the way you code. You won't have to be careful not to break things, because you always have the old version that you can revert to. If, after a few weeks or months, you decide that a particular course of development was a bad idea, you can rewind all the way back to some prior point.
You should generally commit either every day, or every time you finish a task. Like, you might have a 4-hour task of "write this class and get it working". You commit after that. When you're done for the day, you commit. I wouldn't suggest committing in a non-compiling state, though, so you should try to stop for the day only when everything still builds.
As for ignoring, you should ignore things you don't intend to commit. Things that you wouldn't want in the repo. Generated files, temporary files, the stuff that you wouldn't need.
What should be in the repo is 100% of the information necessary to build the project from scratch (minus external dependencies).
On licenses:
I would say that it is very rude to put up a public repository without some kind of license attached. Without a license, that means that anyone even pulling from the repo (which is what you're inviting by making it public) is a violation of your copyright without direct, explicit permission from you.
So either look at how copyright works and pick a license to release your stuff under, or keep your repos private.
The instant you conceive of it, usually. There is no penalty for putting everything you have into a VCS right away, and who knows, you may want that super-duper prototype version later on down the road. Unless you're going to be checking in large media files (don't), my answer is: right now.
As for how often to commit, that's a personal preference. I use git, and as far as I'm concerned there is no problem with checking in every time you make some unit of significant change. The more snapshots along the way you have, the easier it becomes to track down bugs later on.
Your .hgignore is up to you. You need to decide what to version and what not to. As a rule of thumb, any files generated during compilation probably shouldn't be checked in.
Also, if you release it to the public, please attach a license. It doesn't matter what it is, but it makes all the bosses and lawyers in the world happy when they know what they can and can't do with your code. Finally, please don't release something to the public unless you think someone else will benefit from it. The internet already suffers from information overload.
Commit early, and commit often.
With distributed VCS like hg or git there's really no reason not to start a repo before you even write any code. If you decide to scrap your project, you can always delete your repo before you post it to whatever hosting site you're using. The more often you commit, the easier it will be to hone in on where you messed up later on (you will).
Stuff to put in your ignore file would be build files and anything that your IDE generates periodically, like user preferences and tag caches. You don't want to foist your preferences on someone else and you don't want to have to create a commit every time you change the position of a window in the IDE. What's in the repo should be the minimum that someone would need to be able to work on your project.
If you're just doing a small project to learn, then I wouldn't worry about licenses or having a private repo. You can always add a license later. Alternatively, just pick a permissive license and go with it.
Based on my experience with git, I would recommend the following:
Create repository along with project.
Commit as often as you can with detailed comments
When you want to test a new idea - create a branch. If the idea is successful, merge your code to the master branch. If the idea is not successful, you can just discard the branch. Probably this is the most important idea.
Put the names of object files, executables and editor backup files in the ignore list.

Optimisation tips when migrating data into Sitecore CMS

I am currently faced with the task of importing around 200K items from a custom CMS implementation into Sitecore. I have created a simple import page which connects to an external SQL database using Entity Framework and I have created all the required data templates.
During a test import of about 5K items I realized that I needed to find a way to make the import run a lot faster so I set about to find some information about optimizing Sitecore for this purpose. I have concluded that there is not much specific information out there so I'd like to share what I've found and open the floor for others to contribute further optimizations. My aim is to create some kind of maintenance mode for Sitecore that can be used when importing large columes of data.
The most useful information I found was on Mark Cassidy's blogpost http://intothecore.cassidy.dk/2009/04/migrating-data-into-sitecore.html. At the bottom of this post he provides a few tips for when you are running an import.
If migrating large quantities of data, try and disable as many Sitecore event handlers and whatever else you can get away with.
Use BulkUpdateContext()
Don't forget your target language
If you can, make the fields shared and unversioned. This should help migration execution speed.
The first thing I noticed out of this list was the BulkUpdateContext class as I had never heard of it. I quickly understood why as a search on the SND forum and in the PDF documentation returned no hits. So imagine my surprise when i actually tested it out and found that it improves item creation/deletes by at least ten fold!
The next thing I looked at was the first point where he basically suggests creating a version of web config that only has the bare essentials needed to perform the import. So far I have removed all events related to creating, saving and deleting items and versions. I have also removed the history engine and system index declarations from the master database element in web config as well as any custom events, schedules and search configurations. I expect that there are a lot of other things I could look to remove/disable in order to increase performance. Pipelines? Schedules?
What optimization tips do you have?
Incidentally, BulkUpdateContext() is a very misleading name - as it really improves item creation speed, not item updating speed. But as you also point out, it improves your import speed massively :-)
Since I wrote that post, I've added a few new things to my normal routines when doing imports.
Regularly shrink your databases. They tend to grow large and bulky. To do this; first go to Sitecore Control Panel -> Database and select "Clean Up Database". After this, do a regular ShrinkDB on your SQL server
Disable indexes, especially if importing into the "master" database. For reference, see http://intothecore.cassidy.dk/2010/09/disabling-lucene-indexes.html
Try not to import into "master" however.. you will usually find that imports into "web" is a lot faster, mostly because this database isn't (by default) connected to the HistoryManager or other gadgets
And if you're really adventureous, there's a thing you could try that I'd been considering trying out myself, but never got around to. They might work, but I can't guarantee that they will :-)
Try removing all your field types from App_Config/FieldTypes.config. The theory here is, that this should essentially disable all of Sitecore's special handling of the content of these fields (like updating the LinkDatabase and so on). You would need to manually trigger a rebuild of the LinkDatabase when done with the import, but that's a relatively small price to pay
Hope this helps a bit :-)
I'm guessing you've already hit this, but putting the code inside a SecurityDisabler() block may speed things up also.
I'd be a lot more worried about how Sitecore performs with this much data... assuming you only do the import once, who cares how long that process takes. Is this going to be a regular occurrence?

How do the big sites handle immediate schema changes whilst using Django?

I've been using south but I really hate having to manually migrate data all over again even if I make one small itty bitty update to a class. If I'm not using django I can easily just alter the table schema and make an adjustment in a class and I'm good.
I know that most people would probably tell me to properly think out the schema way in advance, but realistically speaking there are times where you need to immediately make changes, and I don't think using south is ideal for this.
Is there some sort of advanced method people use, perhaps even modifying the core of Django itself? Or is there something about south that I'm not just grokking?
I really hate having to manually migrate data all over again even if I make one small itty bitty update to a class.
Can you specify what kind of updates? If you mean adding new fields or editing existing ones then obviously yes. If you mean modifying methods that operate on fields then there is no need to migrate.
I know that most people would probably tell me to properly think out the schema way in advance
It would certainly help to think it over a couple of times. Experience helps too. But obviously you cannot foresee everything.
but realistically speaking there are times where you need to immediately make changes, and I don't think using south is ideal for this.
Honestly I am not convinced by this argument. If changes can be deployed "immediately" using SQL then I'd argue that they can be deployed using South as well. Especially if you have automated your deployment using Fabric or such.
Also I find it hard to believe that the time taken to execute a migration using a generated script can be significantly greater than the time taken to first write the appropriate SQL and then execute it. At least this has not been the case in my experience.
The one exception could be a situation where the ORM doesn't readily have an equivalent for the SQL. In that case you can still execute the raw SQL through your (South) migration script.
Or is there something about south that I'm not just grokking?
I suspect that you are not grokking the idea of having orderly, version-controlled, reversible migrations. SQL-only migrations are not always designed to be reversible (I know there are exceptions). And they are not orderly unless the developers take particular care about keeping them so. I've even seen people fire potentially troublesome updates on production without even pausing to start a transaction first and then discard the SQL without even making a record of it.
I'm not questioning your skills or attention to detail here; I'm just pointing out what I think is your disconnect with South.
Hope this helps.