Is Seeding Data using database migration tools like Flyway a good practice? - database-migration

We are in the process of setting up Flyway for our project, and we are having second thoughts on whether we want to seed the data using the Flyway migration or manually using the sql console or some bootstrap script.
Our concern is if we add the seeded data in Flyway, that means further amendments of those data would need to be in Flyway as well. And we'd probably need to use "where field = x" or some other conditions which might already be invalid at a certain point in time, since the data can be altered by the application. And that'll be problematic.
In their documentation, I can't see anything that advises against seeding data.
I'm just wondering if seeding data with migration tools like Flyway is a good idea.

The best fit is usually reference data, where the answer is a clear yes. for user-modifiable data, it depends. If this data is needed initially on all environments, then the answer is probably also yes. At the end of the day if a where condition doesn't hold true anymore in a certain environment later, it probably also means that you don't want to overwrite that data anyway. Or you need to assign fixed synthetic and immutable ids to all row, to which you can then always refer later, even in the face of data changes.

Related

Save RocksDB stored values between runs

My C++ app is using RocksDB to store in-memory key-value sets.
At some points, I want my app to be able to keep the DB values until its next run. Meaning, the program will shut down, start again and read the same values from the DB as it had before it shut down.
What would be the quickest and simplest way to achieve this?
I found the following article for backup & restore routine - https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F, but maybe its an overkill?
Adding to what yinqiwen said, RocksDB was not meant to be just an in memory data store. It works really well with a variety of storage types. And it is especially good in terms of performance when it comes to flash storage. You may use a variety of RocksDB Options to experiment with what configuration is best to use for your workload, but for most cases, even with the default settings for the persistent storage types, rocks db should work just fine.
rocksdb already provide some ways to persist in-memory RocksDB database. u can see this link to conigure your rocksdb. http://rocksdb.org/blog/245/how-to-persist-in-memory-rocksdb-database/

Work on a Django database without modifying it

I'm developing optimization algorithms which operate on data stored in a postgres django database. My algorithms have to repeatedly modify the objects in the database and sometimes revert the change done (it is metaheuristic algorithms, for those who knows).
The problem is that I don't want to save the modification on the postgres database during the process. I would like to save the modifications at the end of the process, when i'm satisfied with the results of the optimization. I think that the solution is to load all concerned objects in memory, work on them, and save the objects in memory to the database at the end.
However it seems to be more difficult than I thought...
Indeed, when I will make a django query (ie. model1.objects.get or model.objects.filter), I fear that django call the objects sometimes in database and sometimes in it's cache, but I'm pretty sure that in some case it will not be the same than the instances I manually loaded in memory (which are the ones on which I want to work because they may have changed since the load from the database) ...
Is there a way to bypass such problems ?
I implemented a kind of custom mini-database which works but it's becoming too difficult to maintain and over all, I think it's not the most simple and elegant way to proceed. I thought to dump the concerned model of the postgres database into an in-memory one (for performance), work on this in-memory db and when finishing my algorithm, update the data of the original database from the data in the in-memory one (it would imply that django keeps a link, perhaps through the pk, of the original objects with those in the in-memory database to identify which are the same and I don't know if it's possible).
Does someone has an insight?
Thank you in advance.
What you are looking for is transactions. One of the most powerfull features of an RDBS. Simply use START TRANSACTION before you start playing around with the data. At the end if you are happy with it use COMMIT. If you don't want your django app to see the changes use ROLLBACK.
Due to the default transaction isolation level of postgresql, your django app will not see whatever changes you are doing elsewhere until it's committed. At the same time what ever changes you do in your sql console or with other code will be visible to that code even though it's not committed.
Read Committed is the default isolation level in PostgreSQL. When a
transaction uses this isolation level, a SELECT query (without a FOR
UPDATE/SHARE clause) sees only data committed before the query began;
it never sees either uncommitted data or changes committed during
query execution by concurrent transactions. In effect, a SELECT query
sees a snapshot of the database as of the instant the query begins to
run. However, SELECT does see the effects of previous updates executed
within its own transaction, even though they are not yet committed

Liquibase Database Migrations, Managing over time

We've been using Liquibase for our database migrations, which runs off the Migrations.xml file. This references all of our migrations so it looks something to the effect of:
<include file="ABC.xml" relativeToChangelogFile="true" />
<include file="DEF.xml" relativeToChangelogFile="true" />
<include file="HIJ.xml" relativeToChangelogFile="true" />
...
However, over time, this has grown immensely long. Looking to refactor this has anyone solved this issue? What options are available to essentially condense all these going forward?
Thanks
There is no single answer, what works best depends on your environment, process, and why you want to trim them down.
Generally, I recommend just letting your changelog grow because existing steps are the ones you have tested, deployed, and know are correct. Changing them introduces risk which is always good to avoid. Keeping your changelog unchanged also allows you to continue to bootstrap new dev/QA databases the same way older instances were.
While there is some performance penalty to having a large changelog, Liquibase tries to keep it to a minimum and is primarily just parsing the changelog file and then reading from the DATABASECHANGELOG table which should scale well. Since liquibase update is usually ran infrequently, even if it takes a couple seconds to run it's often not a big deal.
All that being said, there are times it makes sense clear out your changelog. The best way to do that depends on what the problem you are trying to resolve is.
The easiest approach is usually to just remove the include references to your oldest changelog files. If you know that all your databases have the changesSets in ABC.xml and DEF.xml already applied you can just remove the references to them from your master changelog and everthing will be fine. Liquibase doesn't care if there are "unknown" changeSets in the DATABASECHANGELOG table. If you want to continue boostrapping dev and QA environments with ABC.xml and DEF.xml, you can create a second legacy" master changelog.xml which include them and either run both changelogs when needed or make sure the legacy changelog contains both the old and new changelogs.
If you do not want to completely remove changeLog references, you can manually modify the existing changeSets. Often times there are just a few changes that can make a big difference in update performance. For example, if an index was created and then dropped you can skip the cost of the create by removing the drop and create changeSets. Again, liquibase doesn't care for databases that already ran the changeSets and will not see them for new ones. If you have some databases that may still have the index and you want it removed, you can use an indexExists precondition on the drop changeSet after you delete the create changeset.
Precondition checks can be expensive as well, especially depending on the database vendor. Sometimes removing now-unneeded precondition checks can improve performance as well.
So far this has not been an issue for us. Yes, you do end up with a lot of includes but so far I simply did not care. I would strongly suggest not to delete any old changeset or making changes to them. Try to keep everything reproduceable as-is.
Maybe it helps to group changesets by the version of your application. With that I mean that in your migrations.xml file include the files for each of version of your application which include your "real" changesets. That way you could also easily figure out which changeset belongs to which version of your application.

SQL Query minimizing/caching in a C++ application

I'm writing a project in C++/Qt and it is able to connect to any type of SQL database supported by the QtSQL (http://doc.qt.nokia.com/latest/qtsql.html). This includes local servers and external ones.
However, when the database in question is external, the speed of the queries starts to become a problem (slow UI, ...). The reason: Every object that is stored in the database is lazy-loaded and as such will issue a query every time an attribute is needed. On average about 20 of these objects are to be displayed on screen, each of them showing about 5 attributes. This means that for every screen that I show about 100 queries get executed. The queries execute quite fast on the database server itself, but the overhead of the actual query running over the network is considerable (measured in seconds for an entire screen).
I've been thinking about a few ways to solve the issue, the most important approaches seem to be (according to me):
Make fewer queries
Make queries faster
Tackling (1)
I could find some sort of way to delay the actual fetching of the attribute (start a transaction), and then when the programmer writes endTransaction() the database tries to fetch everything in one go (with SQL UNION or a loop...). This would probably require quite a bit of modification to the way the lazy objects work but if people comment that it is a decent solution I think it could be worked out elegantly. If this solution speeds up everything enough then an elaborate caching scheme might not even be necessary, saving a lot of headaches
I could try pre-loading attribute data by fetching it all in one query for all the objects that are requested, effectively making them non-lazy. Of course in that case I will have to worry about stale data. How would I detect stale data without at least sending one query to the external db? (Note: sending a query to check for stale data for every attribute check would provide a best-case 0x performance increase and a worst-caste 2x performance decrease when the data is actually found to be stale)
Tackling (2)
Queries could for example be made faster by keeping a local synchronized copy of the database running. However I don't really have a lot of possibilities on the client machines to run for example exactly the same database type as the one on the server. So the local copy would for example be an SQLite database. This would also mean that I couldn't use an db-vendor specific solution. What are my options here? What has worked well for people in these kinds of situations?
Worries
My primary worries are:
Stale data: there are plenty of queries imaginable that change the db in such a way that it prohibits an action that would seem possible to a user with stale data.
Maintainability: How loosely can I couple in this new layer? It would obviously be preferable if it didn't have to know everything about my internal lazy object system and about every object and possible query
Final question
What would be a good way to minimize the cost of making a query? Good meaning some sort of combination of: maintainable, easy to implement, not too aplication specific. If it comes down to pick any 2, then so be it. I'd like to hear people talk about their experiences and what they did to solve it.
As you can see, I've thought of some problems and ways of handling it, but I'm at a loss for what would constitute a sensible approach. Since it will probable involve quite a lot of work and intensive changes to many layers in the program (hopefully as few as possible), I thought about asking all the experts here before making a final decision on the matter. It is also possible I'm just overlooking a very simple solution, in which case a pointer to it would be much appreciated!
Assuming all relevant server-side tuning has been done (for example: MySQL cache, best possible indexes, ...)
*Note: I've checked questions of users with similar problems that didn't entirely satisfy my question: Suggestion on a replication scheme for my use-case? and Best practice for a local database cache? for example)
If any additional information is necessary to provide an answer, please let me know and I will duly update my question. Apologies for any spelling/grammar errors, english is not my native language.
Note about "lazy"
A small example of what my code looks like (simplified of course):
QList<MyObject> myObjects = database->getObjects(20, 40); // fetch and construct object 20 to 40 from the db
// ...some time later
// screen filling time!
foreach (const MyObject& o, myObjects) {
o->getInt("status", 0); // == db request
o->getString("comment", "no comment!"); // == db request
// about 3 more of these
}
At first glance it looks like you have two conflicting goals: Query speed, but always using up-to-date data. Thus you should probably fall back to your needs to help decide here.
1) Your database is nearly static compared to use of the application. In this case use your option 1b and preload all the data. If there's a slim chance that the data may change underneath, just give the user an option to refresh the cache (fully or for a particular subset of data). This way the slow access is in the hands of the user.
2) The database is changing fairly frequently. In this case "perhaps" an SQL database isn't right for your needs. You may need a higher performance dynamic database that pushes updates rather than requiring a pull. That way your application would get notified when underlying data changed and you would be able to respond quickly. If that doesn't work however, you want to concoct your query to minimize the number of DB library and I/O calls. For example if you execute a sequence of select statements your results should have all the appropriate data in the order you requested it. You just have to keep track of what the corresponding select statements were. Alternately if you can use a looser query criteria so that it returns more than one row for your simple query that ought to help performance as well.

How do the big sites handle immediate schema changes whilst using Django?

I've been using south but I really hate having to manually migrate data all over again even if I make one small itty bitty update to a class. If I'm not using django I can easily just alter the table schema and make an adjustment in a class and I'm good.
I know that most people would probably tell me to properly think out the schema way in advance, but realistically speaking there are times where you need to immediately make changes, and I don't think using south is ideal for this.
Is there some sort of advanced method people use, perhaps even modifying the core of Django itself? Or is there something about south that I'm not just grokking?
I really hate having to manually migrate data all over again even if I make one small itty bitty update to a class.
Can you specify what kind of updates? If you mean adding new fields or editing existing ones then obviously yes. If you mean modifying methods that operate on fields then there is no need to migrate.
I know that most people would probably tell me to properly think out the schema way in advance
It would certainly help to think it over a couple of times. Experience helps too. But obviously you cannot foresee everything.
but realistically speaking there are times where you need to immediately make changes, and I don't think using south is ideal for this.
Honestly I am not convinced by this argument. If changes can be deployed "immediately" using SQL then I'd argue that they can be deployed using South as well. Especially if you have automated your deployment using Fabric or such.
Also I find it hard to believe that the time taken to execute a migration using a generated script can be significantly greater than the time taken to first write the appropriate SQL and then execute it. At least this has not been the case in my experience.
The one exception could be a situation where the ORM doesn't readily have an equivalent for the SQL. In that case you can still execute the raw SQL through your (South) migration script.
Or is there something about south that I'm not just grokking?
I suspect that you are not grokking the idea of having orderly, version-controlled, reversible migrations. SQL-only migrations are not always designed to be reversible (I know there are exceptions). And they are not orderly unless the developers take particular care about keeping them so. I've even seen people fire potentially troublesome updates on production without even pausing to start a transaction first and then discard the SQL without even making a record of it.
I'm not questioning your skills or attention to detail here; I'm just pointing out what I think is your disconnect with South.
Hope this helps.