Today I have started my first steps into postgresql, since its recommended by the Django team.
I came across several issues, that I solved patiently one by one.
1) Creating tables under postgresql requires to login as a different OS login, from which you don't even know the password. Fine, I found the solution and created the database.
2) After running syncdb, you can't simply execute a simple insert sql like this:
INSERT INTO App_contacttype (contact_type, company_id) VALUES ('Buyer', 1),('Seller', 1);
Since Django creates it with quotes the table becomes case sensitive, hence it has to be like this:
INSERT INTO "App_contacttype" (contact_type, company_id) VALUES ('Buyer', 1),('Seller', 1);
But the problems seem never to end. Now suddenly the execution of the insert script says
ERROR: value too long for type character varying(40)
SQL state: 22001
In MySQL this was no problem. I don't know, right now I am getting a bit of cold feet, maybe I should just stick to MySQL.
The only reason I was considering postgresql was that some research suggested postgresql has much better support for changing Schemas along the way than MySQL.
However considering http://south.aeracode.org/ would take away all the pain of syncing Schemas, would I even need to worry about Schema changes at all no matter what the underlying database is?
Related
I am currently investigating Flyway as an alternative to Liquibase, but was unable to find an answer to the following question in the documentation:
Assume a migration X is found to contain a bug after deployment in production. In retrospect, X should never have been executed as is, but it's already too late. However, we'd like to replace the migration X with a fixed version X', such that databases that are populated from scratch do not suffer from the same bug.
In Liquibase, you would fix the original changeset and use the <validChecksum> tag to notify Liquibase that the change was made by purpose. Is there a pendant to <validChecksum> in Flyway, or an alternative mechanism that achieves the same?
Although it is a violation of Flyway's API, the following approach has worked fine for us:
Write a beforeValidate.sql that fixes the checksum to match the expected value, so that when Flyway actually validates the checksum, everything seems fine.
An example:
-- The script xyz/V03_201808230839__Faulty_migration.sql was modified to fix a critical bug.
-- However, at this point there were already production systems with the old migration file.
-- On these systems, no additional statements need to be executed to reflect the change,
-- BUT we need to repair the Flyway checksum to match the expected value during the 'validate' command.
UPDATE schema_version
SET checksum = -842223670
WHERE (version, checksum) = ('03.201808230839', -861395806);
This has the advantage of only targetting one specific migration, unlike Flyway's repair command.
Depending how big the mess is you could also
simply have a follow-up migrations to correct it (typo in new column name, ..)
if that is not an option, you must manually fix both the migration and the DB and issue Flyway.repair() to realign the checksum http://flywaydb.org/documentation/command/repair.html
What is the point of hiding this bad change if it has already reached production? Is it expensive to replay every time on empty databases (I assume CI runs) ? Make a new db baseline with that migration already included.
I have an application written in C++ which uses an SQLite database to store information. I need a way of assigning a version number to the database. By this I mean, I need to be able to assign a version number to the state of the database, and if a new 'state' (version) is available, then I need to update the current database state to match the state of the updated version.
I am wondering whether it would be good practice to store the information required for this to happen in a table. I would need the version number, and then some way of storing the tables and their columns related to each version number. This would allow me to make comparisons etc.
I realise that this question Set a version to a SQLite database file is related, however it doesn't quite answer my question, as I am unsure if my approach is correct, and if so, how I go about achieving this.
All help is much appreciated.
Use PRAGMA user_version to read and store an integer value in the database file.
When the version in your code and database file are the same, do nothing. When they are different, upgrade/downgrade accordingly and update the version number.
So I have two 200mb JSON files. The first one takes 1.5 hours to load, and the second which (which makes a bunch of many-to-many relationships models with the first), takes 24+ hours (since there's no updates via the console, I have no clue if it was still going or if it froze, so I stopped it).
Since the loaddata wasn't working that well, I wrote my own script that loaded the data while also outputting what's been recently saved into the db, but I noticed the speed of the script (along with my computer) decayed the longer it went. So I had to stop the script -> restart my computer -> resume at the section of data where I left off, and that would be faster than running the script throughout. This was a tedious process since it took roughly 18 hrs with me restarting the computer every 4 hours to get all the data fully loaded.
I'm wondering if there is a better solution for loading in large amounts of data?
EDIT: I realized there's an option to load in raw SQL, so I may try that, although I need to brush up on my SQL.
When you're loading large amounts of data, writing your own custom script is generally the fastest. Once you've got it loaded in once, you can use your databases import/export options, which will generally be very fast (ex, pgdump).
When you are writing your own script, though, two things which will drastically speed things up:
Loading data inside a transaction. By default the database is likely in autocommit mode, which causes an expensive commit after each insert. Instead, make sure that you begin a transaction before you insert anything, then commit it afterwards (importantly, though, don't forget to commit; nothing sucks like spending three hours importing data, only to realize you forgot to commit it).
Bypassing the Django ORM and use raw INSERT statements. There is some computational overhead to the ORM, and bypassing it will make things faster.
I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).
There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.
I switch to using sqlite3 instead of MySQL because I had to run many jobs on a PBS system which doesn't not have mysql. Of course on my machine I do not have a NFS while there exists one on the PBS. After spending lots of time switching to sqlite3, I go to run many jobs and I corrupt my database.
Of course down in the sqlite3 FAQ it is mentioned about NFS, but I didn't even think about this when I started.
I can copy the database at the beginning of the job but it will turn into a merging nightmare!
I would never recommend sqlite to any of my colleagues for this simple reason: "sqlite doesn't work (on the machines that matter)"
I have read rants about NFS not being up to par and it being their fault.
I have tried a few workarounds, but as this post suggests, it is not possible.
Isn't there a workaround which sacrifices performance?
So what do I do? Try some other db software? Which one?
You are using the wrong tool. Saying "I would never recommend sqlite ..." based on this experience is a bit like saying "I would never recommend glass bottles" after they keep breaking when you use them to hammer in a nail.
You need to specify your problem more precisely. My attempt to read between the lines of your question gives me something like this:
You have many nodes that get work through some unspecified path, and produce output. The jobs do not interact because you say you can copy the database. The output from all the jobs can be merged after they are finished. How do you effectively produce the merged output?
Given that as the question, this is my advice:
Have each job produce its output in a structured file, unique to each job. After the jobs are finished, write a program to parse each file and insert it into an sqlite3 database. This uses NFS in a way it can handle (single process writing sequentially to a file) and uses sqlite3 in a way that is also sensible (single process writing to a database on a local filesystem). This avoid NFS locking issues while running the job, and should improve throughput because you don't have contention on the sqlite3 database.