Use a long running database migration script - database-migration

I'm trialing FluentMigrator as a way of keeping my database schema up to date with minimum effort.
For the release I'm currently building, I need to run a database script to make a simple change to a large number of rows of existing data (around 2% of 21,000,000 rows need to be updated).
There's too much data for to be updated in a single transaction (the transaction log gets full and the script aborts), so I use a WHILE loop to iterate through the table, updating 10,000 rows at a time, each batch in a separate transacticon. This works, and takes around 15 minutes to run to completion.
Now I have the script complete, I'm trying to integrate it into FluentMigrator.
FluentMigrator seems to run all the migrations for a single batch in one transaction.
How do I get FM to run each migration in a separate transaction?
Can I tell FM to not use a transaction for a specific migration?

This is not possible as of now.
There are ongoing discussions and some work already in progress.
Check it out here : https://github.com/schambers/fluentmigrator/pull/178
But your use case will surely help in pushing the things in the right direction.
You are welcome to take part to the discussion!
Maybe someone will find a temporary workaround?

Related

Writing SSM ParameterStore update once when many are trying from Lambda?

We've inherited an application that runs on Lambda. On initialization, this App reads a configuration stored on ParameterStore and if a certain value is out of date, then that value is updated and the app can continue.
The problem is that many users (about 7 concurrent at any one time) can run this app, and they could hit the same update nearly at the same time and therefore cause ParameterStore to throw a PutParameter throttle-back error.
To avoid that and to update the ParameterStore value only once per day rather than on every run, we've devised a simple solution, and we're wanting to get some advice on whether it makes sense or not...
Here are the steps we're thinking:
Random sleep between 1 and 10 seconds ("queue" up calls to minimize clashes -- perhaps this isn't necessary??)
Does an S3 object exist for today: obj_MMDDYYYY (signals that the PStore update has been done today)
If YES then skip updating ParameterStore value
If NO then create S3 obj_MMDDYYYY (today's date), run the Update ParameterStore value
Any advice appreciated.
Thanks!

What will happen if power get shutdown , while we are inserting into database?

I was recently asked a Question in an interview , if anyone can help me to figure out.
Suppose we have 100 files , and a process read a file , parse it , and write data into a database.
Now lets say process was at file number 60 and power got off , Now how will you design a system such that when power comes up , process should start write data into database , where it left before shut down.
This would be one way:
Loop over:
Pick up a file
Check it hasn't been processed with a query to the database.
Process the file
Update the database
Update the database with a log of the file processed
Commit
Move the file out of the non-processed queue
You can also log the file entry to some other persistent resource.
Q. What if there are many files. Doesn't writing to logs slow down the process?
A: Probably not much, it's just one entry into the database per file. It's the cost of resilience.
Q: What if the files are so small it's almost only updating one row per file?
A: Make your update query idempotent. Don't log, but ensure that files are removed from the queue once the transaction is complete.
Q: What if there are many lines in a file. Do you really want to restart with the first line of a file?
A: Depends on the cost/benefit. You could split the file into smaller ones prior to processing each sub-file. If the power out happens all the time, then that's a good compromise. If it happens very rarely, the extra work by the system may not be worth it.
A: What if there is a mix of small and large files?
Q: Put the files into separate queues that handle them accordingly.
The UPS idea by #TimBiegeleisen is very good, though:
Well actually it is about that, because unplugging a database in the middle of a lengthy transaction might result in corrupted data. – Tim Biegeleisen Feb 22 '20 at 10:24
I've experienced failure of one such, so you'll need two.
I think you must:
Store somewhere a reference to a file (ID, index of processed file - depend on the case really).
Your have to define the bounduaries of a single transaction - let it be full processing of one file so: read a file, parese it, store data to the database and update reference to the file you processed. If all of that succeeds you can commit the transaction to the database.
You main task which will process all the files should look into reference table and based on it's state featch next file.
In this case you create transaction around single file processing. If anything goes wrong there, you can always rerun the processing job and it will start where it left off.
Please be aware that this is very simple exaple in most scenarios you want to keep transactions as thin as possible.

BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.
I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1
There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

Informatica PowerExchange CDC Data results in target DB way too slow

First of all, I'm very new to Informatica PowerCenter and PowerExchange.
We are using Informatica PowerCenter and PowerExchange to receive CDC data from our source DB2 to a PostgreSQL DB. Therefore we have one workflow where 7 tables are mapped and we get the result in our PostgreSQL. It works fine so far, but it's lacking performance. Not that the size of data is the problem, it's more the delay I see results in the target DB.
When I insert or delete some data on the DB2 (just like 10 rows in one db), I see the results in our PostgreSQL mostly in about ~10-30 seconds (very rare in less than 5 seconds).
My goal would be to speed up this delay. Is this possible? What would I need for that?
I played a little bit with commit interval, and DTM Buffer size, but nothing helped pretty much.
Also I have the feeling that when I configure the workflow to run continuously, it's even slower, compared to when I execute the workflow, after I made the Inserts/Deletes.
Thanks in advance

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.