In Liquibase what is the reasonable limitation for the number of migrations / changesets - database-migration

In my understanding, every liquibase update run
goes through the complete changelog file
for every changeset met it calculates it's MD5 sum value
and for every changeset checks it against DATABASECHANGELOG table to see whether changeset already deployed or modified
If imagine the large number of changesets already deployed, for example, tens of thousands or more, the process of adding just a single changeset might seem (at least theoretically) time consuming.
What is the correlation between the number of changesets and "migration" time? Are there any limitations? If yes, what are the possible alternative solutions for database migrations with big amount of objects and changes?

Related

Best way to store one variable in AWS?

I have an interesting problem where I have a job processing architecture that has a limit on how many jobs can be processed at once. When another job is about to start, it needs to check how many jobs are being processed, and if it is at the threshold, add the job to a queue.
What has stumped me is the best way to implement a "counter" that tracks the number of jobs running at once? This counter needs to be accessed and incremented up and down from different lambda functions.
My first thought was a CloudWatch custom high latency metric, but 1 second is not quick enough, as the system breaks if too many jobs are submitted. Additionally, I'm not sure if the metric can be incremented up or down only through code. The only thing I can think of now is an entire separate DB or EC2 instance, but that seems like complete overkill for just ONE number. We are not using a DB for data storage, it is in another cloud platform, only S3.
Any suggestions on what to do next? Thank you so much :)
You could use a DynamoDB table to hold your counter as a document. However, keep in mind that some operations in DynamoDB could lead to race conditions, so you might want to “lock” your table.
Depending on your load, this could potentially be free, given the Free Tier.

Update 1bn rows in Amazon RDS Postgres Database

What would be the best way to update all records in a large Amazon RDS Postgres Database with 1bn rows? The update logic needs to run in code e.g NodeJS. We also want to keep track of progress, errors and avoid to much manual work since we expect this to take quite some time to execute.
I have been thinking of some task queue in SQS as a trigger for a Lambda running the code. Where each task covers a small range of the rows.

Create a copy of Redshift production with limited # records in each table

I have a production Redshift cluster with a significant amount of data on it. I would like to create a 'dummy' copy of the cluster that I can use for ad-hoc development and testing of various data pipelines. The copy would have all the schemas/tables of production, but only a small subset of the records in each table (say, limited to 10,000 rows per table).
What would be a good way to create such a copy, and refresh it on a regular basis (in case production schemas change)? Is there a way to create a snapshot of a cluster with limits on each table?
So far my thinking is to create a new cluster and use some of the admin views as defined here to automatically get the DDL of schemas/tables etc. and write scripts that generate UNLOAD statements (with limits on number of records) for each table. I can then use these to populate my dev cluster. However I feel there must be a cleaner solution.
I presume your basic goal is cost-saving. This needs to be balanced against administrative effort (how expensive is your time?).
It might be cheaper to produce a full-copy (restore from backup) of the cluster but turn it off at night/weekends to save money. If you automate the restoration process you could even schedule it to start before you come into work.
That way, you'll have a complete replica of the production system with effectively zero administration overhead (once you write a couple of scripts to create/delete the cluster) and you can save 75% of the costs (40 out of 168 hours per week). Plus, each time you create a new cluster it contains the latest data from the snapshot, so there is no need to keep them "in sync".
The simplest solutions are often the best.

Online update spanner schema is extremely slow

Online updating spanner schema takes minutes even for very very small tables (10s of rows).
i.e. - adding/dropping/altering columns, adding tables, etc.
This can be quite frustrating for development processes and new version deployments.
Any plans for improvement?
Few more questions:
Anyone knows a 3rd party schema comparison tool for spanner? couldn't find any.
What about data backups? in order to save historical snapshots.
Thanks in advance
Schema Updates:
Since Cloud Spanner is a distributed database, it makes sure to update all moving parts of the system which takes the latency as described.
As a suggestion, you could batch the schema updates. This ensures the lower latencies (nearly equivalent to executing a single schema update) and can be executed using API / gcloud command-line tools.
Schema Comparison Tool:
You could use the getDatabaseDdl API to maintain history of your schema changes and use your tool of choice to diff them.

Amazon MapReduce with cronjob + APIs

I have a website set up on an EC2 instance which lets users view info from 4 of their social networks.
Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day.
Initially we had a cron-job which went through each user and did the necessary calls to the APIs and then stored the data on the DB (amazon rds instance).
This operation should take between 2 to 30 seconds per person, which means doing it 1 by 1 would take days to update.
I was looking at MapReduce and would like to know if it would be a suitable option for what im trying to do, but at the moment I can't tell for sure.
Would I be able to give an .sql file to MapReduce, with all the records I want to update + a script that tells MapReduce what to do with each record and have it process them all simultaneously?
If not, what would be the best way to go about it?
Thanks for your help in advance.
I am assuming each user's data is independent of the other users' data, which seems logaical to me. If that-s not the case, please ignore this answer.
Since you have mutually independent data (that is, each user's data is independent from other users') there is no need to use MapReduce. MR is just a paradigm in programming that simplifies data manipulation when the data is not independent (map prepares the data, then there is sorting phase, then reduce pulls the results from the sorted records).
In your case, if you want to use more computers, just split the load between them - each computer should process ~10000 users per hour (very rough estimate). Then users can be distributed among computers beforehand or they can be requested in chunks of 1000 or so users, so the machines that end sooner can process more users.
BUT there is an added bonus in using MR framework (such as Hadoop), even if you only use one phase (map only). It does the error handling for you (nodes failing, jobs failing,...) and it takes care of distributing the input among the nodes.
I'm not sure if MR is worth all the trouble to set it up, depends on your previous experience - YMMV.
If my understanding is correct. should this application to be implement as MapReduce, all the processings are done in the Map phase and reduce might simple output the Map phase result.
So if I were to implement this, I would just divide the job into multiple EC2 instances with each instance process a given range of record in your sql data. This has made the assumption that you have an good idea of how to divide the data to different instances.
The advantage is that you needn't pay for the price of Elastic MapReduce and avoid any possible MapReduce overhead.