GCP Cloud SQL PITR, how long can it take? - google-cloud-platform

We had sort of a disastrous event with a database at work (managed on Cloud SQL, it's a MySQL db) and thankfully we have point in time recovery enabled.
We went ahead and cloned the production database to a certain point in time to be able to recover the data just before the disaster, but it's now been running for more than 5 hours for a 69GB database (the size advertised on the GCP Cloud SQL panel, so the real size of the DB is probably less than that).
Does anyone have experience with this ?
The status of the operation when querying it with the gcloud CLI says "RUNNING". We checked the logs of the instance to see if anything was off but there just isn't any log at all.

Related

Restore metrics after Google Cloud SQL (Postgres) crash

A couple of days our Google Cloud SQL instance "crashed" or at least was not responsive any longer. It recovered and works and all Query Insights and so on work again.
However, most metrics like CPU utilization, Storage usage and Memory usage are currently not available. I thought that would recover automatically as well but after 2 days I wonder if there needs to happen something manually.
Is there something I can do other than restarting the database (which would be only my last resort).
Okay, after waiting around 3 days the metrics are working again.

Is this normal for gcp Cloud SQL disk usage

I created a cloud sql db for learning purposes a while ago and have basically never used it for anything. Yet the storage / disk space keeps climbing:
updated image
Updating image to show timescale... this climb seems to be within just a few hours!
Is this normal? If not, how do I troubleshoot / prevent this steady climb? The only operations against the db seem to be backup operations. I'm not doing any ops (afaik).

Can long running query improves performance using AWS?

As we are a data warehouse team, we deals with millions of records in and out on daily basis. We have jobs running ever day, and loads on to SQL Server Flex clones from oracle DB through ETL loads. As we are dealing with huge amount of data and complex queries, query runs pretty longer and it goes to hours. So we are looking towards using AWS. We wanted to setup our own licensed Microsoft SQL server on EC2. But I was wondering, how this will improve performance of long running query. What would be the main reason that same query takes longer on our own servers and executes faster on AWS. Or did I misunderstood the concept?(just letting you know I am at a learning phase)
PS: We are still in a R&D phase. Any thoughts or opinion would be greatly appreciated regarding AWS for long running queries.
You need to provide more details on your question.
What is your query ?
How big is the tables ?
What is the bottle neck ? CPU ? IO ? RAM ?
AWS is just infrastructure.
It does makes your life easier because you can scale up or down your machine in a click of buttons.
Well, I guess you can crank up your machine to however big you want, but even so, nothing will solve a bad query and bad architecture.
Keep in mind, EC2 comes with 2 type of disk. EBS and Ephemeral.
EBS is SAN. Ephemeral is attached to the EC2 instance it self.
By far, Ephemeral will be much faster of course, but the downside is that when you shutdown your EC2 and start it up again, all of the data in that drive is wiped clean.
As for licensing (windows and SQL Server), it is baked into the EC2 instance pre baked AMI (Amazon Machine Image).
I've never used my own license in EC2.
With same DB, Same hardware configuration, query will perform similarly on AWS or on prim. You need to check whether you have configured DB / indexes etc optimally. Also, think of replicating data to some other database which is optimized for querying huge amount of data.

Cloud SQL Migrate from 1st to 2nd generation

We currently have a Cloud SQL instance with about 600 databases (10Gb total) and we have had several problems with crashes of the instance, so we are thinking about moving to a 2nd generation instance. However, I have found no tool in the console that does this.
Is there some way to do this other than exporting everything as SQL and then executing all queries in the new instance?
And as a side note, is there some limit to the amount of databases per instance? I have found no information on how many databases are recommended to avoid performance and reliability issues.
Thank you
Export and import is the way to do it currently.
Google Cloud SQL uses practically unmodified MySQL binaries, so you can find the limits in the MySQL doc. This one is for 5.6: https://dev.mysql.com/doc/refman/5.6/en/database-count-limit.html
The underlying OS, however, is a custom variant of Linux, and the limits is not documented at this point, but you are probably doing something wrong if you exceed the limits of the OS.

How do I coordinate database hardware (EC2) upgrades on a web site behind ELB?

Suppose I have a server setup w/ one load-balancer that routes traffic between two web servers who both connect to a database that is a RAM cloud. For whatever reason I want to upgrade my database, and this will require me to have it down temporarily. During this downtime I want to put an "upgrading" notice on the front page of the site. I have a specific web app that displays that message.
Should I:
(a) - spin up a new ec2 instance with the web app "upgrading" on it and point the LB at it
(b) - ssh into each web server and pull down the main web app, put up the "upgrading" app
(c) - I'm doing something wrong since I have to put a "upgrading" sign up in the first place
If you go the route of the "upgrading" (dummy/replacement) web app, I would be inclined to run that on a different machine so you can test and verify its behavior in isolation, point the ELB to it, and point the ELB back without touching the real application.
I would further suggest that you not "upgrade" your existing instances, but, instead, bring new instances online, copy as much as you can from the live site, and then take down the live site, finish synching whatever needs to be synched, and then cut the traffic over.
If I were doing this with a single MySQL server-backed site (which I mention only because that is my area of expertise), I would bring the new database server online with a snapshot backup of the existing database, then connect it to the live replication stream generated by the existing database server, beginning at the point-in-time where the snapshot backup was taken, then let it catch up to the present by executing the transactions that occurred since the snapshot. At this point, after the new server caught up to real time playing back the replication events, I would have my live data set in essentially real time, new database hardware. I could then stop the application, reconfigure the application server settings to use the new database server, verify that all of the replication events had propagated, disconnect from the replication stream, and restart the app server against the new database, for a total downtime so short that it would be unlikely to be noticed if done during off-peak time.
Of course, with a Galera cluster, these gyrations would be unnecessary since you can just do a rolling upgrade, one node at a time, without ever losing synchronization of the other two nodes with each other (assuming you had the required minimum of 3 running nodes to start with) and each upgraded node would resync its data from one of the other two when it came back online.
To whatever extent the platform you are using doesn't have comparable functionality to what I've described (specifically, the ability to do database snapshots and playback a stream of a transaction log against a database restored from a snapshot... or quorum-based cluster survivability), I suspect that's the nature of the limitation that makes it feel like you're doing it wrong.
A possible workaround to help you minimize the actual downtime if your architecture doesn't support these kinds of actions would be to enhance your application with the ability to operate in a "read only" mode, where the web site can be browsed but the data can't be modified (you can see the catalog, but not place orders; you read the blogs, but not edit or post comments; you don't bother saving "last login date" for a few minutes; certain privilege levels aren't available; etc.) -- like Stack Overflow has the capability of doing. This would allow you to stop the site just long enough to snapshot it, then restart it again on the existing hardware in read-only mode while you bring up the snapshots on new hardware. Then, when you have the site back to available status on the new hardware, cut the traffic over at the load balancer and you'd be back to normal.