We currently have a Cloud SQL instance with about 600 databases (10Gb total) and we have had several problems with crashes of the instance, so we are thinking about moving to a 2nd generation instance. However, I have found no tool in the console that does this.
Is there some way to do this other than exporting everything as SQL and then executing all queries in the new instance?
And as a side note, is there some limit to the amount of databases per instance? I have found no information on how many databases are recommended to avoid performance and reliability issues.
Thank you
Export and import is the way to do it currently.
Google Cloud SQL uses practically unmodified MySQL binaries, so you can find the limits in the MySQL doc. This one is for 5.6: https://dev.mysql.com/doc/refman/5.6/en/database-count-limit.html
The underlying OS, however, is a custom variant of Linux, and the limits is not documented at this point, but you are probably doing something wrong if you exceed the limits of the OS.
Related
I'm looking for some advice on the best / most cost effective solutions to use for my use case on Google Cloud (described below).
Currently, I'm using Cloud Composer, and it's way too expensive. It seems like this is the result of composer always running, so I'm looking for something that either isn't constantly running or is much cheaper to run / can accomplish the same thing.
Use Case / Process >> I have a process setup that follows the below steps:
There is a site built with Firebase that has a file drop / upload (CSV) functionality to import data into Google Storage
That file drop triggers a cloud function that starts the Cloud Composer DAG
The DAG moves the CSV from Cloud Storage to BigQuery while also performing a bunch of modifications to the dataset using Python / SQL queries.
Any advice on what would potentially be a better solution?
It seems like Dataflow might be an option, but pretty new and wanted a second opinion.
Appreciate the help!
If your file is not so big, you can process it with python and pandas data frame, in my experience it works very well with files around 1,000,000 rows
then with the bigquery API you can upload directly the dataframe transformed into bigquery, all in your cloud function, remember that cloud functions can process data until 9 minutes, the best, this way is costless.
Was looking into it recently myself. I'm pretty sure Dataflow can be used for this case, but I doubt it will be cheaper (also considering time you will spend learning and migrating to Dataflow if you are not an expert already).
Depending on the complexity of transformations you do on the file, you can look into data integration solutions such as https://fivetran.com/, https://www.stitchdata.com/, https://hevodata.com/ etc. They are mainly build to just transfer your data from one place to another, but most of them are also able to perform some transformations on the data. If I'm not mistaken in Fivetran it's sql based and in Hevo it's python.
There's also this article that goes into scaling up and down Composer nodes https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60 . Maybe it will help you to save some cost. I didn't notice any significant cost reduction to be honest, but maybe it works for you.
I'm creating a web app in react with a nodeJS backend. I'm hosting all this on the Google Cloud Platform. I'm using a postgresql database and a redis database, and because my knowledge of these databases is very little, I'm using the managed options (cloud SQL and cloud memorystore).
These are not the cheapest solutions, but for now, they'll do what I want them to do.
My question now is: I'm using the managed options. Imagine my web app has success and grows bigger, I'll probably want my own managed solution (like a redis cluster on compute engine or a postgresql cluster on compute engine). Will I be able to migrate my managed databases to the compute engine solution without downtime/loss of data?
If things are getting bigger, I'll probably hire someone with more knowledge about postgresql/redis, that's not the problem, the only thing I want to know: is it possible to upgrade from a GCP managed solution to an unmanaged solution on compute engine without loss of data and downtime? I'm do not want loss of data at all, a little downtime should not be the problem.
Using the managed solution is, in fact, a better approach for handling databases. GCP takes over updates, management and maintenance of the database and provides reliable tools for backup and scaling.
But to answer your question, yes it is possible to migrate with a minimum downtime. You would need to configure main/worker or master/slave (deprecated terminology) with synchronous replication. After that you can switch your database to worker (which is in Compute Engine) and make it your primary database. This would give basically minimal possible downtime.
I am novice in GCP stack so I am so confused about amount GCP technologies for storing data:
https://cloud.google.com/products/storage
Although google cloud spanner is not mentioned in the article above I know that it is exist and iti is used for data storage: https://cloud.google.com/spanner
From my current view I don't see any significant difference between cloud sql(with postgres under the hood) and cloud spanner. I found that it has a bit different syntax but it doesn't answer when I should prefer this techology to spring cloud sql.
Could you please explain it ?
P.S.
I consider spring cloud sql as a traditional database with automatic replication and horizontal scalability managed by google.
There is not a big difference between them in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale
Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability and better SLOs.
On the other hand, Spanner is also much more expensive than Cloud SQL.
If you just want to store some data of your customer in a cheap way but still don't want to face server configuration Cloud SQL is the right choice.
If you are planning to create a big product or if you want to be ready for a huge increase in users for your application (viral games/applications) Spanner is the right product.
You can find detailed information about Cloud Spanner in this official paper
The main difference between Cloud Spanner and Cloud SQL is the horizontal scalability + global availability of data over 10TB.
Spanner isn’t for generic SQL needs, Spanner is best used for massive-scale opportunities. 1000s of writes per second, globally. 10,000s - 100,000s of reads per second, globally.
Above volume is extremely difficult to achieve with NORMAL SQL / MySQL without doing complex sharding of the database. Spanner deals with all this AND allows ACID updates (which is basically impossible with sharded databases). They accomplish this with super-accurate clocks to manage conflicts.
In short, Spanner is not for CRM databases, it is more for supermassive global data within an organisation. And since Spanner is a bit expensive (compared to cloud SQL), the project should be large enough to justify the additional cost of Spanner.
You can also follow this discussion on Reddit (a good one!): https://www.reddit.com/r/googlecloud/comments/93bxf6/cloud_spanner_vs_cloud_sql/e3cof2r/
Previous answers are correct, the main advantages of Spanner are scalability and availability. While you can scale with Cloud SQL, there is an upper bound to write throughput unless you shard -- which, depending on your use case, can be a major challenge. Dealing with sharded SQL was the big problem that Spanner solved within Google.
I would add to the previous answers that Cloud SQL provides managed instances of MySQL or PostgreSQL or SQL Server, with the corresponding support for SQL. If you're migrating from a MySQL database in a different location, not having to change your queries can be a huge plus.
Spanner has its own SQL dialect, although recently support for a subset of the PostgreSQL dialect was added.
I'm trying to import a SQL dump file from Google Cloud Storage into Cloud SQL (Postgres database) as a daily job.
I saw on Google Documentation for the CloudAPI that there was a way to programmatically import a SQL dump file (URL: https://cloud.google.com/sql/docs/postgres/admin-api/v1beta4/instances/import#examples), but quite honestly, I'm a bit lost here. I haven't programmed using APIs before, and I think this is a major factor here.
In the documentation, I see that there's an area for a HTTP POST request, as well as code, but I'm not sure where this would go. Ideally, I'd like to use other Cloud products to make this daily job happen. Any help would be much appreciated.
(Side note:
I was looking into creating a cron job in Compute Engine for this, but I'm worried about ease of maintenance, especially since I have other jobs I want to build that are dependent on this one.
I'd read that Dataflow could help with this, but I haven't seen anything (tutorials) that suggests it can yet. I'm also fairly new to Dataflow, so that could be a factor as well. )
I would suggest using google-cloud-composer which is essentially airflow for this. There are a lot of Operators to move files between various locations. You can find more information here
I must warn though, that it is still in Beta and unlike google's expected beta this one is rather flaky (at least in my experience)
We are running into quota limits for a small data set which is less than 1Gb in Bigquery. Google cloud gives us no indication of what queries are running on the backend which isn't allowing us to tune the setup. We have a Bigquery dataset and a dashboard built in data studio which is querying on the data set.
I've used relational databases like Oracle in the past and they have excellent tooling to diagnose issues. But with Bigquery, I feel like I am staring into the dark.
I'd appreciate any help/pointers you can give.
The concurrent queries limit makes reference to the number of statements that are executed simultaneously in BigQuery. The quota limit for on-demand, interactive queries is 100 concurrent queries (updated).
Based on this, it is seems that your Data Studio is hitting this quota when running your reports in which case is suggested to re-design your dashboard build in order to avoid exceeding those limits.
Additionally, you can use the bq ls -j -a PROJECTNAME command to list the jobs that have been run in your project in order to identify the queries you need to work with, as well mentioned by Elliott Brossard