How to import a SQL Dump File from Google Cloud Storage into Cloud SQL as a Daily Job? - google-cloud-platform

I'm trying to import a SQL dump file from Google Cloud Storage into Cloud SQL (Postgres database) as a daily job.
I saw on Google Documentation for the CloudAPI that there was a way to programmatically import a SQL dump file (URL: https://cloud.google.com/sql/docs/postgres/admin-api/v1beta4/instances/import#examples), but quite honestly, I'm a bit lost here. I haven't programmed using APIs before, and I think this is a major factor here.
In the documentation, I see that there's an area for a HTTP POST request, as well as code, but I'm not sure where this would go. Ideally, I'd like to use other Cloud products to make this daily job happen. Any help would be much appreciated.
(Side note:
I was looking into creating a cron job in Compute Engine for this, but I'm worried about ease of maintenance, especially since I have other jobs I want to build that are dependent on this one.
I'd read that Dataflow could help with this, but I haven't seen anything (tutorials) that suggests it can yet. I'm also fairly new to Dataflow, so that could be a factor as well. )

I would suggest using google-cloud-composer which is essentially airflow for this. There are a lot of Operators to move files between various locations. You can find more information here
I must warn though, that it is still in Beta and unlike google's expected beta this one is rather flaky (at least in my experience)

Related

Pros and Cons of Google Dataflow VS Cloud Run while pulling data from HTTP endpoint

This is a design approach question where we are trying to pick the best option between Apache Beam / Google Dataflow and Cloud Run to pull data from HTTP endpoints (source) and put them down the stream to Google BigQuery (sink).
Traditionally we have implemented similar functionalities using Google Dataflow where the sources are files in the Google Storage bucket or messages in Google PubSub, etc. In those cases, the data arrived in a 'push' fashion so it makes much more sense to use a streaming Dataflow job.
However, in the new requirement, since the data is fetched periodically from an HTTP endpoint, it sounds reasonable to use a Cloud Run spinning up on schedule.
So I want to gather pros and cons of going with either of these approaches, so that we can make a sensible design for this.
I am not sure this question is appropriate for SO, as it opens a big discussion with different opinions, without clear context, scope, functional and non functional requirements, time and finance restrictions including CAPEX/OPEX, who and how is going to support the solution in BAU after commissioning, etc.
In my personal experience - I developed a few dozens of similar pipelines using various combinations of cloud functions, pubsub topics, cloud storage, firestore (for the pipeline process state managemet) and so on. Sometimes with the dataflow as well (embedded into the pipelieines); but never used the cloud run. But my knowledge and experience may be not relevant in your case.
The only thing I might suggest - try to priorities your requirements (in a whole solution lifecycle context) and then design the solution based on those priorities. I know - it is a trivial idea, sorry to disappoint you.

Can we use Google cloud function to convert xls file to csv

I am new to google cloud functions. My requirement is to trigger cloud function on receiving a gmail and convert the xls attachment from the email to csv.
Can we do using GCP.
Thanks in advance !
Very shortly - that is possible as far as I know.
But.
You might found that in order to automate this task in a reliable, robust and self-healing way, it may be necessary to use half a dozen cloud functions, pubsub topics, maybe a cloud storage, maybe a firestore collection, security manager, customer service account with relevant IAM permissions, and so on. Maybe more than a dozen or two dozens of different GCP resources. And, obviously, those cloud functions are to be developed (I mean the code is to be developed). All together that may be not a very easy or quick to implement.
At the same time, I personally saw (and contributed to a development of) a functional component, based on cloud functions, which together did exactly what you would like to achieve. And that was in production.

Cloud Composer $$$ (Better / Cheaper Options to Take Firebase File drop > Cloud Storage > BigQuery > Few Python / SQL Queries)

I'm looking for some advice on the best / most cost effective solutions to use for my use case on Google Cloud (described below).
Currently, I'm using Cloud Composer, and it's way too expensive. It seems like this is the result of composer always running, so I'm looking for something that either isn't constantly running or is much cheaper to run / can accomplish the same thing.
Use Case / Process >> I have a process setup that follows the below steps:
There is a site built with Firebase that has a file drop / upload (CSV) functionality to import data into Google Storage
That file drop triggers a cloud function that starts the Cloud Composer DAG
The DAG moves the CSV from Cloud Storage to BigQuery while also performing a bunch of modifications to the dataset using Python / SQL queries.
Any advice on what would potentially be a better solution?
It seems like Dataflow might be an option, but pretty new and wanted a second opinion.
Appreciate the help!
If your file is not so big, you can process it with python and pandas data frame, in my experience it works very well with files around 1,000,000 rows
then with the bigquery API you can upload directly the dataframe transformed into bigquery, all in your cloud function, remember that cloud functions can process data until 9 minutes, the best, this way is costless.
Was looking into it recently myself. I'm pretty sure Dataflow can be used for this case, but I doubt it will be cheaper (also considering time you will spend learning and migrating to Dataflow if you are not an expert already).
Depending on the complexity of transformations you do on the file, you can look into data integration solutions such as https://fivetran.com/, https://www.stitchdata.com/, https://hevodata.com/ etc. They are mainly build to just transfer your data from one place to another, but most of them are also able to perform some transformations on the data. If I'm not mistaken in Fivetran it's sql based and in Hevo it's python.
There's also this article that goes into scaling up and down Composer nodes https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60 . Maybe it will help you to save some cost. I didn't notice any significant cost reduction to be honest, but maybe it works for you.

Schedule loading data from GCS to BigQuery periodically

I've researched it and currently come up with a strategy using Apache Airflow. I'm still not sure how to do this. The most blogs and answers I'm getting are directly codes instead of some material to better understand it. Also, please suggest if there is a good way to do it.
I also got an answer like using Background Cloud Function with a Cloud Storage trigger.
You can use BigQuery's Cloud Storage transfers, but note that it's still in BETA.
It gives you the option to schedule transfers from Cloud Storage to BigQuery with certain limitations.
The most blogs and answers I'm getting are directly codes
Apache Airflow comes with a rich UI for many tasks but that doesn't mean you are not supposed to write code in order to get your task done.
For your case, you need to use BigQuery command line operator for Apache Airflow
A good way on how to do this can be found in this link

Cloud SQL Migrate from 1st to 2nd generation

We currently have a Cloud SQL instance with about 600 databases (10Gb total) and we have had several problems with crashes of the instance, so we are thinking about moving to a 2nd generation instance. However, I have found no tool in the console that does this.
Is there some way to do this other than exporting everything as SQL and then executing all queries in the new instance?
And as a side note, is there some limit to the amount of databases per instance? I have found no information on how many databases are recommended to avoid performance and reliability issues.
Thank you
Export and import is the way to do it currently.
Google Cloud SQL uses practically unmodified MySQL binaries, so you can find the limits in the MySQL doc. This one is for 5.6: https://dev.mysql.com/doc/refman/5.6/en/database-count-limit.html
The underlying OS, however, is a custom variant of Linux, and the limits is not documented at this point, but you are probably doing something wrong if you exceed the limits of the OS.