Cloud Composer $$$ (Better / Cheaper Options to Take Firebase File drop > Cloud Storage > BigQuery > Few Python / SQL Queries) - google-cloud-platform

I'm looking for some advice on the best / most cost effective solutions to use for my use case on Google Cloud (described below).
Currently, I'm using Cloud Composer, and it's way too expensive. It seems like this is the result of composer always running, so I'm looking for something that either isn't constantly running or is much cheaper to run / can accomplish the same thing.
Use Case / Process >> I have a process setup that follows the below steps:
There is a site built with Firebase that has a file drop / upload (CSV) functionality to import data into Google Storage
That file drop triggers a cloud function that starts the Cloud Composer DAG
The DAG moves the CSV from Cloud Storage to BigQuery while also performing a bunch of modifications to the dataset using Python / SQL queries.
Any advice on what would potentially be a better solution?
It seems like Dataflow might be an option, but pretty new and wanted a second opinion.
Appreciate the help!

If your file is not so big, you can process it with python and pandas data frame, in my experience it works very well with files around 1,000,000 rows
then with the bigquery API you can upload directly the dataframe transformed into bigquery, all in your cloud function, remember that cloud functions can process data until 9 minutes, the best, this way is costless.

Was looking into it recently myself. I'm pretty sure Dataflow can be used for this case, but I doubt it will be cheaper (also considering time you will spend learning and migrating to Dataflow if you are not an expert already).
Depending on the complexity of transformations you do on the file, you can look into data integration solutions such as https://fivetran.com/, https://www.stitchdata.com/, https://hevodata.com/ etc. They are mainly build to just transfer your data from one place to another, but most of them are also able to perform some transformations on the data. If I'm not mistaken in Fivetran it's sql based and in Hevo it's python.
There's also this article that goes into scaling up and down Composer nodes https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60 . Maybe it will help you to save some cost. I didn't notice any significant cost reduction to be honest, but maybe it works for you.

Related

Cloud SQL to BigQuery ETL tool

I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.
What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.
I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.
Currently I am looking at Data Fusion, Datastream, Apache Airflow.
Any suggestions?
Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results.
Beam has a Flatten transform that can join PCollections.
What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.
Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).
However, I'd be remiss not to ask: should you?
There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.
Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ?
Im a kinder garden student in GCP :)
Thanks in advance
There are many solutions to achieve this. It depends on several factors some of which are:
frequency of data ingestion
whether or not the data needs to be
manipulated before being written into bigquery (your files may not
be formatted correctly)
is this going to be done manually or is this going to be automated
size of the data being written
If you are just looking for an ETL tool you can find many. If you plan to scale this to many pipelines you might want to look at a more advanced tool like Airflow but if you just have a few one-off processes you could set up a Cloud Function within GCP to accomplish this. You can schedule it (via cron), invoke it through HTTP endpoint, or pub/sub. You can see an example of how this is done here
After several tries and datalake/datawarehouse design and architecture, I can recommend you only 1 thing: ingest your data as soon as possible in BigQuery; no matter the format/transformation.
Then, in BigQuery, perform query to format, clean, aggregate, value your data. It's not ETL, it's ELT: you start by loading your data and then you transform them.
It's quicker, cheaper, simpler, and only based on SQL.
It works only if you use ONLY BigQuery as destination.
If you are starting from scratch and have no legacy tools to carry with you, the following GCP managed products target your use case:
Cloud Data Fusion, "a fully managed, code-free data integration service that helps users efficiently build and manage ETL/ELT data pipelines"
Cloud Composer, "a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines"
Dataflow, "a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing"
(Without considering a myriad of data integration tools and fully customized solutions using Cloud Run, Scheduler, Workflows, VMs, etc.)
Choosing one depends on your technical skills, real-time processing needs, and budget. As mentioned by Guillaume Blaquiere, if BigQuery is your only destination, you should try to leverage BigQuery's processing power on your data transformation.

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.

How to import a SQL Dump File from Google Cloud Storage into Cloud SQL as a Daily Job?

I'm trying to import a SQL dump file from Google Cloud Storage into Cloud SQL (Postgres database) as a daily job.
I saw on Google Documentation for the CloudAPI that there was a way to programmatically import a SQL dump file (URL: https://cloud.google.com/sql/docs/postgres/admin-api/v1beta4/instances/import#examples), but quite honestly, I'm a bit lost here. I haven't programmed using APIs before, and I think this is a major factor here.
In the documentation, I see that there's an area for a HTTP POST request, as well as code, but I'm not sure where this would go. Ideally, I'd like to use other Cloud products to make this daily job happen. Any help would be much appreciated.
(Side note:
I was looking into creating a cron job in Compute Engine for this, but I'm worried about ease of maintenance, especially since I have other jobs I want to build that are dependent on this one.
I'd read that Dataflow could help with this, but I haven't seen anything (tutorials) that suggests it can yet. I'm also fairly new to Dataflow, so that could be a factor as well. )
I would suggest using google-cloud-composer which is essentially airflow for this. There are a lot of Operators to move files between various locations. You can find more information here
I must warn though, that it is still in Beta and unlike google's expected beta this one is rather flaky (at least in my experience)

Preprocessing on data stored in BigQuery

I've just started to use GCP and I have some doubts regarding the right use of some of its tools. Particularly, I'm trying to ingest data from Google Analytics into BigQuery. Would it be possible to use Dataprep on data stored in BigQuery? Almost every example I've seen uses Dataprep to visualize data stored in Google Storage, but nothing refers to BigQuery.
Any help would be really appreciated.
You can totally use Dataprep to process data stored in BigQuery. It gives you a great way to visualize how your dataset looks, and interactively define transformations.
Now, do you really want to use Dataprep for this? The transformations will be more expensive and slow, as they will run on Dataflow - which is usually more expensive and slow than doing everything within BigQuery (as the question refers to data that's already in BigQuery).
On the other hand, the interactive environment can help you quickly define what you want and run the created recipe periodically.
See more about this on Lak's "How to schedule a BigQuery ETL job with Dataprep".
https://medium.com/google-cloud/how-to-schedule-a-bigquery-etl-job-with-dataprep-b1c314883ab9.
According to the documentation on Dataprep, you can import BigQuery datasets.
But it might be easier to just to open Dataprep and check the importing options there: