I've researched it and currently come up with a strategy using Apache Airflow. I'm still not sure how to do this. The most blogs and answers I'm getting are directly codes instead of some material to better understand it. Also, please suggest if there is a good way to do it.
I also got an answer like using Background Cloud Function with a Cloud Storage trigger.
You can use BigQuery's Cloud Storage transfers, but note that it's still in BETA.
It gives you the option to schedule transfers from Cloud Storage to BigQuery with certain limitations.
The most blogs and answers I'm getting are directly codes
Apache Airflow comes with a rich UI for many tasks but that doesn't mean you are not supposed to write code in order to get your task done.
For your case, you need to use BigQuery command line operator for Apache Airflow
A good way on how to do this can be found in this link
Related
I am new to google cloud functions. My requirement is to trigger cloud function on receiving a gmail and convert the xls attachment from the email to csv.
Can we do using GCP.
Thanks in advance !
Very shortly - that is possible as far as I know.
But.
You might found that in order to automate this task in a reliable, robust and self-healing way, it may be necessary to use half a dozen cloud functions, pubsub topics, maybe a cloud storage, maybe a firestore collection, security manager, customer service account with relevant IAM permissions, and so on. Maybe more than a dozen or two dozens of different GCP resources. And, obviously, those cloud functions are to be developed (I mean the code is to be developed). All together that may be not a very easy or quick to implement.
At the same time, I personally saw (and contributed to a development of) a functional component, based on cloud functions, which together did exactly what you would like to achieve. And that was in production.
I'm looking for some advice on the best / most cost effective solutions to use for my use case on Google Cloud (described below).
Currently, I'm using Cloud Composer, and it's way too expensive. It seems like this is the result of composer always running, so I'm looking for something that either isn't constantly running or is much cheaper to run / can accomplish the same thing.
Use Case / Process >> I have a process setup that follows the below steps:
There is a site built with Firebase that has a file drop / upload (CSV) functionality to import data into Google Storage
That file drop triggers a cloud function that starts the Cloud Composer DAG
The DAG moves the CSV from Cloud Storage to BigQuery while also performing a bunch of modifications to the dataset using Python / SQL queries.
Any advice on what would potentially be a better solution?
It seems like Dataflow might be an option, but pretty new and wanted a second opinion.
Appreciate the help!
If your file is not so big, you can process it with python and pandas data frame, in my experience it works very well with files around 1,000,000 rows
then with the bigquery API you can upload directly the dataframe transformed into bigquery, all in your cloud function, remember that cloud functions can process data until 9 minutes, the best, this way is costless.
Was looking into it recently myself. I'm pretty sure Dataflow can be used for this case, but I doubt it will be cheaper (also considering time you will spend learning and migrating to Dataflow if you are not an expert already).
Depending on the complexity of transformations you do on the file, you can look into data integration solutions such as https://fivetran.com/, https://www.stitchdata.com/, https://hevodata.com/ etc. They are mainly build to just transfer your data from one place to another, but most of them are also able to perform some transformations on the data. If I'm not mistaken in Fivetran it's sql based and in Hevo it's python.
There's also this article that goes into scaling up and down Composer nodes https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60 . Maybe it will help you to save some cost. I didn't notice any significant cost reduction to be honest, but maybe it works for you.
I an in the early stages of learning Airflow. I am learning Airflow to build a simple ETL (ELT?) data pipeline, and am in the process of the figuring out the architecture for the pipeline (what operators I should use). The basics of my data pipeline are going to be:
Make HTTP GET request from API for raw data.
Save raw JSON results into a GCP bucket.
Transform the data and save into a BigQuery database.
...and the pipeline will be scheduled to run once daily.
As the title suggests, I am trying to determine if the SimpleHttpOperator or PythonOperator is more appropriate to use to make the HTTP GET requests for data. From this somewhat related stackoverflow post, stackoverflow post, the author simply concluded:
Though I think I'm going to simply use the PythonOperator from now on
It seems simple enough to write a 10-20 lines-of-code python script that makes the http request, identifies the GCP storage bucket, and writes to that bucket. However, I'm not sure if this is the best approach for this type of task (call api --> get data --> write to gcp storage bucket).
Any help or thoughts on this, any example links on building similar pipelines, etc. would be greatly helpful. Thanks in advance
I recommend you to see airflow as a glue between processing steps. The processing performed into Airflow should be to conditionally trigger or not a step, doing loop on steps and handle errors.
Why? Because, if tomorrow you choose to change your workflow app, you won't have to code again your process, you will only have to rewrite the workflow logic (because you changed your workflow app). A simple separation of concern.
Thereby, I recommend you to deploy your 10-20 lines of python code into a Cloud Functions and to set a SimpleHTTPOperator to call it.
In addition, it's far more easier to a function than a workflow (to run and to look at the code). The deployments and the updates will be also easier.
I'm trying to import a SQL dump file from Google Cloud Storage into Cloud SQL (Postgres database) as a daily job.
I saw on Google Documentation for the CloudAPI that there was a way to programmatically import a SQL dump file (URL: https://cloud.google.com/sql/docs/postgres/admin-api/v1beta4/instances/import#examples), but quite honestly, I'm a bit lost here. I haven't programmed using APIs before, and I think this is a major factor here.
In the documentation, I see that there's an area for a HTTP POST request, as well as code, but I'm not sure where this would go. Ideally, I'd like to use other Cloud products to make this daily job happen. Any help would be much appreciated.
(Side note:
I was looking into creating a cron job in Compute Engine for this, but I'm worried about ease of maintenance, especially since I have other jobs I want to build that are dependent on this one.
I'd read that Dataflow could help with this, but I haven't seen anything (tutorials) that suggests it can yet. I'm also fairly new to Dataflow, so that could be a factor as well. )
I would suggest using google-cloud-composer which is essentially airflow for this. There are a lot of Operators to move files between various locations. You can find more information here
I must warn though, that it is still in Beta and unlike google's expected beta this one is rather flaky (at least in my experience)
I wanted to dive into the world of distributed systems, cloud computing, IoT, etc., and I gotta be honest, I imagined everything being a little more intuitive than it finally turned out.
I had a tiny testing architecture in mind, that I'd like to set up with Google Clouds and their services, but I am kinda stuck since I can't get my head around some concepts.
What I basically wanted to do (as a first step) is writing a simple java application that would run locally on my computer. This application should just generate random numbers and send those numbers somehow to the google cloud. On the cloud I wanted to define another java application that would manipulate those random numbers in some kind of way (it doesn't matter actually). Afterwards, the output should somehow get back to me of course. And actually, at the moment, I don't even care about how exactly. It could be somehow back to my local app (with some kind of listener, would that be possible?). But it could also simply store the results somewhere on the google cloud? Or maybe upload them to my google drive?
I guess you already noticed that - at some points - I don't even know what i want exactly, since I'm not sure of what is possible, and what not.
Could you provide me some help to get this set up?
The most important questions for me right now are:
Do I need to use a pubsub system, where my generated numbers are sent
to, and which then forwards this to the cloud app, that transforms my
data?
How do I get my data from the local app to the cloud services?
Would my data transforming app run on Google Dataflow?
Above I wrote "as a first step"... because later I would also like to send config files (for example in json format, or xml) to the cloud, and the
cloud application should transform those config files... if I get the
first scenario running the I guess this woul also be no problem
right?
Those are just a few of the questions that are on my mind currently. The most important ones I guess.
It would be a big help. Sorry, if the questions are not very precise, but I really need some kind of pointing into the right direction.
Thank you in advance!
I think it would be good to read up on some of the technologies you mention here:
Google Cloud Pubsub: Pub/Sub enables you to publish messages to a topic, and consume them in another place in the (Google) Cloud. You can see some different examples of publishers and consumers in the link. In your case you could for example write a Java application that writes random numbers to the Pub/Sub queue, where they will sit for 7 days to be consumed by another component (for example, Google Cloud Dataflow). To get started developing, you can find the SDKs here (there is a Java SDK).
Google Cloud Dataflow is managed service running Apache Beam pipelines to process your data at scale. You can learn about the different concepts here and get started designing your pipeline here. I suggest taking a look at some examples first though, which will make it more easy to grasp what is actually going on. Dataflow has a PubSub connector, so in your application you will be able to read from the topic you created before. In Dataflow you can for example multiply all your random numbers and write them to a certain sink (for example Google Cloud Storage, or even BigQuery or PubSub again).
Google Cloud Storage: is a cloud storage where you can put files, for example the output of your Dataflow pipeline. You will be able to manually download the files using the Cloud Console UI, or you can use one of the SDKs to download the output programmatically.
Hope this gives you an overview and some pointers to start. Whenever you are ready and have a more concrete use case in mind, you can start looking at some more components.