I am trying to write an ingestion application using GCP services. There could be around 1 TB of data each day which can come in a streaming way (i.e, 100 GIG each hour or even by once at a specific time)
I am trying to design an ingestion application, I first thoght it is a good idea to write a simple Python script within a cron job to read files sequentiallly (or even within two three threads) and then publish them as a message to pub/sub. Further I need to have a Dataflow job running always read data from pub/sub and save them to BigQuery.
But I really want to know If I need pub/sub at all here, I know dataflow could be very flexible and i wanted to know can I ingest 1 TB of data directly from GCS to BigQuery as batch job, or it is better to be done by a streaming job (by pub/sub) as I told above? what are the pros cons of each approach in terms of cost?
It seems like you don't need Pub/Sub at all.
There is already a Dataflow template for direct transfer of text files from Cloud Storage to BigQuery (in BETA just like the Pub/Sub to BigQuery template) and in general, batch jobs are cheaper than stream jobs (see Pricing Details).
Related
If I have a csv which is to update entries in SQL database. The file size is max 50 KB and update frequency is twice a week. Also if I have a requirement to do some automated sanity testing
What should I use Dataflow or Cloud Function.
For this use case (2 little files per week) Cloud Function will be the best option.
Other options would be to create a Dataflow batch pipeline and trigger by cloud function OR create a Dataflow streaming pipeline, but both options will be more expensive.
Here some documentation about how to connect on Cloud SQL from Cloud Function
And here some documentation related to triggering a Cloud Function from Cloud storage.
See ya.
If there are no aggregations, and each input "element" does not interact with others, that is, the pipeline works in 1:1, then you can use either Cloud Functions or Dataflow.
But if you need to do aggregations, filtering, any complex calculation that involves more than one single element in isolation, you will not be able to implement that with Cloud Functions. You need to use Dataflow in that situation.
We have requirement to push data from Bigquery to pubsub as event using dataflow.Do we have any template available for the same ( as I understand pubsub to BQ with DF template is available). If we use dataflow streaming mechanism is set to true - do we need any scheduler to invoke dataflow to fetch and push data to pubsub? Please guide me on this.
There isn't a template to push BigQuery rows to PubSub. In addition, the streaming mode works only when PubSub is the source, not the sink. When it's a database, or a file, it's always in batch mode.
For your use case, I recommend you to use a simple Cloud Functions or Cloud Run and to trigger it with Cloud Scheduler. The data volume is low, and a serverless product perfectly fit your use case. No need of a big and scalable product like Dataflow
There is an option to export BigQuery table to json files into cloud storage.
(Or may be, if your write to BQ is not heavy, when you write to BigQuery, write also json files into cloud storage into dated folders)
Then use text files to pub sub dataflow template.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#text-files-on-cloud-storage-to-pubsub-stream
I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions
I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.
My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.
Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.
In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.
Giving all this context, I can see two approaches:
1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.
2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.
The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.
The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.
Is there something I am getting wrong here? Can someone enlighten me?
You can use Cloud Dataflow just for loading operations. Here is a tutorial on how to perform ETL operations with Dataflow. It uses BigQuery but you can adapt it to connect to your Cloud SQL or other JDBC sources.
More examples can be found on the official Google Cloud Platform github page for Dataflow analysis of user generated content.
You can also have a look at this GCP ETL architecture example that automates the tasks of extracting data from operational databases.
For simpler ETL operations, Dataprep is an easy tool to use and provides flow scheduling as well.
I'm trying to figure out if there is a service on GCP which would allow consuming a stream from Pub/Sub and dump/batch accumulated data to files in Cloud Storage (e.g. every X minutes). I know that this can be implemented with Dataflow, but looking for more "out of the box" solution, if any exists.
As an example, this is something one can do with AWS Kinesis Firehose - purely on configuration level - one can tell AWS to dump whatever is accumulated in the stream to files on S3, periodically, or when accumulated data reaches some size.
The reason for this is that - when no stream processing is required, but only need to accumulate data - I would like to minimize additional costs of:
building a custom piece of software, even a simple one, if it can be avoided completely
consuming additional compute resources to execute it
To avoid confusion - I'm not looking for a free of charge solution, but the optimal one.
Google maintains a set of templates for Dataflow to perform common tasks between their services.
You can use the "Pubsub to Cloud Storage" template by simply plugging in a few config values - https://cloud.google.com/dataflow/docs/templates/provided-templates#cloudpubsubtogcstext