Create Jobs and schedule them on Bigquery - google-cloud-platform

I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?

Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions

Related

Developing a simple cloud data processing system

What would be the most simple way to take a public data API, for example, and schedule a daily job to calculate a set of statistics and land the computed statistics in a cloud database?
What about using CloudWatch Event rule with Schedule Expression and a target of a Lambda function?
The event rule would trigger your Lambda function, e.g., once a day. The function would then call the API, process the data from the API, and write the results into a DynamoDB or RDS database, depending whether you require relational or non-SQL database.
Same on GCP. You can use Cloud Scheduler for the periodic trigger, and call a Cloud Functions that perform your statistics.
You can use Firestore for storing your data in document format.
The free tiers of each product are huge and if your processing is simple and not running full time, you should pay nothing.

Should I use pub/sub

I am trying to write an ingestion application using GCP services. There could be around 1 TB of data each day which can come in a streaming way (i.e, 100 GIG each hour or even by once at a specific time)
I am trying to design an ingestion application, I first thoght it is a good idea to write a simple Python script within a cron job to read files sequentiallly (or even within two three threads) and then publish them as a message to pub/sub. Further I need to have a Dataflow job running always read data from pub/sub and save them to BigQuery.
But I really want to know If I need pub/sub at all here, I know dataflow could be very flexible and i wanted to know can I ingest 1 TB of data directly from GCS to BigQuery as batch job, or it is better to be done by a streaming job (by pub/sub) as I told above? what are the pros cons of each approach in terms of cost?
It seems like you don't need Pub/Sub at all.
There is already a Dataflow template for direct transfer of text files from Cloud Storage to BigQuery (in BETA just like the Pub/Sub to BigQuery template) and in general, batch jobs are cheaper than stream jobs (see Pricing Details).

Create/update in datastore triggers Cloud function

I have a database in Google Datastore. I don't know how to use cloud functions, but i want to trigger an event after a creation or an update.
Unfortunately the documentation is light on the subject : https://cloud.google.com/appengine/docs/standard/java/datastore/callbacks
I don't know how i could use #PostPut to trigger an event as soon as a line is created or updated.
Does anyone have a tutorial which a basic example ?
thank you
Dan MacGrath provided an answer to a similar request (callbacks are indeed discussed below. Such solution doesn't exist yet. As a workaround, taking into account the current available triggers:
HTTP—invoke functions directly via HTTP requests.
Cloud Storage
Cloud Pub/Sub
Firebase (DB, Storage, Analytics, Auth)
Stackdriver Logging—forward log entries to a Pub/Sub topic by creating a sink. You can then trigger the function.
I would suggest a couple of solutions:
Saving something in a specific bucket from Cloud Storage every time that a line is created or updated to trigger a linked Cloud Function. You can delete the bucket contents afterwards.
Create logs with the same name and then forward them to Pub/Sub, by creating a sink.
EDIT 1
Cloud Storage triggers for Cloud Functions: Official Google doc and tutorial with a sample code in node.js 6 in Github.
Cloud Pub/Sub triggers for Cloud Functions: Official Google doc and tutorial with a sample code in node.js 6 in Github (the same than before).
Cloud Datastore does not support real-time triggers on CRUD (Create, Read, Update, Delete) events.
However, you can migrate to Cloud Firestore which does support real-time triggers for those actions (by way of Cloud Pub/Sub which can be made to invoke a Cloud Function). Cloud Firestore is the successor to Cloud Datastore and may eventually supplant it at some point in future.

ETL approaches to bulk load data in Cloud SQL

I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.
My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.
Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.
In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.
Giving all this context, I can see two approaches:
1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.
2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.
The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.
The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.
Is there something I am getting wrong here? Can someone enlighten me?
You can use Cloud Dataflow just for loading operations. Here is a tutorial on how to perform ETL operations with Dataflow. It uses BigQuery but you can adapt it to connect to your Cloud SQL or other JDBC sources.
More examples can be found on the official Google Cloud Platform github page for Dataflow analysis of user generated content.
You can also have a look at this GCP ETL architecture example that automates the tasks of extracting data from operational databases.
For simpler ETL operations, Dataprep is an easy tool to use and provides flow scheduling as well.

Rate limited API requests in Cloud Composer

I'm planning a project whereby I'd be hitting the (rate-limited) Reddit API and storing data in GCS and BigQuery. Initially, Cloud Functions would be the choice, but I'd have to create a Datastore implementation to manage the "pseudo" queue of requests and GAE for cron jobs.
Doing everything in Dataflow wouldn't make sense because it's not advised the make external requests (i.e. hitting the Reddit API) and perpetually running a single job.
Could I use Cloud Composer to read fields from a Google Sheet, then create a queue of requests based on the Google Sheet, then have a task queue execute those requests, store them in GCS and load into BigQuery?
Sounds like a legitimate use case for Composer, additionally you could also leverage the pool concept in Airflow to manage concurrent calls to the same endpoint (e.g., Reddit API).