Running dbt vs scheduled queries on bigquery - google-cloud-platform

I am trying to see if there are any benefits to running dbt on bigquery as oppose to scheduled queries. Hence, what are the benefits?

If DBT runs a simple query, there is no real advantages. But in that case, I won't use Schedule queries, but Cloud Workflows, or Cloud Scheduler that invoque a BigQuery API call (jobs API, with a query job).
However, if you build a complex query, DBT can help you to create and maintain it, even to create temporary working table to get the final table.
But at the end of the day, DBT is simply a SQL query generator and you can schedule the output with the service that you prefer.

Related

Speed up BigQuery query job to import from Cloud SQL

I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

Process data from BigQuery using Dataflow

I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.

What's the difference between Dataflow sql, Beam SQL (Zeta sql or CALCITE SQL)?

While browsing I just came across Dataflow SQL. Is it any different from beamSQL?
Apache Beam SQL is a functionality of Apache Beam that allows you to execute queries directly from your pipeline.
As you can see here, Beam SQL has two options of SQL syntax: Beam Calcite SQL and Zeta SQL. The advantage of using Zeta SQL is that its very similar to BigQuery's syntax hence its useful in pipelines that read from or write to BigQuery.
Dataflow SQL is a functionality of Dataflow that allows you to create pipelines directly from a BigQuery query. It's said in the documentation that it supports the Zeta SQL syntax (BigQuery syntax).
To create a new Dataflow job through the BigQuery's console, to the following steps:
Go to BigQuery's console
Just under the Query editor, click in More and then in Query settings
Select Cloud Dataflow engine in the first option as you can see below
After that, you can click in Create Cloud Dataflow job and your query will become a job in Dataflow.
I hope it helps

Automatically materialize view from query

I want to automatically materialize a view based on a BigQuery query (all source tables are in BigQuery as well). Is there a lightweight solution for this in google cloud?
BigQuery doesn't support materialized views, here is a feature request and
here another one (that you can star to increase visibility)
You can create something from scratch executing a CRON at regular interval that will run a query job with output table as the one you want to produce.
Like with gcloud
bq query --destination_table project.dataset.materialized_view --use_legacy_sql=false --replace "SELECT * FROM `project.dataset.view_name`"
Or with the API as well