I am trying to create an ETL process, I have the desired data stored in Big Query. Every time I want to run my process in Dataprep this error pops up:
The schema of the BigQuery table does not match the recipe (...)
To solve it I have to manually re-import the table from Big Query, my question is:
Is there a way to automate the manual re-import of the table on every scheduled run in order to solve this error?
Note: I found this question with a similar issue but the solution given was manual and I want an automated solution.
Related
I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.
BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.
From the link above, WRITE_TRUNCATE means:
Specifies that write should replace a table.
The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.
If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).
Because of companies policies, we have a lot of information that we need as input inserted into a BigQuery table that we need to SELECT from.
My problem is that doing a select directly into this table and trying to run a process (a virtual machine, etc) is prone to errors and reworking. If my process stops, I need to run the query again and reprocess everything.
Is there a way to export data from Big Query to a Kinesis-like stream (I'm more familiar with AWS)?
DataFlow + PubSub seems to be the way to go for this kind of issue.
Thank you jamiet!
I have a task to create backup\copies of certain select tables from various datasets from one project into another (or) within the same project. A model query is listed below. There are about 300 odd tables in total.
CREATE OR REPLACE TABLE $target_project.$target_dataset.$table_name_$suffix AS
SELECT * FROM $source_project.$source_dataset.$table_name
In order to accomplish this, I have created a config table with the dataset name and table name. I have two approaches in mind -
Option 1 -
Create SQL script which loops through all the records from the config table, generates dynamic sql and executes them. Am unable to find a proper way to loop through table records, getting values in variables. The struct command only takes one query at a time.
Option 2 -
Create a sql file containing all the CREATE TABLE SCRIPT statements with placeholders for the project name and suffix names.
Use a DAG, pass the variables into the sql file and execute the script using Cloud Composer.
Option 2 is working and feasible. It is just not scalable owing the fact that any changes to be made would require one to modify the script again, re-upload etc.
Can someone help me with Option 1 (or) advise if there is any better way to accomplish this task?
We work with the gcp suite of products and am happy to engage with various tools to accomplish this.
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.
I have a BigQuery table and an external data import process that should add entries every day. I need to verify that the table contains current data (with a timestamp of today). Writing the SQL-query is not a problem.
My question is how to best install such a monitoring in GCP? Can Stackdriver execute custom BigQuery SQL? Or would a CloudFunction be more suitable? An AppEngine application with a cronjob? What's the best practise?
Not sure what's the best practice here, but one simple solution is to use BigQuery scheduled query. Schedule query, make it fail is something is wrong using ERROR() function, configure scheduled query to notify (it sends email) if it fails.