How to call a Bigquery stored procedure in Nifi - google-cloud-platform

I have a bigquery stored procedures which will run on some GCS object and do magic out of it. The procedures work perfect manually but I want to call the procedure from Nifi. I have worked with HANA and know that I need JDBC driver to connect and perform query.
Either I can use the executeprocess processor or I could use executeSQL processor. I dont know to be honest
I am not sure how to achieve that in Nifi with bigquery stored procedures. Could anyone help me on this?
Thanks in advance!!
Updated with new error if someone could help

Option1: Executeprocess
The closest thing to "execute manually" is installing the Google Cloud SDK and execute within 'executeprocess' this:
bq query 'CALL STORED_PROCEDURE(ARGS)'
or
bq query 'SELECT STORED_PROCEDURE(ARGS)'
Option 2: ExecuteSQL
If you want to use ExecuteSQL with Nifi to call the stored procedure, you'll the BigQuery JDBC Driver.
Both 'select' and 'call' methods will work with BigQuery.
Which option is better?
I believe ExecuteSQL is easier than Executeprocess.
Why? because you need to install the GCloud SDK on all systems that might run executecommand, and you must pass the google cloud credentials to them.
That means sharing the job is not easy.
Plus, this might involve administrator rights in all the machines.
In the ExecuteSQL case you'll need to:
1 - Copy the jdbc driver to the lib directory inside your Nifi installation
2 - Connect to BigQuery using pre-generated access/refresh tokens - see JDBC Driver for Google BigQuery Install and Configuration guide - that's Oauth type 2.
The good part is that when you export the flow, the credentials are embedded on it: no need to mess with credentials.json files etc (this could be also bad from a security standpoint).
Distributing jdbc jars is easier than installing the GCloud SDK: just drop a file on the lib folder. If you need it in more than one node, you can scp/sftp it, or distribute it with Ambari.

Related

Data streaming from raspberry pi CSV file to BigQuery table

I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?
Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!
Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer

Can I tell Google Cloud SQL to restore my backup to a completely different database?

Since there is a nightly backup of SQL we are wondering of a good way to restore this backup to a different database in the same MySQL server instance. We have prod_xxxx for all our production databases AND we have staging_xxxx for all our staging databases (yes not that good in that they are all on the same mysql instance right now).
Anyways, we would love to restore all tables/constraints/etc and data from prod_incomingdb to staging_incomingdb. Is this possible in cloud SQL?
Since this is over a productive instance I recommend you to perform a backup before start, in order to avoid any data corruption.
To clone a database within the same instance, there is not a direct way to perform the task (this is a missing feature on MySQL).
I followed this path in order to successfully clone a database within same MySQL Cloud SQL instance.
1.- Create a dump of the desired database using the Google Cloud Console (Web UI) by follow these steps
*it is very important to only dump the desired database in format SQL, please not select multiple databases on the dump.
After finish the process, the dump will be available in a Google Cloud Storage Bucket.
2.- Download the dump file to a Compute Engine VM or to any local machine with linux.
3.- please replace the database name (the old one) in the USE clauses.
I used this sed command over my downloaded dump to change the names of the databases
sed -i 's/USE `employees`;/USE `emp2`;/g' employees.sql
*this can take some seconds depending the size of your file.
4.- Upload the updated file to the Cloud storage bucket.
5.- Create a new empty database on your Cloud SQL instance, in this case my target instance is called emp2.
6.- Import the modified dump by following these steps
I could not figure out the nightly backups as it seems to restore an entire instance. I think the answer to the above is no. I did find out that I can export and then import (not exactly what I wanted though as I didn't want to be exporting our DB during the day but for now, we may go with that and automate a nightly export later).

How to connect pgBadger to Google Cloud SQL

I have a database on a Google Cloud SQL instance. I want to connect the database to pgBadger which is used to analyse the query. I have tried finding various methods, but they are asking for the log file location.
I believe there are 2 major limitations preventing an easy set up that would allow you to use pgBadger with logs generated by a Cloud SQL instance.
The first is the fact that Cloud SQL logs are processed by Stackdriver, and can only be accessed through it. It is actually possible to export logs from Stackdriver, however the outcome format and destination will still not meet the requirements for using pgBadger, which leads to the second major limitation.
Cloud SQL does not allow changes in all required configuration directives. The major one is the log_line_prefix, which currently does not follow the required format and it is not possible to change it. You can actually see what flags are supported in Cloud SQL in the Supported flags documentation.
In order to use pgBadger you would need to reformat the log entries, while exporting them to a location where pgBadger could do its job. Stackdriver can stream the logs through Pub/Sub, so you could develop an app to process and store them in the format you need.
I hope this helps.

How to schedule a query (Export Data) from Google Big Query to External Storage space (Eg: Box)

I read many articles and solutions regarding scheduling queries to external storage places in Google Big Query but they didn't seem to be that clear.
Note: My company has subscription only to Google Big Query and not to the complete cloud Services (Google Cloud Platform).
I know how to do it manually but I am looking to automate the process since I need the same data every week.
Any suggestions will be appreciated. Thank you.
Option 1
You can use Apache Airflow which provides the option to create schedule task on to of BigQuery using BigQuery operator.
You can find in this link the basic steps required to start setting this up
option 2
You can use the Google BigQuery command line to export your data as you do from the webUI, for example:
bq --location=[LOCATION] extract --destination_format [FORMAT] --compression [COMPRESSION_TYPE] --field_delimiter [DELIMITER] --print_header [BOOLEAN] [PROJECT_ID]:[DATASET].[TABLE] gs://[BUCKET]/[FILENAME]
Once you get this working you can use any schedule process of your liking to schedule the run of this job
BTW: Airflow has a connector which enables you to run the command line tool
Once the file in GCP you can use Box G suite integration to see and manage your files

How to copy huge file(200-500GB) everyday from Teradata server to HDFS

I have teradata files on SERVER A and I need to copy to Server B into HDFS. what options do i have?
distcp is ruled because Teradata is not on HDFS
scp is not feasible for huge files
Flume and Kafka are meant for Streaming and not for file movement. Even if i use Flume using Spool_dir, it will be an overkill.
Only option I can think of is NiFi. Does anyone has any suggestions on how can i utilize Nifi?
or if someone has already gone through these kind of scenarios, what was the approach followed?
I haven't specifically worked with Teradata dataflow in NiFi but having worked with other SQL sources on NiFi, I believe it is possible & pretty straight-forward to develop dataflow that ingests data from Teradata to HDFS.
For starters you can do a quick check with ExecuteSQL processor available in NiFi. The SQL related processors take one DBCPConnectionPool property which is a NiFi controller service which should be configured with the JDBC URL of your Teradata server and the driver path and driver class name. Once you validate the connection is fine, you can take a look at GenerateTableFetch/ QueryDatabaseTable
Hortonworks has an article which talks about configuring DBCPConnectionPool with a Teradata server : https://community.hortonworks.com/articles/45427/using-teradata-jdbc-connector-in-nifi.html