Is it now possible to define Python UDFs in BigQuery ? If not is it on the roadmap for soon ?
The last ticket saying that only JavaScript can be used is from 2018 [1].
[1] BigQuery UDF in Python or only in JavaScript
Now it’s not possible, only SQL and JavaScript are supported natively in BigQuery.
But a feature added recently, allows using an udf based on a Cloud Function or Cloud Run service (Remote Functions).
It give more flexibility and allows using an udf with your preferred language.
Related
I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?
Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!
Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer
I have a query related to the google cloud storage for julia application.
Currently, I am hosting a julia application (docker container) on GCP and would like to allow the app to utilize cloud storage buckets to write and read the data.
I have explored few packages which promise to do this operation.
GoogleCloud.jl
This package in the docs show a clear and concise representation of the implementation. However, adding this package result in incremental compilation warning with many of the packages failing to compile. I have opened an issue on their github page : https://github.com/JuliaCloud/GoogleCloud.jl/issues/41
GCP.jl
The scope is limited, currently the only support is for BigQuery
Python package google
This is quite informative and operational but will take a toll on the code's performance. But do advise if this is the only viable option.
I would like to know are there other methods which can be used to configure a julia app to work with google storage?
Thanks look forward to the suggestions!
GCP.jl is promising plus you may be able to do with gRPC if Julia support gRPC (see below).
Discovery
Google has 2 types of SDK (aka Client Library). API Client Libraries are available for all Google's APIs|services.
Cloud Client Libraries are newer, more language idiosyncratic but only available for Cloud. Google Cloud Storage (GCS) is part of Cloud but, in this case, I think an API Client Library is worth pursuing...
Google's API (!) Client Libraries are auto-generated from a so-called Discovery document. Interestingly, GCP.jl specifically describes using Discovery to generate the BigQuery SDK and mentions that you can use the same mechanism for any other API Client Library (i.e. GCS).
NOTE Explanation of Google Discovery
I'm unfamiliar with Julia but, if you can understand enough of that repo to confirm that it's using the Discovery document to generate APIs and, if you can work out how to reconfigure it for GCS, this approach would provide you with a 100% fidelity SDK for Cloud Storage (and any other Google API|service).
Someone else tried to use the code to generate an SDK for Sheets and had an issue so it may not be perfect.
gRPC
Google publishes for the subset of its services that support gRPC. If you'd prefer to use gRPC, it ought be possible to use the Protobufs in Google's repo to define a gRPC client for Cloud Storage
I have a bigquery stored procedures which will run on some GCS object and do magic out of it. The procedures work perfect manually but I want to call the procedure from Nifi. I have worked with HANA and know that I need JDBC driver to connect and perform query.
Either I can use the executeprocess processor or I could use executeSQL processor. I dont know to be honest
I am not sure how to achieve that in Nifi with bigquery stored procedures. Could anyone help me on this?
Thanks in advance!!
Updated with new error if someone could help
Option1: Executeprocess
The closest thing to "execute manually" is installing the Google Cloud SDK and execute within 'executeprocess' this:
bq query 'CALL STORED_PROCEDURE(ARGS)'
or
bq query 'SELECT STORED_PROCEDURE(ARGS)'
Option 2: ExecuteSQL
If you want to use ExecuteSQL with Nifi to call the stored procedure, you'll the BigQuery JDBC Driver.
Both 'select' and 'call' methods will work with BigQuery.
Which option is better?
I believe ExecuteSQL is easier than Executeprocess.
Why? because you need to install the GCloud SDK on all systems that might run executecommand, and you must pass the google cloud credentials to them.
That means sharing the job is not easy.
Plus, this might involve administrator rights in all the machines.
In the ExecuteSQL case you'll need to:
1 - Copy the jdbc driver to the lib directory inside your Nifi installation
2 - Connect to BigQuery using pre-generated access/refresh tokens - see JDBC Driver for Google BigQuery Install and Configuration guide - that's Oauth type 2.
The good part is that when you export the flow, the credentials are embedded on it: no need to mess with credentials.json files etc (this could be also bad from a security standpoint).
Distributing jdbc jars is easier than installing the GCloud SDK: just drop a file on the lib folder. If you need it in more than one node, you can scp/sftp it, or distribute it with Ambari.
I am new learner in Informatica cloud data integration. Currently I am trying to convert SSIS ETL to Informatica.
While conversion, at one point I need to call a SQL Server stored procedure inside Informatica data integration which mainly update some data in tables. I tried many things but not getting success.
Can anyone have any idea how we can call a SQL Server stored procedure using informatica cloud data integration?
Please use the SQL transformation.
refer to the following link : network.informatica.com/videos/1213
Accept this as the right answer if it helped you. This will help people in future
You need to use a pre- or post processing script and call it from the native system as part of an integration. If this is for Application Integration then call as a command (and enable commands).
HTH
Scott S Nelson
I am used to Google Cloud SQL, where you can connect to a database outside of GAE. Is something like this possible for the GAE datastore, using the Python NDB interface ideally ?
Basically, my use-case is I want to run acceptance tests that pre-populate and clean a datastore.
It looks like the current options are a JSON API or protocol buffers -- in beta. If so, it's kind of a pain then I can't use my NDB models to populate the data, but have to reimplement them for the tests, and worry that they haven't been saved to the datastore in the exact same way as if through the application.
Just checking I'm not missing something....
PS. yes I know about remote_api_shell, I don't want a shell though. I guess piping commands into it is one way, but ugghh ...
Cloud Datastore can be accessed via client libraries outside of App Engine. They run on the "v1 API" which just went GA (August 16, 2016) after a few years in Beta.
The Client Libraries are available for Python, Java, Go, Node.js, Ruby, and there is even .NET.
As a note, GQL language variant supported in DB/NDB is a bit different from what the Cloud Datastore service itself supports via the v1 API. The NDB Client Library does some of its own custom parsing that can split certain queries into multiple ones to send to the service, combining the results client-side.
Take a read of our GQL reference docs.
Short answer: they're working on it. Details in
google-cloud-datastore#2 and gcloud-python#40.