Run GQL query from Google Cloud Workflows - google-cloud-platform

I have a simple GQL query, SELECT * FROM PacerFeed WHERE pollInterval >= 0 that I want to run in a GCP Workflow wit the Firestore connector.
What is the database ID in the parent field? Is there a way I can just provide the whole query rather than the yaml'd fields? If not, what is the correct yaml args for this query?
- getFeeds:
call: googleapis.firestore.v1.projects.databases.documents.runQuery
// These args are not correct, just demonstrative.
args:
parent: projects/{projectId}/databases/{database_id}/documents
body:
structuredQuery:
from: [PacerFeed]
select: '*'
where: pollInterval >= 0
result: got
PS can someone with more pts add a 'google-cloud-workflows' tag?

As in the comments, Jim pointed out what over a week of back and forth with GCP support could not. The db must be in Firestore Native mode.
I suggest not using GCP Workflows at all. The documentation sucks, there is not schema, the only thing GCP Support can do is tell you to hire a MSP to fill the doc gaps and point you to an example ... that does not include the parent field.

Related

Google Cloud Logging filter to remove duplicates

Dues to the distributed nature of my system, I have duplicates in my Google Cloud Loggings logs.
03:34pm : id: 2094380, message 1
03:34pm : id: 2094380, message 1
03:35pm : id: 5922284, message 2
03:35pm : id: 5922284, message 2
My end goal is to create a graph based on a count of my events (using a log-based metrics).
Is there a way to filter in Google Cloud Logging my logs to have only the first occurrence of each line?
Like what koblan and guillaume blaquiere suggested, we can store the logs generated to a big query table follow this doc to export your logs to bigquery and use distinct functionality for getting distinct results. Now as guillaume blaquiere said query them in log analytics or looker studio for visualizing your log data.

what will be the query for check completion of workflow?

I have to cheack the status of workflow weather that workflow completed within scheduled time or not in sql query format. And also send an email of workflow status like 'completed within time ' or not 'completed within time'. So, please help me out
You can do it either using option1 or option 2.
You need access to repository meta database.
Create a post session shell script. You can pass workflow name and benchmark value to the shell script.
Get workflow run time from repository metadata base.
SQL you can use -
SELECT WORKFLOW_NAME,(END_TIME-START_TIME)*24*60*60 diff_seconds
FROM
REP_WFLOW_RUN
WHERE WORKFLOW_NAME='myWorkflow'
You can then compare above value with benchmark value. Shell script can send a mail depending on outcome.
you need to create another workflow to check this workflow.
If you do not have access to Metadata, please follow above steps except metadata SQL.
Use pmcmd GetWorkflowDetails to check status, start and end time for a workflow.
pmcmd GetWorkflowDetails -sv service -d domain -f folder myWorkflow
You can then grep start and end time from there, compare them with benchmark values. The problem is the format etc. You need little bit scripting here.

GCP Dataflow UDF input

I'm trying to remove Datastore Bulk with Dataflow and to use JS UDF to filter entities regarding doc. But this code:
function func(inJson) {
var row = JSON.parse(inJson);
var currentDate = new Date();
var date = row.modifiedAt.split(' ')[0];
return some code
}
causes
TypeError: Cannot read property "split" from undefined
Input should be A JSON string of the entity and entities should have modifiedAt property.
What exactly passes Dataflow to UDF and how could I log it in Dataflow console?
Assuming that modifiedAt is a property you added, I would expect the JSON in Dataflow to match the Datastore rest api (https://cloud.google.com/datastore/docs/reference/data/rest/v1/Entity). Which would mean you probably want row.properties.modifiedAt . You also probably want to pull out the timestampValue of the property (https://cloud.google.com/datastore/docs/reference/data/rest/v1/projects/runQuery#Value).
Take a look at this newly published Google Cloud blog on how to get started with Dataflow UDFs.
With the Datastore Bulk Delete Dataflow template, the UDF input is the JSON string of the Datastore Entity. As noted by Jim, the properties are nested under properties field per this schema.
With respect to your other question about logging, you can add in your UDF the following print statement whose output will show up in Dataflow console under Worker Logs:
print(JSON.stringify(row, null, 2))
Alternatively, you can view the logs in Cloud Logging using the following log query:
log_id("dataflow.googleapis.com/worker")
resource.type="dataflow_step"

How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know

Get all my scheduled SQL queries in BigQuery Google Cloud

Im trying to get SQL codes by command-line (CLI) of my scheduled queries in BigQuery. I'm also interested if there is a way to do that by the Google Cloud Platform user interface.
I have taken a quick look to this related post, but that's not the answer that I am looking for.
List Scheduled Queries in BigQuery
Thank you in advance for all your answers.
I found how to query the scheduled queries with the bq CLI. You have to rely on the BigQuery Transfer API. Why? I don't know, but it's the right keyword here.
For listing all your schedule query, perform this (change your location if you want!):
bq ls --transfer_config --transfer_location=eu
# Result
name displayName dataSourceId state
--------------------------------------------------------------------------------------------- ------------- ----------------- -------
projects/763366003587/locations/europe/transferConfigs/5de1fc66-0000-20f2-bee7-089e082935bc test scheduled_query
For viewing the detail, copy the name and use bq show
bq show --transfer_config \
projects/763366003587/locations/europe/transferConfigs/5de1fc66-0000-20f2-bee7-089e082935bc
# Result
updateTime destinationDatasetId displayName schedule datasetRegion userId scheduleOptions dataSourceId
params
----------------------------- ---------------------- ------------- ----------------- --------------- ---------------------- -------------------------------------------------------------------------------------- ----------------- --------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
2019-11-18T20:20:22.279237Z bi_data test every day 20:19 europe -7444165337568771239 {u'endTime': u'2019-11-18T21:19:36.528Z', u'startTime': u'2019-11-18T20:19:36.497Z'} scheduled_query {u'query': u'
SELECT * FROM `gbl-imt-homerider-basguillaueb.bi_data.device_states`', u'write_disposition': u'WRITE_TRUNCATE', u'destination_table_name_template': u'test_schedule'}
You can use json format and jq for getting only the query like this
bq show --format="json" --transfer_config \
projects/763366003587/locations/europe/transferConfigs/5de1fc66-0000-20f2-bee7-089e082935bc \
| jq '.params.query'
# Result
"SELECT * FROM `gbl-imt-homerider-basguillaueb.bi_data.device_states`"
I can explain how I found this unexpected solution that if you want, but it's not the topic here. I think it's not documented
On the GUI, it's easier.
Go to BigQuery (new UI, in blue)
Click on scheduled query on the left menu
Click on your scheduled query name
Click on configuration on the top on the screen
To get your scheduled queries (= datatransfers), you can also use the python API:
from google.cloud import bigquery_datatransfer
bq_datatransfer_client = bigquery_datatransfer.DataTransferServiceClient()
request_datatransfers = bigquery_datatransfer.ListTransferConfigsRequest(
# if US, you can just do parent='projects/YOUR_PROJECT_ID'
parent='projects/YOUR_PROJECT_ID/locations/EU',
)
# this method will also deal with pagination
response_datatransfers = bq_datatransfer_client.list_transfer_configs(
request=request_datatransfers)
# to convert the response to a list of scheduled queries
datatransfers = list(response_datatransfers)
To get the actual query text from the scheduled query:
for datatransfer in datatransfers:
print(datatransfer.display_name)
print(datatransfer.params.get('query'))
print('\n')
See also these SO questions:
How do I list my scheduled queries via the Python google client API?
List Scheduled Queries in BigQuery
Docs on this specific part of the python API:
https://cloud.google.com/python/docs/reference/bigquerydatatransfer/latest/google.cloud.bigquery_datatransfer_v1.services.data_transfer_service.DataTransferServiceClient#google_cloud_bigquery_datatransfer_v1_services_data_transfer_service_DataTransferServiceClient_list_transfer_configs