GCP Dataflow UDF input - google-cloud-platform

I'm trying to remove Datastore Bulk with Dataflow and to use JS UDF to filter entities regarding doc. But this code:
function func(inJson) {
var row = JSON.parse(inJson);
var currentDate = new Date();
var date = row.modifiedAt.split(' ')[0];
return some code
}
causes
TypeError: Cannot read property "split" from undefined
Input should be A JSON string of the entity and entities should have modifiedAt property.
What exactly passes Dataflow to UDF and how could I log it in Dataflow console?

Assuming that modifiedAt is a property you added, I would expect the JSON in Dataflow to match the Datastore rest api (https://cloud.google.com/datastore/docs/reference/data/rest/v1/Entity). Which would mean you probably want row.properties.modifiedAt . You also probably want to pull out the timestampValue of the property (https://cloud.google.com/datastore/docs/reference/data/rest/v1/projects/runQuery#Value).

Take a look at this newly published Google Cloud blog on how to get started with Dataflow UDFs.
With the Datastore Bulk Delete Dataflow template, the UDF input is the JSON string of the Datastore Entity. As noted by Jim, the properties are nested under properties field per this schema.
With respect to your other question about logging, you can add in your UDF the following print statement whose output will show up in Dataflow console under Worker Logs:
print(JSON.stringify(row, null, 2))
Alternatively, you can view the logs in Cloud Logging using the following log query:
log_id("dataflow.googleapis.com/worker")
resource.type="dataflow_step"

Related

Run GQL query from Google Cloud Workflows

I have a simple GQL query, SELECT * FROM PacerFeed WHERE pollInterval >= 0 that I want to run in a GCP Workflow wit the Firestore connector.
What is the database ID in the parent field? Is there a way I can just provide the whole query rather than the yaml'd fields? If not, what is the correct yaml args for this query?
- getFeeds:
call: googleapis.firestore.v1.projects.databases.documents.runQuery
// These args are not correct, just demonstrative.
args:
parent: projects/{projectId}/databases/{database_id}/documents
body:
structuredQuery:
from: [PacerFeed]
select: '*'
where: pollInterval >= 0
result: got
PS can someone with more pts add a 'google-cloud-workflows' tag?
As in the comments, Jim pointed out what over a week of back and forth with GCP support could not. The db must be in Firestore Native mode.
I suggest not using GCP Workflows at all. The documentation sucks, there is not schema, the only thing GCP Support can do is tell you to hire a MSP to fill the doc gaps and point you to an example ... that does not include the parent field.

How to export firestore data collection by collection field value

I am trying to export google firestore data in cloud storage location using the below command.
gcloud firestore export gs://<bucketName>/<folderName> --collection-ids=<collectionName>
I would like to furthermore filter the collection by its field name value or date range, is it possible to do so? cloud function solution also welcome.
This is my js solution
console.log('data export bucket', bucket);
const databaseName = client.databasePath(
process.env.GCLOUD_PROJECT,
'project_name'
);
const response = await client
.exportDocuments({
name: databaseName,
outputUriPrefix: bucket,
// Leave collectionIds empty to export all collections
// or define a list of collection IDs:
// collectionIds: ['users', 'posts']
collectionIds: tables,
});
console.log('successfully exportToGCS', response);
}```
At the time of writing, there is no such possibility within the Firestore export feature. These options come to my mind:
1) Since exports and imports are charged at the same rate as read/write operations, I would suggest running a query in FS instead of using the export functionality. You can run such a query with regular Cloud Function as you are already proposing. Refer to this guide for getting multiple documents from a collection.
2) In case you really need the Firestore/Datastore export format, there's an option to copy all filtered data to a new collection and exporting from there, but you'll probably pay far more then you'd want to.
Depending on what you want do you want to do with the data next, you can create a Cloud Function that will listen for changes and copy your filtered data to a new collection, from which you can then export.

How get a metric sample from monitoring APIs

I took a look very carefully to monitoring API. As far as I have read, it is possible to use gcloud for creating Monitoring Policies and edit the Policies ( Using Aleert API).
Nevertheless, from one hand it seems gcloud is able only to create and edit policies options not for reading the result from such policies. From this page I read this options:
Creating new policies
Deleting existing policies
Retrieving specific policies
Retrieving all policies
Modifying existing policies
On another hand I read from result of a failed request
Summary of the result of a failed request to write data to a time series.
So it rings a bell in my mind that I do can get a list of results like all failed request to write during some period. But how?
Please, my straigh question is: can I somehow either listen alert events or get a list of alert reults throw Monitoring API v3?.
I see tag_firestore_instance somehow related to firestore but how to use it and which information can I search for? I can't find anywhere how to use it. Maybe as common get (eg. Postman/curl) or from gcloud shell.
PS.: This question was originally posted in Google Group but I was encoraged to ask here.
*** Edited after Alex's suggestion
I have an Angular page listening a document from my Firestore database
export class AppComponent {
public transfers: Observable<any[]>;
transferCollectionRef: AngularFirestoreCollection<any>;
constructor(public auth: AngularFireAuth, public db: AngularFirestore) {
this.listenSingleTransferWithToken();
}
async listenSingleTransferWithToken() {
await this.auth.signInWithCustomToken("eyJ ... CVg");
this.transferCollectionRef = this.db.collection<any>('transfer', ref => ref.where("id", "==", "1"));
this.transfers = this.transferCollectionRef.snapshotChanges().map(actions => {
return actions.map(action => {
const data = action.payload.doc.data();
const id = action.payload.doc.id;
return { id, ...data };
});
});
}
}
So, I understand there is at least one reader count to return from
name: projects/firetestjimis
filter: metric.type = "firestore.googleapis.com/document/read_count"
interval.endTime: 2020-05-07T15:09:17Z
It was a little difficult to follow what you were saying, but here's what I've figured out.
This is a list of available Firestore metrics: https://cloud.google.com/monitoring/api/metrics_gcp#gcp-firestore
You can then pass these metric types to this API
https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.timeSeries/list
On that page, I used the "Try This API" tool on the right side and filled in the following
name = projects/MY-PROJECT-ID
filter = metric.type = "firestore.googleapis.com/api/request_count"
interval.endTime = 2020-05-05T15:01:23.045123456Z
In chrome's inspector, i can see that this is the GET request that the tool made:
https://content-monitoring.googleapis.com/v3/projects/MY-PROJECT-ID/timeSeries?filter=metric.type%20%3D%20%22firestore.googleapis.com%2Fapi%2Frequest_count%22&interval.endTime=2020-05-05T15%3A01%3A23.045123456Z&key=API-KEY-GOES-HERE
EDIT:
The above returned 200, but with an empty json payload.
We also needed to add the following entry to get data to populate
interval.startTime = 2020-05-04T15:01:23.045123456Z
Also try going here console.cloud.google.com/monitoring/metrics-explorer and type firestore in the "Find resource type and metric" box and see if google's own dashboards has data populating. (This is to confirm that there is actually data there for you to fetch)

Google Cloud DLP tag in Data Catalog shows as Job State as pending?

I have first created a custom template in DLP (with custom detectors) and then created a DLP job using the new DLP template against a BQ table and ran the job with publish to Data Catalog setting.
The DLP job completed but the DLP tag in Data Catalog has the job state as pending...this only happens when I use Custom templates for DLP job.
If I use out of the box DLP detectors tag shows up correctly Data Catalog. Any idea why the custom DLP template results are not showing in Data Catalog?
Here is the o/p of the list tag for the BQ table entry in Data Catalog
"name":"projects/XXXX/locations/US/entryGroups/#bigquery/entries/XXXX",
"template":"projects/XXXX/locations/us-central1/tagTemplates/data_loss_prevention",
"fields":{
"job_name":{
"displayName":"DLP job name",
"stringValue":"projects/XXXX/dlpJobs/i-Copy-of-test_dlp_job4"
},
"job_state":{
"displayName":"DLP job state",
"stringValue":"PENDING"
}
},
"templateDisplayName":"Data Loss Prevention Annotations"
}
Have you follow the documentation?
https://cloud.google.com/dlp/docs/creating-templates-inspect
https://cloud.google.com/dlp/docs/quickstart-create-template-inspect#create-template
I recommend you to perform this task again and go to stackdriver logging and look for the logs, maybe as this is a custom template you need to set an additional permission, however I can’t see your project and logs. Another thing you could try is to change the values, have you tried to change the Likelihood to unspecified or create a simple template like the default ones?

How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know