How to get the number of rows inserted using BigQuery Streaming - google-cloud-platform

I am reading data from a CSV file, inserting the data to a Big Query table using insertAll() method from Streaming Insert as shown below:
InsertAllResponse response = dfsf.insertAll(InsertAllRequest.newBuilder(tableId).setRows(rows).build());
rows here is an Iterable declared like this:
Iterable<InsertAllRequest.RowToInsert> rows
Now, I am actually batching the rows to insert into a size of 500 as suggested here - link to suggestion
After all the data has been inserted, how do I count the total number of rows that were inserted?
I want to find that out and log it to log4j.

This can be done one of two ways
The BigQuery Jobs API via the getQueryResults
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/getQueryResults
Cloud Logging, the output you want in the tableDataChange field.
Here is a sample output:
{
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {},
"authenticationInfo": {
"principalEmail": "service_account"
},
"requestMetadata": {
"callerIp": "2600:1900:2000:1b:400::27",
"callerSuppliedUserAgent": "gl-python/3.7.1 grpc/1.22.0 gax/1.14.2 gapic/1.12.1 gccl/1.12.1,gzip(gfe)"
},
"serviceName": "bigquery.googleapis.com",
"methodName": "google.cloud.bigquery.v2.JobService.InsertJob",
"authorizationInfo": [
{
"resource": "projects/project_id/datasets/dataset/tables/table",
"permission": "bigquery.tables.updateData",
"granted": true
}
],
"resourceName": "projects/project_id/datasets/dataset/tables/table",
"metadata": {
"tableDataChange": {
"deletedRowsCount": "2",
"insertedRowsCount": "2",
"reason": "QUERY",
"jobName": "projects/PRJOECT_ID/jobs/85f19bdd-aff5-4abe-9283-9f0bc9ed3ce8"
},
"#type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata"
}
},
"insertId": "7x7ye390qm",
"resource": {
"type": "bigquery_dataset",
"labels": {
"project_id": "PRJOECT_ID",
"dataset_id": "dataset-id"
}
},
"timestamp": "2020-10-26T07:00:22.960735Z",
"severity": "INFO",
"logName": "projects/PRJOECT_ID/logs/cloudaudit.googleapis.com%2Fdata_access",
"receiveTimestamp": "2020-10-26T07:00:23.763159336Z"
}

Related

How to build a multi-dimentional json native query for Druid?

I have data with multiple dimensions, stored in the Druid cluster. for example, Data of movies and the revenue they earned from each country where they were screened.
I'm trying to build a query that the answer to be returned will be a table of all the movies, the total revenue of each of them, and the revenue for each country.
I succeeded to do it in Turnilo - it generated for me the following Druid query -
[
[
{
"queryType": "timeseries",
"dataSource": "movies_source",
"intervals": "2021-11-18T00:01Z/2021-11-21T00:01Z",
"granularity": "all",
"aggregations": [
{
"name": "__VALUE__",
"type": "doubleSum",
"fieldName": "revenue"
}
]
},
{
"queryType": "topN",
"dataSource": "movies_source",
"intervals": "2021-11-18T00:01Z/2021-11-21T00:01Z",
"granularity": "all",
"dimension": {
"type": "default",
"dimension": "movie_id",
"outputName": "movie_id"
},
"aggregations": [
{
"name": "revenue",
"type": "doubleSum",
"fieldName": "revenue"
}
],
"metric": "revenue",
"threshold": 50
}
],
[
{
"queryType": "topN",
"dataSource": "movies_source",
"intervals": "2021-11-18T00:01Z/2021-11-21T00:01Z",
"granularity": "all",
"filter": {
"type": "selector",
"dimension": "movie_id",
"value": "some_movie_id"
},
"dimension": {
"type": "default",
"dimension": "country",
"outputName": "country"
},
"aggregations": [
{
"name": "revenue",
"type": "doubleSum",
"fieldName": "revenue"
}
],
"metric": "revenue",
"threshold": 5
}
]
]
But it doesn't work when I'm trying to use it as a body for a Postman query - I got
{
"error": "Unknown exception",
"errorMessage": "Unexpected token (START_ARRAY), expected VALUE_STRING: need JSON String that contains type id (for subtype of org.apache.druid.query.Query)\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 2, column: 3]",
"errorClass": "com.fasterxml.jackson.databind.exc.MismatchedInputException",
"host": null
}
How should I build the corresponding query so that it works with Postman?
I am not familiar with Turnilo but have you tried using the Druid Console to write SQL and convert to Native request with the "Explain SQL query" option under the "Run/..." menu?
Your native queries seem to be doing a Top N instead of listing all movies, so I think the SQL might be something like:
SELECT movie_id, country_id, SUM(revenue) total_revenue
FROM movies_source
WHERE __time BETWEEN '2021-11-18 00:01:00' AND '2021-11-21 00:01:00'
GROUP BY movie_id, country_id
ORDER BY total_revenue DESC
LIMIT 50
I don't have the data source to test, but tested with sample wikipedia data with similar query structure:
SELECT namespace, cityName, sum(sum_added) total
FROM "wikipedia" r
WHERE cityName IS NOT NULL
AND __time BETWEEN '2015-09-12 00:00:00' AND '2015-09-15 00:00:00'
GROUP BY namespace, cityName
ORDER BY total DESC
limit 50
which results in the following Native query:
{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "wikipedia"
},
"intervals": {
"type": "intervals",
"intervals": [
"2015-09-12T00:00:00.000Z/2015-09-15T00:00:00.001Z"
]
},
"virtualColumns": [],
"filter": {
"type": "not",
"field": {
"type": "selector",
"dimension": "cityName",
"value": null,
"extractionFn": null
}
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "namespace",
"outputName": "d0",
"outputType": "STRING"
},
{
"type": "default",
"dimension": "cityName",
"outputName": "d1",
"outputType": "STRING"
}
],
"aggregations": [
{
"type": "longSum",
"name": "a0",
"fieldName": "sum_added",
"expression": null
}
],
"postAggregations": [],
"having": null,
"limitSpec": {
"type": "default",
"columns": [
{
"dimension": "a0",
"direction": "descending",
"dimensionOrder": {
"type": "numeric"
}
}
],
"limit": 50
},
"context": {
"populateCache": false,
"sqlOuterLimit": 101,
"sqlQueryId": "cd5aabed-5e08-49b7-af63-fe82c125d3ee",
"useApproximateCountDistinct": false,
"useApproximateTopN": false,
"useCache": false
},
"descending": false
}

AWS Stepfunction, ValidationException

i got the error "The provided key element does not match the schema", while getting data from AWS dynamoDB using stepfunction.
stepfunction Defination
{
"Comment": "This is your state machine",
"StartAt": "Choice",
"States": {
"Choice": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.data.Type",
"StringEquals": "GET",
"Next": "DynamoDB GetItem"
},
{
"Variable": "$.data.Type",
"StringEquals": "PUT",
"Next": "DynamoDB PutItem"
}
]
},
"DynamoDB GetItem": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "KeshavDev",
"Key": {
"Email": {
"S": "$.Email"
}
}
},
"End": true
},
"DynamoDB PutItem": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Parameters": {
"TableName": "KeshavDev",
"Item": {
"City": {
"S.$": "$.City"
},
"Email": {
"S.$": "$.Email"
},
"Address": {
"S.$": "$.Address"
}
}
},
"InputPath": "$.data",
"End": true
}
}
}
Input
{
"data": {
"Type": "GET",
"Email": "demo#gmail.com"
}
}
Error
{ "resourceType": "dynamodb", "resource": "getItem", "error":
"DynamoDB.AmazonDynamoDBException", "cause": "The provided key
element does not match the schema (Service: AmazonDynamoDBv2; Status
Code: 400; Error Code: ValidationException; Request ID:
a78c3d7a-ca3f-4483-b986-1735201d4ef2; Proxy: null)" }
I see some potential issues with the getItem task when compared to AWS documentation.
I think the Key field needs to be S.$ similar to what you have in your putItem task.
There is no ResultPath attribute to tell the state machine where to put the results.
Your path may not be correct, try $.data.Email
"DynamoDB GetItem": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "KeshavDev",
"Key": {
"Email": {
"S.$": "$.data.Email"
}
}
},
"ResultPath": "$.DynamoDB",
"End": true
},
To be honest, I'm not sure if one of all of these are contributing to the validation error those are some things to experiment with.
On another note, there are some open source validators for Amazon State Language but for this case, they were not very helpful and said that your code was valid.
its working, above JD D mentoned steps and also by adding both key in step function definition.
DynamoDb have two key.
primary partition key
primary sort key

Monitoring api in Google gives "By" as response

I am reading monitoring data through Google Timeseries api. The api is working correctly and if give alignment period=3600s it gives me the values for that time series between start and end time for any metric type.
I am calling it through Python like this:
service.projects().timeSeries().list(
name=api_args["project_name"],
filter=api_args["metric_filter"],
aggregation_alignmentPeriod=api_args["aggregation_alignment_period"],
# aggregation_crossSeriesReducer=api_args["crossSeriesReducer"],
aggregation_perSeriesAligner=api_args["perSeriesAligner"],
aggregation_groupByFields=api_args["group_by"],
interval_endTime=api_args["end_time_str"],
interval_startTime=api_args["start_time_str"],
pageSize=config.PAGE_SIZE,
pageToken=api_args["nextPageToken"]
).execute()
and in Postman:
https://monitoring.googleapis.com/v3/projects/my-project/timeSeries?pageSize=500&interval.startTime=2020-07-04T16%3A39%3A37.230000Z&aggregation.alignmentPeriod=3600s&aggregation.perSeriesAligner=ALIGN_SUM&filter=metric.type%3D%22compute.googleapis.com%2Finstance%2Fnetwork%2Freceived_bytes_count%22+&pageToken=&interval.endTime=2020-07-04T17%3A30%3A01.497Z&alt=json&aggregation.groupByFields=metric.labels.key
I face an issue here:
{
"metric": {
"labels": {
"instance_name": "insta-demo1",
"loadbalanced": "false"
},
"type": "compute.googleapis.com/instance/network/received_bytes_count"
},
"resource": {
"type": "gce_instance",
"labels": {
"instance_id": "1234343552",
"zone": "us-central1-f",
"project_id": "my-project"
}
},
"metricKind": "DELTA",
"valueType": "INT64",
"points": [
{
"interval": {
"startTime": "2020-07-04T16:30:01.497Z",
"endTime": "2020-07-04T17:30:01.497Z"
},
"value": {
"int64Value": "6720271"
}
}
]
},
{
"metric": {
"labels": {
"loadbalanced": "true",
"instance_name": "insta-demo2"
},
"type": "compute.googleapis.com/instance/network/received_bytes_count"
},
"resource": {
"type": "gce_instance",
"labels": {
"instance_id": "1234566343",
"project_id": "my-project",
"zone": "us-central1-f"
}
},
"metricKind": "DELTA",
"valueType": "INT64",
"points": [
{
"interval": {
"startTime": "2020-07-04T16:30:01.497Z",
"endTime": "2020-07-04T17:30:01.497Z"
},
"value": {
"int64Value": "579187"
}
}
]
}
],
"unit": "By". //This "By" is the value which is causing problem,
I am getting this value like "unit": "By" or "unit":"ms" or something like that at the end, Also if I don't find any data for a range I'm getting this value, as I am evaluating this response in Python I am getting key error as there is not key called "unit"
logMessage: "Key Error: ' '"
severity: "ERROR"
As the response is empty I am getting the single key called "unit". Also at the end of any response I am getting this "unit":"ms" or "unit":"by" - is there any way to prevent that unit value coming in the response?
I am new to Google Cloud APIs and Python. What can I try next?
The "unit" field expresses the kind of resource the metric is counting. For bytes, it is "By". Read this. I understand it is always returned, so there is no way of not receiving it; I recommend you to adapt your code to correctly deal with its appearance in the responses.

Cost of GetMetricData Api calls when using metric math?

I'm working on ingesting metrics from Lambda into our centralized logging system. Our first idea is too costly so I'm trying to figure out if there is a way to lower the cost (instead of ingesting 3 metrics from 200 lambdas every 60s).
I've been messing around with MetricMath and have pretty much figured out what I want to do. I'd run this as a k8s cron job like thing and variabilize the start and end time.
How would this be charged? Is it the number of metrics used to perform the math or the number of values that I output?
i.e. m1 and m2 are pulling Errors and Invocations from 200 lambdas. To pull each of these individually would be 400 metrics.
In this method, would it only be 1, 3, or 401?
{
"MetricDataQueries": [
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Errors"
},
"Period": 300,
"Stat": "Sum",
"Unit": "Count"
},
"ReturnData": false
},
{
"Id": "m2",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Invocations"
},
"Period": 300,
"Stat": "Sum",
"Unit": "Count"
},
"ReturnData": false
},
{
"Id": "e1",
"Expression": "m1 / m2",
"Label": "ErrorRate"
}
],
"StartTime": "2020-02-25T02:00:0000",
"EndTime": "2020-02-26T02:05:0000"
}
Output:
{
"Messages": [],
"MetricDataResults": [
{
"Label": "ErrorRate",
"StatusCode": "Complete",
"Values": [
0.0045127626568890146
],
"Id": "e1",
"Timestamps": [
"2020-02-26T19:00:00Z"
]
}
]
}
Example 2:
Same principle. This is pulling the invocations by of each function by FunctionName. It then sorts them and outputs the most invoked. Any idea how many metrics this would be?
{
"MetricDataQueries": [
{
"Id": "e2",
"Expression": "SEARCH(' {AWS/Lambda,FunctionName} MetricName=`Invocations` ', 'Sum', 60)",
"ReturnData" : false
},
{
"Id": "e3",
"Expression": "SORT(e2, SUM, DESC, 1)"
}
],
"StartTime": "2020-02-26T12:00:0000",
"EndTime": "2020-02-26T12:01:0000"
}
Same question. 1 or 201 metrics?
Output:
{
"MetricDataResults": [
{
"Id": "e3",
"Timestamps": [
"2020-02-26T12:00:00Z"
],
"Label": "1 - FunctionName",
"Values": [
91.0
],
"StatusCode": "Complete"
}
],
"Messages": []
}
Billing is on metrics requested: https://aws.amazon.com/cloudwatch/pricing/
In the first example, you're requesting only 2 metrics. These metrics are aggregates of per lambda function metrics, but as far you're concerned, that's only 2 metrics and you will be billed for 2. You're not billed for the metric math, only for metrics you request.
In the second example, the number of metrics the search returns is the amount you will be billed for, 200 in your case.

ElasticSearch (AWS): How to use another index as a query/match parameter?

Basically I am trying to implement this strategy.
Sample Data:
PUT /newsfeed_consumer_network/consumer_network/urn:viadeo:member:me
{
"producerIds": [
"urn:viadeo:member:ned",
"urn:viadeo:member:john",
"urn:viadeo:member:mary"
]
}
PUT /newsfeed/news/urn:viadeo:news:33
{
"producerId": "urn:viadeo:member:john",
"published": "2014-12-17T12:45:00.000Z",
"actor": {
"id": "urn:viadeo:member:john",
"objectType": "member",
"displayName": "John"
},
"verb": "add",
"object": {
"id": "urn:viadeo:position:10",
"objectType": "position",
"displayName": "Software Engineer # Viadeo"
},
"target": {
"id": "urn:viadeo:profile:john",
"objectType": "profile",
"displayName": "John's profile"
}
}
Sample Query:
POST /newsfeed/news/_search
{
"query": {
"bool": {
"must": [{
"match": {
"actor.id": {
"producerId": {
"index": "newsfeed_consumer_network",
"type": "consumer_network",
"id": "urn:viadeo:network:me",
"path": "producerIds"
}
}
}
}]
}
}
}
I am getting the following error:
"type": "query_parsing_exception",
"reason": "[match] query does not support [index]"
How can I use an index to support a matching query? Is there any way to implement this?
Basically I just want to use another document as the source of the matching parameter for my query. Is this even possible with ElasticSearch?