How to deduplicate GCP logs from Logs Explorer? - google-cloud-platform

I am using GCP Logs explorer to store logging messages from my pipeline.
I need to debug an issue by looking at logs from a specific event. The message of this error is identical except for an event ID at the end.
So for example, the error message is
event ID does not exist: foo
I know that I can use the following syntax to construct a query that will return the logs with this particular message structure
resource.type="some_resource"
resource.labels.project_id="some_project"
resource.labels.job_id="some_id"
severity=WARNING
jsonPayload.message:"Event ID does not exist:"
The last line in that query will then return every log where the message has that string.
I end up with a result like this
Event ID does not exist: 1A
Event ID does not exist: 2A
Event ID does not exist: 2A
Event ID does not exist: 3A
so I wish to deduplicate that to end up with only
Event ID does not exist: 1A
Event ID does not exist: 2A
Event ID does not exist: 3A
But I don't see support for this type of deduplication in the language docs
Due to the amount of rows, I also cannot download a delimited log file.
Is it possible to deduplicate the amount of rows?

To deduplicate records with BigQuery, follow these steps:
Identify whether your dataset contains duplicates.
Create a SELECT query that aggregates the desired column using a
GROUP BY clause.
Materialize the result to a new table using CREATE OR REPLACE TABLE [tablename] AS [SELECT STATEMENT].
You can review the full tutorial in this link.
To analyze a big quantity of logs, you could route them to BigQuery and analyze the logs using Fluentd.
Fluentd has an output plugin that can use BigQuery as a destination for storing the collected logs. Using the plugin, you can directly load logs into BigQuery in near real time from many servers.
In this link, you can find a complete tutorial on how to Analyze logs using Fluentd and BigQuery.
To route your logs to BigQuery, first it is necessary to create a sink and route it to BigQuery.
Sinks control how Cloud Logging routes logs. Using sinks, you can
route some or all of your logs to supported destinations.
Sinks belong to a given Google Cloud resource: Cloud projects, billing
accounts, folders, and organizations. When the resource receives a log
entry, it routes the log entry according to the sinks contained by
that resource. The log entry is sent to the destination associated
with each matching sink.
You can route log entries from Cloud Logging to BigQuery using sinks.
When you create a sink, you define a BigQuery dataset as the
destination. Logging sends log entries that match the sink's rules to
partitioned tables that are created for you in that BigQuery dataset.
1) In the Cloud console, go to the Logs Router page:
2) Select an existing Cloud project.
3) Select Create sink.
4) In the Sink details panel, enter the following details:
Sink name: Provide an identifier for the sink; note that after you create the sink, you can't rename the sink but you can delete it and
create a new sink.
Sink description (optional): Describe the purpose or use case for the sink.
5) In the Sink destination panel, select the sink service and destination:
Select sink service: Select the service where you want your logs routed. Based on the service that you select, you can select from the
following destinations:
BigQuery table: Select or create the particular dataset to receive the routed logs. You also have the option to use partitioned tables.
For example, if your sink destination is a BigQuery dataset, the sink
destination would be the following:
bigquery.googleapis.com/projects/PROJECT_ID/datasets/DATASET_ID
Note that if you are routing logs between Cloud projects, you still
need the appropriate destination permissions.
6) In the Choose logs to include in sink panel, do the following:
In the Build inclusion filter field, enter a filter expression that
matches the log entries you want to include. If you don't set a
filter, all logs from your selected resource are routed to the
destination.
To verify you entered the correct filter, select Preview logs. This
opens the Logs Explorer in a new tab with the filter prepopulated.
7) (Optional) In the Choose logs to filter out of sink panel, do the following:
In the Exclusion filter name field, enter a name.
In the Build an exclusion filter field, enter a filter expression that
matches the log entries you want to exclude. You can also use the
sample function to select a portion of the log entries to exclude. You
can create up to 50 exclusion filters per sink. Note that the length
of a filter can't exceed 20,000 characters.
8) Select Create sink.
More information about Configuring and managing sinks here.
To review details, the formatting, and rules that apply when routing log entries from Cloud Logging to BigQuery, please follow this link.

Related

GCP log explorer filter for list item count more than 1

I am trying to write a filter in GCP log explorer, which can look for a count of the values of an attribute.
Example:
I am trying to find the logs like below, which has two items for "referencedTables" attribute.
GCP Log Explorer Screenshot
I have tried below options which doesn't work -
protoPayload.metadata.jobChange.job.jobStats.queryStats.referencedTables.*.count>1
protoPayload.metadata.jobChange.job.jobStats.queryStats.referencedTables.count>1
Also tried Regex looking for "tables" keyword occurrence twice -
protoPayload.metadata.jobChange.job.jobStats.queryStats.referencedTable=~"(\tables+::\tables+))"
Also tried Regex querying second item, which means there are more than one items -
protoPayload.metadata.jobChange.job.jobStats.queryStats.referencedTables1=~"^[A-Za-z0-9_.]+$"
Note that - these types of logs are BigQuery audit logs, that are logged in GCP logging service, when you run "insert into.. select" type of queries in BigQuery.
I think you can't use logging filters to filter across log entries only within a log entry.
One solution to your problem is log-based metrics where you'd create a metric by extracting values from logs but you'd then have to use MQL to query (e.g. count) the metric.
A more simple (albeit ad hoc) solution is to use use gcloud logging read to --filter the logs (possibly --format the results in JSON for easier processing) and then pipeline the results into a tool like jq where you could count the results.

Creating a BigQuery dataset from a log sink in GCP

When running
gcloud logging sinks list
it seems I have several sinks for my project
▶ gcloud logging sinks list
NAME DESTINATION FILTER
myapp1 bigquery.googleapis.com/projects/myproject/datasets/myapp1 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp1"
myapp2 bigquery.googleapis.com/projects/myproject/datasets/myapp2 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp2"
myapp3 bigquery.googleapis.com/projects/myproject/datasets/myapp3 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp3"
However, when I navigate in my BigQuery console, I don't see the corresponding datasets.
Is there a way to import these sinks as datasets so that I can run queries against them?
This guide on creating BigQuery datasets does not list how to do so from a log sink (unless I am missing something)
Also any idea why the above datasets are not displayed when using the bq ls command?
Firstly, be sure to be in the good project. if not, you can import dataset from external project by clicking on the PIN button (and you need to have enough permission for this).
Secondly, the Cloud Logging sink to BigQuery doesn't create the dataset, only the tables. So, if you have created the sinks without the dataset, you sinks aren't running (or run in error). Here more details
BigQuery: Select or create the particular dataset to receive the exported logs. You also have the option to use partitioned tables.
In general, what you expect for this feature to do is right, using BigQuery as log sink is to allow you to query the logs with BQ. For the problem you're facing, I believe it is to do with using Web console vs. gcloud.
When using BigQuery as log sink, there are 2 ways to specify a dataset:
point to an existing dataset
create a new dataset
When creating a new sink via web console, there's an option to have Cloud Logging create a new dataset for you as well. However, when using gcloud logging sinks create, it does not automatically create a dataset for you, only create the log sink. It seems like it also does not validate whether the specified dataset exists.
To resolve this, you could either use web console for the task or create the datasets on your own. There's nothing special about creating a BQ dataset to be a log sink destination comparing to creating a BQ dataset for other purpose. Create a BQ dataset, then create a log sink to point to the dataset and you're good to go.
Conceptually, different products (BigQuery, Cloud Logging) on GCP runs independently, the log sink in Cloud Logging is simply an object that pairs up filter and destination, but does not own/manage the destination resource (eg. BQ dataset). It's just that in web console, it provide some extra integration to make things easier.

BigQuery summary

Where/how can I easily see how many BigQuery analysis queries have been run per month. How about storage usage overall/changes-over-time(monthly)?
I've had a quick look at "Monitoring > Dashboards > Bigquery". Is that the best place to explore? It only seems to go back to early October - was that when it was released or does it only display the last X weeks of data? Trying metrics explorer for Queries Count (Metric:bigquery.googleapis.com/job/num_in_flight) was giving me a weird unlabelled y-axis, e.g. a scale of 0 to 0.08? Odd as I expect to see a few hundred queries run per week.
Context: It would be good to have a high level summary of BigQuery, as the the months progress, to give an idea to the wider organisation and management on the scale of usage.
You can track your bytes billed by exporting BigQuery usage logs.
Setup logs export (this is using the Legacy Logs Viewer)
Open Logging -> Logs Viewer
Click Create Sink
Enter "Sink Name"
For "Sink service" choose "BigQuery dataset"
Select your BigQuery dataset to monitor
Create sink
Create sink
Once Logs is enabled, all queries to be executed will store data usage logs in table "cloudaudit_googleapis_com_data_access_YYYYMMDD" under the BigQuery dataset you selected in your sink.
Created cloudaudit_googleapis_com_* tables
Here is a sample query to get bytes used per user
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.metadataJson AS metadataJson,
CAST(JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStats.queryStats.totalBilledBytes") AS INT64) AS totalBilledBytes,
FROM
`myproject_id.training_big_query.cloudaudit_googleapis_com_data_access_*`
)
SELECT
principalEmail,
SUM(totalBilledBytes) AS billed_bytes
FROM
data
WHERE
JSON_EXTRACT_SCALAR(metadataJson, "$.jobChange.job.jobConfig.type") = "QUERY"
GROUP BY principalEmail
ORDER BY billed_bytes DESC
Query results
NOTES:
You can only track the usage starting at the date when you set up the logs export
Table "cloudaudit_googleapis_com_data_access_YYYYMMDD" is created daily to track all logs
I think Cloud Monitoring is the only place to create and view metrics. If you are not happy with what they provide for BigQuery by default, the only other alternative is to create your own customized carts and dashboards that satisfy your need. You can achieve that using Monitoring Query Language. Using MQL you can achieve the stuff you described in you question. Here are the links for more detailed information.
Introduction to BigQuery monitoring
Introduction to Monitoring Query Language

Amazon Connect - How to generate report based on CTRs

We have created a contact centre with two contact flows and 1 customer queue flow. Under Metrics there are multiple report types which focus on the Queue or Agents. But what I need is to get a report based on the Customer's choice for the Get customer Input blocks. ie., path taken by the customer at each intersection. Is there a way to achieve this ?
Example:
customer selected option A at level 1 and Option 3 at Level 2, contact flow name etc.. I believe these info will reside in CTRs(Contact attributes) but how to get a cumulative report on all records. Because as far I checked we can only get one contact at a time using Contact search.
You can stream the CTR data to a redshift database using the streaming feature in Amazon Connect. The general steps are to create a kinesis stream and then turn on streaming in Amazon connect and attach the kinesis stream.
There is a quickstart to help set this up here: https://aws.amazon.com/quickstart/connect/data-streaming/

Is there any way to track a job across services in stackdrive?

We use lots of components in Google Cloud, for example a job may start on App Engine, then do some work in Apache Airflow, then do some Dataflow work which will run a BigQuery insert.
Is there any way we can track the status of a job across all components using stack driver. For example tell stackdriver somehow a custom job id and query for it.
You can use advanced logs filters [1] to include log entries from various products. In the logging page search for your BigQuery Job ID. Click to the Job ID and select show matching entries. This will open advanced filter text box with the proper syntax. Then you can add more queries with an OR in between.