I am new to Google Cloud StackDriver Logging and as per this documentation StackDriver stores the Data Access audit logs for 30 days. Also mentioned on the same page, that Size of a log entry is limited to 100KB.
I am aware of the fact that the logs can be exported Google Cloud Storage using Cloud SDK as well as using Logging Libraries in many languages (we prefer Python).
I have two questions related to the exporting the logs, which are:
Is there any way in StackDriver to schedule something similar to a task or cronjob that keeps exporting the Logs in the Google Cloud storage automatically after a fixed interval of time?
What happens to the log entries which are larger than 100KB. I assume they get truncated. Is my assumption correct? If yes, is there any way in which we can export/view the full(which is not at all truncated) Log entry?
Is there any way in StackDriver to schedule something similar to a
task or cronjob that keeps exporting the Logs in the Google Cloud
storage automatically after a fixed interval of time?
Stackdriver supports exporting log data via sinks. There is no schedule that you can set as everything is automatic. Basically, the data is exported as soon as possible and you have no control over the amount exported at each sink or the delay between exports. I have never found this to be an issue. Logging, by design, is not to be used as a real-time system. The closest is to sink to PubSub which has a couple of second delay (based upon my experience).
There are two methods to export data from Stackdriver:
Create an export sink. Supported destinations are BigQuery, Cloud Storage and PubSub. The log entries will be written to the destination automatically. You can then use tools to process the exported entries. This is the recommended method.
Write your own code in Python, Java, etc. to read the log entries and do what you want with them. Scheduling is up to you. This method is manual and requires your management of schedule and destination.
What happens to the log entries which are larger than 100KB. I assume
they get truncated. Is my assumption correct? If yes, is there any way
in which we can export/view the full(which is not at all truncated)
Log entry?
Entries that exceed the max size of an entry cannot be written to Stackdriver. The API call that attempts to create the entry will fail with an error message similar to (Python error message):
400 Log entry with size 113.7K exceeds maximum size of 110.0K
This means that entries that are too large will be discarded unless the writer has logic to handle this case.
As per the documentation of stack driver logging the whole process is automatic. Export sink to google cloud storage is slower than the Bigquery and Cloud sub/pub. link for the documentation
I recently used the export sink to the big query, which is better than cloud pub/sub in case if you don't want to use other third-party application for log analysis. For Bigquery sink needs dataset where do you want to store the log entries. I noticed that sink create bigquery table on a timestamp basis in the bigquery dataset.
One more thing if you want to query timestamp partitioned tables check this link
Legacy SQL Functions and Operators
Related
The Google provided Dataflow Streaming template for data masking/tokenization from cloud storage to bigquery using cloud DLP is giving inconsistent output for each source files.
We have 100 odd files with 1M records each in the GCS bucket and we are calling the dataflow streaming template to tokenize the data using DLP and load into BigQuery.
While loading the files sequentially we saw that the results are inconsistent
For few files full 1M got loaded but for most of them the rows are varied between 0.98M to 0.99M. Is there any reason for such behaviour?
I am not sure but it's maybe due to BigQuery best-effort deduplication mechanism used for streaming data to BigQuery :
From the Beam documentation :
Note: Streaming inserts by default enables BigQuery best-effort deduplication mechanism. You can disable that by setting ignoreInsertIds. The quota limitations are different when deduplication is enabled vs. disabled :
Streaming inserts applies a default sharding for each table
destination. You can use withAutoSharding (starting 2.28.0 release) to
enable dynamic sharding and the number of shards may be determined and
changed at runtime. The sharding behavior depends on the runners.
From the Google Cloud documentation :
Best effort de-duplication When you supply insertId for an inserted
row, BigQuery uses this ID to support best effort de-duplication for
up to one minute. That is, if you stream the same row with the same
insertId more than once within that time period into the same table,
BigQuery might de-duplicate the multiple occurrences of that row,
retaining only one of those occurrences.
The system expects that rows provided with identical insertIds are
also identical. If two rows have identical insertIds, it is
nondeterministic which row BigQuery preserves.
De-duplication is generally meant for retry scenarios in a distributed
system where there's no way to determine the state of a streaming
insert under certain error conditions, such as network errors between
your system and BigQuery or internal errors within BigQuery. If you
retry an insert, use the same insertId for the same set of rows so
that BigQuery can attempt to de-duplicate your data. For more
information, see troubleshooting streaming inserts.
De-duplication offered by BigQuery is best effort, and it should not
be relied upon as a mechanism to guarantee the absence of duplicates
in your data. Additionally, BigQuery might degrade the quality of best
effort de-duplication at any time in order to guarantee higher
reliability and availability for your data.
If you have strict de-duplication requirements for your data, Google
Cloud Datastore is an alternative service that supports transactions.
This mecanism can be disabled with ignoreInsertIds
You can test with disabling this mecanism and check if all the rows are inserted.
By adjusting the value of the batch size in the template all files of 1M records each got loaded successfully
I have a table in google bigquery in which I calculate a column (think of it as anomaly detection column),
is there a way, within GCP, to send a rule based alert (e.g. once the value in the column is 1)
if not how would you recommend dealing with this issue.
Thanks
A solution could be to use eventarc triggers: when data is inserted into your table, the job is written into cloud audit logs and you can trigger a Cloud Run like this.
With this Cloud Run it's possible to inspect the column you mention and send notifications accordingly.
Here is a good reference on how to proceed.
Because you run a request to calculate the line and the alert to send, the solution is to get the result of this request and to use it.
You can export data to a file (csv for example) and then to trigger a cloud functions on the file creation on Cloud Storage. The cloud functions will read the file and for each line trigger an alert, or send only one alert with the summary of the file, or to send the file in attachement.
You can also fetch all the line of the request result and publish a PubSub message for each line. Like that, you can process in parallel all the messages with Cloud Function or Cloud Run (this time it's not possible to have only 1 alert with a summary of all the lines, the message are unitary)
I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!
I'm wondering if there is a good way to export spans from Google Stackdriver to BigQuery for better analysis of traces?
The only potential solutions I'm seeing currently are writing to the trace and BigQuery APIs individually or querying the trace API on an ad hoc basis.
The first isn't great because it would require a pretty big change to the application code (I currently just use OpenCensus with Stackdriver exporter to transparently write traces to Stackdriver). The second isn't great because it's a lot of lift to query the API for spans and write them to BigQuery and it has to be done on an ad hoc basis.
A sink similar to log exporting would be great.
Unfortunately, at this moment there is no way to Exporting Stackdriver traces to BigQuery.
I noticed that exactly the same feature was already asked to be implemented on GCP side. GCP product team are already aware about this feature request and considering to implement this feature.
Please note that the Feature Requests are usually not resolved immediately as it depends on how many users are demanding the same feature. All communication regarding this feature request will be done inside public issue tracker 1, also you can 'star' it to acknowledge that you are interested in it, but keep in mind that there is no exact ETA.
Yes. It's a best practice that recommend you.
the analysis is better, your logs are partitioned and the queries efficient.
the log format don't change. The values that you log can , but not your query structure
The logs have a limited retention period in stackdriver. With bigquery, you keep all the time that you want
It's free! At least, the sink process. You have to pay storage and bigquery processing
I have 3 advices:
think to purge your logs for reducing storage cost. However, data older than 90 days are cheaper.
before setting up a sink, select only the relevant logs entry that you want to save to bigquery.
don't forget the partition time, logs can be rapidly huge, and uncontrolled query expensive.
Bonus: if you have to comply to RGPD and you have personal data in logs, be sure to list the process in your RGPD log book.
You exporting logs to big-query , you have to create table in big-query and add data in bigquery.
By default in GCP, all logs go via stack driver.
To export logs in the big query from the stackdriver , you have to create Logger Sink using code or GCP logging UI
Then create a Sink, add a filter. https://cloud.google.com/logging/docs/export/configure_export_v2
hen add logs to stack driver using code
public static void writeLog(Severity severity, String logName, Map<String,
String> jsonMap) {
List<Map<String, String>> maps = limitMap(jsonMap);
for (Map<String, String> map : maps) {
LogEntry logEntry = LogEntry.newBuilder(Payload.JsonPayload.of(map))
.setSeverity(severity)
.setLogName(logName)
.setResource(monitoredResource)
.build();
logging.write(Collections.singleton(logEntry));
}
}
private static MonitoredResource monitoredResource =
MonitoredResource.newBuilder("global")
.addLabel("project_id", logging.getOptions().getProjectId())
.build();
I am working on Amazon Matillion for Redshift and we have multiple jobs running daily triggered by SQS messages. Now I am checking the possibility of creating a UI dashboard for stakeholders which will monitor live progress of jobs and will show report of previous jobs, like Job name, tables impacted, job status/reason for failure etc. Does Matillion maintain this kind of information implicitly? Or I will have to maintain this information for each job.
Matillion has an API which you can use to obtain details of all task history. Information on the tasks API is here:
https://redshiftsupport.matillion.com/customer/en/portal/articles/2720083-loading-task-information?b_id=8915
You can use this to pull data on either currently running jobs or completed jobs down to component level including name of job, name of component, how long it took to run, whether it ran successfully or not and any applicable error message.
This information can be pulled into a Redshift table using the Matillion API profile which comes built into the product and the API Query component. You could then build your dashboard on top of this table. For further information I suggest you reach out to Matillion via their Support Center.
The API is helpful, but you can only pass a date as a parameter (this is for Matillion for Snowflake, assume it's the same for Redshift). I've requested the ability to pass a datetime so we can run the jobs throughout the day and not pull back the same set of records every time our API call runs.