Exporting Stackdriver traces to BigQuery - google-cloud-platform

I'm wondering if there is a good way to export spans from Google Stackdriver to BigQuery for better analysis of traces?
The only potential solutions I'm seeing currently are writing to the trace and BigQuery APIs individually or querying the trace API on an ad hoc basis.
The first isn't great because it would require a pretty big change to the application code (I currently just use OpenCensus with Stackdriver exporter to transparently write traces to Stackdriver). The second isn't great because it's a lot of lift to query the API for spans and write them to BigQuery and it has to be done on an ad hoc basis.
A sink similar to log exporting would be great.

Unfortunately, at this moment there is no way to Exporting Stackdriver traces to BigQuery.
I noticed that exactly the same feature was already asked to be implemented on GCP side. GCP product team are already aware about this feature request and considering to implement this feature.
Please note that the Feature Requests are usually not resolved immediately as it depends on how many users are demanding the same feature. All communication regarding this feature request will be done inside public issue tracker 1, also you can 'star' it to acknowledge that you are interested in it, but keep in mind that there is no exact ETA.

Yes. It's a best practice that recommend you.
the analysis is better, your logs are partitioned and the queries efficient.
the log format don't change. The values that you log can , but not your query structure
The logs have a limited retention period in stackdriver. With bigquery, you keep all the time that you want
It's free! At least, the sink process. You have to pay storage and bigquery processing
I have 3 advices:
think to purge your logs for reducing storage cost. However, data older than 90 days are cheaper.
before setting up a sink, select only the relevant logs entry that you want to save to bigquery.
don't forget the partition time, logs can be rapidly huge, and uncontrolled query expensive.
Bonus: if you have to comply to RGPD and you have personal data in logs, be sure to list the process in your RGPD log book.

You exporting logs to big-query , you have to create table in big-query and add data in bigquery.
By default in GCP, all logs go via stack driver.
To export logs in the big query from the stackdriver , you have to create Logger Sink using code or GCP logging UI
Then create a Sink, add a filter. https://cloud.google.com/logging/docs/export/configure_export_v2
hen add logs to stack driver using code
public static void writeLog(Severity severity, String logName, Map<String,
String> jsonMap) {
List<Map<String, String>> maps = limitMap(jsonMap);
for (Map<String, String> map : maps) {
LogEntry logEntry = LogEntry.newBuilder(Payload.JsonPayload.of(map))
.setSeverity(severity)
.setLogName(logName)
.setResource(monitoredResource)
.build();
logging.write(Collections.singleton(logEntry));
}
}
private static MonitoredResource monitoredResource =
MonitoredResource.newBuilder("global")
.addLabel("project_id", logging.getOptions().getProjectId())
.build();

Related

feeding real time data into aws personalize

I want to feed real time data into aws personalize to build a recommendation engine. I've read online resources and in those guides, I could see that the training user-interaction data, user data and item data is provided in the beginning while creating the recommendation engine.
However, I have an app and I will gather data in the app and want to feed those realtime data into aws personalize. I want to know if building the recommendation engine is possible without providing any data at first and then stream real time data from my app later with the putevents, putItem and putUser api from aws-sdk? I'm quite new to this so I'm quite confused with this initial step
I want to know if building the recommendation engine is possible without providing any data at first and then stream real time data from my app later with the putevents, putItem and putUser api from aws-sdk?
Yes, it is possible. You just need to adjust the sequence of creating resources.
Interaction data is required for all Personalize recipes before a recommender can be created that provides recommendations. However, if you don't have interaction data (or enough data; see quotas and limits) to start with, you can create a dataset group and an interactions dataset, feed interactions to the dataset using the PutEvents API (see recording events page), and then create a domain recommender or custom solution when enough data has been ingested.
The minimum amount of interaction data (and potentially item metadata) required before you can train a model/recommender depends on the recipe that you select. Generally speaking, you will need 1000 interactions across 25 distinct users where each of those users has 2+ interactions. The domain recommenders also require specific event types. Check the docs linked above. The quality and relevance of recommendations will improve as you collect more data and retrain.

Best way to get pub sub json data into bigquery

I am currently trying to ingest numerous types of pub sub-data (JSON format) into GCS and BigQuery using a cloud function. I am just wondering what is the best way to approach this?
At the moment I am just dumping the events to GCS (each even type is on its own directory path) and was trying to create an external table but there are issues since the JSON isn't newline delimited.
Would it be better just to write the data as JSON strings in BQ and do the parsing in BQ?
With BigQuery, you have a brand new type name JSON. It helps you to query more easily JSON data type. It could be the solution if you store your event in BigQuery.
About your questions about the use of Cloud Functions, it depends. If you have a few events, Cloud Functions are great and not so much expensive.
If you have an higher rate of event, Cloud Run can be a good alternative to leverage concurrency and to keep the cost low.
If you have million of event per hour or per minute, consider Dataflow with the pubsub to bigquery template.

How to retrive Bigquery billing details from GCP console or UI?

My team is using bigquery for our product development. Other bill of Rs 5159 got generated for one days transaction.
I checked the transaction details and
BigQuery Analysis: 15.912 Tebibytes [Currency conversion: USD to INR using rate 69.155]
Is is possible to somehow find out more details about the transactions like table name, queries that were executed and exact time of execution?
BigQuery automatically sends audit logs to Stackdriver Logging and provide the ability to do aggregated analysis on logs data. You can see BigQuery schema for exported logs for details
As quick example: Query cost breakdown by identity
This query shows estimated query costs by user identity. It estimates costs based on the list price for on-demand queries in the US. This pricing may not be accurate for other locations or for customers leveraging flat-rate billing.
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent
FROM
`MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_data_access_YYYYMMDD`
)
SELECT
principalEmail,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY principalEmail
ORDER BY Estimated_USD_Cost DESC
As of last year BigQuery provides INFORMATION_SCHEMA tables that also give access to job information via JOBS_BY_* views. The INFORMATION_SCHEMA.JOBS_BY_USER and INFORMATION_SCHEMA.JOBS_BY_PROJECT views even include the exact query alongside the processed bytes. It might not be 100% accuracte (because bytes processed != bytes billed) but it should allow you to gain a good overview over your costs, which queries triggered them an who the initiator was.
Example
SELECT
creation_time,
job_id,
project_id,
user_email,
total_bytes_processed,
query
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
The most "efficient" way to keep an eye on the cost is using the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION view as it automatically includes all projects of the organization. You need to be Organization Owner or Organization Administrator to use that view, though.
From there you can figure out which jobs were the most expensive (= get their job id) and form there drill down via JOBS_BY_PROJECT to get the exact query.
See https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/ for a more comprehensive explanation.
You need to Export Billing Data to BigQuery
Tools for monitoring, analyzing and optimizing cost have become an important part of managing development. Billing export to BigQuery enables you to export your daily usage and cost estimates automatically throughout the day to
export data to a CSV,JSON file
However, if you use regular file export, you should be aware that regular file export captures a smaller dataset than export to BigQuery. For more information about regular file export and the data it captures, see Export Billing Data to a File.
to a BigQuery dataset you specify.
After you enable BigQuery export, it might take a few hours to start seeing your data. Billing data automatically exports your data to BigQuery in regular intervals, but the frequency of updates in BigQuery varies depending on the services you're using. Note that BigQuery loads are ACID compliant, so if you query the BigQuery billing export dataset while data is being loaded into it, you will not encounter partially loaded data.
Follow the step by step guide: How to enable billing export to BigQuery
https://cloud.google.com/billing/docs/how-to/export-data-bigquery

Can we schedule StackDriver Logging to export log?

I am new to Google Cloud StackDriver Logging and as per this documentation StackDriver stores the Data Access audit logs for 30 days. Also mentioned on the same page, that Size of a log entry is limited to 100KB.
I am aware of the fact that the logs can be exported Google Cloud Storage using Cloud SDK as well as using Logging Libraries in many languages (we prefer Python).
I have two questions related to the exporting the logs, which are:
Is there any way in StackDriver to schedule something similar to a task or cronjob that keeps exporting the Logs in the Google Cloud storage automatically after a fixed interval of time?
What happens to the log entries which are larger than 100KB. I assume they get truncated. Is my assumption correct? If yes, is there any way in which we can export/view the full(which is not at all truncated) Log entry?
Is there any way in StackDriver to schedule something similar to a
task or cronjob that keeps exporting the Logs in the Google Cloud
storage automatically after a fixed interval of time?
Stackdriver supports exporting log data via sinks. There is no schedule that you can set as everything is automatic. Basically, the data is exported as soon as possible and you have no control over the amount exported at each sink or the delay between exports. I have never found this to be an issue. Logging, by design, is not to be used as a real-time system. The closest is to sink to PubSub which has a couple of second delay (based upon my experience).
There are two methods to export data from Stackdriver:
Create an export sink. Supported destinations are BigQuery, Cloud Storage and PubSub. The log entries will be written to the destination automatically. You can then use tools to process the exported entries. This is the recommended method.
Write your own code in Python, Java, etc. to read the log entries and do what you want with them. Scheduling is up to you. This method is manual and requires your management of schedule and destination.
What happens to the log entries which are larger than 100KB. I assume
they get truncated. Is my assumption correct? If yes, is there any way
in which we can export/view the full(which is not at all truncated)
Log entry?
Entries that exceed the max size of an entry cannot be written to Stackdriver. The API call that attempts to create the entry will fail with an error message similar to (Python error message):
400 Log entry with size 113.7K exceeds maximum size of 110.0K
This means that entries that are too large will be discarded unless the writer has logic to handle this case.
As per the documentation of stack driver logging the whole process is automatic. Export sink to google cloud storage is slower than the Bigquery and Cloud sub/pub. link for the documentation
I recently used the export sink to the big query, which is better than cloud pub/sub in case if you don't want to use other third-party application for log analysis. For Bigquery sink needs dataset where do you want to store the log entries. I noticed that sink create bigquery table on a timestamp basis in the bigquery dataset.
One more thing if you want to query timestamp partitioned tables check this link
Legacy SQL Functions and Operators

Loading data (incrementally) into Amazon Redshift, S3 vs DynamoDB vs Insert

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose,
How should i collect the data ?
Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ?
What i mean is:
- If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3...
- If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .
What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?
On any case what are the best practices to avoid data duplication in RedShift ?
Appreciate the help!
It is preferred to aggregate event logs before ingesting them into Amazon Redshift.
The benefits are:
You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.
You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.
You can accumulate your events in several places before aggregating and loading them into Redshift:
Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.
DynamoDB - as #Swami described, DynamoDB is a very good way to accumulate the events.
Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.
Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).
See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490
Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.
Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.
However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.
Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.
One option to consider is to create time series tables in DynamoDB where you create a table every day or week in DynamoDB to write every user interaction. At the end of the time period (day, hour or week), you can copy the logs on to Redshift.
For more details, on DynamoDB time series table see this pattern: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
and this blog:
http://aws.typepad.com/aws/2012/09/optimizing-provisioned-throughput-in-amazon-dynamodb.html
For Redshift DynamoDB copy: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html
Hope this helps.
Though there is already an accepted answer here, AWS launched a new service called Kinesis Firehose which handles the aggregation according to user defined intervals, a temporary upload to s3 and the upload (SAVE) to redshift, retries and error handling, throughput management,etc...
This is probably the easiest and most reliable way to do so.
You can write data to CSV file on local disk and then run Python/boto/psycopg2 script to load data to Amazon Redshift.
In my CSV_Loader_For_Redshift I do just that:
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
Just being a little selfish here and describing exactly what Snowplow ,an event analytics platform does. They use this awesome unique way of collecting event logs from the client and aggregating it on S3.
They use Cloudfront for this. What you can do is, host a pixel in one of the S3 buckets and put that bucket behind a CloudFront distribution as an origin. Enable logs to an S3 bucket for the same CloudFront.
You can send logs as url parameters whenever you call that pixel on your client (similar to google analytics). These logs can then be enriched and added to Redshift database using Copy.
This solves the purpose of aggregation of logs. This setup will handle all of that for you.
You can also look into Piwik which is an open source analytics service and see if you can modify it specific to your needs.