Can I query Logs Explorer directly from a cloud function? - google-cloud-platform

For a monitoring project I'm using Logs Router to send log data to BigQuery Table so I can then query the BigQuery table from cloud functions. Would it be possible to directly query Log Explorer from Cloud Functions? (i.e not having to replicate my logs to BigQuery?)
Thanks

Yes, of course you can. You even have client libraries for that. However, keep in mind that, by default, your logs are kept only 30 days. It could be enough, or not, depending on your use case.
You can create custom log bucket, with a different retention period, or sing the logs in BigQuery.
The main advantage of BigQuery if the capacity to join the logs data with other data in BigQuery, to perform powerful analytics computation. But still depends on your use case.

Related

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

AWS Log Aggregator on the Cheap

Our CIO had a heart attack upon seeing our AWS bill.
I need to aggregate Apache and Tomcat logs from multiple EC2 (in scaling group) -- what could be the best way to initiate this without breaking the bank? The goal of the logs is to view events by IP address, account names, view the transaction flows (diagnostic/audit logging -- not so much as performance metrics).
ELK is out of the equation (political). Cloudwatch is allowed + anything else.
Depends on volume and access patterns, but pushing the logs to S3 and using Athena to query them is a good shout.
Its cheap because S3 is a really cheap datastore, and Athena is server-less, meaning you only pay for the queries you run.
Make sure you convert the logs to a compressed data format (like Apace Parquet) to save even more dosh.
https://aws.amazon.com/athena
https://docs.aws.amazon.com/athena/latest/ug/querying-apache-logs.html
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
My arguments against S3/Athena would be that S3 may be the cheapest storage mechanism but how will you get the logs off your box and into S3? I'm not aware of any AWS agents that do this but there may be some commercial or open source projects to do it. Also, there is some setup required to get Athena to work for searching such as defining schemas and/or setting up AWS Glue Crawlers to discover data. You'll often find that Glue Crawlers won't be the great of identifying log data if it's not in something like JSON formatted.
I would highly recommend CloudWatch. AWS has created a CloudWatch agent that is available for multiple OSs that will pull and forward your logs from your EC2 instances. CloudWatch also has some free searching tools and now the more powerful CloudWatch Insights tool to help you search your data in a way similar to what other first-class log aggregators allow.
CloudWatch pricing is also pretty cheap. It's only $0.50/GB ingested and $0.02/GB long term storage (in us-east-1 at least). And there is no charge to use the CloudWatch agent which is the biggest advantage as you don't have to invent and test a new way to pull logs off of your boxes.

Developing a simple cloud data processing system

What would be the most simple way to take a public data API, for example, and schedule a daily job to calculate a set of statistics and land the computed statistics in a cloud database?
What about using CloudWatch Event rule with Schedule Expression and a target of a Lambda function?
The event rule would trigger your Lambda function, e.g., once a day. The function would then call the API, process the data from the API, and write the results into a DynamoDB or RDS database, depending whether you require relational or non-SQL database.
Same on GCP. You can use Cloud Scheduler for the periodic trigger, and call a Cloud Functions that perform your statistics.
You can use Firestore for storing your data in document format.
The free tiers of each product are huge and if your processing is simple and not running full time, you should pay nothing.

Google Stackdrive custom metrics - data retention period

I'm using GCP Stackdrive custom metrics and created few dashboard graphs to show the traffic on the system. The problem is that the graph system is keeping the data for few weeks - not forever.
From Stackdrive documentation:
See Quotas and limits for limits on the number of custom metrics and
the number of active time series, and for the data retention period.
If you wish to keep your metric data beyond the retention period, you
must manually copy the data to another location, such as Cloud Storage
or BigQuery.
Let's decide to work with Cloud Storage as a container to store data for the long term.
Questions:
How does this "manual data copy" is working? Just write the same data into two places (Google storage and Stackdrive)?
How the stackdrive is connecting the storage and generating graph of it?
You can use Stackdriver's Logs Export feature to export your logs into either of three sinks, Google Cloud Storage, BigQuery or Pub/Sub topic. Here are the instructions on how to export stackdriver logs. You are not writing logs in two places in real-time but exporting logs based on the filters you set.
One thing to keep in mind is you will not be able to use stackdriver graphs or alerting tools with the exported logs.
In addition, if you export logs into bigquery, you can plug a Datastudio graphe to see your metrics.
You can also do this with Cloud Storage export but it's less immediate and less handy
I'll suggest this guide on creating a pipeline to export metrics to BigQuery for long-term storage and analytics.
https://cloud.google.com/solutions/stackdriver-monitoring-metric-export