(Please feel free to mark this question as duplicate and share pointer to duplicates.)
Hi,
We are developing spring boot based application and will be using docker in production.
Currently it is using MongoDB (Atlas) for storing its log. Looks like MongoDB Cloud will be expensive option to store logs/audit trails.
Since we are going to use AWS, which AWS service we should use to store Log4J Logs and audit messages?
Usually people do store logs in s3, where you can archive logs with a combination of infrequent access and glacier for a reasonable money and you can apply also some life-cycle policy so the logs are automatically removed after a defined amount of time.
If you are looking for some kind of streaming/logging over a network, you may start with some AWS Lambda functions or SQS or you may want to go with some kind of service like https://aws.amazon.com/kinesis/data-firehose/ if you believe that you are really big.
The other advantage of S3 (beside the lowest price) is that most of the other services support reading data from S3. So if you decide later that you want to analyze data with ElasticSearch or Elastic Map-Reduce cluster you will probably have some way how to do it.
Related
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
Our CIO had a heart attack upon seeing our AWS bill.
I need to aggregate Apache and Tomcat logs from multiple EC2 (in scaling group) -- what could be the best way to initiate this without breaking the bank? The goal of the logs is to view events by IP address, account names, view the transaction flows (diagnostic/audit logging -- not so much as performance metrics).
ELK is out of the equation (political). Cloudwatch is allowed + anything else.
Depends on volume and access patterns, but pushing the logs to S3 and using Athena to query them is a good shout.
Its cheap because S3 is a really cheap datastore, and Athena is server-less, meaning you only pay for the queries you run.
Make sure you convert the logs to a compressed data format (like Apace Parquet) to save even more dosh.
https://aws.amazon.com/athena
https://docs.aws.amazon.com/athena/latest/ug/querying-apache-logs.html
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
My arguments against S3/Athena would be that S3 may be the cheapest storage mechanism but how will you get the logs off your box and into S3? I'm not aware of any AWS agents that do this but there may be some commercial or open source projects to do it. Also, there is some setup required to get Athena to work for searching such as defining schemas and/or setting up AWS Glue Crawlers to discover data. You'll often find that Glue Crawlers won't be the great of identifying log data if it's not in something like JSON formatted.
I would highly recommend CloudWatch. AWS has created a CloudWatch agent that is available for multiple OSs that will pull and forward your logs from your EC2 instances. CloudWatch also has some free searching tools and now the more powerful CloudWatch Insights tool to help you search your data in a way similar to what other first-class log aggregators allow.
CloudWatch pricing is also pretty cheap. It's only $0.50/GB ingested and $0.02/GB long term storage (in us-east-1 at least). And there is no charge to use the CloudWatch agent which is the biggest advantage as you don't have to invent and test a new way to pull logs off of your boxes.
My question is a little bit general, we want to build a solution based on amazon rekognition. But we want to make sure that amazon don't keep our data after the process is completed for example. When i use the detect_text function in boto3 like this.
response = client.detect_text(Image={'Bytes': images_bytes})
After i get the response, what happen to the images_bytes that has been uploaded to amazon for processing? Is it automatically destroyed or amazon keeps it locally?
It is unlikely that AWS would be able to provide you with specific details of how it implements a service.
AWS does have various security certifications that might address your general question of how customer data is handled.
See: Cloud Compliance - Amazon Web Services (AWS)
We are hosting our services in AWS beanstalk managed instances. That is forcing us to move away from files based logging to use database based logging.
Is DynamoDB a good choice for replacing file based logging. If so, what should be the primary key. I thought of using timestamp but multiple messages may be logged by the same service within the same timeStamp so that might not be reliable.
Any advice would be appreciated.
Don't use DynamoDB to store logs. You'll be paying for throughput and space needlessly.
Amazon CloudWatch has built-in logging capabilities.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html
Another alternative is a dedicated logging service such as Loggly which is cloud-based and can receive logs in many common formats, plus they have an API to send custom logs. In the web-based console, you can search and filter through the logs.
As an alternative, why don't you use cloudwatch? I ended up writing a whole app to consolidate logs across ec2 instances in a beanstalk app, then last year AWS opened up cloudwatch as a service, so I junked my stuff. You tell cloudwatch where your logs are on the instance, give it a log group and stream name, and all your logs are consolidated in one spot, in cloudwatch. You can also run alarms off them using the standard AWS setup. It's pretty slick, and easy - don't have to write a front end to do lookups, it's already there.
Don't know what you're using for logging - we are a node.js shop, used winston for logging, and there is a nice NPM module that works with Winston to log automatically, called winston-cloudwatch.
I'm developing a platform where users will in effect have their own site within a directory of my own. Each user site will consist of a package of php scripts and the template/image files for their sites custom layout. Each user site will be connected to their own Amazon RDS. I need to be able to track the resource usage of each directory so that I can bill each user for the resources they have used. Would it be possible to setup custom metrics with CloudWatch so that I can calculate costs?
You should be able to use cloudwatch to do this, however, it might not be the most efficient place to put this information if you are going to bill or report on it. I think you are better off computing the data and then storing it in a database of your own. This way you have easy access to the data and you can do things with data that may not work well in the context of cloudwatch.