Scanning S3 buckets for auditing purposes - amazon-web-services

What is the best way to scan data in S3 (for auditing purposes, possibly)? I was asked to do some research on this and utilizing AWS Athena was my first idea I could think of. But if you can provide more knowledge/ideas, I'd appreciate it.
Thanks!

You want to use Amazon Macie:
Amazon Macie is a security service that uses machine learning to automatically discover, classify, and protect sensitive data in AWS. Amazon Macie recognizes sensitive data such as personally identifiable information (PII) or intellectual property, and provides you with dashboards and alerts that give visibility into how this data is being accessed or moved.
Video: AWS Summit Series 2017 - New York: Introducing Amazon Macie

Related

AWS QLDB multi region support

I am working on QLDB from last 3 months on a single region using it as a leisure database.
Now, business wants to move applications in multi-region support.
I found many of the aws services support multi region like DynamoDB, secret manager.
but there is limitations on QLDB for multi region use.
I saw from some aws articles that QLDB does not have support for multi region as its not distributed technology.
Now, to cater business requirement with minimal changes in code, I have to approaches/workaround for QLDB to support multi region,
Do I need to create region based ledger, with same functionality? I understand there are major challenges with maintaining the geo based traffic.
I will keep QLDB ledger in single region and gives cross region access permissions to Lambda functions to access it. Its a simplest one but eat latency.
Which approach helps in long term and in scalability? Or please suggest if anyone has different approach to achieve this.
Do I need to create region based leisure, with same functionality? I understand there are major challenges with maintaining the geo based traffic.
Yes, at this moment, like you said there is no multi region support or global in aws jargon, you need to create region based leisure on your own.
to cater business requirement with minimal changes in code
You can achieve cross region replication by following as mentioned in docs
Amazon QLDB does not support cross-region replication as of now. QLDB's export to S3 feature enables customers to export the contents of the QLDB journal to a S3 bucket. The S3 buckets can be configured for cross-region replication.
Side note :
I will keep QLDB leisure in single region and gives cross region access permissions to Lambda functions to access it. Its a simplest one but eat latency.
If your business wants multi-region support this option would not satisfy their conditions.

AWS DataSync vs. AWS Transfer Family

I read the documentation from the official website. but it does not give me a clear picture.
Why would need to use AWS Transfer Family since AWS DataSync can also achieve the same result?
I notice the protocol differences, but am not quite sure about the data migration use case.
Why would we pick one over the other?
Why would need to use AWS Transfer Family since AWS DataSync can also achieve the same result?
It depends on what you mean by achieving the same result.
If it is transferring data to & from AWS then - yes both achieve the same result.
However, the main difference is that AWS Transfer Family is practically an always-on server endpoint enabled for SFTP, FTPS, and/or FTP.
If you need to maintain compatibility for current users and applications that use SFTP, FTPS, and/or FTP then using AWS Transfer Family is a must as that ensures the contract is not broken and that you can continue to use them without any modifications. Existing transfer workflows for your end-users are preserved & existing client-side configurations are maintained.
On the other hand, AWS DataSync is ideal for transferring data between on-premises & AWS or between AWS storage services. A few use-cases that AWS suggests are migrating active data to AWS, archiving data to free up on-premises storage capacity, replicating data to AWS for business continuity, or transferring data to the cloud for analysis and processing.
At the core, both can be used to transfer data to & from AWS but serve different business purposes.
Your exact question in the AWS DataSync FAQs:
Q: When do I use AWS DataSync and when do I use AWS Transfer Family?
A: If you currently use SFTP to exchange data with third parties, AWS Transfer Family provides a fully managed SFTP, FTPS, and FTP transfer directly into and out of Amazon S3, while reducing your operational burden.
If you want an accelerated and automated data transfer between NFS servers, SMB file shares, self-managed object storage, AWS Snowcone, Amazon S3, Amazon EFS, and Amazon FSx for Windows File Server, you can use AWS DataSync. DataSync is ideal for customers who need online migrations for active data sets, timely transfers for continuously generated data, or replication for business continuity.
Also see: AWS Transfer Family FAQs - Q: Why should I use the AWS Transfer Family?

AWS Log Aggregator on the Cheap

Our CIO had a heart attack upon seeing our AWS bill.
I need to aggregate Apache and Tomcat logs from multiple EC2 (in scaling group) -- what could be the best way to initiate this without breaking the bank? The goal of the logs is to view events by IP address, account names, view the transaction flows (diagnostic/audit logging -- not so much as performance metrics).
ELK is out of the equation (political). Cloudwatch is allowed + anything else.
Depends on volume and access patterns, but pushing the logs to S3 and using Athena to query them is a good shout.
Its cheap because S3 is a really cheap datastore, and Athena is server-less, meaning you only pay for the queries you run.
Make sure you convert the logs to a compressed data format (like Apace Parquet) to save even more dosh.
https://aws.amazon.com/athena
https://docs.aws.amazon.com/athena/latest/ug/querying-apache-logs.html
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
My arguments against S3/Athena would be that S3 may be the cheapest storage mechanism but how will you get the logs off your box and into S3? I'm not aware of any AWS agents that do this but there may be some commercial or open source projects to do it. Also, there is some setup required to get Athena to work for searching such as defining schemas and/or setting up AWS Glue Crawlers to discover data. You'll often find that Glue Crawlers won't be the great of identifying log data if it's not in something like JSON formatted.
I would highly recommend CloudWatch. AWS has created a CloudWatch agent that is available for multiple OSs that will pull and forward your logs from your EC2 instances. CloudWatch also has some free searching tools and now the more powerful CloudWatch Insights tool to help you search your data in a way similar to what other first-class log aggregators allow.
CloudWatch pricing is also pretty cheap. It's only $0.50/GB ingested and $0.02/GB long term storage (in us-east-1 at least). And there is no charge to use the CloudWatch agent which is the biggest advantage as you don't have to invent and test a new way to pull logs off of your boxes.

Which AWS service should be used for storing log4j logs?

(Please feel free to mark this question as duplicate and share pointer to duplicates.)
Hi,
We are developing spring boot based application and will be using docker in production.
Currently it is using MongoDB (Atlas) for storing its log. Looks like MongoDB Cloud will be expensive option to store logs/audit trails.
Since we are going to use AWS, which AWS service we should use to store Log4J Logs and audit messages?
Usually people do store logs in s3, where you can archive logs with a combination of infrequent access and glacier for a reasonable money and you can apply also some life-cycle policy so the logs are automatically removed after a defined amount of time.
If you are looking for some kind of streaming/logging over a network, you may start with some AWS Lambda functions or SQS or you may want to go with some kind of service like https://aws.amazon.com/kinesis/data-firehose/ if you believe that you are really big.
The other advantage of S3 (beside the lowest price) is that most of the other services support reading data from S3. So if you decide later that you want to analyze data with ElasticSearch or Elastic Map-Reduce cluster you will probably have some way how to do it.

How amazon Rekognition handles data privacy?

My question is a little bit general, we want to build a solution based on amazon rekognition. But we want to make sure that amazon don't keep our data after the process is completed for example. When i use the detect_text function in boto3 like this.
response = client.detect_text(Image={'Bytes': images_bytes})
After i get the response, what happen to the images_bytes that has been uploaded to amazon for processing? Is it automatically destroyed or amazon keeps it locally?
It is unlikely that AWS would be able to provide you with specific details of how it implements a service.
AWS does have various security certifications that might address your general question of how customer data is handled.
See: Cloud Compliance - Amazon Web Services (AWS)