Amazon Macie to read database data - amazon-web-services

I am doing some POC in Amazon Macie. I got from the documentation that it identifies PII data like credit card. Even I ran an example where I put some valid credit card numbers in CSV and put into S3 bucket and was identified by Macie.
I want to know if the same PII data is under some database backup/dump file, which is in S3 bucket. Will Macie be able to identify? I didn't find anything in the documentation.

So a couple of things are important here
Macie can only handle certain types of files and certain compression formats
If you specify S3 buckets that include files of a format that isn't supported in Macie, Macie doesn't classify them.
Compression formats
https://docs.aws.amazon.com/macie/latest/userguide/macie-compression-archive-formats.html
Encrypted Objects
Macie can only handle certain types of encrypted Amazon S3 objects
See the following link for more details:
https://docs.aws.amazon.com/macie/latest/userguide/macie-integration.html#macie-encrypted-objects
Macie Limits
Macie has a default limit on the amount of data that it can classify in an account. After this data limit is reached, Macie stops classifying the data. The default data classification limit is 3 TB. This can be increased if requested.
Macie's content classification engine processes up to the first 20 MB of an S3 object.
So specifically if you dump is compressed but in a suitable format inside the compression then yes Macie can classify, but on an important note it will only classify the first 20 MB of the file which is a problem if the file is large.
Typically I use lambda to split a large file into files just under 20 MB. You still need to think if you have X number of files how do you take a record from a file that has been classified as PII and map it back into something that is useable.

Related

Splunk migration to S3 DataLake

We're looking at moving away from Splunk as our datastore and looking at AWS Data Lake backed by S3.
What would be the process of migrating data from Splunk to S3? I've read lots of documents talking about archiving data from Splunk to S3 but not sure if this archives the data as a usable format OR if its in some archive format that needs to be restored to splunk itself?
Check out Splunk's SmartStore feature. It moves your non-hot buckets to S3 so you save storage costs. Running SmartStore on AWS only makes sense, however, if you run Splunk on AWS. Otherwise, the data export charges will bankrupt you. Data export applies when Splunk needs to search a bucket that's stored in S3 and so copies that bucket to an indexer. See https://docs.splunk.com/Documentation/Splunk/8.0.0/Indexer/AboutSmartStore for more information.
From what I've read there are a couple of ways to do it:
Export using the Web UI
Export using REST API Endpoint
Export using CLI
Copy certain files in the filesystem
So far I've tried using the CLI to export and I've managed to export around 500,000 events at a time using
splunk search "index=main earliest=11/11/2019:00:00:01 latest=11/15/2019:23:59:59" -output rawdata -maxout 500000 > output2.dmp
However - I'm not sure how I can accurately repeat this step to make sure I include all 100 million+ events. IE search from DATE A to DATE B for 500,000 records, then search from DATE B to DATE C for the next 500,000 - without missing any events inbetween.

Are parquet files splittable when stored in AWS S3?

I know that parquet files are splittable if they are stored in block storage. E.g stored on HDFS
Are they also splittable when stored in object storage such as AWS s3?
This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.
Yes, Parquet files are splittable.
S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).
I'm not 100% sure what you mean here, but generally (I think), you have parquet partition on partition keys and save columns into blocks of rows. When I have used in it AWS S3 it has saved like:
|-Folder
|--Partition Keys
|---Columns
|----Rows_1-100.snappy.parquet
|----Rows_101-200.snappy.parquet
This handles the splitting efficiencies you mention.

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?
An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.
Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily
Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

Doubts using Amazon S3 monthly calculator

I'm using Amazon S3 to store videos and some audios (average size of 25 mb each) and users of my web and android app (so far) can access them with no problem but I want to know how much I'll pay later exceeding the free stage of S3 so I checked the S3 monthly calculator.
I saw that there is 5 fields:
Storage: I put 3 gb cause right now there are 130 files (videos and audios)
PUT/COPY/POST/LIST Requests: I put 15 cause I'll upload manually around 10-15 files each month
GET/SELECT and Other Requests: I put 10000 cause a projection tells me that the users will watch/listen those files around 10000 times monthly
Data Returned by S3 Select: I put 250 Gb (10000 x 25 mb)
Data Scanned by S3 Select: I don't know what to put cause I don't need that amazon scans or analyze those files.
Am I using that calculator in a proper way?
What do I need to put in "Data Scanned by S3 Select"?
Can I put only zero?
For audio and video, you can definitely specify 0 for S3 Select -- both data scanned and data returned.
S3 Select is an optional feature that only works with certain types of text files -- like CSV and JSON -- where you make specific requests for S3 to scan through the files and return matching values, rather than you downloading the entire file and filtering it yourself.
This would not be used with audio or video files.
Also, don't overlook "Data transfer out." In addition to the "get" requests, you're billed for bandwidth when files are downloaded, so this needs to show the total size of all the downloads. This line item is data downloaded from S3 via the Internet.

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.