Amazon Glacier - is there any compression? - amazon-web-services

I'm using Glacier app on my QNAP NAS drive and was backing up some folder and I'm wondering if Glacier is using any file compression? Folder original size is 41.5GB and on the Glacier Management Console I see this container/vault have 36.5GB - is that correct and it's a compression or my NAS just didn't backed up everything - how I can verify files integrity?
The questions is about Amazon Glacier - not the NAS drive - however I don't know how the app works - and if compression is not implemented in the app itself, I just want to know if glacier itself is compressing on-the-fly data or not?
I couldn't google this info anywhere.
Many thanks, Peter.

Glacier doesn't compress data.
There are two different definitions of "gigabyte" -- one binary (1024 x 1024 x 1024, properly called "gibibytes" and abbreviated GiB, though sometimes casually called "gigabytes" and abbreviated GB) and one based on metric prefixes (1000 x 1000 x 1000, this one is properly called "gigabytes" and abbreviated GB).
I don't think that explains the discrepancy, here, since 41.5 gigabytes = 38.65 gibibytes. Closer, but probably not close enough.
Googling found some qnap documentation that suggests qnap itself supports compression and sparse file detection, either of which might have reduced the size of your backup, and that qnap saved each file as its own archive within a vault, which should give you something to go on, as far as trying to validate the backup.

Related

Amazon AWS S3 Glacier: is there a file hierarchy

Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?
For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt
I am storing multiple .tar.gz files, and would like:
All files in the same Vault
Within the Vault, files are grouped into several categories (as opposed to flat structure)
I could not find how to do this documented anywhere. If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?
Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?
No, there's no hierarchy other than "archives exist inside a vault".
For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt
This is actually incorrect.
S3 doesn't have any inherent hierarchy. The character / is absolutely no different than any other character valid for the key of an S3 Object.
The S3 Console — and most S3 client tools, including AWS's CLI — treat the / character in a special way. But notice that it is a client-side thing. The client will make sure that listing happens in such a way that a / behaves as most people would expect, that is, as a "hierarchy separator".
If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?
You need to keep track of your hierarchy separately. For example, when you store an archive in Glacier, you could write metadata about that archive in a database (RDS, DynamoDB, etc).
As a side note, be careful about .tar.gz in Glacier, especially if you're talking about (1) a very large archive (2) that is composed of a large number of small individual files (3) which you may want to access individually.
If those conditions are met (and in my experience they often are in real-world scenarios), then using .tar.gz will often lead to excessive costs when retrieving data.
The reason is because you pay per number of requests as well as per size of request. So while having one huge .tar.gz file may reduce your costs in terms of number of requests, the fact that gzip uses DEFLATE, which is a non-splittable compression algorithm, means that you'll have to retrieve the entire .tar.gz archive, decompress it, and finally get the one file that you actually want.
An alternative approach that solves the problem I described above — and that, at the same time, relates back to your question and my answer — is to actually first gzip the individual files, and then tar them together. The reason this solves the problem is that when you tar the files together, the individual files actually have clear bounds inside the tarball. And then, when you request a retrieval from glacier, you can request only a range of the archive. E.g., you could say, "Glacier, give me bytes between 105MB and 115MB of archive X". That way you can (1) reduce the total number of requests (since you have a single tar file), and (2) reduce the total size of the requests and storage (since you have compressed data).
Now, to know which range you need to retrieve, you'll need to store metadata somewhere — usually the same place where you will keep your hierarchy! (like I mentioned above, RDS, DynamoDB, Elasticsearch, etc).
Anyways, just an optimization that could save a tremendous amount of money in the future (and I've worked with a ton of customers who wasted a lot of money because they didn't know about this).

Amazon Macie to read database data

I am doing some POC in Amazon Macie. I got from the documentation that it identifies PII data like credit card. Even I ran an example where I put some valid credit card numbers in CSV and put into S3 bucket and was identified by Macie.
I want to know if the same PII data is under some database backup/dump file, which is in S3 bucket. Will Macie be able to identify? I didn't find anything in the documentation.
So a couple of things are important here
Macie can only handle certain types of files and certain compression formats
If you specify S3 buckets that include files of a format that isn't supported in Macie, Macie doesn't classify them.
Compression formats
https://docs.aws.amazon.com/macie/latest/userguide/macie-compression-archive-formats.html
Encrypted Objects
Macie can only handle certain types of encrypted Amazon S3 objects
See the following link for more details:
https://docs.aws.amazon.com/macie/latest/userguide/macie-integration.html#macie-encrypted-objects
Macie Limits
Macie has a default limit on the amount of data that it can classify in an account. After this data limit is reached, Macie stops classifying the data. The default data classification limit is 3 TB. This can be increased if requested.
Macie's content classification engine processes up to the first 20 MB of an S3 object.
So specifically if you dump is compressed but in a suitable format inside the compression then yes Macie can classify, but on an important note it will only classify the first 20 MB of the file which is a problem if the file is large.
Typically I use lambda to split a large file into files just under 20 MB. You still need to think if you have X number of files how do you take a record from a file that has been classified as PII and map it back into something that is useable.

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?
An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.
Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily
Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

Preferred internet connection to transport of 4 TB CSV files to Amazon (AWS)

I need to transfer bunch of CSV files with around 4 TB in total to AWS.
What is preferred internet connection from my ISP which can handle this transfer or
link does not play any role. My link is 70 Mbps Upload/Download Dedicated. Is this enough or I need to increase my link speed?
Thnx.
4 TB = 4,194,304 mbyte
70 mbit/sec ~= 8.75 mbyte/sec (approximate because there will be network overheads)
Dividing results in 479,349 seconds, or 5.55 days
Increasing your link speed will certainly improve this, but you'll probably find that you get more improvement using compression (CSV implies text with a numeric bias, which compresses extremely well).
You don't say what you'll be uploading to, nor how you'll be using the results. If you're uploading to S3, I'd suggest using GZip (or another compression format) to compress the files before uploading, and then let the consumers decompress as needed. If you're uploading to EFS, I'd create an EC2 instance to receive the files and use rsync with the -z option (which will compress over the wire but leave the files uncompressed on the destination). Of course, you may still prefer pre-compressing the files, to save on long-term storage costs.

s3zipper has limited to download only 1000 files from s3 bucket

I am using s3zipper along with PHP to stream zip S3 files. However there is one issue. We have more than 1000 of files to download (approx 2K to 10K varying). So when we send request to s3zipper lets say 1500 files, we were getting only 1000 files within a zip.
As per AWS docs they have 1000 keys limitation i.e.
S3 API version 2 implementation of the GET operation returns some or all (up to 1,000) of the objects in a bucket.
. So if we want to get more than than we have to use marker parameter AWS A. But in s3zipper.go this call aws_bucket.GetReader(file.S3Path), is reading file and adding to zip.I am not sure how I can use marker in this case.
I am curious how we can come over from this limitation. I am newbie to Go language, any help in this regard will be highly appreciated.