aws managed service handling large number of small files

aws managed service handling large number of small files - amazon-web-services

It is expected to around 1,000,000 json files will be generated per day from the on-premise system (not going to the internet) and expected to be aggregated for analytics. Each text file is less than 4 kb.
My current thought is to use AWS DataSync to upload the files to s3. Using s3 to store the file, saying 3 years. I am not sure what service to use to do the analytics.
But AWS good practice is that Athena and Glue are good at handling the small number of large files, and we should try to avoid the large number of small files.
So is there any existing AWS service that is good at aggregating this kind of data?
Thanks!

Related

Costs Related to Individual Bucket Items in S3

My AWS S3 costs have been going up pretty quickly for usage type "DataTransfer-Out-Bytes". I have thousands of images in this one bucket and I can't seem to find a way to drill down into the bucket to see which individual bucket items might be causing the increase. Is there a way to see which individual files are attributing to the higher data transfer cost?

Use Cloudfront if you can - its cheaper(if you properly set your cache headers!) than directly hosting from S3 and Cloudfront includes a popular objects report - which would answer your question.
If your using S3 alone you need to enable logging on the bucket (more storage cost) and then crunch the data in the logs (more data transfer cost) to get your answer. You can use AWS Athena to process the s3 access logs or use unix command line tools like grep/wc/uniq/cut to operate on the log files locally/from a server to find the culprits.

Techniques for AWS CloudTrail and VPC Flow log S3 archival

Following AWS-recommended best practices, we have organization-wide CloudTrail and VPC flow logging configured to log to a centralized logs archive account. Since CloudTrail and VPC flow are organization-wide in multiple regions, we're getting a high number of new log files saved to S3 daily. Most of these files are quite small (several KB).
The high number of small log files is fine while they're in the STANDARD storage class, since you just pay for total data size without any minimum file size overhead. However, we've found it challenging to deep archive these files after 6 or 12 months, since any storage class other than STANDARD (such as GLACIER) has a minimum billable file size (STANDARD-IA is 128, GLACIER doesn't have a minimum size but adds 40KB of metadata per object, etc.).
What are the best practices for archiving a large number of small S3 objects? I could use a Lambda to download multiple files, re-bundle them into a larger file, and re-store it, but that would be pretty expensive in terms of compute time and GET/PUT requests. As far as I can tell, S3 Batch Operations has no support for this. Any suggestions?

Consider using a tool like S3-utils concat. This is not an AWS-supported tool but an open source tool to perform the type of action you are requiring.
You'll probably want the pattern matching syntax which will allow you to create a single file for each day's logs.
$ s3-utils concat my.bucket.name 'date-hierachy/(\d{4})/(\d{2})/(\d{2})/*.gz' 'flat-hierarchy/$1-$2-$3.gz'
This could be run as a daily job so each day is condensed into one file. Definitely recommended to run this in a resource on the Amazon network (i.e. your VPC with the s3 gateway endpoint attached) to improve file transfer performance and avoid data transfer out fees.

AWS API Gateway + Firehose package multiple records

We have an application which accepts 10K requests/second, put them into S3 and then process them.
Currently we're using Kafka but we would like to replace it with Firehose for different reasons (maintenance, cost, etc). I configured API Gateway with Firehose and without any coding I was able to store my requests in S3 in parquet files.
Now comes cost estimation. From Amazon example 500 records/second will cost 216 $ / month. The record size is rounded up to 5Kb. In our case 10K requests/second will cost 20 times more.
Our record size is 1.5k. So it makes sense to package multiple records into one. I did not find an example of how to do it easily. I don't want to implement this application by myself because there are many edge cases to be managed. And for me it seems to be pretty common case which should be already implemented.
Is there a standard way (AWS service, github project, etc) which can be used to package records?
Or is there a better solution to my problem?

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively

You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.

If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.

1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.

This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

Using AWS Kinesis for large file uploads

My client has a service which stores a lot of files, like video or sound files. The service works well, however looks like the long-time file storing is quite a challenge, and we would like to use AWS for storing these files.
The problem is the following, the client wants to use AWS kinesis for transferring every file from our servers to AWS. Is this possible? Can we transfer files using that service? There's a lot of video files, and we got more and more every day. And every files is relatively big.
We would also like to save some detail of the files, possibly into dynamoDB, we could use Lambda functions for that.
The most important thing, that we need a reliable data transfer option.

KInesis would not be the right tool to upload files, unless they were all very small - and most videos would almost certainly be over the 1MB record size limit:
The maximum size of a data blob (the data payload before
Base64-encoding) within one record is 1 megabyte (MB).
https://aws.amazon.com/kinesis/streams/faqs/

Use S3 with multi-part upload using one of the SDK's. Objects you won't be accessing for 90+ days can be moved to Glacier.
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 4302-4306). Amazon Web Services, Inc.. Kindle Edition.
To further optimize file upload speed, use transfer acceleration:
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 2060-2062). Amazon Web Services, Inc.. Kindle Edition.

Kinesis launched a new service "Kinesis Video Streams" - https://aws.amazon.com/kinesis/video-streams/ which may be helpful to move large amount of data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

aws managed service handling large number of small files - amazon-web-services

Related

Costs Related to Individual Bucket Items in S3

Techniques for AWS CloudTrail and VPC Flow log S3 archival

AWS API Gateway + Firehose package multiple records

Best way to transfer data from on-prem to AWS

Using AWS Kinesis for large file uploads

Categories

Resources