I'm using lambda to transfer data from s3 to redshift through copy command. Now I have files coming every hour to s3 bucket and one file transfer took more than hour and while other file landed to s3 bucket and now there is deadlock, so what all possible options I can apply to remove it and to make the process more efficient ?
Redshift supports parallel COPY, which allows you to load data from multiple files in parallel. This can significantly reduce the time it takes to load data from S3 to Redshift, especially if you have a large number of files or if the files are large in size.
Use a larger number of compute nodes: If you have a larger number of compute nodes, the COPY process can be faster and can reduce deadlock because it can be parallelized across more nodes.
You can try any of this if not yet.
Related
I have some S3 buckets in one AWS account which have large amount of data (50+ Tbs)
I want it to move it to new S3 buckets in another account completely and use the 1st AWS account for another purpose.
The method I know is AWS CLI using s3 cp/s3 sync/s3 mv , but this would take days when running in my laptop
And I want it to be more cost effective when considering the data transfer also.
Buckets contain mainly zip files and rar files having size ranging from 1GB to 150+GB and also other files too.
Can someone suggest me methods to do this which would be cost effective as well as less time consuming .
You can use Skyplane which is much faster and cheaper than aws s3 cp (up to 110x for large files). Skyplane will automatically compress data to reduce egress costs, and will also give you cost estimates before running the transfer.
You can transfer data between buckets in region A and region B with:
skyplane cp -r s3://<region-A-bucket>/ s3://<region-B-bucket>/
If the destination bucket is in the same region as the source bucket (even if it's in a different account), there's no data transfer cost for running s3 cp/sync/mv according to the docs (check the Data transfer tab).
For a fast solution, consider using S3 Transfer Acceleration, but note that this does incur transfer costs.
Following AWS-recommended best practices, we have organization-wide CloudTrail and VPC flow logging configured to log to a centralized logs archive account. Since CloudTrail and VPC flow are organization-wide in multiple regions, we're getting a high number of new log files saved to S3 daily. Most of these files are quite small (several KB).
The high number of small log files is fine while they're in the STANDARD storage class, since you just pay for total data size without any minimum file size overhead. However, we've found it challenging to deep archive these files after 6 or 12 months, since any storage class other than STANDARD (such as GLACIER) has a minimum billable file size (STANDARD-IA is 128, GLACIER doesn't have a minimum size but adds 40KB of metadata per object, etc.).
What are the best practices for archiving a large number of small S3 objects? I could use a Lambda to download multiple files, re-bundle them into a larger file, and re-store it, but that would be pretty expensive in terms of compute time and GET/PUT requests. As far as I can tell, S3 Batch Operations has no support for this. Any suggestions?
Consider using a tool like S3-utils concat. This is not an AWS-supported tool but an open source tool to perform the type of action you are requiring.
You'll probably want the pattern matching syntax which will allow you to create a single file for each day's logs.
$ s3-utils concat my.bucket.name 'date-hierachy/(\d{4})/(\d{2})/(\d{2})/*.gz' 'flat-hierarchy/$1-$2-$3.gz'
This could be run as a daily job so each day is condensed into one file. Definitely recommended to run this in a resource on the Amazon network (i.e. your VPC with the s3 gateway endpoint attached) to improve file transfer performance and avoid data transfer out fees.
I have about 80,000,000 50KB files on S3 (4TB), which I want to transfer to Glacier DA.
I have come to realize there's a cost inefficiency in transferring a lot of small files to Glacier.
Assuming I don't mind archiving my files into a single (or multiple) tar/zips - what would be the best practice to transition those files to Glacier DA?
It is important to note that I only have these files on S3, and not on any local machine.
The most efficient way would be:
Launch an Amazon EC2 instance in the same region as the bucket. Choose an instance type with high-bandwidth networking (eg t3 family). Launch it with spot pricing because you can withstand the small chance that it is stopped. Assign plenty of EBS disk space. (Alternatively, you could choose a Storage Optimized instance since the disk space is included free, but the instance is more expensive. Your choice!)
Download a subset of the files to the instance using the AWS Command-Line Interface (CLI) by specifying a path (subdirectory) to copy. Don't try and do it all at once!
Zip/compress the files on the EC2 instance
Upload the compressed files to S3 using --storage-class DEEP_ARCHIVE
Check that everything seems good, and repeat for another subset!
The above would incur very little charge since you can terminate the EC2 when it is no longer needed, and EBS is only charged while the volumes exist.
If it takes too long to list a subset of the files, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then use this list to specifically copy files, or identify a path/subdirectory to copy.
As an extra piece of advice... if your system is continuing to collect even more files, you might consider collecting the data in a different way (eg streaming to Kinesis Firehose to batch data together), or combining the data on a regular basis rather than letting it creep up to so many files again. Fewer, larger files are much easier to use in processes if possible.
We have lots of files in S3 (>1B), I'd like to compress those to reduce storage costs.
What would be a simple and efficient way to do this?
Thank you
Alex
Amazon S3 cannot compress your data.
You would need to write a program to run on an Amazon EC2 instance that would:
Download the objects
Compress them
Upload the files back to S3
An alternative is to use Storage Classes:
If the data is infrequently accessed, use S3 Standard - Infrequent Access -- this is available immediately and is cheaper as long as data is accessed less than once per month
Glacier is substantially cheaper but takes some time to restore (speed of restore is related to cost)
I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).
There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2
You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog
I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/