Extract .gz files in S3 automatically

Extract .gz files in S3 automatically - amazon-web-services

I'm trying to find a solution to extract ALB logs file in .gz format when they're uploaded automatically from ALB to S3.
My bucket structure is like this
/log-bucket
..alb-1/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-2/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-3/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
Basically, every 5 minutes, each ALB would automatically push logs to correspond S3 bucket. I'd like to extract new .gz files right at that time in same bucket.
Is there any ways to handle this?
I noticed that we can use Lambda function but not sure where to start. A sample code would be greatly appreciated!

Your best choice would probably be to have an AWS Lambda function subscribed to S3 events. Whenever a new object gets created, this Lambda function would be triggered. The Lambda function could then read the file from S3, extract it, write the extracted data back to S3 and delete the original one.
How that works is described in Using AWS Lambda with Amazon S3.
That said, you might also want to reconsider if you really need to store uncompressed logs in S3. Compressed files are not only cheaper, as they don't take as much storage space as uncompressed ones, but they are usually also faster to process, as the bottleneck in most cases is network bandwidth for transferring the data and not available CPU-resources for decompression. Most tools also support working directly with compressed files. Take Amazon Athena (Compression Formats) or Amazon EMR (How to Process Compressed Files) for example.

Related

Aws extract tar.gz on S3

As i'm New on aws and a little confused by all the similar services, I would like to have some leads and know if I am in the right direction.
I have tar.gz archives stored on aws s3 glacier deep archives. I would like that when requesting a restore, the archive is automatically extracted and the folders and files it contains put in s3 (with an expiration date).
these archives are too big to be extracted via lambda (300GB or more).
My idea would be to trigger a lambda function when the restore is complete and use that lambda function to start another aws service that does the extraction. I was thinking either aws batch or fargate. Which service do you think is the most suitable? For this kind of simple task it is preferable to use an arm architecture?
If someone has already done this before and has codes to share I'm interested (if not I'll try to put my final solution here for others).

How can I decompress ZIP files from S3, recompress them & then move them to an S3 bucket?

I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $file.gz and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?
Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,
Would a Lambda function or any other service be more suited for this?

From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.
If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.
However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.
I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.
Your Lambda can then download the file from S3 using download_file (example provided by Boto3 documentation), unzip it using zipfile & eventually GZIP compress the file using gzip.
You can then upload the output file to the new bucket using upload_object(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object.
Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function & aws_s3_bucket resources.
Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively

You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.

If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.

1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.

This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

Parse data from aws s3 bucket and save parsed data to another bucket

I'm new to AWS S3, and I was reading to this tutorial from AWS on how to move data from bucket to another
How can I copy objects between Amazon S3 buckets?
However, I didn't notice, or it didn't mention that you can apply a hook or any intermediate step before data will be saved.
Ideally, we wanted to take the data from a log bucket(its very dirty and wanted to clean it up a bit) and save another copy of it in another S3 (the parsed data). We also wanted to do this periodically so that automation would be necessary for the future.
What I wanted to know is that, can I do this with just S3 or do I need to use another service to do the parsing and saving to another bucket.
Any insight is appreciated, thanks!

S3 by itself is simply for storage. You should be looking at using AWS Lambda with Amazon S3.
Every time a file is pushed to your Log bucket, S3 can trigger a Lambda function (that you write) that can read the file, do the clean up, and then push the cleaned data to the new S3 bucket.
Hope this helps.

Is there any service on AWS that can help me convert mp4 files to mp3?

I'm new to Amazon web services and I'm wondering if the platform offers any solution to convert media files to different formats ( mp4 to mp3) or do I have to use a lambda function with a third party library to achieve this.
Thank you !

You can get up and running quickly with Elastic Transcoder. You will need to:
create two s3 buckets, your 'inbox' and 'outbox'
add a transcoder pipeline specifying which bucket is your in/out buckets, and you what file types you want to transcode from and two.
you can set up a trigger so that every time something hits the in bucket, the process runs, or you can place something in the in bucket and use the sdk or cli to trigger a job.
Two things to note:
When you fire a job, you have to pass in the name of the file that will be created. If the file already exists in the out bucket, an error will be thrown.
As with all of aws' complete services, you get a little free up front, then it gets expensive. Once you get the hang of it, you can save some money rolling your own in lambda like this

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract .gz files in S3 automatically - amazon-web-services

Related

Aws extract tar.gz on S3

How can I decompress ZIP files from S3, recompress them & then move them to an S3 bucket?

Best way to transfer data from on-prem to AWS

Parse data from aws s3 bucket and save parsed data to another bucket

Is there any service on AWS that can help me convert mp4 files to mp3?

Categories

Resources