Realtime sync between AWS S3 bucket and google cloud storage bucket - amazon-web-services

I have AWS s3 bucket where I am receiving multiple parquet files every minutes after performing some operation in AWS firehose. Now I have to make Real time sync of these files with GCP cloud storage bucket as we have multi cloud env and further process will be happening in GCP cloud.
But I have problem that how can I do real time sync between two cloud buckets so that as soon as any file comes to AWS s3, same time it should come to GCP bucket as well. Any inputs please

If you literally mean updates happen at S3 and GCS atomically, that's not possible. The best you could do is have a job that gets notifications when updates complete at one, and initiate a copy to the other. You'd need to put some work into making the job robust regarding transient failures.

Related

How to automatically sync s3 bucket to a local folder using windows server

Im trying to have a replica of my s3 bucket in a local folder. it should be updated when a change occurs on the bucket.
You can use the aws cli s3 sync command to copy ('synchronize') files from an Amazon S3 bucket to a local drive.
To have it update frequently, you could schedule it as a Windows Scheduled Tasks. Please note that it will be making frequent calls to AWS, which will incur API charges ($0.005 per 1000 requests).
Alternatively, you could use utilities that 'mount' an Amazon S3 bucket as a drive (Tntdrive, Cloudberry, Mountain Duck, etc). I'm not sure how they detect changes -- they possibly create a 'virtual drive' where the data is not actually downloaded until it is accessed.
You can use rclone and Winfsp to mount S3 as a drive.
Though this might not be a 'mount' in traditional terms.
You will need to setup a task scheduler for a continuous sync.
Example : https://blog.spikeseed.cloud/mount-s3-as-a-disk/

How to move data from GCS to S3 with scheduled jobs

I know that GCP's Transfer Service for cloud data allows me to schedule jobs that move data from S3 to GCS.
I would like to do the same but in the other direction.
For example, move all data from a bucket in GCS to a bucket in S3 everyday at 12am.
How do I do this?
(I believe gsutil allows me to do it but doesn't allow scheduling and also doesn't leverage spreading the load across multiple nodes)
AWS Datasync started offering this service recently -> https://aws.amazon.com/blogs/storage/migrating-google-cloud-storage-to-amazon-s3-using-aws-datasync/

Loading files from AWS S3 bucket to a Snowflake tables

I want to copy files from S3 bucket to the Snowflake. To do this I'm using Lambda function. In the S3 bucket I have a folders and in every folders there are many CSV files. These CSV files can small and huge. I have created a Lambda function that is loading these files to the Snowflake. The problem is that Lambda function can work only 15 minutes. It's not enough to load all the files to the Snowflake. Can you help me with this problem? I have one solution for this - execute lambda only with one file not with all files
As you said, the maximum execution time for a Lambda function is 15 minutes, and is not a good idea load all the file in the memory, because you will have high costs with execution time and high usage of memory.
But, if you really want to use Lambdas and you are dealing with files over 1GB, perhaps you should consider AWS Athena or optimizing your AWS Lambda function to read the file using a stream instead of loading the whole file into memory.
Other option may be to create a SQS message when the file lands on s3 and have an EC2 instance poll the queue and process as necessary. For more information check here: Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances.
The best option will be automate the Snowpipe with AWS Lambda, for this check the Snowpipe docs Automating Snowpipe with AWS Lambda.

download files from AWS S3 bucket in parallel

I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs