I know that GCP's Transfer Service for cloud data allows me to schedule jobs that move data from S3 to GCS.
I would like to do the same but in the other direction.
For example, move all data from a bucket in GCS to a bucket in S3 everyday at 12am.
How do I do this?
(I believe gsutil allows me to do it but doesn't allow scheduling and also doesn't leverage spreading the load across multiple nodes)
AWS Datasync started offering this service recently -> https://aws.amazon.com/blogs/storage/migrating-google-cloud-storage-to-amazon-s3-using-aws-datasync/
Related
I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).
I have AWS s3 bucket where I am receiving multiple parquet files every minutes after performing some operation in AWS firehose. Now I have to make Real time sync of these files with GCP cloud storage bucket as we have multi cloud env and further process will be happening in GCP cloud.
But I have problem that how can I do real time sync between two cloud buckets so that as soon as any file comes to AWS s3, same time it should come to GCP bucket as well. Any inputs please
If you literally mean updates happen at S3 and GCS atomically, that's not possible. The best you could do is have a job that gets notifications when updates complete at one, and initiate a copy to the other. You'd need to put some work into making the job robust regarding transient failures.
I need to move data from on-premise to AWS redshift(region1). what is the fastest way?
1) use AWS snowball to move on-premise to s3 (region1)and then use Redshift's SQL COPY cmd to copy data from s3 to redshift.
2) use AWS Datapipeline(note there is no AWS Datapipeline in region1 yet. so I will setup a Datapipeline in region2 which is closest to region1) to move on-premise data to s3 (region1) and another AWS DataPipeline (region2) to copy data from s3 (region1) to redshift (region1) using the AWS provided template (this template uses RedshiftCopyActivity to copy data from s3 to redshift)?
which of above solution is faster? or is there other solution? Besides, will RedshiftCopyActivity faster than running redshift's COPY cmd directly?
Note it is one time movement so I do not need AWS datapipeline's schedule function.
Here is AWS Datapipeline's link:
AWS Data Pipeline. It said: AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources....
It comes down to network bandwidth versus the quantity of data.
The data needs to move from the current on-premises location to Amazon S3.
This can either be done via:
Network copy
AWS Snowball
You can use an online network calculator to calculate how long it would take to copy via your network connection.
Then, compare that to using AWS Snowball to copy the data.
Pick whichever one is cheaper/easier/faster.
Once the data is in Amazon S3, use the Amazon Redshift COPY command to load it.
If data is being continually added, you'll need to find a way to send continuous updates to Redshift. This might be easier via network copy.
There is no benefit in using Data Pipeline.
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case
I am trying to set up cross region replication so that my original file will be replicated to two different regions. Right now, I can only get it to replicate to one other region.
For example, my files are on US Standard. When a file is uploaded it is replicated from US Standard to US West 2. I would also like for that file to be replicated to US West 1.
Is there a way to do this?
It appears the Cross-Region Replication in Amazon S3 cannot be chained. Therefore, it cannot be used to replicate from Bucket A to Bucket B to Bucket C.
An alternative would be to use the AWS Command-Line Interface (CLI) to synchronise between buckets, eg:
aws s3 sync s3://bucket1 s3://bucket2
aws s3 sync s3://bucket1 s3://bucket3
The sync command only copies new and changed files. Data is transferred directly between the Amazon S3 buckets, even if they are in different regions -- no data is downloaded/uploaded to your own computer.
So, put these commands in a cron job or a Scheduled Task to run once an hour and the buckets will nicely replicate!
See: AWS CLI S3 sync command documentation