I read the documentation from the official website. but it does not give me a clear picture.
Why would need to use AWS Transfer Family since AWS DataSync can also achieve the same result?
I notice the protocol differences, but am not quite sure about the data migration use case.
Why would we pick one over the other?
Why would need to use AWS Transfer Family since AWS DataSync can also achieve the same result?
It depends on what you mean by achieving the same result.
If it is transferring data to & from AWS then - yes both achieve the same result.
However, the main difference is that AWS Transfer Family is practically an always-on server endpoint enabled for SFTP, FTPS, and/or FTP.
If you need to maintain compatibility for current users and applications that use SFTP, FTPS, and/or FTP then using AWS Transfer Family is a must as that ensures the contract is not broken and that you can continue to use them without any modifications. Existing transfer workflows for your end-users are preserved & existing client-side configurations are maintained.
On the other hand, AWS DataSync is ideal for transferring data between on-premises & AWS or between AWS storage services. A few use-cases that AWS suggests are migrating active data to AWS, archiving data to free up on-premises storage capacity, replicating data to AWS for business continuity, or transferring data to the cloud for analysis and processing.
At the core, both can be used to transfer data to & from AWS but serve different business purposes.
Your exact question in the AWS DataSync FAQs:
Q: When do I use AWS DataSync and when do I use AWS Transfer Family?
A: If you currently use SFTP to exchange data with third parties, AWS Transfer Family provides a fully managed SFTP, FTPS, and FTP transfer directly into and out of Amazon S3, while reducing your operational burden.
If you want an accelerated and automated data transfer between NFS servers, SMB file shares, self-managed object storage, AWS Snowcone, Amazon S3, Amazon EFS, and Amazon FSx for Windows File Server, you can use AWS DataSync. DataSync is ideal for customers who need online migrations for active data sets, timely transfers for continuously generated data, or replication for business continuity.
Also see: AWS Transfer Family FAQs - Q: Why should I use the AWS Transfer Family?
Related
As-Is:
We are currently uploading files to Amazon S3.
These files are processed by a lambda function which then writes a file back to Amazon S3.
Problem:
We are processing critical data. So the data must not be stored in the cloud according to the compliance team.
It shall be stored on-premise on our own file servers.
Question:
How can we replace S3 easily so that our lambda function is accessing the file on the on-premise file server?
(The files must not stored be on S3 - even for a millisecond)
(Alternatively the file might be provided by a user e.g. on a GUI)
If the data can't be transmitted to the cloud, then you can't use a Lambda function in the cloud to process it - if the code is not running on your servers, then it has to receive a copy of the data somehow, which means the data is leaving your network.
If you really want to have the same experience as running in AWS but with on-premise hardware, you can get an AWS Outpost, which is like your own little bit of AWS.
Alternatively, just run the code that would have been in Lambda on your own servers, perhaps using an open-source package that gives you Lambda-like execution using local containers.
So the data must not be stored in the cloud according to the compliance team.
If your only concern is that you don't want to store data on S3, you can put your Lambda in a VPC and have a Site-to-Site VPN from your on-premises network to the AWS VPC.
Usually compliance is not just limited to long term storage like S3. You should check if your data is allowed to leave your local network. In order for Lambda to do processing on your data, the data has to be stored temporarily in the cloud, and also it will leave your local network. If there are compliance limitations for these cases, probably Lambda would not be the best option.
I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).
I need to move data from on-premise to AWS redshift(region1). what is the fastest way?
1) use AWS snowball to move on-premise to s3 (region1)and then use Redshift's SQL COPY cmd to copy data from s3 to redshift.
2) use AWS Datapipeline(note there is no AWS Datapipeline in region1 yet. so I will setup a Datapipeline in region2 which is closest to region1) to move on-premise data to s3 (region1) and another AWS DataPipeline (region2) to copy data from s3 (region1) to redshift (region1) using the AWS provided template (this template uses RedshiftCopyActivity to copy data from s3 to redshift)?
which of above solution is faster? or is there other solution? Besides, will RedshiftCopyActivity faster than running redshift's COPY cmd directly?
Note it is one time movement so I do not need AWS datapipeline's schedule function.
Here is AWS Datapipeline's link:
AWS Data Pipeline. It said: AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources....
It comes down to network bandwidth versus the quantity of data.
The data needs to move from the current on-premises location to Amazon S3.
This can either be done via:
Network copy
AWS Snowball
You can use an online network calculator to calculate how long it would take to copy via your network connection.
Then, compare that to using AWS Snowball to copy the data.
Pick whichever one is cheaper/easier/faster.
Once the data is in Amazon S3, use the Amazon Redshift COPY command to load it.
If data is being continually added, you'll need to find a way to send continuous updates to Redshift. This might be easier via network copy.
There is no benefit in using Data Pipeline.
We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.
i am bit confused between AWS S3 and AWS storage gateway as both functions the same of storing data.Can anyone explain me with example of what is the exact difference between two services offered by Amazon
AWS S3 is the data repository
AWS Storage Gateway connects on premise storage to the S3 repository.
You would use Storage Gateway for a number of reasons
You want to stop purchasing storage devices, and use S3 to back your enterprise storage. In this case, your company would save to a location defined on the storage gateway device, which would then handle local caching, and offload the less frequently accessed data to S3.
You want to use it as a back up system - whereby Storage Gateway would snap shot the data into S3
To take advantage of the newly released virtual tape library - which would alloy you to transition from tape storage to S3/Glacier storage, without losing your existing tape software and cataloging investment.
1, AWS S3 is a file system. It acts as network disk. For people has no cloud experience, you can treat it as dropbox.
2, AWS Storage Gateway is an virtual interface (or in practice, it is a virtual pc running on your server) which allow you read/write data from/to aws S3 or other aws storage service transparently
You can assume S3 is dropbox itself, you can access it through web or api, and AWS Storage Gateway is the dropbox client on your pc, which simulate the dropbox as your local drive (actually a network drive in the real case).
I think, the above answers are enough explanatory. But here's just a check
Why would I use Storing data on AWS S3?
Easy to use
Cost-effective
Long durability and availability
No limitation for storing amount of data. Only thing is - Data object should not be more than 5 TB
Why would I use AWS Storage Gateway?
I have large amount of data or important data that is stored in data centre and I want to store on Cloud (AWS) for "obvious" reasons
I need a mechanism to transfer my important data from data centre to AWS S3
I need to store my old and "not-so-useful" but "may-be-needed-in-future" type data so I will store it on AWS
Glacier
Now, I need a mechanism to implement this successfully. AWS Storage gateway is provided to fulfil this requirement.
AWS Storage Gateway will provide you a VM which will be installed on your data centre and will transfer that data.
That's it. (y)