If I want to utilize Amazon Web Services to provide the hardware (cores and memory) to process a large amount of data, do I need to upload that data to AWS? Or can I keep the data on the system and rent the hardware?
Yes, in order for an AWS-managed system to process a large amount of data, you will need to upload the data to an AWS region for processing at some point. AWS does not rent out servers to other physical locations, as far as I'm aware (EDIT: actually, AWS does have an offering for on-premises data processing as of Nov 30 2016, see Snowball Edge).
AWS offers a variety of services for getting large amounts of data into its data centers for processing, (ranging from basic HTTP uploads to physically mailing disk drives for direct data import), and the best service to use will depend entirely on your specific use-case, needs and budget. See the overview page at Cloud Data Migration for an overview of the various services and help on selecting the most appropriate service.
Related
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
On the AWS developer docs for Sagemaker, they recommend us to use PIPE mode to directly stream large datasets from S3 to the model training containers (since it's faster, uses less disk storage, reduces training time, etc.).
However, they don't include information on whether this data streaming transfer is charged for (they only include data transfer pricing for their model building & deployment stages, not training).
So, I wanted to ask if anyone knew whether this data transfer in PIPE mode is charged for, since if it is, I don't get how this would be recommended for large datasets, since streaming a few epochs for each model iteration can get prohibitively expensive for large datasets (my dataset, for example, is 6.3TB on S3).
Thank you!
You are charged for the S3 GET calls that you do similarly to what you would be charged if you used the FILE option of the training. However, these charges are usually marginal compared to the alternatives.
When you are using the FILE mode, you need to pay for the local EBS on the instances, and for the extra time that your instances are up and only copying the data from S3. If you are running multiple epochs, you will not benefit much from the PIPE mode, however, when you have so much data (6.3 TB), you don't really need to run multiple epochs.
The best usage of PIPE mode is when you can use a single pass over the data. In the era of big data, this is a better model of operation, as you can't retrain your models often. In SageMaker, you can point to your "old" model in the "model" channel, and your "new" data in the "train" channel and benefit from the PIPE mode to the maximum.
I just realized that on S3's official pricing page, it says the following under the Data transfer section:
Transfers between S3 buckets or from Amazon S3 to any service(s) within the same AWS Region are free.
And since my S3 bucket and my Sagemaker instances will be in the same AWS region, the data transfer costs should be free.
When we transfer huge amount of data from the on-premise data center to snowball and dispatch the snowball device to AWS data centers, during the data transfer period, there will be a good amount of real-time data that will continue to get recorded in the customer data center. Let's say snowball will take 2 days to ship data to designated AWS region data center. How does AWS ensure, this 2 days data recorded in the on-prem data center is also migrated into the cloud so there is no additional data left in the customer data center and 100% data is migrated? What options AWS has to address this?
It really depends on the kind of data you are talking about, but the general strategy would be using Snowball to transfer all the bulk data until a particular moment in time, and from that moment in time start shipping the data that gets into your data centre to AWS directly over the network.
The problem Snowball solves is transferring large amounts of data that wouldn't be efficient over the network, but for all the new data, you can send a copy to the AWS in real time or at regular intervals. The size of that data should be small enough so network transfer works fine.
Regarding data migration AWS has many different services and it really depends on your specific requirements.
A very common setup is to use at the very least direct connect, so you can have a dedicated connection to Amazon's data centres. If the data you have is small enough, you can just use simple tools to send the data to S3 or to a kinesis firehose.
For more complex scenarios, you might want to use maybe a storage gateway, that sits in your data centre and allows seamless integration with many data and file stores on AWS.
So the details would depend on each use case, but the answer would be a combination of the technologies mentioned on the Cloud Data Migration page
My client has a service which stores a lot of files, like video or sound files. The service works well, however looks like the long-time file storing is quite a challenge, and we would like to use AWS for storing these files.
The problem is the following, the client wants to use AWS kinesis for transferring every file from our servers to AWS. Is this possible? Can we transfer files using that service? There's a lot of video files, and we got more and more every day. And every files is relatively big.
We would also like to save some detail of the files, possibly into dynamoDB, we could use Lambda functions for that.
The most important thing, that we need a reliable data transfer option.
KInesis would not be the right tool to upload files, unless they were all very small - and most videos would almost certainly be over the 1MB record size limit:
The maximum size of a data blob (the data payload before
Base64-encoding) within one record is 1 megabyte (MB).
https://aws.amazon.com/kinesis/streams/faqs/
Use S3 with multi-part upload using one of the SDK's. Objects you won't be accessing for 90+ days can be moved to Glacier.
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 4302-4306). Amazon Web Services, Inc.. Kindle Edition.
To further optimize file upload speed, use transfer acceleration:
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 2060-2062). Amazon Web Services, Inc.. Kindle Edition.
Kinesis launched a new service "Kinesis Video Streams" - https://aws.amazon.com/kinesis/video-streams/ which may be helpful to move large amount of data.
We need to populate database which sits on Amazon WS { EC2 (Compute Cluster Eight extra large) + EBS 1TB }. Given that we have close to 700GB of data on local, how can I find out the time (theoretical) it would take to upload the entire data? I could not find any information on data upload/download speeds for EC2?
Since this will depend strongly on the networking betweeen your site and amazon's data centre...
Test it with a few GB and extrapolate.
Be aware of AWS Import/Export and consider the option of simply couriering Amazon a portable hard drive. (Old saying: "Never underestimate the bandwidth of a stationwagon full of tape"). In fact I note the page includes a section "When to use..." which gives some indication of transfer times vs. connection bandwidth.