When we transfer huge amount of data from the on-premise data center to snowball and dispatch the snowball device to AWS data centers, during the data transfer period, there will be a good amount of real-time data that will continue to get recorded in the customer data center. Let's say snowball will take 2 days to ship data to designated AWS region data center. How does AWS ensure, this 2 days data recorded in the on-prem data center is also migrated into the cloud so there is no additional data left in the customer data center and 100% data is migrated? What options AWS has to address this?
It really depends on the kind of data you are talking about, but the general strategy would be using Snowball to transfer all the bulk data until a particular moment in time, and from that moment in time start shipping the data that gets into your data centre to AWS directly over the network.
The problem Snowball solves is transferring large amounts of data that wouldn't be efficient over the network, but for all the new data, you can send a copy to the AWS in real time or at regular intervals. The size of that data should be small enough so network transfer works fine.
Regarding data migration AWS has many different services and it really depends on your specific requirements.
A very common setup is to use at the very least direct connect, so you can have a dedicated connection to Amazon's data centres. If the data you have is small enough, you can just use simple tools to send the data to S3 or to a kinesis firehose.
For more complex scenarios, you might want to use maybe a storage gateway, that sits in your data centre and allows seamless integration with many data and file stores on AWS.
So the details would depend on each use case, but the answer would be a combination of the technologies mentioned on the Cloud Data Migration page
Related
I have 8 TB of on premise data at present. I need to transfer it to AWS S3. Going forward every month 800gb of data will be required to update. What will be the cost of the different approaches?
Run a python script in ec2 instance.
Use AWS Lambda for the transfer.
Use AWS DMS to transfer the data.
I'm sorry that I wont do the calculations for you,
but i hope with this tool you can do it yourself :)
https://calculator.aws/#/
According to
https://aws.amazon.com/s3/pricing/
Data Transfer IN To Amazon S3 From Internet
All data transfer in $0.00 per GB
Hope you will find your answer !
While data is inside SQL, you need to move that out of it first. If your SQL is AWS's managed RDS, that's easy task, just backup to s3. Yet if it's something you manage by hand, figure out to move data to s3. Btw, you can not only use s3, but disk services too.
You do not need EC2 instance to make data transfer unless you need some compute on that data.
Then to move 8Tb there are couple of options. Cost is tricky thing while downtime of slower transfer may mean losses, maybe security risk is another cost to think about, developer's time etc. etc. so it really depends on your situation
Option A would be to use AWS File Gateway and mount locally network drive with enough space and just sync from local to that drive. https://aws.amazon.com/storagegateway/file/ Maybe this would be the easiest way, while File Gateway will take care of failed connections, retries etc. You can mount local network drive to your OS which sends data to S3 bucket.
Option B would be just send over the public network. Which may be not possible if connection is slow or insecure by your requirements.
Option C which is usually not used for single time transfer - private link to AWS. This would provide more security and probably speed.
Option D would be to use snow family products. Smallest AWS Snowcone has exactly 8Tb of capacity, so if you really under 8Tb, maybe it would be more cost effective way to transfer. If you actually have a bit more than 8Tb, you need AWS Snowball, which can handle much more then 8Tb but it's <80Tb, which is enough in your case. Fun note, for up to 100PB data transfer there is Snowmobile.
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
On the AWS developer docs for Sagemaker, they recommend us to use PIPE mode to directly stream large datasets from S3 to the model training containers (since it's faster, uses less disk storage, reduces training time, etc.).
However, they don't include information on whether this data streaming transfer is charged for (they only include data transfer pricing for their model building & deployment stages, not training).
So, I wanted to ask if anyone knew whether this data transfer in PIPE mode is charged for, since if it is, I don't get how this would be recommended for large datasets, since streaming a few epochs for each model iteration can get prohibitively expensive for large datasets (my dataset, for example, is 6.3TB on S3).
Thank you!
You are charged for the S3 GET calls that you do similarly to what you would be charged if you used the FILE option of the training. However, these charges are usually marginal compared to the alternatives.
When you are using the FILE mode, you need to pay for the local EBS on the instances, and for the extra time that your instances are up and only copying the data from S3. If you are running multiple epochs, you will not benefit much from the PIPE mode, however, when you have so much data (6.3 TB), you don't really need to run multiple epochs.
The best usage of PIPE mode is when you can use a single pass over the data. In the era of big data, this is a better model of operation, as you can't retrain your models often. In SageMaker, you can point to your "old" model in the "model" channel, and your "new" data in the "train" channel and benefit from the PIPE mode to the maximum.
I just realized that on S3's official pricing page, it says the following under the Data transfer section:
Transfers between S3 buckets or from Amazon S3 to any service(s) within the same AWS Region are free.
And since my S3 bucket and my Sagemaker instances will be in the same AWS region, the data transfer costs should be free.
If I want to utilize Amazon Web Services to provide the hardware (cores and memory) to process a large amount of data, do I need to upload that data to AWS? Or can I keep the data on the system and rent the hardware?
Yes, in order for an AWS-managed system to process a large amount of data, you will need to upload the data to an AWS region for processing at some point. AWS does not rent out servers to other physical locations, as far as I'm aware (EDIT: actually, AWS does have an offering for on-premises data processing as of Nov 30 2016, see Snowball Edge).
AWS offers a variety of services for getting large amounts of data into its data centers for processing, (ranging from basic HTTP uploads to physically mailing disk drives for direct data import), and the best service to use will depend entirely on your specific use-case, needs and budget. See the overview page at Cloud Data Migration for an overview of the various services and help on selecting the most appropriate service.
Sorry for the silly question, I am new to cloud development. I am trying to develop a realtime processing app in cloud, which can process the data from a sensor in realtime. the data stream is very low data rate, <50Kbps per sensor. probably <10 sensors will be running at once.
I am confused, what is the use of Amazon Kinesis for this application. I can use EC2 directly to receive my stream and process it. Why do I need Kinesis?
Why do I need Kinesis?
Short answer, you don't.
Yes, you can use EC2 - and probably dozens of other technologies.
Here is the first two sentences of the Kinesis product page:
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. You can configure hundreds of thousands of data producers to continuously put data into an Amazon Kinesis stream.
So, if want to manage the stack yourelf, and/or you don't need massive scale and/or you don't need the ability to scale this processing to hundreds of thousands of simulataneous producers, then Kinesis may be overkill.
On the other hand, if the ingestion of this data is mission critical, and you don't have the time, skills or ability to manage the underlying infrastructure - or there is a chance the scale of your application will grow exponentially, then maybe Kinesis is the right choice - only you can decide based on your requirements.
Along with what E.J Brennan Just said, there are many other ways to solve your problem as the rate of data is very low.
As far as I know, amazon kinesis runs on ec2 under the hood, so may be your question is why to use kinesis as a streaming solution.
for scalability reasons ,you might need the streaming solution in future, as your volume of data grows and as the cost of maintaining the on premises resources increases and the focus shifts from application development to administration.
So kinesis for that matter ,would provide pay per use model instead of you worrying about increasing/reducing your resource stack.