I plan to run mapreduce job on the data stored in S3. Data size is around 1PB. Will EMR copy entire 1TB data to spawned VMs with replication factor 3 (if my rf = 3)? If yes, does amazon charge for copying data from S3 to MapReduce cluster?
Also, is it possible to use EMR for the data not residing in s3?
Amazon Elastic Map Reduce accesses data directly from Amazon S3. It does not copy the data to HDFS. (It might use some local temp storage, I'm not 100% sure.)
However, it certainly won't trigger your HDFS replication factor, since the data is not stored in HDFS. For example, Task Nodes that don't have HDFS can still access data in S3.
There is no Data Transfer charge for data movements between Amazon S3 and Amazon EMR within the same Region, but it will count towards the S3 Request count.
Amazon Elastic Map Reduce can certainly be used on data not residing in Amazon S3 -- it's just a matter of loading the data from your data source, such as using scp to copy the data into HDFS. Please note that the contents of HDFS will disappear when your cluster terminates. That's why S3 is a good place to store data for EMR -- it is persistent and there is no limit on the amount of data that is stored.
Related
I am trying to consume some data in redshift using sagemaker to train some model. After some research, I found the best way to do so is first unloading the data from redshift to an S3 bucket. I assume sagemaker has API to directly interact with redshift, but why do we need to first unload it to an S3 bucket?
UNLOADing is a best practice and generally the method that the docs will promote. This is due to efficiency and performance. Redshift is a cluster with a single leader and multiple compute nodes. S3 is a cluster - a distributed object store. Having multiple compute nodes connect to S3 when moving data is far faster and less of a burden to the database.
Also, tools that you may be using with sagemaker (like EMR) are also clusters and will also benefit from multiple parallel connections to S3.
The larger the amount of data being moved the greater this benefit will be.
So I have some historical data on S3 in .csv/.parquet format. Everyday I have my batch job running which gives me 2 files having the list of data that needs to be deleted from the historical snapshot, and the new records that needs to be inserted to the historical snapshot. I cannot run insert/delete queries on athena. What are the options (cost effective and managed by aws) do I have to execute my problem?
Objects in Amazon S3 are immutable. This means that be replaced, but they cannot be edited.
Amazon Athena, Amazon Redshift Spectrum and Hive/Hadoop can query data stored in Amazon S3. They typically look in a supplied path and load all files under that path, including sub-directories.
To add data to such data stores, simply upload an additional object in the given path.
To delete all data in one object, delete the object.
However, if you wish to delete data within an object, then you will need to replace the object with a new object that has those rows removed. This must be done outside of S3. Amazon S3 cannot edit the contents of an object.
See: AWS Glue adds new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3
Data Bricks has a product called Delta Lake that can add an additional layer between queries tools and Amazon S3:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake supports deleting data from a table because it sits "in front of" Amazon S3.
Please help me understand the use of distcp, we are using s3 and in some scripts I can see they are directly writing data to s3 and many cases writing data to hdfs and then using distcp to copy data to s3.
So when to use distcp and when can we write to cloud directly?
First of all you need to be very clear why to use distcp.
Distcp is mainly used to transfer across hadoop cluster. Lets say you have two remote hadoop cluster 1 in California and other 1 is in Arizona and cluster1 is your primary cluster and cluster2 is your secondary means that you are doing all the processing on cluster1 and dumping new data to cluster2 once the processing is completed on cluster2.
In this scenrio you will distcp(copy) you data from cluster1 to cluster2 because both cluster are different and you can copy your data very fast as it copies data in parallel using mappers. So you can think of distcp as similar to ftp which is used for local data copy across different servers.
In your case i think hdfs you mentioned is other hadoop cluser from which you are copying your data to aws s3 or vice versa.
Hope it clears your doubt
I need to move data from on-premise to AWS redshift(region1). what is the fastest way?
1) use AWS snowball to move on-premise to s3 (region1)and then use Redshift's SQL COPY cmd to copy data from s3 to redshift.
2) use AWS Datapipeline(note there is no AWS Datapipeline in region1 yet. so I will setup a Datapipeline in region2 which is closest to region1) to move on-premise data to s3 (region1) and another AWS DataPipeline (region2) to copy data from s3 (region1) to redshift (region1) using the AWS provided template (this template uses RedshiftCopyActivity to copy data from s3 to redshift)?
which of above solution is faster? or is there other solution? Besides, will RedshiftCopyActivity faster than running redshift's COPY cmd directly?
Note it is one time movement so I do not need AWS datapipeline's schedule function.
Here is AWS Datapipeline's link:
AWS Data Pipeline. It said: AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources....
It comes down to network bandwidth versus the quantity of data.
The data needs to move from the current on-premises location to Amazon S3.
This can either be done via:
Network copy
AWS Snowball
You can use an online network calculator to calculate how long it would take to copy via your network connection.
Then, compare that to using AWS Snowball to copy the data.
Pick whichever one is cheaper/easier/faster.
Once the data is in Amazon S3, use the Amazon Redshift COPY command to load it.
If data is being continually added, you'll need to find a way to send continuous updates to Redshift. This might be easier via network copy.
There is no benefit in using Data Pipeline.
I would like to backup a snapshot of my Amazon Redshift cluster into Amazon Glacier.
I don't see a way to do that using the API of either Redshift or Glacier. I also don't see a way to export a Redshift snapshot to a custom S3 bucket so that I can write a script to move the files into Glacier.
Any suggestion on how I should accomplish this?
There is no function in Amazon Redshift to export data directly to Amazon Glacier.
Amazon Redshift snapshots, while stored in Amazon S3, are only accessible via the Amazon Redshift console for restoring data back to Redshift. The snapshots are not accessible for any other purpose (eg moving to Amazon Glacier).
The closest option for moving data from Redshift to Glacier would be to use the Redshift UNLOAD command to export data to files in Amazon S3, and then to lifecycle the data from S3 into Glacier.
Alternatively, simply keep the data in Redshift snapshots. Backup storage beyond the provisioned storage size of your cluster and backups stored after your cluster is terminated are billed at standard Amazon S3 rates. This has the benefit of being easily loadable back into a Redshift cluster. While you'd be paying slightly more for storage (compared to Glacier), the real cost saving is in the convenience of quickly restoring the data in future.
Any use case to take a backup as Redshift automatically keeps snapshots. Here is a reference link