AWS EMR migration from us-east to us-west - amazon-web-services

I am planning to move an emr cluster from us-east to us-west. I have data residing in hdfs as well as s3. But due to lack of proper documentation i am unable to start with this thing.
Does anyone has any experience in doing so ?

You can use s3-dist-cp tool on EMR to copy data from HDFS to S3 and later , you can use the same tool to copy from S3 to HDFS on the cluster in different region. Also note that its always recommended to use EMR with s3 buckets on same region.

Related

AWS EMR: How to migrate data from one EMR to another EMR

I currently have an AWS EMR cluster running with HBase. And I am saving the data to S3. I want to migrate the data to a new EMR cluster on the same account. What is the proper way to migrate data from one EMR to another?
Thank you
There are different ways two copy the table from one cluster to another:
Use CopyTable utility. The disadvantage is that it can degrade the region server performance or there is a need to disable the tables prior to copy.
Hbase Snapshots. (Recommended). It has a little impact on region server performance.
You can follow the aws documentation to perform snapshot/restore operations.
Basically you will do the following:
Create Snapshot
Export to S3
Import from S3
Restore to Hbase

Migrating from one aws elasticsearch cluster to another aws elasticsearch cluster in different regions

I am trying to move data from one aws elasticsearch cluster in oregon to another aws elasticsearch cluster in N.Virginia. I have registered the repository in source ES and taken a manual snapshot to s3(in Oregon). Now i am trying to register a repository in destination ES in the same s3 location but it is not letting me do it.
Its throwing up an error that the s3 bucket should be in the same region.
I am now stuck. Can anybody suggest a method for this?
Based on the comments, the solution is to make a backup of the cluster into a bucket in Oregon, copy it from the bucket to a bucket in N.Virginia, and then restore the cluster in the N.Virginia region using the second bucket.

AWS Document DB MultiRegion replication

I know that the AWS DocumentDB doesnot support multiregion replication even the snapshot cannot be shared across regions.
Please suggest how we can manually do the replication
Sam,
AWS released cross-region snapshot copy today (7/10/20), so that should get you what you need. Good luck.
https://aws.amazon.com/about-aws/whats-new/2020/07/amazon-documentdb-support-cross-region-snapshot-copy/
Thank you for the feedback, Sam. A couple of options are to use change streams + Lambda/worker to replicate the data or to take take backups to S3 with mongodump and utilize S3's cross-region replication capabilities.
Amazon DocumentDB now supports global clusters. The primary cluster supports writes and read-scaling, up to 5 regions can be added as read-only, see https://aws.amazon.com/documentdb/global-clusters/

Uploading File to S3, then process in EMR and last transfer to Redshift

I am new in this forum and technology and looking for your advice. I am working on POC and below are my requirement. Could you please guide me the way to achieve the result.
Copy data from NAS to S3.
Use S3 as a source in EMR Job with target to S3/Redshift.
Any link, pdf will also helpful.
Thanks,
Pardeep
There's a lot here that you're asking and there's not a lot of info on your use case to go by so I'm going to be very general in my answer and hopefully it at least points you in the right direction.
You can use Lambda to copy data from your NAS to S3. Assuming your NAS is on-premise and assuming you have a VPN into your VPC or even Direct Connect configured, then you can use a VPC enabled Lambda function to read from the NAS on-premise and write to S3.
If your NAS is running on EC2 the above will remain the same except there's no need for VPN or Direct Connect.
Are you looking to kick off the EMR job from Lambda? You can use S3 as a source for EMR to then output to S3 either from within Lambda or via other means as well.
If you can provide more info on your use case we could probably give you a better quality answer.
Copy data from NAS to S3.
Really depends on the amount of data and the frequency on which you run the copy job. If the data in GBs, then you can install AWS CLI on a machine where NFS is attached. AWS CLI command like CP can be multithreaded and can easily copy your datasets to S3. You might also enable S3 transfer acceleration to speed things up. Having AWS Direct connect to your company network can also speed up any transfers from on-premis to AWS.
http://docs.aws.amazon.com/cli/latest/topic/s3-config.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
https://aws.amazon.com/directconnect/
If the data is in TBs (which is probably distributed across multiple volumes), then you might have to consider using physical transfer utilities like AWS Snowball,AWSImportExport or AWS Snowmobile based on the use-case.
https://aws.amazon.com/cloud-data-migration/
Use S3 as a source in EMR Job with target to S3/Redshift.
Again, as there are lot of applications on EMR, there are lot of choices. Redshift supports COPY/UNLOAD commands to S3 which any application can make use of. If you want to use SPARK on EMR , then installing databricks spark-redshift driver is a viable option for you.
https://github.com/databricks/spark-redshift
https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Does the default EMR Spark come preconfigured to directly access redshift tables?

Reading/writing operations between EMR Spark clusters and redshift can definitely be done via. an intermediary data dump to s3.
There are spark libraries, however, which can directly treat redshift as a datasource: https://github.com/databricks/spark-redshift
Do the EMR 5.0 Spark clusters come preconfigured with the library and access credentials for redshift access?
Do the EMR 5.0 Spark clusters come preconfigured with the library and
access credentials for redshift access?
No , EMR doesn't provide this library from databricks.
To access Redshift :
Connectivity to Redshift doesn't require any IAM based authentication. It simply requires the EMR cluster (master/slave IP's or EMR master/slave SG's) whitelisted in Redshift's security group on its default port 5439 .
Now since the Spark executors run the COPY / LOAD commands based on spark commands and these commands require access to S3, you would need to configure IAM credentials mentioned here: https://github.com/databricks/spark-redshift#aws-credentials
To access S3 from EMR:
EMR nodes by default assume an IAM instance profile role called EMR_EC2_DefaultRole and permissions on this role define what EMR nodes and its entities (using InstanceProfileCredentialsProvider) can have access to. So you may use the 4th way mentioned in the documentation. The (AccessKey , SecretKey , Tokens) can be retrieved like the following and can be used as an Option or Parameter(temporary_aws_access_key_id , temporary_aws_secret_access_key , temporary_aws_session_token )
https://github.com/databricks/spark-redshift#parameters
//Get cred's from IAM instance profile
val provider = new InstanceProfileCredentialsProvider()
val credentials: AWSSessionCredentials = provider.getCredentials.asInstanceOf[AWSSessionCredentials]
val token = credentials.getSessionToken
val awsAccessKey = credentials.getAWSAccessKeyId
val awsSecretKey = credentials.getAWSSecretKey
The EMR_EC2_DefaultRole should have permissions to Read/Write on S3 object being used as tempdir.
Finally, EMR does include Redshift's JDBC drivers on /usr/share/aws/redshift/jdbc which can be used in spark's driver and executor classpaths(--jars).
In order to allow access between EMR and any other AWS resource, you'll need to edit the roles (Identify and Access Management, aka "IAM") that are applied to the master / core nodes, and add permission to consume the services you need, i.e. S3 (already enabled by default), Redshift, etc.
Sidenote, in some cases you get away with using the AWS SDK in your applications to interface with those other services' APIs.
There are some specific things you must do to get Spark to successfully talk to redshift:
get the redshift jdbc, include it in your spark classpath and include the jar with the --jars flag.
create a special role in IAM for redshift. that means start by creating the role, then choosing the redshift class / option at the beginning, so the primary resource is actually redshift, and then from there add your additional permissions.
go into redshift and add that new role to your redshift cluster
provide the role's ARN in your spark application
make sure S3 is given permissions in that new role, because when spark and redshift talk to each other over JDBC, all the data is stored as an intermediate fileset in s3... like a temp swap file in S3.
Note: if you get permissions errors about S3, try changing the protocol in the file path from s3:// to s3a:// -- for some
reason that bypasses that security somehow. Source
After you do all of those things, then redshift and spark can talk to each other. its a lot of stuff.