AWS EMR: How to migrate data from one EMR to another EMR - amazon-web-services

I currently have an AWS EMR cluster running with HBase. And I am saving the data to S3. I want to migrate the data to a new EMR cluster on the same account. What is the proper way to migrate data from one EMR to another?
Thank you

There are different ways two copy the table from one cluster to another:
Use CopyTable utility. The disadvantage is that it can degrade the region server performance or there is a need to disable the tables prior to copy.
Hbase Snapshots. (Recommended). It has a little impact on region server performance.
You can follow the aws documentation to perform snapshot/restore operations.
Basically you will do the following:
Create Snapshot
Export to S3
Import from S3
Restore to Hbase

Related

Migrate AWS Glue Job to EC2

i'm currently using some glue jobs for minimum transformations and sending info from S3/Athena tables to Redshift, now we don't process a lot of data so glue is expensive, slow and difficult to tune for this volume of data.
I couldn't find how to start in EC2 to make the code migration, credentials, dependencies.
Maybe I can call a lambda to process it in my EC2 instance? Can I run spark on 1 node and then scale to a cluster in the future? should I migrate Glue Job to python (not pyspark)?
I found EMR will be expensive too for this volume, the ideal is start with minumum
Don't need the full solution, just pointing in the right direction so I can start trying this.
Thank you!
Here are few suggetions for your requiremnt
Serverless frameworks like Glue and lambda is more suitable rather than persisted EMR or EC2
AWS Lambda: You can consider using lambda with python modules, if your volume of data is less and transformations are minimal.
AWS Glue with Python not spark - It's also a cost effective solution.
AWS Ec2 - Going for EC2 legacy approach and costly.

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
For setting up number of reducers, you can use the property mapreduce.job.reduces similar to below:
s3-dist-cp -Dmapreduce.job.reduces=10 --src hdfs://path/to/data/ --dest s3://path/to/s3/
Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.
You can call S3DistCp by adding it as a step in your existing EMR cluster. Steps can be added to a cluster at launch or to a running cluster using the console, AWS CLI, or API.
So you control the number of workers during EMR cluster creation or you can resize existing cluster. You can check exact steps in EMR docs.

Python script to load data from AWS S3 to Redshift

Has anybody worked on creating a python script to load data from s3 to redshift tables for multiple files. How can we acheive it in AWS CLI. Your learnings and inputs on the same is appreciated.
The COPY command is the best way to load data from Amazon S3 to Amazon Redshift. It can load multiple files in parallel into the one table.
Use any Python library (eg PostgreSQL + Python | Psycopg) to connect to Amazon Redshift, then issue the COPY command.
The AWS Command-Line Interface (CLI) does not have the ability to run the COPY command on Redshift because it needs to be issued to the database, while the AWS CLI issues commands to AWS. (The AWS CLI can be used to launch/terminate a Redshift cluster, but not to connect to the cluster itself.)

AWS EMR migration from us-east to us-west

I am planning to move an emr cluster from us-east to us-west. I have data residing in hdfs as well as s3. But due to lack of proper documentation i am unable to start with this thing.
Does anyone has any experience in doing so ?
You can use s3-dist-cp tool on EMR to copy data from HDFS to S3 and later , you can use the same tool to copy from S3 to HDFS on the cluster in different region. Also note that its always recommended to use EMR with s3 buckets on same region.

Simplest way to get data from AWS mysql RDS to AWS Elasticsearch?

I have data in an AWS RDS, and I would like to pipe it over to an AWS ES instance, preferably updating once an hour, or similar.
On my local machine, with a local mysql database and Elasticsearch database, it was easy to set this up using Logstash.
Is there a "native" AWS way to do the same thing? Or do I need to set up an EC2 server and install Logstash on it myself?
You can achieve the same thing with your local Logstash, simply point your jdbc input to your RDS database and the elasticsearch output to your AWS ES instance. If you need to run this regularly, then yes, you'd need to setup a small instance to run Logstash on it.
A more "native" AWS solution to achieve the same thing would include the use of Amazon Kinesis and AWS Lambda.
Here's a good article explaining how to connect it all together, namely:
how to stream RDS data into a Kinesis Stream
configuring a Lambda function to handle the stream
push the data to your AWS ES instance
Take a look at Amazon DMS. Its usually used for DB migrations, however, it also supports continuous data replication. This might simplify the process and be cost-effective.
You can use AWS Database Migration Service to perform continuous data replication. Continuous data replication has a multitude of use cases including Disaster Recovery instance synchronization, geographic database distribution and Dev/Test environment synchronization. You can use DMS for both homogeneous and heterogeneous data replications for all supported database engines. The source or destination databases can be located in your own premises outside of AWS, running on an Amazon EC2 instance, or it can be an Amazon RDS database. You can replicate data from a single database to one or more target databases or data from multiple source databases can be consolidated and replicated to one or more target databases.
https://aws.amazon.com/dms/