Reading/writing operations between EMR Spark clusters and redshift can definitely be done via. an intermediary data dump to s3.
There are spark libraries, however, which can directly treat redshift as a datasource: https://github.com/databricks/spark-redshift
Do the EMR 5.0 Spark clusters come preconfigured with the library and access credentials for redshift access?
Do the EMR 5.0 Spark clusters come preconfigured with the library and
access credentials for redshift access?
No , EMR doesn't provide this library from databricks.
To access Redshift :
Connectivity to Redshift doesn't require any IAM based authentication. It simply requires the EMR cluster (master/slave IP's or EMR master/slave SG's) whitelisted in Redshift's security group on its default port 5439 .
Now since the Spark executors run the COPY / LOAD commands based on spark commands and these commands require access to S3, you would need to configure IAM credentials mentioned here: https://github.com/databricks/spark-redshift#aws-credentials
To access S3 from EMR:
EMR nodes by default assume an IAM instance profile role called EMR_EC2_DefaultRole and permissions on this role define what EMR nodes and its entities (using InstanceProfileCredentialsProvider) can have access to. So you may use the 4th way mentioned in the documentation. The (AccessKey , SecretKey , Tokens) can be retrieved like the following and can be used as an Option or Parameter(temporary_aws_access_key_id , temporary_aws_secret_access_key , temporary_aws_session_token )
https://github.com/databricks/spark-redshift#parameters
//Get cred's from IAM instance profile
val provider = new InstanceProfileCredentialsProvider()
val credentials: AWSSessionCredentials = provider.getCredentials.asInstanceOf[AWSSessionCredentials]
val token = credentials.getSessionToken
val awsAccessKey = credentials.getAWSAccessKeyId
val awsSecretKey = credentials.getAWSSecretKey
The EMR_EC2_DefaultRole should have permissions to Read/Write on S3 object being used as tempdir.
Finally, EMR does include Redshift's JDBC drivers on /usr/share/aws/redshift/jdbc which can be used in spark's driver and executor classpaths(--jars).
In order to allow access between EMR and any other AWS resource, you'll need to edit the roles (Identify and Access Management, aka "IAM") that are applied to the master / core nodes, and add permission to consume the services you need, i.e. S3 (already enabled by default), Redshift, etc.
Sidenote, in some cases you get away with using the AWS SDK in your applications to interface with those other services' APIs.
There are some specific things you must do to get Spark to successfully talk to redshift:
get the redshift jdbc, include it in your spark classpath and include the jar with the --jars flag.
create a special role in IAM for redshift. that means start by creating the role, then choosing the redshift class / option at the beginning, so the primary resource is actually redshift, and then from there add your additional permissions.
go into redshift and add that new role to your redshift cluster
provide the role's ARN in your spark application
make sure S3 is given permissions in that new role, because when spark and redshift talk to each other over JDBC, all the data is stored as an intermediate fileset in s3... like a temp swap file in S3.
Note: if you get permissions errors about S3, try changing the protocol in the file path from s3:// to s3a:// -- for some
reason that bypasses that security somehow. Source
After you do all of those things, then redshift and spark can talk to each other. its a lot of stuff.
Related
I am trying to move data from one aws elasticsearch cluster in oregon to another aws elasticsearch cluster in N.Virginia. I have registered the repository in source ES and taken a manual snapshot to s3(in Oregon). Now i am trying to register a repository in destination ES in the same s3 location but it is not letting me do it.
Its throwing up an error that the s3 bucket should be in the same region.
I am now stuck. Can anybody suggest a method for this?
Based on the comments, the solution is to make a backup of the cluster into a bucket in Oregon, copy it from the bucket to a bucket in N.Virginia, and then restore the cluster in the N.Virginia region using the second bucket.
I am trying to use AWS DMS to move data from a source database ( AWS RDS MySQL ) in the Paris region ( eu-west-3 ) to a target database ( AWS Redshift ) in the Ireland region ( eu-west-1 ). The goal is to continuously replicate ongoing changes.
I am running into these kind of errors :
An error occurred (InvalidResourceStateFault) when calling the CreateEndpoint operation: The redshift cluster datawarehouse needs to be in the same region as the current region. The cluster's region is eu-west-1 and the current region is eu-west-3.
The documentation says :
The only requirement to use AWS DMS is that one of your endpoints must
be on an AWS service.
So what I am trying to do should be possible. In practice, it's seems it's not allowed.
How to use AWS DMS from a region to an other ?
In what region, should my endpoints be ?
In what region, should my replication task be ?
My replication instance has to be on the same region than the RDS MySQL instance because they need to share a subnet
AWS provides this whitepaper called "Migrating AWS Resources to a New AWS Region", updated last year. You may want to contact their support, but an idea would be to move your RDS to another RDS in the proper region, before migrating to Redshift. In the whitepaper, they provide an alternative way to migrate RDS (without DMS, if you don't want to use it for some reason):
Stop all transactions or take a snapshot (however, changes after this point in time are lost and might need to be reapplied to the
target Amazon RDS DB instance).
Using a temporary EC2 instance, dump all data from Amazon RDS to a file:
For MySQL, make use of the mysqldump tool. You might want to
compress this dump (see bzip or gzip).
For MS SQL, use the bcp
utility to export data from the Amazon RDS SQL DB instance into files.
You can use the SQL Server Generate and Publish Scripts Wizard to
create scripts for an entire database or for just selected objects.36
Note: Amazon RDS does not support Microsoft SQL Server backup file
restores.
For Oracle, use the Oracle Export/Import utility or the
Data Pump feature (see
http://aws.amazon.com/articles/AmazonRDS/4173109646282306).
For
PostgreSQL, you can use the pg_dump command to export data.
Copy this data to an instance in the target region using standard tools such as CP, FTP, or Rsync.
Start a new Amazon RDS DB instance in the target region, using the new Amazon RDS security group.
Import the saved data.
Verify that the database is active and your data is present.
Delete the old Amazon RDS DB instance in the source region
I found a work around that I am currently testing.
I declare "Postgres" as the engine type for the Redshift cluster. It tricks AWS DMS into thinking it's an external database and AWS DMS no longer checks for regions.
I think it will result in degraded performance, because DMS will probably feed data to Redshift using INSERTs instead of the COPY command.
Currently Redshift has to be in the same region as the replication instance.
The Amazon Redshift cluster must be in the same AWS account and the
same AWS Region as the replication instance.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html
So one should create the replication instance in the Redshift region inside a VPC
Then use VPC peering to enable the replication instance to connect to the VPC of the MySQL instance in the other region
https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html
I am planning to move an emr cluster from us-east to us-west. I have data residing in hdfs as well as s3. But due to lack of proper documentation i am unable to start with this thing.
Does anyone has any experience in doing so ?
You can use s3-dist-cp tool on EMR to copy data from HDFS to S3 and later , you can use the same tool to copy from S3 to HDFS on the cluster in different region. Also note that its always recommended to use EMR with s3 buckets on same region.
I am planning to use Amazon EMR for spark streaming application. Amazon provides a nice interface to show stderr & controller log. But for streaming application I am not sure how manage the logs.
Amazon logs the data to /var/log/hadoop/steps/<step-id> and similar places for spark. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-view-web-log-files.html
I was wondering on how do we rotate logs and still be accessible via the aws emr web interface. We can easily change the log rotation policy by configuring the hadoop-log4j, but that way I cannot access it via the web interface. Also EMR should manage the log s3 upload
AWS EMR also stores the logs in S3.
Navigate to your cluster console for the running cluster, and in the left middle column, you'll see the path to the s3 bucket.
Be careful not to reuse the same s3 bucket path for future clusters, otherwise you could overwrite your log data.
I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR