How to configure AWS EMR to use s3 as hdfs storage - hdfs

I am trying to create a EMR cluster with below configurations, but is failing in Bootstrap stage. The EMR release I am using is EMR 5.13.0
[
{
"Classification": "core-site",
"Properties": {
"fs.defaultFS": "s3://my-s3-bucket",
"fs.s3a.imp": "org.apache.hadoop.fs.s3.S3FileSystem"
}
}
]
If I remove this configuration the cluster gets provisioned successfully.
Any idea how s3 backed hdfs configuration can be done ?

In short, what you are trying to achieve is not possible.
Reason: HDFS is an implementation of Hadoop FileSystem API - that is modeled based on POSIX filesystem behavior.
While EMR File System (EMRFS) is an Object Store at the core which mimics HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. It still violates some of the requirements of Hadoop FileSystem API to be considered a replacement of HDFS. See "Object Stores vs. Filesystems" section in the above link.
With that said, you can still use Amazon S3 as storage option on EMR without configuring anything by just using URI scheme s3:// .
Hope this answers your question.

Related

Is it possible to run HBASE on AWS but it stores/pointes to HDFS?

just wanted to know if this question is even relevant?
I tried understanding many blogs but could not reach to a conclusion.
Yes, you can run HBase on Amazon EMR. And you can choose either S3 (via EMRFS) or native HDFS (on cluster):
It utilizes Amazon S3 (with EMRFS) or the Hadoop Distributed Filesystem (HDFS) as a fault-tolerant datastore.

Where to put application property file for spark application running on AWS EMR

I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.

AWS EMR migration from us-east to us-west

I am planning to move an emr cluster from us-east to us-west. I have data residing in hdfs as well as s3. But due to lack of proper documentation i am unable to start with this thing.
Does anyone has any experience in doing so ?
You can use s3-dist-cp tool on EMR to copy data from HDFS to S3 and later , you can use the same tool to copy from S3 to HDFS on the cluster in different region. Also note that its always recommended to use EMR with s3 buckets on same region.

AWS EMR Presto not finding correct Hive schemas using AWS Glue

So I am having an issue with being able to execute Presto queries via AWS EMR.
I have launched an EMR running hive/presto and using AWS Glue as the metastore.
When I SSH into the master node and run hive I can run "show schemas;" and it shows me the 3 different databases that we have on AWS Glue.
If I then enter the Presto CLI and run "show schemas on hive" I only see two "default" and "information_schema"
For the life of me I cannot figure out why presto is not able to see the same Hive schemas.
It is a basic default cluster launch on EMR using default settings mainly.
Can someone point me in the direction of what I should be looking for? I have checked the hive.properties file and that looks good, I am just at a loss as to why presto is not able to see the same info as hive.
I do have the following configuration set
[{"classification":"hive-site", "properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}, "configurations":[]}]
AWS docs http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html make it seem that this should be plug and play but I am obviously missing something
Starting from Amazon EMR release version 5.10.0 you can. Simply, set the hive.metastore.glue.datacatalog.enabled property to true, as follows:
[
{
"Classification": "presto-connector-hive",
"Properties": {
"hive.metastore.glue.datacatalog.enabled": "true"
}
}
]
Optionally, you can manually set
hive.metastore.glue.datacatalog.enabled=true in the
/etc/presto/conf/catalog/hive.properties file on the master node. If
you use this method, make sure that
hive.table-statistics-enabled=false in the properties file is set
because the Data Catalog does not support Hive table and partition
statistics. If you change the value on a long-running cluster to
switch metastores, you must restart the Presto server on the master
node (sudo restart presto-server).
Sources:
AWS Docs
Looks like this has been solved in emr-5.10. You want to add the following config:
{"Classification":"presto-connector-hive","Properties":{"hive.metastore.glue.datacatalog.enabled": "true"}}
Source: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html
The recent 0.198 release of Presto now supports AWS Glue as a metadata source.
Add support for using AWS Glue as the metastore. Enable it by setting
the hive.metastore config property to glue.
https://prestodb.io/docs/current/release/release-0.198.html

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible:
Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?
However, when creating a new job flow, we can only configure a S3 bucket as input data origin.
Any ideas/samples on how to do this?
Thanks!
P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.
How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.
If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.
If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.
Try using scp to copy files to your EMR instance:
my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file
(or use ftp, or wget, or curl, or anything else you want)
then log into your EMR instance with ssh and load it into hadoop:
my-desktop-box$ ssh my-emr-node
my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file