I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR
Related
In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work.
How can I view the DAG/Spark UI of my Glue job?
You need to setup an ec2 instance that can host the history server.
The below documentation has links to CloudFormation templates that you can use.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
You can access the history server via the ec2 instance(default on 18080). You need to configure the networks and ports suitably.
EDIT - There is also an option to setup SparkUI locally. This requires downloading the docker image from aws-glue-samples repo amd settin the AWS credential and s3 location there. This server consummes the files that the glue job generates. The files are about 4MB large.
I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.
Has anybody worked on creating a python script to load data from s3 to redshift tables for multiple files. How can we acheive it in AWS CLI. Your learnings and inputs on the same is appreciated.
The COPY command is the best way to load data from Amazon S3 to Amazon Redshift. It can load multiple files in parallel into the one table.
Use any Python library (eg PostgreSQL + Python | Psycopg) to connect to Amazon Redshift, then issue the COPY command.
The AWS Command-Line Interface (CLI) does not have the ability to run the COPY command on Redshift because it needs to be issued to the database, while the AWS CLI issues commands to AWS. (The AWS CLI can be used to launch/terminate a Redshift cluster, but not to connect to the cluster itself.)
I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards
So I am having an issue with being able to execute Presto queries via AWS EMR.
I have launched an EMR running hive/presto and using AWS Glue as the metastore.
When I SSH into the master node and run hive I can run "show schemas;" and it shows me the 3 different databases that we have on AWS Glue.
If I then enter the Presto CLI and run "show schemas on hive" I only see two "default" and "information_schema"
For the life of me I cannot figure out why presto is not able to see the same Hive schemas.
It is a basic default cluster launch on EMR using default settings mainly.
Can someone point me in the direction of what I should be looking for? I have checked the hive.properties file and that looks good, I am just at a loss as to why presto is not able to see the same info as hive.
I do have the following configuration set
[{"classification":"hive-site", "properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}, "configurations":[]}]
AWS docs http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html make it seem that this should be plug and play but I am obviously missing something
Starting from Amazon EMR release version 5.10.0 you can. Simply, set the hive.metastore.glue.datacatalog.enabled property to true, as follows:
[
{
"Classification": "presto-connector-hive",
"Properties": {
"hive.metastore.glue.datacatalog.enabled": "true"
}
}
]
Optionally, you can manually set
hive.metastore.glue.datacatalog.enabled=true in the
/etc/presto/conf/catalog/hive.properties file on the master node. If
you use this method, make sure that
hive.table-statistics-enabled=false in the properties file is set
because the Data Catalog does not support Hive table and partition
statistics. If you change the value on a long-running cluster to
switch metastores, you must restart the Presto server on the master
node (sudo restart presto-server).
Sources:
AWS Docs
Looks like this has been solved in emr-5.10. You want to add the following config:
{"Classification":"presto-connector-hive","Properties":{"hive.metastore.glue.datacatalog.enabled": "true"}}
Source: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html
The recent 0.198 release of Presto now supports AWS Glue as a metadata source.
Add support for using AWS Glue as the metastore. Enable it by setting
the hive.metastore config property to glue.
https://prestodb.io/docs/current/release/release-0.198.html