AWS EMR Presto job - amazon-web-services

Is it possible to submit presto jobs/steps in any way to an EMR cluster just like you can submit Hive jobs/steps via a script in S3?
I would not like to SSH to the instance to execute the commands, but do it automatically

You can use any JDBC/ODBC client to connect with Presto cluster. You can do this with available drivers if you want to connect programmatically.
https://prestodb.github.io/docs/current/installation/jdbc.html
https://teradata.github.io/presto/docs/0.167-t/installation/odbc.html
If you do have any BI tool like tableau, Clickview, Superset then it can be easily done as well.
e.g https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_presto.htm

Related

Orchestration of Redshift Stored Procedures using AWS Managed Airflow

I have many redshift stored procedures created(15-20) some can run asynchronously while many have to run in a synchronous manner.
I tried scheduling them in Async and Sync manner using Aws Eventbridge but found many limitations (failure handling and orchestration).
I went ahead to use AWS Managed Airflow.
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
If possible can we achieve all these using AWS console only, without Aws cli?
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
You can use Airflow Connections to connect to Redshift. This is the native approach for managing connections to external services such as databases.
Managing Connections (Airflow)
Amazon Redshift Connection (Airflow)
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
You can use the PostgresOperator to execute SQL commands in the Redshift cluster. When initializing the PostgresOperator, set the postgres_conn_id parameter to the Redshift connection ID (e.g. redshift_default). Example:
PostgresOperator(
task_id="call_stored_proc",
postgres_conn_id="redshift_default",
sql="sql/stored_proc.sql",
)
PostgresOperator (Airflow)
How-to Guide for PostgresOperator (Airflow)
If possible can we achieve all these using AWS console only, without Aws cli?
No, it's not possible to achieve this only using the AWS console.

Create Database on Amazon Redshift with Query Editor

I've created a Redshift cluster using the AWS management console. The cool thing that AWS setup was this query editor to be able to write queries directly on your cluster without having to install a SQL client on your computer.
However, I was trying to create a new database on the instance but it doesn't seem to be possible using AWS query editor. Am I right or did I miss something?
I indeed missed something, you simply need to go into your query editor and write
CREATE DATABASE db_name OWNER=db_owner;

AWS EMR Presto not finding correct Hive schemas using AWS Glue

So I am having an issue with being able to execute Presto queries via AWS EMR.
I have launched an EMR running hive/presto and using AWS Glue as the metastore.
When I SSH into the master node and run hive I can run "show schemas;" and it shows me the 3 different databases that we have on AWS Glue.
If I then enter the Presto CLI and run "show schemas on hive" I only see two "default" and "information_schema"
For the life of me I cannot figure out why presto is not able to see the same Hive schemas.
It is a basic default cluster launch on EMR using default settings mainly.
Can someone point me in the direction of what I should be looking for? I have checked the hive.properties file and that looks good, I am just at a loss as to why presto is not able to see the same info as hive.
I do have the following configuration set
[{"classification":"hive-site", "properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}, "configurations":[]}]
AWS docs http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html make it seem that this should be plug and play but I am obviously missing something
Starting from Amazon EMR release version 5.10.0 you can. Simply, set the hive.metastore.glue.datacatalog.enabled property to true, as follows:
[
{
"Classification": "presto-connector-hive",
"Properties": {
"hive.metastore.glue.datacatalog.enabled": "true"
}
}
]
Optionally, you can manually set
hive.metastore.glue.datacatalog.enabled=true in the
/etc/presto/conf/catalog/hive.properties file on the master node. If
you use this method, make sure that
hive.table-statistics-enabled=false in the properties file is set
because the Data Catalog does not support Hive table and partition
statistics. If you change the value on a long-running cluster to
switch metastores, you must restart the Presto server on the master
node (sudo restart presto-server).
Sources:
AWS Docs
Looks like this has been solved in emr-5.10. You want to add the following config:
{"Classification":"presto-connector-hive","Properties":{"hive.metastore.glue.datacatalog.enabled": "true"}}
Source: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html
The recent 0.198 release of Presto now supports AWS Glue as a metadata source.
Add support for using AWS Glue as the metastore. Enable it by setting
the hive.metastore config property to glue.
https://prestodb.io/docs/current/release/release-0.198.html

Presto Sandbox cluster on AWS EMR - add connector (catalog/.properties)

I just deployed a Presto Sandbox cluster on AWS using EMR. Is there any way to add connectors to my Presto cluster apart from manually (ssh) creating the properties and then restarting the cluster?
If you're looking for a UI to add a connector, Presto itself doesn't offer that and as far as I know Amazon EMR doesn't either. I'm afraid you'll have to add connectors manually by SSH-ing to the master node, creating the appropriate file, distributing it to all the nodes and then restarting everything.
Adding connectors to Presto with EMR does require manual restarting as you mention. You might be able to use a CFT to automate some of this, or you can try something like Ahana Cloud https://ahana.io/ahana-cloud/ which is a managed serviced for Presto in AWS.

Setting hive properties in Amazon EMR?

I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR