Presto on Amazon S3 - amazon-web-services

I'm trying to use Presto on Amazon S3 bucket, but haven't found much related information on the Internet.
I've installed Presto on a micro instance but I'm not able to figure out how I could connect to S3. There is a bucket and there are files in it. I have a running hive metastore server and I have configured it in presto hive.properties. But when I try to run the LOCATION command in hive, its not working.
IT throws an error saying cannot find the file scheme type s3.
And also I do not know why we need to run hadoop but without hadoop the hive doesnt run. Is there any explanation to this.
This and this are the documentations i've followed while set up.

Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.
The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ... with a LOCATION ... clause referencing the underlying files that hold the data.
In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/) and set the hive.metastore.uri parameter to the thrift service of an appropriate Hive metastore service.
The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.
If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.
Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.

Related

How can we interact with Dataproc Metastore to fetch list of databases and tables?

I am using Dataproc metastore as a Metastore service with GCP. How can I interact with it to fetch list of databases and tables from it? Is it possible to do this without running dataproc cluster ?
Edit - I have to fetch the metadata without running Dataproc cluster.
Since I am using Dataproc Metastore service to store metadata, I need to fetch metadata directly from it.
The Dataproc Metastore API is used to manage the Dataproc Metastore service instance (get/create/update etc). As mentioned in one of the comments, you can use the thrift URI (you will find the URI under the configuration tab of the metastore service if you are using the console).
Once you have a thrift client that connects to the thrift URI, you can fetch databases or tables. Although you can use the thrift API to create databases and tables as well, the typical use case is to configure a big data processing engine/framework like spark or hive to use the metastore and not directly interact with the metastore.

aws emr with glue: how to specify database name?

I'm trying to run a hive job using Glue metadata. From the aws docs
Under AWS Glue Data Catalog settings select Use for Hive table
metadata.
I created a cluster that apparently connects to the default database from glue (i can tell by running show tables; from hive, which lists a table from defaultdatabase.
Now does anyone know how to provide an option to connect to another database from glue ? The only thing I could find in the docs is the opportunity of providing a hive.metastore.glue.catalogid where you can provide a catalog from another account, but I cannot find anything in the docs about using the right database.
Or perhaps all the databases are loaded. If so, do you know how to access them within hive ?
Ok, it turns out all the databases are loaded in hive. You can simply access them by using select * from my_database_name.my_table_name, or by setting the database name once with use my_database_name

How can i load data from a hive table on an external network to Amazon Redshift?

I need some suggestions here to load data from a external hive table on an external network to Amazon RedShift. The hive is totally like a separate entity on an external network and i need to make it available in the Amazon Redshift.
The Basic here is that you would need internet connectivity If your Hive in on-premise. One possible approach is as follows :
1. Mount your S3 to your Hadoop Environment.
2. Create the Hive DDL compatible with for Red-shift only then you will be able to write to Redshift.

Running Hive or Spark using Amazon EMR on Talend?

I am trying to run hive queries on Amazon AWS using Talend. So far I can create clusters on AWS using the tAmazonEMRManage object, the next steps would be
1) To load the tables with data
2) Run queries against the Tables.
My data sits in S3. So far the documentation on talend does not seem to indicate the Hive objects tHiveLoad and tHiveRow support S3 which makes me wonder whether running hive queries on EMR via Talend is even possible
The documentation on how to do this is scarce. Has anyone tried doing this successfully or can point me in the right direction please?

How do I run Hive queries on a cloud platform like AWS?

I've run Pig scripts on an AWS EMR cluster, but have no experience with Hive. I'm trying to understand how to run Hive queries and scripts on the data in HDFS because there needs to be a database and table setup.
Am I supposed to setup the database and table before running the Hive query or script? A hive query obviously is supposed to have a FROM clause, but how do I know which table and database to designate in that FROM clause?
You'll want to create an External Table, which defines the table to Hive in SQL-like syntax, but points to the data that is already present on disk.
Refer to any Hive documentation, such as https://cwiki.apache.org/confluence/display/Hive/Tutorial