Running Hive or Spark using Amazon EMR on Talend? - amazon-web-services

I am trying to run hive queries on Amazon AWS using Talend. So far I can create clusters on AWS using the tAmazonEMRManage object, the next steps would be
1) To load the tables with data
2) Run queries against the Tables.
My data sits in S3. So far the documentation on talend does not seem to indicate the Hive objects tHiveLoad and tHiveRow support S3 which makes me wonder whether running hive queries on EMR via Talend is even possible
The documentation on how to do this is scarce. Has anyone tried doing this successfully or can point me in the right direction please?

Related

What is the best option to LOAD/UNLOAD schemas from a Redshift cluster to another cluster on another region on a schedule?

I have two Redshift clusters on two different regions, one is production and the other is development. I need to export multiple schemas from production cluster to dev cluster on a weekly basis. So I am using the Prod UNLOAD --> S3 --> LOAD Dev approach to do this.
Currently I am using the query below which will return a table of query commands I need to run.
select 'unload (''select * from '||n.nspname||'.'||c.relname||''') to ''s3://my-redshift-data/'||c.relname||'_''
iam_role ''arn:aws:iam::xxxxxxxxxxxxxx:role/my-redshift-role'' ;' as sql
from pg_class c
left join pg_namespace n on c.relnamespace=n.oid
where n.nspname in ('schema1','schema2')and relkind = 'r'
The results returned look something like below:
unload ('select * from schema1.table1') to 's3://my-redshift-data/table1_' iam_role 'arn:aws:iam::xxxxxxxxxxxxxx:role/my-redshift-role';
unload ('select * from schema2.table2') to 's3://my-redshift-data/table2_' iam_role 'arn:aws:iam::xxxxxxxxxxxxxx:role/my-redshift-role';
Just some additional information, I have a Kubernetes cluster on the same vpc as my Redshift cluster, running apps that are connecting to Redshift cluster. I also have a Gitlab server that have gitlab runners that are connected to Kubernetes cluster using Gitlab agent. There are a few ways I can think of to do this:
Using Gitlab scheduled pipeline and run the UNLOAD/LOAD script using Redshift Data API
Using Kubernetes batch job to run UNLOAD/LOAD script using Redshift Data API
Using AWS Lambda and write a python script (something new to me), and schedule it using Event Bridge?
I would appreciate any suggestion because I can't decide what is the best way to approach.
Whenever building anything new it's helpful to think about the time / cost / quality tradeoff. How much time do you have? How much time do you want to spend supporting it in the future? Will you have time to improve it later if necessary? Will someone else be able to pick it up to improve it? etc.
If you're looking to get this set up with as little effort as possible, and worry about the future when it happens, it's likely that a Gitlab pipeline & Redshift Data API will do the job. It is quite literally just a few calls to that API.
you need an IAM role for your Gitlab runner to authenticate to the API, which you likely already have. Obviously check that the role has the correct access.
you can store the database credentials as Gitlab variables as a fast/simple option, or leverage some more secure secret store if necessary / critical
remember that the IAM role needs to access both the "source" (prod) and "target" (dev) clusters; if they are in different accounts this is more setup.

aws emr with glue: how to specify database name?

I'm trying to run a hive job using Glue metadata. From the aws docs
Under AWS Glue Data Catalog settings select Use for Hive table
metadata.
I created a cluster that apparently connects to the default database from glue (i can tell by running show tables; from hive, which lists a table from defaultdatabase.
Now does anyone know how to provide an option to connect to another database from glue ? The only thing I could find in the docs is the opportunity of providing a hive.metastore.glue.catalogid where you can provide a catalog from another account, but I cannot find anything in the docs about using the right database.
Or perhaps all the databases are loaded. If so, do you know how to access them within hive ?
Ok, it turns out all the databases are loaded in hive. You can simply access them by using select * from my_database_name.my_table_name, or by setting the database name once with use my_database_name

What query to run to determine Amazon Athena version?

I'd like to determine what version of Amazon Athena I'm connected to by running a query. Is this possible? If so, what is the query?
Searching Google, SO, and AWS docs have not found an answer.
Amazon Redshift launches as a cluster, with virtual machines being used for that specific cluster. The cluster must be specifically updated between versions because it is continuously running and is accessible by only one AWS account. Think of it as software running on your own virtual machines.
From Amazon Redshift Clusters:
Amazon Redshift provides a setting, Allow Version Upgrade, to specify whether to automatically upgrade the Amazon Redshift engine in your cluster if a new version of the engine becomes available.
Amazon Athena, however, is a fully-managed service. There is no cluster to be created -- you simply provide your query and it uses the metastore to know where to find data. Think of it just like Amazon S3 -- many servers provide access to multiple AWS customers simultaneously.
From Amazon Athena – Interactive SQL Queries for Data in Amazon S3:
Behind the scenes, Athena parallelizes your query, spreads it out across hundreds or thousands of cores, and delivers results in seconds.
As a fully-managed service, there is only ever one version of Amazon Athena, which is the version that is currently available.

Apahce Spark on Redshift vs Apache Spark on HIVE EMR

I'm doing some studies about Redshift and Hive working at AWS.
I have an application working in Spark, that is in local cluster, working with Apache Hive. Then we will migrate to AWS.
We found that there is a Data Warehouse solution that is Redshift.
Redshift is a Columnar database that is really fast to queries for Tb of data with no big issues. Working with Redshift will not take much time of maintaining. But I have a question, how is the performance of Redshift over Hive?
If I store a Hive with EMR, setting the storage at EMR and handle the metastore with Hive it will take this to process the data with Spark.
What is the performance of Redshift over Hive in EMR? Redshift is the best solution for Apache Spark in terms of performance?
Or Using Hive I will take much performance with spark that will compensates the maintenance time?
-------EDIT-------
Well I read more about it, and I found how Redshift will work with Spark in EMR.
According to what I saw, when you call the data from Redshift it will load the information to a S3 bucket like this:
This information I found at Databricks Blog
According this, is Hive faster than Redshift for EMR?

Presto on Amazon S3

I'm trying to use Presto on Amazon S3 bucket, but haven't found much related information on the Internet.
I've installed Presto on a micro instance but I'm not able to figure out how I could connect to S3. There is a bucket and there are files in it. I have a running hive metastore server and I have configured it in presto hive.properties. But when I try to run the LOCATION command in hive, its not working.
IT throws an error saying cannot find the file scheme type s3.
And also I do not know why we need to run hadoop but without hadoop the hive doesnt run. Is there any explanation to this.
This and this are the documentations i've followed while set up.
Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.
The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ... with a LOCATION ... clause referencing the underlying files that hold the data.
In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/) and set the hive.metastore.uri parameter to the thrift service of an appropriate Hive metastore service.
The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.
If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.
Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.