I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks.
We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog.
I want to use the Glue Data Catalog as the Hive metastore for these Spark jobs.
I see that it's possible to do when Spark is run in EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
but is it possible from Spark running on EKS?
I have seen this code released by aws:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
but I can't understand if patching of the Hive jar is necessary for what I'm trying to do.
Also I need the hive-site.xml file for connecting Spark to the metastore, how can I get this file from Glue Data Catalog?
I found a solution for that.
I created a new spark image with this instructions: https://github.com/viaduct-ai/docker-spark-k8s-aws
and finally at my job yaml file, I added some configurations
sparkConf:
...
spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
Related
While using google cloud dataproc to run a pyspark job. My code tried to do a query on bigquery using pyspark
query = 'select max(col a) from table'
df = spark.read.format('bigquery').load(query)
Look at this notebook. Here you'll have a example code to do bigquery queries with spark in dataproc.
You see this error because Dataproc does not include Spark BigQuery connector jar by default, that's why you need to add it to your Spark application if you want to use Spark to process data in BigQuery.
Here is documentation with examples on how to do this for Dataproc Serverless and Dataproc clusters:
https://cloud.google.com/dataproc-serverless/docs/guides/bigquery-connector-spark-example
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
Has anybody worked on creating a python script to load data from s3 to redshift tables for multiple files. How can we acheive it in AWS CLI. Your learnings and inputs on the same is appreciated.
The COPY command is the best way to load data from Amazon S3 to Amazon Redshift. It can load multiple files in parallel into the one table.
Use any Python library (eg PostgreSQL + Python | Psycopg) to connect to Amazon Redshift, then issue the COPY command.
The AWS Command-Line Interface (CLI) does not have the ability to run the COPY command on Redshift because it needs to be issued to the database, while the AWS CLI issues commands to AWS. (The AWS CLI can be used to launch/terminate a Redshift cluster, but not to connect to the cluster itself.)
I would like to migrate my HIVE tables to an existing Dataproc cluster. Is there any way to deploy the tables using Google Deployment Manager. I have checked the list of supported resource types in GDM; could not locate a hive resource type. However, there is an instance of dataproc.v1.cluster available and I was able to successfully deploy a dataproc cluster. Now, is there any way to deploy my HIVE ddls within the cluster?
For customizing your cluster we recommend initialization actions.
If your hive tables are all setup as EXTERNAL with data in GCS or BigQuery you could benefit from setting up a metastore on Cloud SQL
I'm doing some studies about Redshift and Hive working at AWS.
I have an application working in Spark, that is in local cluster, working with Apache Hive. Then we will migrate to AWS.
We found that there is a Data Warehouse solution that is Redshift.
Redshift is a Columnar database that is really fast to queries for Tb of data with no big issues. Working with Redshift will not take much time of maintaining. But I have a question, how is the performance of Redshift over Hive?
If I store a Hive with EMR, setting the storage at EMR and handle the metastore with Hive it will take this to process the data with Spark.
What is the performance of Redshift over Hive in EMR? Redshift is the best solution for Apache Spark in terms of performance?
Or Using Hive I will take much performance with spark that will compensates the maintenance time?
-------EDIT-------
Well I read more about it, and I found how Redshift will work with Spark in EMR.
According to what I saw, when you call the data from Redshift it will load the information to a S3 bucket like this:
This information I found at Databricks Blog
According this, is Hive faster than Redshift for EMR?
I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR