I'm doing some studies about Redshift and Hive working at AWS.
I have an application working in Spark, that is in local cluster, working with Apache Hive. Then we will migrate to AWS.
We found that there is a Data Warehouse solution that is Redshift.
Redshift is a Columnar database that is really fast to queries for Tb of data with no big issues. Working with Redshift will not take much time of maintaining. But I have a question, how is the performance of Redshift over Hive?
If I store a Hive with EMR, setting the storage at EMR and handle the metastore with Hive it will take this to process the data with Spark.
What is the performance of Redshift over Hive in EMR? Redshift is the best solution for Apache Spark in terms of performance?
Or Using Hive I will take much performance with spark that will compensates the maintenance time?
-------EDIT-------
Well I read more about it, and I found how Redshift will work with Spark in EMR.
According to what I saw, when you call the data from Redshift it will load the information to a S3 bucket like this:
This information I found at Databricks Blog
According this, is Hive faster than Redshift for EMR?
Related
I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks.
We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog.
I want to use the Glue Data Catalog as the Hive metastore for these Spark jobs.
I see that it's possible to do when Spark is run in EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
but is it possible from Spark running on EKS?
I have seen this code released by aws:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
but I can't understand if patching of the Hive jar is necessary for what I'm trying to do.
Also I need the hive-site.xml file for connecting Spark to the metastore, how can I get this file from Glue Data Catalog?
I found a solution for that.
I created a new spark image with this instructions: https://github.com/viaduct-ai/docker-spark-k8s-aws
and finally at my job yaml file, I added some configurations
sparkConf:
...
spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
I currently have an AWS EMR cluster running with HBase. And I am saving the data to S3. I want to migrate the data to a new EMR cluster on the same account. What is the proper way to migrate data from one EMR to another?
Thank you
There are different ways two copy the table from one cluster to another:
Use CopyTable utility. The disadvantage is that it can degrade the region server performance or there is a need to disable the tables prior to copy.
Hbase Snapshots. (Recommended). It has a little impact on region server performance.
You can follow the aws documentation to perform snapshot/restore operations.
Basically you will do the following:
Create Snapshot
Export to S3
Import from S3
Restore to Hbase
There are tons of examples to migrate data from Aurora DB to Redshift, but I couldn't find any example or documentation for migrating data from Redshift to Aurora DB. Any suggestion/example/doc for migrating data from Redshift into Aurora DB in an efficient way?
You can do it by unloading1 the data into S3 directly from RedShift and then loading it into Aurora2&3.
Redshift UNLOAD Official Docs.
Aurora MySQL LOAD FROM S3 Offical Docs.
Aurora PostgreSQL LOAD FROM S3 Official Docs.
Note that the LOAD on Aurora does not support .parquet or other colunar files, so your best shot is to unload it as .csv. Depending on the size of your tables, you might consider doing it in batches or in periods of reduced workload.
By the moment I'm writting this answer you can't use Redshift as a source for DMS, otherwise this would be the prefered way.
I am trying to run hive queries on Amazon AWS using Talend. So far I can create clusters on AWS using the tAmazonEMRManage object, the next steps would be
1) To load the tables with data
2) Run queries against the Tables.
My data sits in S3. So far the documentation on talend does not seem to indicate the Hive objects tHiveLoad and tHiveRow support S3 which makes me wonder whether running hive queries on EMR via Talend is even possible
The documentation on how to do this is scarce. Has anyone tried doing this successfully or can point me in the right direction please?
I'm trying to use Presto on Amazon S3 bucket, but haven't found much related information on the Internet.
I've installed Presto on a micro instance but I'm not able to figure out how I could connect to S3. There is a bucket and there are files in it. I have a running hive metastore server and I have configured it in presto hive.properties. But when I try to run the LOCATION command in hive, its not working.
IT throws an error saying cannot find the file scheme type s3.
And also I do not know why we need to run hadoop but without hadoop the hive doesnt run. Is there any explanation to this.
This and this are the documentations i've followed while set up.
Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.
The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ... with a LOCATION ... clause referencing the underlying files that hold the data.
In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/) and set the hive.metastore.uri parameter to the thrift service of an appropriate Hive metastore service.
The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.
If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.
Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.