As I am going to perform a spark job for sentiment analysis on Google Cloud platform and I decided to use Dataproc. Is it worth doing with Dataproc or are there any suggestions. I need to perform sentiment analysis for huge dataset from twitter. That is I decided to use the Google cloud platform as my big data and distributed environment.
GCP Dataproc is definitely a great choice for your use-case. Dataproc natively supports Spark and also recently added support for Spark 3.
Please check which Dataproc image is right for your use case.
Following resources could be helpful while configuring and running Spark job on a cluster.
Creating and configuring cluster
Submit a job
Tutorial to run Spark scala job
Some more resources from community Spark job, PySpark Job,
I would like to avoid AWS dev endpoint. Is there a way where I can test and debug my PySpark code without using AWS dev endpoint with the help of testing my code in local notebook/IDE?
As others have said, it depends on which part of the Glue are you going to use. If your code is based on pure Spark, without the Dynamic Frames etc. Then local version of Spark may suffice, if however you are intending on using Glue extensions, there is not really an option of not using the Dev End point at this stage.
I hope that this helps.
If you are going to deploy your pyspark code on AWS Glue service, you may have to use GlueContext & other AWS Glue APIs. So if you would like to test against AWS Glue service, using these AWS Glue APIs then you have to have an AWS Dev Endpoint.
However having a AWS Glue notebook is optional, since you can setup zeppelin, etc. establish an ssh tunnel connection with AWS Glue DEP for dev / testing from local env. Make sure you delete the DEPoint once your development/testing is done for the day.
Alternately, if you are not keen on using AWS Glue APIs other than GlueContext, then yes, you can setup zeppelin in local environment, test the code locally and then upload your code to S3, create a Glue job for testing in AWS Glue Service
We have a setup here, where we have pyspark install locally and we use VSCode to develop our pyspark codes, unit test, and debug. We run the codes against the local pyspark installation during development, then we deploy those codes to EMR to run with real dataset.
I'm not sure how much of this apply to what you're trying to do with Glue, as it's a level higher in abstraction.
We use pytest to test pyspark code. We keep pyspark code in another file and calls those functions inglue code file. With this separation, we can unit test pyspark code using pytest
I was able to test without dev endpoints
Please follow the instructions here
https://support.wharton.upenn.edu/help/glue-debugging
I have HBase Spark job running at AWS EMR cluster. Recently we moved to GCP. I transferred all HBase data to BigTable. Now I am running same Spark - Java/Scala job in Dataproc. Spark job failing as it is looking spark.hbase.zookeeper.quorum setting.
Please let me know, how without code change I can make my spark job to run successfully with BigTable.
Regards,
Neeraj Verma
While BigTable shares same principles and same Java API is available as HBase, it is not sharing its wire protocol. So standard HBase Client won't work (zookeeper error looks like you are trying to connect to BigTable via HBase client). Instead, you need to modify your program to use BigTable-specific client. It implements same Java interfaces as HBase, but requires custom google jars in classpath and few property overrides to enable it.
I am trying to run hive queries on Amazon AWS using Talend. So far I can create clusters on AWS using the tAmazonEMRManage object, the next steps would be
1) To load the tables with data
2) Run queries against the Tables.
My data sits in S3. So far the documentation on talend does not seem to indicate the Hive objects tHiveLoad and tHiveRow support S3 which makes me wonder whether running hive queries on EMR via Talend is even possible
The documentation on how to do this is scarce. Has anyone tried doing this successfully or can point me in the right direction please?
I'm doing some studies about Redshift and Hive working at AWS.
I have an application working in Spark, that is in local cluster, working with Apache Hive. Then we will migrate to AWS.
We found that there is a Data Warehouse solution that is Redshift.
Redshift is a Columnar database that is really fast to queries for Tb of data with no big issues. Working with Redshift will not take much time of maintaining. But I have a question, how is the performance of Redshift over Hive?
If I store a Hive with EMR, setting the storage at EMR and handle the metastore with Hive it will take this to process the data with Spark.
What is the performance of Redshift over Hive in EMR? Redshift is the best solution for Apache Spark in terms of performance?
Or Using Hive I will take much performance with spark that will compensates the maintenance time?
-------EDIT-------
Well I read more about it, and I found how Redshift will work with Spark in EMR.
According to what I saw, when you call the data from Redshift it will load the information to a S3 bucket like this:
This information I found at Databricks Blog
According this, is Hive faster than Redshift for EMR?