I have HBase Spark job running at AWS EMR cluster. Recently we moved to GCP. I transferred all HBase data to BigTable. Now I am running same Spark - Java/Scala job in Dataproc. Spark job failing as it is looking spark.hbase.zookeeper.quorum setting.
Please let me know, how without code change I can make my spark job to run successfully with BigTable.
Regards,
Neeraj Verma
While BigTable shares same principles and same Java API is available as HBase, it is not sharing its wire protocol. So standard HBase Client won't work (zookeeper error looks like you are trying to connect to BigTable via HBase client). Instead, you need to modify your program to use BigTable-specific client. It implements same Java interfaces as HBase, but requires custom google jars in classpath and few property overrides to enable it.
Related
Hello fellow developers,
I have recently started learning about GCP and I am working on a POC that requires me to create a pipeline that is able to schedule Dataproc jobs written in PySpark.
Currently, I have created a Jupiter notebook on my Dataproc cluster and that reads data from GCS and writes it to BigQuery, it's working fine on Jupyter but I want to use that notebook inside a pipeline.
Just like on Azure we can schedule pipeline runs using Azure data factory, Please help me out which GCP tool would be helpful to achieve similar results.
My goal is to schedule the run of multiple Dataproc jobs.
Yes, you can do that by creating a Dataproc workflow and scheduling it with Cloud Composer, see this doc for more details.
By using Data Fusion, you won’t be able to schedule Dataproc jobs written in PySpark. Data Fusion is a code-free deployment of ETL/ELT data pipelines. As per your requirement, you can directly create and schedule a pipeline to pull data from GCS and load it into BigQuery with Data Fusion.
I am doing a POC on Google Cloud Dataproc along with HBase as one of the component.
I created cluster and was able to get the cluster running along with the HBase service. I can list and create tables via shell.
I want to use the Apache Phoenix as the client to query on HBase. I installed that on the cluster by referring to this link.
The installation when fine but when I execute sqlline.py localhost which should create the Meta table in hbase. It actually fails and gives error as Region in Transistion.
Does anyone know how to resolve this or is there a limitation that Apache Phoenix cannot be used along with Dataproc.
There is no limitation on Dataproc to use Apache Phoenix. You might want to dig deeper into the error message, it might be a configuration issue.
As I am going to perform a spark job for sentiment analysis on Google Cloud platform and I decided to use Dataproc. Is it worth doing with Dataproc or are there any suggestions. I need to perform sentiment analysis for huge dataset from twitter. That is I decided to use the Google cloud platform as my big data and distributed environment.
GCP Dataproc is definitely a great choice for your use-case. Dataproc natively supports Spark and also recently added support for Spark 3.
Please check which Dataproc image is right for your use case.
Following resources could be helpful while configuring and running Spark job on a cluster.
Creating and configuring cluster
Submit a job
Tutorial to run Spark scala job
Some more resources from community Spark job, PySpark Job,
I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or SparkSubmitOperator in a long run for submitting pyspark jobs. Any help would be appreciated in advance.
Below are the pros and cons of using SSHOperator vs SparkSubmit Operator in airflow and my recommendation followed.
SSHOperator : This operator will perform SSH action into remote Spark server and execute the spark submit in remote cluster.
Pros:
No additional configuration required in the airflow workers
Cons:
Tough to maintain the spark configuration parameters
Need to enable SSH port 22 from airflow servers to spark servers which leads to security concerns ( though you are on private network its not a best practice to use SSH based remote execution.)
SparkSubbmitOperator : This operator will perform spark submit operation in clean way still you need to have additional infrastructure configuration.
Pros:
As mentioned above it comes with handy spark configuration and no additional effort to invoke spark submit
Cons:
Need to install spark on all airflow servers.
Apart from these 2 options I have listed additional 2 options.
Install Livy server on spark clusters and use python Livy library to interact with Spark servers from Airflow. Refer : https://pylivy.readthedocs.io/en/stable/
If your spark clusters are on AWS EMR , I would encourage to using EmrAddStepsOperator
Refer here for additional discussions : To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow
SparkSubmitOperator is a specialized operator. That is, it should make writing tasks for submitting Spark jobs easier and the code itself more readable and maintainable. Therefore, I would use it if possible.
In your case, you should consider if the effort of modifying the infrastructure, such that you can use the SparkSubmitOperator, is worth the benefits, which I mentioned above.
I'm studying how to use GCP, especially focus on the Big Data and analytic functions, I'm not quite sure about their functionality. I did some mapping to understand these components. Could you help to check out my understanding?
Cloud Pub/Sub: Apache Kafka
Cloud Dataproc: Apache Hadoop, Spark
GCS: HDFS compatible
Cloud Dataflow: Apache Beam, Flink
Datastore: MongoDB
BigQuery: Teradata
BigTable: HBase
Memorystore: Redis
Cloud SQL: MySQL, PostgreSQL
Cloud Composer: Informatica
Cloud Data Studio: Tableau
Cloud Datalab: Jupyter notebook
I'm not totally sure what you want to know, your understanding of the GCP products is not far off, but if you are studiying GCP and want to understand them better, you can take a look at the Google Cloud developer's cheat sheet. It has a brief explanation of all the products inside GCP.
Link to the GitHub of the cheat sheet