While using google cloud dataproc to run a pyspark job. My code tried to do a query on bigquery using pyspark
query = 'select max(col a) from table'
df = spark.read.format('bigquery').load(query)
Look at this notebook. Here you'll have a example code to do bigquery queries with spark in dataproc.
You see this error because Dataproc does not include Spark BigQuery connector jar by default, that's why you need to add it to your Spark application if you want to use Spark to process data in BigQuery.
Here is documentation with examples on how to do this for Dataproc Serverless and Dataproc clusters:
https://cloud.google.com/dataproc-serverless/docs/guides/bigquery-connector-spark-example
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
Related
I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks.
We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog.
I want to use the Glue Data Catalog as the Hive metastore for these Spark jobs.
I see that it's possible to do when Spark is run in EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
but is it possible from Spark running on EKS?
I have seen this code released by aws:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
but I can't understand if patching of the Hive jar is necessary for what I'm trying to do.
Also I need the hive-site.xml file for connecting Spark to the metastore, how can I get this file from Glue Data Catalog?
I found a solution for that.
I created a new spark image with this instructions: https://github.com/viaduct-ai/docker-spark-k8s-aws
and finally at my job yaml file, I added some configurations
sparkConf:
...
spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
When I go to https://cloud.google.com/dataproc, I see this ...
"Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."
But gcloud dataproc jobs submit doesn't list all of them. It lists only 8 (hadoop, hive, pig, presto, pyspark, spark, spark-r, spark-sql). Any idea why?
~ gcloud dataproc jobs submit
ERROR: (gcloud.dataproc.jobs.submit) Command name argument expected.
Available commands for gcloud dataproc jobs submit:
hadoop Submit a Hadoop job to a cluster.
hive Submit a Hive job to a cluster.
pig Submit a Pig job to a cluster.
presto Submit a Presto job to a cluster.
pyspark Submit a PySpark job to a cluster.
spark Submit a Spark job to a cluster.
spark-r Submit a SparkR job to a cluster.
spark-sql Submit a Spark SQL job to a cluster.
For detailed information on this command and its flags, run:
gcloud dataproc jobs submit --help
Some OSS components are offered as Dataproc Optional Components. Not of all them have a job submit API, some (e.g., Anaconda, Jupyter) don't need one, some (e.g., Flink, Druid) might add in the future.
Some other OSS components are offered as libraries, e.g., GCS connector, BigQuery connector, Apache Parquet.
Hello fellow developers,
I have recently started learning about GCP and I am working on a POC that requires me to create a pipeline that is able to schedule Dataproc jobs written in PySpark.
Currently, I have created a Jupiter notebook on my Dataproc cluster and that reads data from GCS and writes it to BigQuery, it's working fine on Jupyter but I want to use that notebook inside a pipeline.
Just like on Azure we can schedule pipeline runs using Azure data factory, Please help me out which GCP tool would be helpful to achieve similar results.
My goal is to schedule the run of multiple Dataproc jobs.
Yes, you can do that by creating a Dataproc workflow and scheduling it with Cloud Composer, see this doc for more details.
By using Data Fusion, you won’t be able to schedule Dataproc jobs written in PySpark. Data Fusion is a code-free deployment of ETL/ELT data pipelines. As per your requirement, you can directly create and schedule a pipeline to pull data from GCS and load it into BigQuery with Data Fusion.
As I am going to perform a spark job for sentiment analysis on Google Cloud platform and I decided to use Dataproc. Is it worth doing with Dataproc or are there any suggestions. I need to perform sentiment analysis for huge dataset from twitter. That is I decided to use the Google cloud platform as my big data and distributed environment.
GCP Dataproc is definitely a great choice for your use-case. Dataproc natively supports Spark and also recently added support for Spark 3.
Please check which Dataproc image is right for your use case.
Following resources could be helpful while configuring and running Spark job on a cluster.
Creating and configuring cluster
Submit a job
Tutorial to run Spark scala job
Some more resources from community Spark job, PySpark Job,
I would like to migrate my HIVE tables to an existing Dataproc cluster. Is there any way to deploy the tables using Google Deployment Manager. I have checked the list of supported resource types in GDM; could not locate a hive resource type. However, there is an instance of dataproc.v1.cluster available and I was able to successfully deploy a dataproc cluster. Now, is there any way to deploy my HIVE ddls within the cluster?
For customizing your cluster we recommend initialization actions.
If your hive tables are all setup as EXTERNAL with data in GCS or BigQuery you could benefit from setting up a metastore on Cloud SQL