I have spark running on EMR and i have been trying to connect to spark-SQL from SQLWorkbench using the JDBC hive drivers, but in vain. I have started the thrift server on the EMR and i'm able to connect to Hive on port 10000(default) from Tableau/SQL Workbench. When i try to run a query, it fires a Tez/Hive job. However, i want to run the query using Spark. Within the EMR box, I'm able to connect to SparkSQL using beeline and run a query as a spark job. Resource manager shows that the beeline query is running as a spark job, while the query running through SQLWorkbench, is running a hive/Tez job.
When i checked the logs, i found that the thrift server to connect to spark was running on port 10001(default).
When i fire up beeline, the entries come up for connection and sql that i'm running. However, when the same connection parameters are used to connect form SQLWorkbench/Tableau, it has an exception without much details. the exception just say connection ended.
I tried running on a custom port by passing the parameters, beeline works, but not through jdbc connection.
Any help to resolve this issue?
I was able to resolve the issue. I was able to connect to SparkSQL from Tableau and the reason i was not able to connect was we were bringing up the thrift service as root. Not sure why it would matter, i had to change the permission on the log folder to the current user(not root) and bring up the thrift service, which enabled me to connect without any issues.
Related
Janusgraph is deployed and running in GCP container, I can access that using Cloud shell.
I want to perform some CRUD operation using python runtime.
What are the connection URL, and ports I have to mention to get proper result.
Docs used to create the GCP environment - https://cloud.google.com/architecture/running-janusgraph-with-bigtable#overview
Docs used to connect gremlin to Python - https://tinkerpop.apache.org/docs/current/reference/#connecting-via-drivers
But I'm unable to hit the server, Is there anyone out who tried to establish this type of connection.
we are trying to create a data proc cluster with a newer image. We already upgraded our cloud SQL backed hive metastore to 3.1.0 . But while creating dataproc cluster with dataproc initialization action (where we are installing cloud sql proxy ) its failing with an error saying "Failed to bring up Cloud SQL Metastore"
below is the snippet of log trace
Synchronizing state of mysql.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable mysql
Removed /etc/systemd/system/multi-user.target.wants/mysql.service.
Created symlink /etc/systemd/system/multi-user.target.wants/cloud-sql-proxy.service → /lib/systemd/system/cloud-sql-proxy.service.
About to run 'nc -zv localhost 3306' with retries...
nc: connect to localhost port 3306 (tcp) failed: Connection refused
Connection to localhost 3306 port [tcp/mysql] succeeded!
Cloud SQL Proxy installation succeeded
About to run 'run_validation' with retries...
Metastore connection URL: jdbc:mysql://localhost:3306/hive_metastore
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
Hive distribution version: 3.1.0
Metastore schema version: 3.1.0
schemaTool completed
[2022-09-29T12:46:49+0000]: Failed to bring up Cloud SQL Metastore
As you can see schema is matching but still I am getting errors.
as part of the backup, I have cloned the previous cloud SQL instance and upgraded that cloud SQL instance to 3.1.0 and using that clone instance as metastore in the new dataproc cluster.
can anybody help me to understand this problem and solution to this problem?
I am doing a POC on Google Cloud Dataproc along with HBase as one of the component.
I created cluster and was able to get the cluster running along with the HBase service. I can list and create tables via shell.
I want to use the Apache Phoenix as the client to query on HBase. I installed that on the cluster by referring to this link.
The installation when fine but when I execute sqlline.py localhost which should create the Meta table in hbase. It actually fails and gives error as Region in Transistion.
Does anyone know how to resolve this or is there a limitation that Apache Phoenix cannot be used along with Dataproc.
There is no limitation on Dataproc to use Apache Phoenix. You might want to dig deeper into the error message, it might be a configuration issue.
Is it possible to submit presto jobs/steps in any way to an EMR cluster just like you can submit Hive jobs/steps via a script in S3?
I would not like to SSH to the instance to execute the commands, but do it automatically
You can use any JDBC/ODBC client to connect with Presto cluster. You can do this with available drivers if you want to connect programmatically.
https://prestodb.github.io/docs/current/installation/jdbc.html
https://teradata.github.io/presto/docs/0.167-t/installation/odbc.html
If you do have any BI tool like tableau, Clickview, Superset then it can be easily done as well.
e.g https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_presto.htm
When running spark sql jobs on local mode, my Spark UI has a SQL tab.
When running that same job on AWS EMR, the Spark UI's SQL job is no longer there?
I've SSH tunneled
setup the FoxyProxy settings
can view the various EMR UI's in the browser
can view the Spark UI in the browser
is there a code reason why this would not be there in Amazon's EMR version of the Spark UI?
There are two different UI's for Spark in EMR: 1. the Spark UI, and 2. the Spark History Server.
They look the same, but they are served from different URLs, and one of them continues to be served even after an application completes, as long as the EMR cluster is still up and running.
The history server AFAIK does not have a SQL tab