AWS EMR Zeppelin is missing MYSQL interpreter - amazon-web-services

I launched a fresh AWS EMR Spark cluster with Zeppelin on AWS to query an MYSQL database. When I tried to add an MYSQL interpreter in Zeppelin the option does not exist. I googled to find a way to get the interpreter to display but I didn't find a solution. How can I get the MYSQL interpreter in Zeppelin so I can query the MYSQL database?

Spark SQL supports many features of SQL:2003 and SQL:2011 [ 1][2], you may consider doing that that via Spark on Zeppelin by adding dependency.
Get a mysql connector with proper version
Add it as a dependency to the Spark interpreter on Zeppelin. (I put the jar on the master machine)
You should be able to access a MySQL table right now. The following is an example using the API of Scala:
/* Database Configuration*/
val jdbcURL = s"jdbc:mysql://${HOST}/${DATABASE}"
val jdbcUsername = s"${USERNAME}"
val jdbcPassword = s"${PASSWORD}"
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", jdbcUsername)
connectionProperties.put("password", jdbcPassword)
connectionProperties.put("driver", "com.mysql.cj.jdbc.Driver")
/* Read Data from MySQL */
val desiredData = spark.read.jdbc(jdbcURL, "${TABLE NAME}", connectionProperties)
desiredData.printSchema
/* Data Manipulation */
desiredData.createOrReplaceTempView("desiredData")
val query = s"""
SELECT COUNT(*) AS `Record Number`
FROM desiredData
"""
spark.sql(query).show
val query2 = s"""
SELECT ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column1, column2) AS column3
FROM desiredData
"""
spark.sql(query2).show
.
.
.
Testing Notes:
EMR: emr-5.10.0 with Pig 0.17.0, Zeppelin 0.7.3, and ,Spark 2.2.0
MySQL: MariaDB 5.2.10
References
Apache Hive (n.d.). Home. [online] Cwiki.apache.org. Available at: https://cwiki.apache.org/confluence/display/Hive/Home [Accessed 1 Dec. 2017].
Apache Spark (n.d.). Compatibility with Apache Hive. [online] spark.apache.org. Available at: ​https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive [Accessed 1 Dec. 2017].

Related

Apache Iceberg tables not working with AWS Glue in AWS EMR

I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:
[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]
IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is the code in the notebook:
%%configure -f
{
"conf":{
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.dev.type":"hadoop",
"spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
}
}
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
spark = SparkSession.builder.getOrCreate()
# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)
# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)
How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?
You need to pass the catalog name glue.
Example: glue_catalog.<your_database_name>.<your_table_name>
https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html

Cannot connect to Cloud SQL using Apache-Beam JDBC

I am trying to connect to Cloud SQL by using Python SDK io.jdbc module, more specifically ReadFromJdbc class, which is documented here- https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html
Based on it and info on connecting to Cloud MySQL using JDBC here- https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/blob/main/docs/jdbc-mysql.md I wrote the following code
import apache_beam as beam
import apache_beam.io.jdbc as jdbc
import typing
import apache_beam.coders as coders
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = {
'project': 'project-name',
'runner': 'DataflowRunner',
'region': 'europe-central2',
'staging_location':"gs://temp",
'temp_location':"gs://temp",
'template_location':"gs://templates/temp_name"
}
pipeline_options = PipelineOptions.from_dictionary(pipeline_options)
serviceAccount = r'path\to\serviceaccount.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = serviceAccount
ExampleRow = typing.NamedTuple('ExampleRow',
[('id', int), ('migration', str)])
coders.registry.register_coder(ExampleRow, coders.RowCoder)
with beam.Pipeline(options=pipeline_options) as p:
res = (
p
| "Read database list" >> jdbc.ReadFromJdbc(
table_name='table',
driver_class_name='com.mysql.jdbc.Driver',
jdbc_url='jdbc:mysql:///<DATABASE_NAME>?cloudSqlInstance=<INSTANCE_CONNECTION_NAME>&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=<MYSQL_USER_NAME>&password=<MYSQL_USER_PASSWORD>',
username='user',
password='pass',
query = "select id, migration from db.table;",
fetch_size=1,
classpath=["com.google.cloud.sql:mysql-socket-factory-connector-j-8:1.7.2"],
expansion_service = 'host:6666'
)
| "Print results" >> beam.io.WriteToText(r'gs://output/out.csv')
)
For the expansion service I have set up WLS2 python environment as documented here- https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service
Unfortunately, I get this error:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:127.0.0.1:6666: WSA Error"
debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNAVAILABLE: ipv4:127.0.0.1:6666: WSA Error {grpc_status:14, created_time:"2022-12-08T15:43:05.445755053+00:00"}"
I tried to switch expansion_service to a specific IP that I got from wls hostname -I but it produced the same result, even though you can reach it (tested with ping and hosted a webserver).
Am I doing something completely wrong? I find it hard to believe that it's so hard to connect to Cloud SQL, so I must be...
Transforms under apache_beam.io.jdbc module are cross-language transforms implemented in the Beam Java SDK. Hence, during the pipeline construction, Python SDK will connect to a Java expansion service to expand these transforms. You followed the instructions to create a Python expansion service.
I think the easiest thing to do will be to use the default expansion service.
First, install Java runtime in the computer from where the pipeline is constructed and make sure that java command is available.
Use the following transform to read from Cloud SQL,
p | "Read database list" >> jdbc.ReadFromJdbc(
table_name='table',
driver_class_name='com.mysql.jdbc.Driver',
jdbc_url='jdbc:mysql:///<DATABASE_NAME>?cloudSqlInstance=<INSTANCE_CONNECTION_NAME>&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=<MYSQL_USER_NAME>&password=<MYSQL_USER_PASSWORD>',
username='user',
password='pass',
query = "select id, migration from db.table;",
fetch_size=1,
classpath=["com.google.cloud.sql:mysql-socket-factory-connector-j-8:1.7.2"]
)

InvalidQueryException: Consistency level LOCAL_ONE is not supported for this operation. Supported consistency levels are: LOCAL_QUORUM

import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("XXXX")
.set("spark.cassandra.connection.host" ,"cassandra.us-east-2.amazonaws.com")
.set("spark.cassandra.connection.port", "9142")
.set("spark.cassandra.auth.username", "XXXXX")
.set("spark.cassandra.auth.password", "XXXXX")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/home/nihad/.cassandra/cassandra_truststore.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "XXXXX")
.set("spark.cassandra.output.consistency.level", "LOCAL_QUORUM")
val connector = CassandraConnector(conf)
val session = connector.openSession()
sesssion.execute("""INSERT INTO "covid19".delta_by_states (state_code, state_value, date ) VALUES ('kl', 5, '2020-03-03');""")
session.close()
i amn trying to write data to AWS Cassandra Keyspace using Spark App set in my local system.
Problem is when i execute above code, I get Exception like below:
"com.datastax.oss.driver.api.core.servererrors.InvalidQueryException:
Consistency level LOCAL_ONE is not supported for this operation.
Supported consistency levels are: LOCAL_QUORUM"
As you can see from the above code I have already set cassandra.output.consistency.level as LOCAL_QUORUM in Spark Conf. Also I am using datastax cassandra driver.
But when I read data from AWS Cassandra, it works fine. Also I tried same INSERT command in AWS Keyspace cqlsh. It is working fine there too. So Query is valid.
Can someone help me how to set consistency via datastax.CassandraConnector?
Cracked it.
Instead of setting cassandra consistency via spark config. I created an application.conf file in src/main/resources directory.
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider{
class = PlainTextAuthProvider
username = "serviceUserName"
password = "servicePassword"
}
basic.load-balancing-policy {
local-datacenter = "us-east-2"
}
advanced.ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "yourPath/.cassandra/cassandra_truststore.jks"
truststore-password = "trustorePassword"
}
basic.request.consistency = LOCAL_QUORUM
basic.request.timeout = 5 seconds
}
and created cassandra session like below
import com.datastax.oss.driver.api.core.config.DriverConfigLoader
import com.datastax.oss.driver.api.core.CqlSession
val loader = DriverConfigLoader.fromClassPath("application.conf")
val session = CqlSession.builder().withConfigLoader(loader).build()
sesssion.execute("""INSERT INTO "covid19".delta_by_states (state_code, state_value, date ) VALUES ('kl', 5, '2020-03-03');""")
It finally worked. No need to mess with spark config
Doc for Driver Config https://docs.datastax.com/en/drivers/java/4.0/com/datastax/oss/driver/api/core/config/DriverConfigLoader.html#fromClasspath-java.lang.String-
datastax configuration doc https://docs.datastax.com/en/developer/java-driver/4.6/manual/core/configuration/reference/

Connect Python to H2

I'm trying to make a connection from python2.7 to H2 (h2-1.4.193.jar - latest)
H2 (is running and available): java -Dh2.bindAddress=127.0.0.1 -cp "E:\Dir\h2-1.4.193.jar;%H2DRIVERS%;%CLASSPATH%" org.h2.tools.Server -tcpPort 15081 -baseDir E:\Dir\db
For python I'm using jaydebeapi:
import jaydebeapi
conn = jaydebeapi.connect('org.h2.Driver', ['jdbc:h2:tcp://localhost:15081/db/test', 'sa', ''], 'E:\Path\to\h2-1.4.193.jar')
curs = conn.cursor()
curs.execute('create table PERSON ("PERSON_ID" INTEGER not null, "NAME" VARCHAR not null, primary key ("PERSON_ID"))')
curs.execute("insert into PERSON values (1, 'John')")
curs.execute("select * from PERSON")
data = curs.fetchall()
print(data)
As a result everytime I get an error: Process finished with exit code -1073741819 (0xC0000005)
Do you have any ideas about this case? Or maybe there is something else that I can use instead of the jaydebeapi?
Answering my own question:
First of all I could not do anything through the jaydebeapi.
I've read that H2 supports PostgreSQL network protocol. My next steps were to transfer h2 and python into pgsql:
H2 pg:
java -Dh2.bindAddress=127.0.0.1 -cp h2.jar;postgresql-9.4.1212.jre6.jar org.h2.tools.Server -baseDir E:\Dir\h2\db
TCP server running at tcp://localhost:9092 (only local connections)
PG server running at pg://localhost:5435 (only local connections)
Web Console server running at http://localhost:8082 (only local connections)
postgresql.jar was included to try to connect from Web Console.
Python: psycopg2 instead of jaydebeapi:
import psycopg2
conn = psycopg2.connect("dbname=h2pg user=sa password='sa' host=localhost port=5435")
cur = conn.cursor()
cur.execute('create table PERSON ("PERSON_ID" INTEGER not null, "NAME" VARCHAR not null, primary key ("PERSON_ID"))')
As a result - it's working now. Connection was established and table was created.
Web Console settings:
Generic PostgreSQL
org.postgresql.Driver
jdbc:postgresql://localhost:5435/h2pg
name: sa, pass: sa
Web Console did connect but did not show me table list and showed many errors instead: "CURRENT_SCHEMAS" is not found etc.... PG admin 4 was not also able to connect. SQuirrel to the rescue - it had connected to this db and all is working fine there.
Perhaps a bit late for an update after 1.5 years, but the current version connects fine with H2, without having to use a postgres driver.
conn = jaydebeapi.connect("org.h2.Driver", "jdbc:h2:~/test", ["sa", ""], "/Users/angelo/websites/GEPR/h2/bin/h2-1.4.197.jar",)
source: https://pypi.org/project/JayDeBeApi/#usage

How to connect to Amazon Redshift or other DB's in Apache Spark?

I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases
The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).
df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?
Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
Start the shell including the jar.
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar
Create a df by giving appropriate details:
myDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://host:port/db_name") \
.option("dbtable", "table_name") \
.option("user", "user_name") \
.option("password", "password") \
.load()
Spark Version: 2.2
It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc",
url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret",
dbtable="schema.table"
)
Hope this helps someone!
If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.
If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.
Disclosure: I'm one of the authors of spark-redshift.
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/
You can either define your environment variable SPARK_CLASSPATH in .bashrc, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.
You can also define it in your conf/spark-defaults.conf in the following way:
spark.driver.extraClassPath /path/to/file/postgresql-9.4-1201.jdbc41.jar
Make sure it is reflected in the Environment tab of your Spark WebUI.
You will also need to set appropriate AWS credentials in the following way:
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
The simplest way to make a jdbc connection to Redshift using python is as follows:
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"
spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
# Read data from a query
df = spark.read \
.format(jdbc_driver) \
.option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
.option("query", "your query") \
.load()
This worked for in Scala in AWS Glue with Spark 2.4:
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://HOST:PORT/DBNAME?user=USERNAME&password=PASSWORD",
"dbtable" -> "(SELECT a.row_name FROM schema_name.table_name a) as from_redshift")).load()
// back to DynamicFrame
val datasource0 = DynamicFrame(jdbcDF, glueContext)
Works with any SQL query.