I want to run local Dynamodb spark job without using EMR cluster,
that read data from some table and write it to parquet / CSV file.
I didn't found any spark-dynamo connector that supports that, maybe you have any ideas?
My code sample:
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.sql.SparkSession
object copyDynamoTable extends App {
val spark = SparkSession
.builder()
.appName("test")
.master("local")
.getOrCreate()
val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "hen.poc.client") // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
val orders = spark.sparkContext.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(orders.count)
I received the following exception:
18/09/05 17:06:41 INFO util.TaskCalculator: Cluster has 1 active nodes.
18/09/05 17:06:41 WARN util.ClusterTopologyNodeCapacityProvider: Exception when trying to determine instance types
java.nio.file.NoSuchFileException: /mnt/var/lib/info/job-flow.json
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.readJobFlowJsonString(ClusterTopologyNodeCapacityProvider.java:103)
at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.getCoreNodeMemoryMB(ClusterTopologyNodeCapacityProvider.java:42)
at org.apache.hadoop.dynamodb.util.TaskCalculator.getMaxMapTasks(TaskCalculator.java:54)
at org.apache.hadoop.dynamodb.DynamoDBUtil.calcMaxMapTasks(DynamoDBUtil.java:265)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBInputFormat.getSplits(AbstractDynamoDBInputFormat.java:47)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD.count(RDD.scala:1162)
at com.data.spark.dynamodb.copyDynamoTable$.delayedEndpoint$com$riskified$data$spark$dynamodb$copyDynamoTable$1(copyDynamoTable.scala:30)
at com.data.spark.dynamodb.copyDynamoTable$delayedInit$body.apply(copyDynamoTable.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.data.spark.dynamodb.copyDynamoTable$.main(copyDynamoTable.scala:9)
at com.data.spark.dynamodb.copyDynamoTable.main(copyDynamoTable.scala)
Exception in thread "main" java.lang.ArithmeticException: / by zero
This is a file that is present on an EMR cluster. This is to try to determine what instance type it is running against to determine some job settings such as memory. Obviously running locally you wouldn't have this file so this is expected.
Please follow the bellow thread :
emr/github.com/issues/50
Related
I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:
[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]
IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is the code in the notebook:
%%configure -f
{
"conf":{
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.dev.type":"hadoop",
"spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
}
}
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
spark = SparkSession.builder.getOrCreate()
# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)
# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)
How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?
You need to pass the catalog name glue.
Example: glue_catalog.<your_database_name>.<your_table_name>
https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("XXXX")
.set("spark.cassandra.connection.host" ,"cassandra.us-east-2.amazonaws.com")
.set("spark.cassandra.connection.port", "9142")
.set("spark.cassandra.auth.username", "XXXXX")
.set("spark.cassandra.auth.password", "XXXXX")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/home/nihad/.cassandra/cassandra_truststore.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "XXXXX")
.set("spark.cassandra.output.consistency.level", "LOCAL_QUORUM")
val connector = CassandraConnector(conf)
val session = connector.openSession()
sesssion.execute("""INSERT INTO "covid19".delta_by_states (state_code, state_value, date ) VALUES ('kl', 5, '2020-03-03');""")
session.close()
i amn trying to write data to AWS Cassandra Keyspace using Spark App set in my local system.
Problem is when i execute above code, I get Exception like below:
"com.datastax.oss.driver.api.core.servererrors.InvalidQueryException:
Consistency level LOCAL_ONE is not supported for this operation.
Supported consistency levels are: LOCAL_QUORUM"
As you can see from the above code I have already set cassandra.output.consistency.level as LOCAL_QUORUM in Spark Conf. Also I am using datastax cassandra driver.
But when I read data from AWS Cassandra, it works fine. Also I tried same INSERT command in AWS Keyspace cqlsh. It is working fine there too. So Query is valid.
Can someone help me how to set consistency via datastax.CassandraConnector?
Cracked it.
Instead of setting cassandra consistency via spark config. I created an application.conf file in src/main/resources directory.
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider{
class = PlainTextAuthProvider
username = "serviceUserName"
password = "servicePassword"
}
basic.load-balancing-policy {
local-datacenter = "us-east-2"
}
advanced.ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "yourPath/.cassandra/cassandra_truststore.jks"
truststore-password = "trustorePassword"
}
basic.request.consistency = LOCAL_QUORUM
basic.request.timeout = 5 seconds
}
and created cassandra session like below
import com.datastax.oss.driver.api.core.config.DriverConfigLoader
import com.datastax.oss.driver.api.core.CqlSession
val loader = DriverConfigLoader.fromClassPath("application.conf")
val session = CqlSession.builder().withConfigLoader(loader).build()
sesssion.execute("""INSERT INTO "covid19".delta_by_states (state_code, state_value, date ) VALUES ('kl', 5, '2020-03-03');""")
It finally worked. No need to mess with spark config
Doc for Driver Config https://docs.datastax.com/en/drivers/java/4.0/com/datastax/oss/driver/api/core/config/DriverConfigLoader.html#fromClasspath-java.lang.String-
datastax configuration doc https://docs.datastax.com/en/developer/java-driver/4.6/manual/core/configuration/reference/
I am new to **AWS Glue,** and my aim is to extract transform and load files uploaded in S3 bucket to RDS instance. Also I need to transfer the files into separate S3 buckets based on the Glue Job status (Success /Failure). There will be more than one file uploaded into the initial S3 bucket. How can I get the name of the files uploaded so that i can transfer those files to appropriate buckets.
Step 1: Upload files to S3 bucket1.
Step 2: Trigger lamda function to call Job1
Step 3: On success of job1 transfer file to S3 bucket2
Step 4: On failure transfer to another S3 bucket
Have a lambda event trigger listening to the folder you are uploading
the files to S3 In the lambda, use AWS Glue API to run the glue job
(essentially a python script in AWS Glue).
In Glue python script, use the appropriate library, such as pymysql, etc.
as an external library packaged with your python script.
Perform data load operations from S3 to your RDS tables. If you are
using Aurora Mysql, then AWS has provided a nice feature
"load from S3", so you can directly load the
file into the tables (you may need to do some configurations in the
PARAMETER GROUP / IAM Roles).
Lambda script to call glue job:
s3 = boto3.client('s3')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="<YOUR GLUE JOB NAME>"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
raise e
Glue Script:
import mysql.connector
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import SQLContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
url="<RDS URL>"
uname="<USER NAME>"
pwd="<PASSWORD>"
dbase="DBNAME"
def connect():
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
return cur, conn
def create_stg_table():
cur, conn = connect()
createStgTable1 = <CREATE STAGING TABLE IF REQUIRED>
loadQry = "LOAD DATA FROM S3 PREFIX 'S3://PATH FOR YOUR CSV' REPLACE INTO TABLE <DB.TABLENAME> FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (#var1, #var2, #var3, #var4, #var5, #var6, #var7, #var8) SET ......;"
cur.execute(createStgTable1)
cur.execute(loadQry)
conn.commit()
conn.close()
You can then create a cloudwatch alert wherein check for the glue job status, and depending upon the status, perform file copy operations between S3. We have similar setup in our production.
Regards
Yuva
We started using Dataflow to read from PubSub and Stream to BigQuery.
Dataflow should work 24/7, because pubsub is constantly updated with analytics data of multiple websites around the world.
Code looks like this:
from __future__ import absolute_import
import argparse
import json
import logging
import apache_beam as beam
from apache_beam.io import ReadFromPubSub, WriteToBigQuery
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
logger = logging.getLogger()
TABLE_IDS = {
'table_1': 0,
'table_2': 1,
'table_3': 2,
'table_4': 3,
'table_5': 4,
'table_6': 5,
'table_7': 6,
'table_8': 7,
'table_9': 8,
'table_10': 9,
'table_11': 10,
'table_12': 11,
'table_13': 12
}
def separate_by_table(element, num):
return TABLE_IDS[element.get('meta_type')]
class ExtractingDoFn(beam.DoFn):
def process(self, element):
yield json.loads(element)
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
logger.info('STARTED!')
parser = argparse.ArgumentParser()
parser.add_argument('--topic',
dest='topic',
default='projects/PROJECT_NAME/topics/TOPICNAME',
help='Gloud topic in form "projects/<project>/topics/<topic>"')
parser.add_argument('--table',
dest='table',
default='PROJECTNAME:DATASET_NAME.event_%s',
help='Gloud topic in form "PROJECT:DATASET.TABLE"')
known_args, pipeline_args = parser.parse_known_args(argv)
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
lines = p | ReadFromPubSub(known_args.topic)
datas = lines | beam.ParDo(ExtractingDoFn())
by_table = datas | beam.Partition(separate_by_table, 13)
# Create a stream for each table
for table, id in TABLE_IDS.items():
by_table[id] | 'write to %s' % table >> WriteToBigQuery(known_args.table % table)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger.setLevel(logging.INFO)
run()
It works fine but after some time (2-3 days) it stops streaming for some reason.
When I check job status, it contains no errors in the logs section (you know, ones marked with red "!" in dataflow's job details). If I cancel the job and run it again - it starts working again, as usual.
If I check Stackdriver for additional logs, here's all Errors that happened:
Here's some warnings that occur periodically while job executes:
Details of one of them:
{
insertId: "397122810208336921:865794:0:479132535"
jsonPayload: {
exception: "java.lang.IllegalStateException: Cannot be called on unstarted operation.
at com.google.cloud.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.getElementsSent(RemoteGrpcPortWriteOperation.java:111)
at com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor$SingularProcessBundleProgressTracker.updateProgress(BeamFnMapTaskExecutor.java:293)
at com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor$SingularProcessBundleProgressTracker.periodicProgressUpdate(BeamFnMapTaskExecutor.java:280)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"
job: "2018-11-30_10_35_19-13557985235326353911"
logger: "com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor"
message: "Progress updating failed 4 times. Following exception safely handled."
stage: "S0"
thread: "62"
work: "c-8756541438010208464"
worker: "beamapp-vitar-1130183512--11301035-mdna-harness-lft7"
}
labels: {
compute.googleapis.com/resource_id: "397122810208336921"
compute.googleapis.com/resource_name: "beamapp-vitar-1130183512--11301035-mdna-harness-lft7"
compute.googleapis.com/resource_type: "instance"
dataflow.googleapis.com/job_id: "2018-11-30_10_35_19-13557985235326353911"
dataflow.googleapis.com/job_name: "beamapp-vitar-1130183512-742054"
dataflow.googleapis.com/region: "europe-west1"
}
logName: "projects/PROJECTNAME/logs/dataflow.googleapis.com%2Fharness"
receiveTimestamp: "2018-12-03T20:33:00.444208704Z"
resource: {
labels: {
job_id: "2018-11-30_10_35_19-13557985235326353911"
job_name: "beamapp-vitar-1130183512-742054"
project_id: PROJECTNAME
region: "europe-west1"
step_id: ""
}
type: "dataflow_step"
}
severity: "WARNING"
timestamp: "2018-12-03T20:32:59.442Z"
}
Here's the moment when it seems to start having problems:
Additional info messages that may help:
According to these messages, we don't run out of memory/processing power etc. The job is run with these parameters:
python -m start --streaming True --runner DataflowRunner --project PROJECTNAME --temp_location gs://BUCKETNAME/tmp/ --region europe-west1 --disk_size_gb 30 --machine_type n1-standard-1 --use_public_ips false --num_workers 1 --max_num_workers 1 --autoscaling_algorithm NONE
What could be the problem here?
This isn't really an answer, more helping identify the cause: so far, all streaming Dataflow jobs I've launched using python SDK have stopped that way after some days, whether they use BigQuery as sink or not. So the cause rather seems to be the general fact that streaming jobs with the python SDK are still in beta.
My personal solution: use the Dataflow templates to stream from Pub/Sub to BigQuery (thus avoiding the python SDK), then schedule queries in BigQuery to periodically treat the data. Unfortunately that might not be appropriate for your use cases.
in my company we are experiencing the same and identical problem, as described by the OP, with a similar use case.
Unfortunately the problem is real, concrete and apparently with a random occurrence.
As a workaround, we are considering rewriting our pipeline using the java SDK.
I had a similar issue to this and found that the warning logs contained python Stack trace hidden in the java logs advising of errors.
These errors were continually re-tried by workers causing them to crash and completely freeze the pipeline. I initially thought the No. of workers was too low, so scaled up the number of workers, but the pipeline just took longer to freeze.
I ran the pipeline locally and exported the pubsub messages as text and identified they contained dirty data(messages that did not match the BQ table schema) and as I had no exception handling, that seemed to be the cause of the pipeline to freeze.
Adding a function only accept a record where the first key matches the expected column of your BQ Schema fixed my issue and the Dataflow Job has been running with no issues ongoing.
def bad_records(row):
if 'key1' in row:
yield row
else:
print('bad row',row)
|'exclude bad records' >> beam.ParDo(bad_records)
I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases
The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).
df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?
Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
Start the shell including the jar.
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar
Create a df by giving appropriate details:
myDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://host:port/db_name") \
.option("dbtable", "table_name") \
.option("user", "user_name") \
.option("password", "password") \
.load()
Spark Version: 2.2
It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc",
url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret",
dbtable="schema.table"
)
Hope this helps someone!
If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.
If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.
Disclosure: I'm one of the authors of spark-redshift.
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/
You can either define your environment variable SPARK_CLASSPATH in .bashrc, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.
You can also define it in your conf/spark-defaults.conf in the following way:
spark.driver.extraClassPath /path/to/file/postgresql-9.4-1201.jdbc41.jar
Make sure it is reflected in the Environment tab of your Spark WebUI.
You will also need to set appropriate AWS credentials in the following way:
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
The simplest way to make a jdbc connection to Redshift using python is as follows:
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"
spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
# Read data from a query
df = spark.read \
.format(jdbc_driver) \
.option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
.option("query", "your query") \
.load()
This worked for in Scala in AWS Glue with Spark 2.4:
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://HOST:PORT/DBNAME?user=USERNAME&password=PASSWORD",
"dbtable" -> "(SELECT a.row_name FROM schema_name.table_name a) as from_redshift")).load()
// back to DynamicFrame
val datasource0 = DynamicFrame(jdbcDF, glueContext)
Works with any SQL query.