Simultaneously read Snowflake and s3 using Spark - amazon-web-services

I tried to read data from s3 and snowflake simultaneously using spark and put it into snowflake after processing(join Operation).
During the tests, I found that each gives the same result but different performance.
(The second attempt was made for logs.)
v1 : Read from s3 and snowflake respectively, process "Join" operation and save snowflake
1st try : It took 2 hours, 46 minutes.
2nd try : It took 1 hour, 12 minutes.
v2 : Write the data B read from snowflake in s3, read it again, and save snowflake after "Join" operation
1st try : It took 5 minutes.
2nd try : It took 5 minutes.
v3 : Read from s3 and snowflake respectively, write the result to s3 and read it again, and save snowflake
1st try : It took 1h 11mins.
2nd try : It took 50 minutes.
Question. Why does each test perform differently?
I think v1 should be the fastest in that snowflake spark-connector is supported.
Here is test details and time information elapsed for each line,
Common config
Data A - stored in s3 : 20.2GB
Data B - stored in snowflake(internal table) : 7KB(195 rows)
spark config : r5.4xlarge(4) on-demand
snowflake warehouse : 3x-large
aws emr version : emr-5.31.0
spark version : 2.4.6
snowflake version : spark-snowflake_2.11
Test(using aws EMR)
v1 : Read from s3 and snowflake respectively, process "Join" operation and save snowflake
Result : OOM occurs after 2 hours ((Exit status 137.)
Change spark config by aws guide ".set("spark.sql.shuffle.partitions", "500") & resultDF.repartition(500).cache(): It took 2h 46mins.
object snowflake_v1 {
val spark_config = new SparkConf()
.set("fs.s3.maxConnections", "5000")
.set("spark.sql.broadcastTimeout", "1200")
.set("spark.sql.shuffle.partitions", "500")
.set("spark.network.timeout", "10000000")
.set("spark.executor.heartbeatInterval", "1000000")
val spark = SparkSession
.builder()
.master("yarn")
.config(spark_config)
.appName("snowflake")
.getOrCreate()
def main(args: Array[String]): Unit = {
//parquet at snowflake_v1.scala:27
val Adf =
spark.read
.parquet(
"s3://"
)
.cache()
var sfOptions = Map.apply(
"sfURL" -> "XXX",
"sfUser" -> "XXX",
"sfPassword" -> "XXX",
"sfDatabase" -> "XXX",
"sfSchema" -> "XXX",
"sfWarehouse" -> "XXX"
)
val Bdf: DataFrame = spark.sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_B")
.load()
val resDF =
Adf.join(Bdf, Seq("cnty"), "leftouter").cache()
val newDF = resultDF.repartition(500).cache() //If not, OOM occurs
newDF .write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_result_spark_v1")
.option("parallelism", "8")
.mode(SaveMode.Overwrite)
.save()
}
}
v2 : Write the data B read from snowflake in s3, read it again, and save snowflake after "Join" operation
Result : It took 5mins.
object snowflake_v2 {
val spark_config = "same as v1"
val spark = "same as v1"
def main(args: Array[String]): Unit = {
// parquet at snowflake_v2.scala:25
val Adf = "same as v1"
var sfOptions = "same as v1"
val Bdf: DataFrame = "same as v1"
//parquet at snowflake_v2.scala:53
Bdf.write
.mode(SaveMode.Overwrite)
.parquet("s3://..b")
//parquet at snowflake_v2.scala:56
val Bdf2=
spark.read.parquet("s3://..b")
val resDF =
Adf.join(Bdf2, Seq("cnty"), "leftouter").cache()
resDF .write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_result_spark_v2")
.option("parallelism", "8")
.mode(SaveMode.Overwrite)
.save()
}
}
v3 : Read from s3 and snowflake respectively, write the result to s3 and read it again, and save snowflake
Result : It took 30mins, but duplicated rows occur in s3 and snowflake
Before writing the result in s3, it takes 1h 11mins if I print the count so that the actual operation proceeds.(No duplication)
object snowflake_v3 {
val spark_config = "same as v1"
val spark = "same as v1"
def main(args: Array[String]): Unit = {
//parquet at snowflake_v3.scala:25
val Adf = "same as v1"
var sfOptions = "same as v1"
val Bdf: DataFrame = "same as v1"
val resDF =
Adf.join(Bdf, Seq("cnty"), "leftouter")
println("resDF count")
//count at snowflake_v3.scala:54
println(resDF.count) //If not, duplicated rows occur
//parquet at snowflake_v3.scala:58
resDF.write
.mode(SaveMode.Overwrite)
.parquet("s3://../temp_result")
//parquet at snowflake_v3.scala:65
val resDF2 =
spark.read.parquet("s3://../temp_result")
resDF2 .write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_result_spark_v3")
.option("parallelism", "8")
.mode(SaveMode.Overwrite)
.save()
}
}

Related

job aborted while writing parquet to s3 via glue jobs

My code looks like below which consists of transformations:
dictionaryDf = spark.read.option("header", "true").csv(
"s3://...../.csv")
web_notif_data = fullLoad.cache()
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
print("::::::data has been loaded::::::::::::")
distinct_campaign_name = web_notif_data.select(
trim(web_notif_data.campaign_name).alias("campaign_name")).distinct()
web_notif_data.createOrReplaceTempView("temp")
variablesList = Config.get('web', 'variablesListWeb')
web_notif_data = spark.sql(variablesList)
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
web_notif_data = web_notif_data.withColumn("camp", regexp_replace("campaign_name", "_", ""))
web_notif_data = web_notif_data.drop("campaign_name")
web_notif_data = web_notif_data.withColumnRenamed("camp", "campaign_name")
web_notif_data = web_notif_data.withColumn("channel", lit("web_notification"))
web_notif_data.createOrReplaceTempView("data")
campaignTeamWeb = Config.get('web', 'campaignTeamWeb')
web_notif_data = spark.sql(campaignTeamWeb)
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
distinct_campaign_name = distinct_campaign_name.withColumn("camp", F.regexp_replace(
F.lower(F.trim(col("campaign_name"))),
"[^a-zA-Z0-9]", ""))
output_df3 = (
distinct_campaign_name.withColumn("cname_split",
F.explode(F.split(F.lower(F.trim(col("campaign_name"))), "_")))
.join(
dictionaryDf,
(
(
(F.col("function") == "contains") &
F.col("camp").contains(F.col("terms"))
) |
(
(F.col("function") == "match") &
F.col("campaign_name").contains("_") &
(F.col("cname_split") == F.col("terms"))
)
),
"left"
)
.withColumn(
"empty_is_other",
F.when(
(
F.col("product").isNull() &
F.col("product_category").isNull()
),
"other"
)
)
.withColumn(
"rn",
F.row_number().over(
Window.partitionBy("campaign_name")
.orderBy(
F.when(
F.col("function").isNull(), 3
).when(
F.col("function") == "match", 2
).otherwise(1),
F.length(F.col("terms")).desc(),
F.col("product").isNull()
)
)
)
.filter("rn=1")
.select(
"campaign_name",
F.coalesce("product", "empty_is_other").alias("prod"),
F.coalesce("product_category", "empty_is_other").alias("prod_cat"),
)
.na.fill("")
)
print(":::::::::::transformations have been done finally::::::::::::")
web_notif_data1 = web_notif_data # Just taking the backup of DF in case something goes wrong
web_notif_data = web_notif_data.drop("campaign_name")
web_notif_data = web_notif_data.withColumnRenamed("temp_campaign_name", "campaign_name")
veryFinalDF = web_notif_data.join(output_df3, "campaign_name", "left_outer")
# veryFinalDF.show(truncate=False)
veryFinalDF.write.mode("overwrite").parquet(aggregatedPath)
print("::::final data have been written successfully::::::")
where fullLoad is the data frame that reads from the Redshift table.This code works fine on 0.2 Million records. However, in production for 15 days, data could be around a minimum of 25 Million records. I don't know the size since the data is stored in the redshift table and we are reading from it and then processing the data. I am running this code via Glue jobs and it gets stuck on the last line i.e while writing the data as parquet. It gives me the below error:
I tried running it with 30 executors. It takes around 20 mins to load the data from Redshift into the fullLoad dataframe. What else can be done to avoid this error? I am new to AWS and Glue jobs.

Query condition missed key schema element: source_transaction_id in dynamoDB Scala

I am trying to execute a query with secondary Index as following
val valMap = new ValueMap()
.withString(":v_source_transaction_id", "843f45ad-cb1d-4f41-9ede-366c9304e447")
//.withString(":source_transaction_trace_id","843f45ad-cb1d-4f41-9ede-366c9304e443")
println(valMap)
//search is defined then extract dates from search.. else continue with simple logic.
val keyConditionExpression = """source_transaction_id =
| :v_source_transaction_id""".stripMargin
val spec = new QuerySpec()
.withProjectionExpression("source_transaction_id, transaction_date")
.withKeyConditionExpression(keyConditionExpression)
.withValueMap(valMap)
.withMaxResultSize(2)
case class DataItems(transaction_date: String)
val itemList = new ListBuffer[DataItems]
val items = table.getIndex("gsi-settlement").query(spec)
println(table.getIndex("gsi-settlement"))
val iterator = items.iterator()
while (iterator.hasNext) {
val next = iterator.next()
itemList += DataItems(next.getString("transaction_date"))
}
itemList.foreach(println)
here gsi settlement is the secondary index and source transaction id is primary key and I am getting the following error:
[AmazonDynamoDBException: Query condition missed key schema element: source_transaction_id
Try this :
val keyConditionExpression = """source_transaction_id = :v_source_transaction_id""".stripMargin
Currently your keyConditionExpression coming up as
source_transaction_id =
:v_source_transaction_id
Error : AmazonDynamoDBException: Query condition missed key schema element: source_transaction_id is suggesting that you are missing the GSI key in your query

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

Akka Streams: Throttling a File Source

Ive got a file containing several thousand lines like this.
Mr|David|Smith|david.smith#gmail.com
Mrs|Teri|Smith|teri.smith#gmail.com
...
I want to read the file emitting each line downstream but in a throttled manner ie. 1 per/sec.
I cannot quite figure out how to get the throttling working in the flow.
flow1 (below) outputs the first line after 1 sec and then terminates.
flow2 (below) waits 1 sec then outputs the whole file.
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 = Flow[ByteString].
via(Framing.delimiter(ByteString(System.lineSeparator),10000)).
throttle(1, 1.second, 1, ThrottleMode.shaping).
map(bs => bs.utf8String)
val flow2 = Flow[ByteString].
throttle(1, 1.second, 1, ThrottleMode.shaping).
via(Framing.delimiter(ByteString(System.lineSeparator), 10000)).
map(bs => bs.utf8String)
val sink = Sink.foreach(println)
val res = source.via(flow2).to(sink).run().onComplete(_ => system.terminate())
I couldn't glean any solution from studying the docs.
Would greatly appreciate any pointers.
Use runWith, instead of to, with flow1:
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 =
Flow[ByteString]
.via(Framing.delimiter(ByteString(System.lineSeparator), 10000))
.throttle(1, 1.second, 1, ThrottleMode.shaping)
.map(bs => bs.utf8String)
val sink = Sink.foreach(println)
source.via(flow1).runWith(sink).onComplete(_ => system.terminate())
to returns the materialized value of the Source (i.e., the source.via(flow1)), so you're terminating the actor system when the "left-hand side" of the stream is completed. What you want to do is to shut down the system when the materialized value of the Sink is completed. Using runWith returns the materialized value of the Sink parameter and is equivalent to:
source
.via(flow1)
.toMat(sink)(Keep.right)
.run()
.onComplete(_ => system.terminate())

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}