Ive got a file containing several thousand lines like this.
Mr|David|Smith|david.smith#gmail.com
Mrs|Teri|Smith|teri.smith#gmail.com
...
I want to read the file emitting each line downstream but in a throttled manner ie. 1 per/sec.
I cannot quite figure out how to get the throttling working in the flow.
flow1 (below) outputs the first line after 1 sec and then terminates.
flow2 (below) waits 1 sec then outputs the whole file.
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 = Flow[ByteString].
via(Framing.delimiter(ByteString(System.lineSeparator),10000)).
throttle(1, 1.second, 1, ThrottleMode.shaping).
map(bs => bs.utf8String)
val flow2 = Flow[ByteString].
throttle(1, 1.second, 1, ThrottleMode.shaping).
via(Framing.delimiter(ByteString(System.lineSeparator), 10000)).
map(bs => bs.utf8String)
val sink = Sink.foreach(println)
val res = source.via(flow2).to(sink).run().onComplete(_ => system.terminate())
I couldn't glean any solution from studying the docs.
Would greatly appreciate any pointers.
Use runWith, instead of to, with flow1:
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 =
Flow[ByteString]
.via(Framing.delimiter(ByteString(System.lineSeparator), 10000))
.throttle(1, 1.second, 1, ThrottleMode.shaping)
.map(bs => bs.utf8String)
val sink = Sink.foreach(println)
source.via(flow1).runWith(sink).onComplete(_ => system.terminate())
to returns the materialized value of the Source (i.e., the source.via(flow1)), so you're terminating the actor system when the "left-hand side" of the stream is completed. What you want to do is to shut down the system when the materialized value of the Sink is completed. Using runWith returns the materialized value of the Sink parameter and is equivalent to:
source
.via(flow1)
.toMat(sink)(Keep.right)
.run()
.onComplete(_ => system.terminate())
Related
I tried to read data from s3 and snowflake simultaneously using spark and put it into snowflake after processing(join Operation).
During the tests, I found that each gives the same result but different performance.
(The second attempt was made for logs.)
v1 : Read from s3 and snowflake respectively, process "Join" operation and save snowflake
1st try : It took 2 hours, 46 minutes.
2nd try : It took 1 hour, 12 minutes.
v2 : Write the data B read from snowflake in s3, read it again, and save snowflake after "Join" operation
1st try : It took 5 minutes.
2nd try : It took 5 minutes.
v3 : Read from s3 and snowflake respectively, write the result to s3 and read it again, and save snowflake
1st try : It took 1h 11mins.
2nd try : It took 50 minutes.
Question. Why does each test perform differently?
I think v1 should be the fastest in that snowflake spark-connector is supported.
Here is test details and time information elapsed for each line,
Common config
Data A - stored in s3 : 20.2GB
Data B - stored in snowflake(internal table) : 7KB(195 rows)
spark config : r5.4xlarge(4) on-demand
snowflake warehouse : 3x-large
aws emr version : emr-5.31.0
spark version : 2.4.6
snowflake version : spark-snowflake_2.11
Test(using aws EMR)
v1 : Read from s3 and snowflake respectively, process "Join" operation and save snowflake
Result : OOM occurs after 2 hours ((Exit status 137.)
Change spark config by aws guide ".set("spark.sql.shuffle.partitions", "500") & resultDF.repartition(500).cache(): It took 2h 46mins.
object snowflake_v1 {
val spark_config = new SparkConf()
.set("fs.s3.maxConnections", "5000")
.set("spark.sql.broadcastTimeout", "1200")
.set("spark.sql.shuffle.partitions", "500")
.set("spark.network.timeout", "10000000")
.set("spark.executor.heartbeatInterval", "1000000")
val spark = SparkSession
.builder()
.master("yarn")
.config(spark_config)
.appName("snowflake")
.getOrCreate()
def main(args: Array[String]): Unit = {
//parquet at snowflake_v1.scala:27
val Adf =
spark.read
.parquet(
"s3://"
)
.cache()
var sfOptions = Map.apply(
"sfURL" -> "XXX",
"sfUser" -> "XXX",
"sfPassword" -> "XXX",
"sfDatabase" -> "XXX",
"sfSchema" -> "XXX",
"sfWarehouse" -> "XXX"
)
val Bdf: DataFrame = spark.sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_B")
.load()
val resDF =
Adf.join(Bdf, Seq("cnty"), "leftouter").cache()
val newDF = resultDF.repartition(500).cache() //If not, OOM occurs
newDF .write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_result_spark_v1")
.option("parallelism", "8")
.mode(SaveMode.Overwrite)
.save()
}
}
v2 : Write the data B read from snowflake in s3, read it again, and save snowflake after "Join" operation
Result : It took 5mins.
object snowflake_v2 {
val spark_config = "same as v1"
val spark = "same as v1"
def main(args: Array[String]): Unit = {
// parquet at snowflake_v2.scala:25
val Adf = "same as v1"
var sfOptions = "same as v1"
val Bdf: DataFrame = "same as v1"
//parquet at snowflake_v2.scala:53
Bdf.write
.mode(SaveMode.Overwrite)
.parquet("s3://..b")
//parquet at snowflake_v2.scala:56
val Bdf2=
spark.read.parquet("s3://..b")
val resDF =
Adf.join(Bdf2, Seq("cnty"), "leftouter").cache()
resDF .write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_result_spark_v2")
.option("parallelism", "8")
.mode(SaveMode.Overwrite)
.save()
}
}
v3 : Read from s3 and snowflake respectively, write the result to s3 and read it again, and save snowflake
Result : It took 30mins, but duplicated rows occur in s3 and snowflake
Before writing the result in s3, it takes 1h 11mins if I print the count so that the actual operation proceeds.(No duplication)
object snowflake_v3 {
val spark_config = "same as v1"
val spark = "same as v1"
def main(args: Array[String]): Unit = {
//parquet at snowflake_v3.scala:25
val Adf = "same as v1"
var sfOptions = "same as v1"
val Bdf: DataFrame = "same as v1"
val resDF =
Adf.join(Bdf, Seq("cnty"), "leftouter")
println("resDF count")
//count at snowflake_v3.scala:54
println(resDF.count) //If not, duplicated rows occur
//parquet at snowflake_v3.scala:58
resDF.write
.mode(SaveMode.Overwrite)
.parquet("s3://../temp_result")
//parquet at snowflake_v3.scala:65
val resDF2 =
spark.read.parquet("s3://../temp_result")
resDF2 .write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t_result_spark_v3")
.option("parallelism", "8")
.mode(SaveMode.Overwrite)
.save()
}
}
I need to modify the data to give input to CEP system, my current data looks like below
val rdd = {"var":"system-ready","value":0.0,"objectID":"2018","partnumber":2,"t":"2017-08-25 11:27:39.000"}
I need output like
t = "2017-08-25 11:27:39.000
Check = { var = "system-ready",value = 0.0, objectID = "2018", partnumber = 2 }
I have to write RDD map operations to achieve this if anybody suggests better option welcome. colcount is the number of columns.
rdd.map(x => x.split("\":").mkString("\" ="))
.map((f => (f.dropRight(1).split(",").last.toString, f.drop(1).split(",").toSeq.take(colCount-1).toString)))
.map(f => (f._1, f._2.replace("WrappedArray(", "Check = {")))
.map(f => (f._1.drop(0).replace("\"t\"", "t"), f._2.dropRight(1).replace("(", "{"))) /
.map(f => f.toString().split(",C").mkString("\nC").replace(")", "}").drop(0).replace("(", "")) // replacing , with \n, droping (
.map(f => f.replace("\" =\"", "=\"").replace("\", \"", "\",").replace("\" =", "=").replace(", \"", ",").replace("{\"", "{"))
Scala's JSON parser seems to be a good choice for this problem:
import scala.util.parsing.json
rdd.map( x => {
JSON.parseFull(x).get.asInstanceOf[Map[String,String]]
})
This will result in an RDD[Map[String, String]]. You can then access the t field from the JSON, for example, using:
.map(dict => "t = "+dict("t"))
My version of RegEx is being greedy and now working as it suppose to. I need extract each message with timestamp and user who created it. Also if user has two or more consecutive messages it should go inside one match / block / group. How to solve it?
https://regex101.com/r/zD5bR6/1
val pattern = "((a\.b|c\.d)\n(.+\n)+)+?".r
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
UPDATE
ndn solution throws StackOverflowError when executed:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4708)
.......
Code:
val pattern = "(?:.+(?:\\Z|\\n))+?(?=\\Z|\\w\\.\\w)".r
val array = (pattern findAllIn str).toArray.reverse foreach{println _}
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
I don't think a regular expression is the right tool for this job. My solution below uses a (tail) recursive function to loop over the lines, keep the current username and create a Message for every timestamp / message pair.
import java.time.LocalTime
case class Message(user: String, timestamp: LocalTime, message: String)
val Timestamp = """\[(\d{2})\:(\d{2})\:(\d{2})\]""".r
def parseMessages(lines: List[String], usernames: Set[String]) = {
#scala.annotation.tailrec
def go(
lines: List[String], currentUser: Option[String], messages: List[Message]
): List[Message] = lines match {
// no more lines -> return parsed messages
case Nil => messages.reverse
// found a user -> keep as currentUser
case user :: tail if usernames.contains(user) =>
go(tail, Some(user), messages)
// timestamp and message on next line -> create a Message
case Timestamp(h, m, s) :: msg :: tail if currentUser.isDefined =>
val time = LocalTime.of(h.toInt, m.toInt, s.toInt)
val newMsg = Message(currentUser.get, time, msg)
go(tail, currentUser, newMsg :: messages)
// invalid line -> ignore
case _ =>
go(lines.tail, currentUser, messages)
}
go(lines, None, Nil)
}
Which we can use as :
val input = """
a.b
[10:12:03]
you can also get commands
[10:11:26]
from the console
[10:11:21]
can you check if has been resolved
[10:10:47]
ah, okay
c.d
[10:10:39]
anyways startsLevel is still 4
a.b
[10:09:25]
might be a dead end
[10:08:56]
that need to be started early as well
"""
val lines = input.split('\n').toList
val users = Set("a.b", "c.d")
parseMessages(lines, users).foreach(println)
// Message(a.b,10:12:03,you can also get commands)
// Message(a.b,10:11:26,from the console)
// Message(a.b,10:11:21,can you check if has been resolved)
// Message(a.b,10:10:47,ah, okay)
// Message(c.d,10:10:39,anyways startsLevel is still 4)
// Message(a.b,10:09:25,might be a dead end)
// Message(a.b,10:08:56,that need to be started early as well)
The idea is to take as little characters as possible that will be followed by a username or the end of the string:
(?:.+(?:\Z|\n))+?(?=\Z|\w\.\w)
See it in action
I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values.
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
val counts = eegStreams(a).map(x => (math.round(x.toDouble), 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(4), Seconds(4))
val sortedCounts = counts.map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap)
ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
//sortedCounts.foreachRDD(rdd =>println("\nTop 10 amplitudes:\n" + rdd.take(10).mkString("\n")))
sortedCounts.map(tuple => "%s,%s".format(tuple._1, tuple._2)).saveAsTextFiles("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))
I can print top 10 as above (commented).
I have also tried
sortedCounts.foreachRDD{ rdd => ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
but I get the following error. My Array is not serializable
15/01/05 17:12:23 ERROR actor.OneForOneStrategy:
org.apache.spark.streaming.StreamingContext
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext
Can you try this?
sortedCounts.foreachRDD(rdd => rdd.filterWith(ind => ind)((v, ind) => ind <= 10).saveAsTextFile(...))
Note: I didn't test the snippet...
Your first version should work. Just declare #transient ssc = ... where the Streaming Context is first created.
The second version won't work b/c StreamingContext cannot be serialized in a closure.
I am running by a tutorial
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
And at some point we use the mapReduceTriplets operations. This returns the expected result
// Find the oldest follower for each user
val oldestFollower: VertexRDD[(String, Int)] = userGraph.mapReduceTriplets[(String, Int)](
// For each edge send a message to the destination vertex with the attribute of the source vertex
edge => Iterator((edge.dstId, (edge.srcAttr.name, edge.srcAttr.age))),
// To combine messages take the message for the older follower
(a, b) => if (a._2 > b._2) a else b
)
But the IntelliJ points me that mapReduceTriplets is deprecated (as of 1.2.0) and should be replaced by aggregateMessages
// Find the oldest follower for each user
val oldestFollower: VertexRDD[(String, Int)] = userGraph.aggregateMessages()[(String, Int)](
// For each edge send a message to the destination vertex with the attribute of the source vertex
edge => Iterator((edge.dstId, (edge.srcAttr.name, edge.srcAttr.age))),
// To combine messages take the message for the older follower
(a, b) => if (a._2 > b._2) a else b
)
So I run the exact same code but then I don't have any output. Is that the expected result or should I change something due to the cahnge of aggregateMessages?
Probably you need something like this:
val oldestFollower: VertexRDD[(String, Int)] = userGraph.aggregateMessages[(String, Int)]
(
// For each edge send a message to the destination vertex with the attribute of the source vertex
sendMsg = { triplet => triplet.sendToDst(triplet.srcAttr.name, triplet.srcAttr.age) },
// To combine messages take the message for the older follower
mergeMsg = {(a, b) => if (a._2 > b._2) a else b}
)
You can find aggregateMessages function signature and useful examples at Grapx proggraming guide page. Hope this helps.