Saving partial spark DStream window to HDFS

Saving partial spark DStream window to HDFS - hdfs

I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values.
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
val counts = eegStreams(a).map(x => (math.round(x.toDouble), 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(4), Seconds(4))
val sortedCounts = counts.map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap)
ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
//sortedCounts.foreachRDD(rdd =>println("\nTop 10 amplitudes:\n" + rdd.take(10).mkString("\n")))
sortedCounts.map(tuple => "%s,%s".format(tuple._1, tuple._2)).saveAsTextFiles("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))
I can print top 10 as above (commented).
I have also tried
sortedCounts.foreachRDD{ rdd => ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
but I get the following error. My Array is not serializable
15/01/05 17:12:23 ERROR actor.OneForOneStrategy:
org.apache.spark.streaming.StreamingContext
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext

Can you try this?
sortedCounts.foreachRDD(rdd => rdd.filterWith(ind => ind)((v, ind) => ind <= 10).saveAsTextFile(...))
Note: I didn't test the snippet...

Your first version should work. Just declare #transient ssc = ... where the Streaming Context is first created.
The second version won't work b/c StreamingContext cannot be serialized in a closure.

Related

How to remove double quotes from keys in RDD and split JSON into two lines?

I need to modify the data to give input to CEP system, my current data looks like below
val rdd = {"var":"system-ready","value":0.0,"objectID":"2018","partnumber":2,"t":"2017-08-25 11:27:39.000"}
I need output like
t = "2017-08-25 11:27:39.000
Check = { var = "system-ready",value = 0.0, objectID = "2018", partnumber = 2 }
I have to write RDD map operations to achieve this if anybody suggests better option welcome. colcount is the number of columns.
rdd.map(x => x.split("\":").mkString("\" ="))
.map((f => (f.dropRight(1).split(",").last.toString, f.drop(1).split(",").toSeq.take(colCount-1).toString)))
.map(f => (f._1, f._2.replace("WrappedArray(", "Check = {")))
.map(f => (f._1.drop(0).replace("\"t\"", "t"), f._2.dropRight(1).replace("(", "{"))) /
.map(f => f.toString().split(",C").mkString("\nC").replace(")", "}").drop(0).replace("(", "")) // replacing , with \n, droping (
.map(f => f.replace("\" =\"", "=\"").replace("\", \"", "\",").replace("\" =", "=").replace(", \"", ",").replace("{\"", "{"))

Scala's JSON parser seems to be a good choice for this problem:
import scala.util.parsing.json
rdd.map( x => {
JSON.parseFull(x).get.asInstanceOf[Map[String,String]]
})
This will result in an RDD[Map[String, String]]. You can then access the t field from the JSON, for example, using:
.map(dict => "t = "+dict("t"))

Akka Streams: Throttling a File Source

Ive got a file containing several thousand lines like this.
Mr|David|Smith|david.smith#gmail.com
Mrs|Teri|Smith|teri.smith#gmail.com
...
I want to read the file emitting each line downstream but in a throttled manner ie. 1 per/sec.
I cannot quite figure out how to get the throttling working in the flow.
flow1 (below) outputs the first line after 1 sec and then terminates.
flow2 (below) waits 1 sec then outputs the whole file.
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 = Flow[ByteString].
via(Framing.delimiter(ByteString(System.lineSeparator),10000)).
throttle(1, 1.second, 1, ThrottleMode.shaping).
map(bs => bs.utf8String)
val flow2 = Flow[ByteString].
throttle(1, 1.second, 1, ThrottleMode.shaping).
via(Framing.delimiter(ByteString(System.lineSeparator), 10000)).
map(bs => bs.utf8String)
val sink = Sink.foreach(println)
val res = source.via(flow2).to(sink).run().onComplete(_ => system.terminate())
I couldn't glean any solution from studying the docs.
Would greatly appreciate any pointers.

Use runWith, instead of to, with flow1:
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 =
Flow[ByteString]
.via(Framing.delimiter(ByteString(System.lineSeparator), 10000))
.throttle(1, 1.second, 1, ThrottleMode.shaping)
.map(bs => bs.utf8String)
val sink = Sink.foreach(println)
source.via(flow1).runWith(sink).onComplete(_ => system.terminate())
to returns the materialized value of the Source (i.e., the source.via(flow1)), so you're terminating the actor system when the "left-hand side" of the stream is completed. What you want to do is to shut down the system when the materialized value of the Sink is completed. Using runWith returns the materialized value of the Sink parameter and is equivalent to:
source
.via(flow1)
.toMat(sink)(Keep.right)
.run()
.onComplete(_ => system.terminate())

How to extract messages using regex in Scala?

My version of RegEx is being greedy and now working as it suppose to. I need extract each message with timestamp and user who created it. Also if user has two or more consecutive messages it should go inside one match / block / group. How to solve it?
https://regex101.com/r/zD5bR6/1
val pattern = "((a\.b|c\.d)\n(.+\n)+)+?".r
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
UPDATE
ndn solution throws StackOverflowError when executed:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4708)
.......
Code:
val pattern = "(?:.+(?:\\Z|\\n))+?(?=\\Z|\\w\\.\\w)".r
val array = (pattern findAllIn str).toArray.reverse foreach{println _}
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)

I don't think a regular expression is the right tool for this job. My solution below uses a (tail) recursive function to loop over the lines, keep the current username and create a Message for every timestamp / message pair.
import java.time.LocalTime
case class Message(user: String, timestamp: LocalTime, message: String)
val Timestamp = """\[(\d{2})\:(\d{2})\:(\d{2})\]""".r
def parseMessages(lines: List[String], usernames: Set[String]) = {
#scala.annotation.tailrec
def go(
lines: List[String], currentUser: Option[String], messages: List[Message]
): List[Message] = lines match {
// no more lines -> return parsed messages
case Nil => messages.reverse
// found a user -> keep as currentUser
case user :: tail if usernames.contains(user) =>
go(tail, Some(user), messages)
// timestamp and message on next line -> create a Message
case Timestamp(h, m, s) :: msg :: tail if currentUser.isDefined =>
val time = LocalTime.of(h.toInt, m.toInt, s.toInt)
val newMsg = Message(currentUser.get, time, msg)
go(tail, currentUser, newMsg :: messages)
// invalid line -> ignore
case _ =>
go(lines.tail, currentUser, messages)
}
go(lines, None, Nil)
}
Which we can use as :
val input = """
a.b
[10:12:03]
you can also get commands
[10:11:26]
from the console
[10:11:21]
can you check if has been resolved
[10:10:47]
ah, okay
c.d
[10:10:39]
anyways startsLevel is still 4
a.b
[10:09:25]
might be a dead end
[10:08:56]
that need to be started early as well
"""
val lines = input.split('\n').toList
val users = Set("a.b", "c.d")
parseMessages(lines, users).foreach(println)
// Message(a.b,10:12:03,you can also get commands)
// Message(a.b,10:11:26,from the console)
// Message(a.b,10:11:21,can you check if has been resolved)
// Message(a.b,10:10:47,ah, okay)
// Message(c.d,10:10:39,anyways startsLevel is still 4)
// Message(a.b,10:09:25,might be a dead end)
// Message(a.b,10:08:56,that need to be started early as well)

The idea is to take as little characters as possible that will be followed by a username or the end of the string:
(?:.+(?:\Z|\n))+?(?=\Z|\w\.\w)
See it in action

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?

If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}

Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}

Writing methods that call other methods inside d3 ember components?

Let's say I am creating a reusable component in Ember, and I want a helper function that calls another helper function defined within. For example,
App.SomeCoolComponent = Ember.Component.extend
offset: 50
position: (x) -> x * 100
offsetPosition: # would like (x) -> position(x) + offset
So conceptually, return a function that would evaluate the position at x, add offset, and return value. Obviously this is a silly example and I could just write offSetPosition without calling position, but in a more complex scenario that is repeating code. The problem is I can't figure out how to get this to work. I tried
offsetPosition: (x) -> #get('position')(x) + #get('offset')
which fails because #get isn't defined within the function, it has the wrong scope. I've tried to insert things like Ember.computed in various places also with no luck, e.g. the following also doesn't work:
offsetPosition: Ember.computed(->
(x) -> #get('position')(x) + #get('offset')).property('position', 'offset')
What is the correct way of doing this?
Ember version 1.3.0-beta.1+canary.48513b24. Thanks in advance!
Edit: seems like my problem stems from passing the function into a d3 call. For example:
App.SomeCoolComponent = Ember.Component.extend
offset: 50
position: (d, i) -> i * 100
offsetPosition: (d, i) ->
#position(d, i) + #get('offset')
# Some other code not shown
didInsertElement: ->
data = [1, 2, 3]
i = 1
d = data[i]
console.log(#position(d, i)) # works
console.log(#offsetPosition(d, i)) # works
d3.select('svg').selectAll('circle').data(data).enter().append('circle')
.attr('cx', #position) # works
.attr('cy', #offsetPosition) # fails
.attr('r', 30)
The error message is Uncaught TypeError: Object #<SVGCircleElement> has no method 'position'
Any thoughts on how to resolve this issue?

The problem is that you are passing a function offsetPosition (which references this and expects it to point to App.SomeCoolComponent) to a D3 callback where this is replaced by the DOMElement.
You can solve the problem in two ways:
Using the fat arrow syntax:
d3.select('svg').selectAll('circle').data(data).enter().append('circle')
.attr('cx', #position) # works
.attr('cy', (d, i) => #offsetPosition(d, i))
.attr('r', 30)
Using bind explicitly:
d3.select('svg').selectAll('circle').data(data).enter().append('circle')
.attr('cx', #position) # works
.attr('cy', #offsetPosition.bind(this))
.attr('r', 30)

methods (aka not computed properties) are in the current context and should just be called like a method, and not with getters/setters.
offsetPosition: (x) ->
#position(x) + #get("offset")
position: (x) ->
x * 100
Here's an example: http://emberjs.jsbin.com/eWIYICu/3/edit
App.AnAppleComponent = Ember.Component.extend({
offset: 50,
position: function(x) {
return x * 100;
},
offsetPosition: function(x) {
return this.position(x) + this.get('offset');
},
displayOffset: function(){
return this.offsetPosition(Math.floor(Math.random() * 10) + 1);
}.property('offset')
});
Personally I'd create a mixin and add my methods in there, then add the mixin wherever that logic is needed. Mixins are in the scope of whatever they are added to.
BTW You can use Ember.Get(object, 'propertyOnObject') anywhere in the app.
In response to your edit, you are passing in methods into those attribute values instead of the values of those methods (which is why it works above, but not below). There is a good chance since you are sending in those methods jquery is applying those methods later way out of scope.
didInsertElement: ->
data = [1, 2, 3]
i = 1
d = data[i]
position = #position(d, i)
offsetPosition = #offsetPosition(d, i)
console.log position
console.log offsetPosition
d3.select("svg").selectAll("circle").data(data).enter().append("circle").attr("cx", position).attr("cy", offsetPosition).attr "r", 30
I have a feeling you are wanting this to dynamically update or something along those lines, if that's the case you really want to be using computed properties, the bacon of Ember. Here's an updated version of the apple component:
http://emberjs.jsbin.com/eWIYICu/5/edit
<div {{bind-attr style='dynamicStyle'}}>
dynamicStyle: function(){
var rgb = this.get('rgb'),
rgb1 = rgb * 21 % 255,
rgb2 = rgb * 32 % 255,
rgb3 = rgb * 41 % 255;
return 'background-color:rgb(' + rgb1 + ',' + rgb2 + ',' + rgb3 + ');';
}.property('rgb'),

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Saving partial spark DStream window to HDFS - hdfs

Can you try this? sortedCounts.foreachRDD(rdd => rdd.filterWith(ind => ind)((v, ind) => ind <= 10).saveAsTextFile(...)) Note: I didn't test the snippet...

Your first version should work. Just declare #transient ssc = ... where the Streaming Context is first created. The second version won't work b/c StreamingContext cannot be serialized in a closure.

Related

How to remove double quotes from keys in RDD and split JSON into two lines?

Akka Streams: Throttling a File Source

How to extract messages using regex in Scala?

Regex / subString to extract all matching patterns / groups

Writing methods that call other methods inside d3 ember components?

Categories

Resources