Stream records from DataBase using Akka Stream - akka

I have a system using Akka which currently handles incoming streaming data over message queues. When a record arrives then it is processed, mq is acked and record is passed on for further handling within the system.
Now I would like to add support for using DBs as input.
What would be a way to go for the input source to be able to handle DB (should stream in > 100M records at the pace that the receiver can handle - so I presume reactive/akka-streams?)?

Slick Library
Slick streaming is how this is usually done.
Extending the slick documentation a bit to include akka streams:
//SELECT Name from Coffees
val q = for (c <- coffees) yield c.name
val action = q.result
type Name = String
val databasePublisher : DatabasePublisher[Name] = db stream action
import akka.stream.scaladsl.Source
val akkaSourceFromSlick : Source[Name, _] = Source fromPublisher databasePublisher
Now akkaSourceFromSlick is like any other akka stream Source.
"Old School" ResultSet
It is also possible to use a plain ResultSet, without slick, as the "engine" for an akka stream. We will utilize the fact that a stream Source can be instantiated from an Iterator.
First create the ResultSet using standard jdbc techniques:
import java.sql._
val resultSetGenerator : () => Try[ResultSet] = Try {
val statement : Statement = ???
statement executeQuery "SELECT Name from Coffees"
}
Of course all ResultSet instances have to move the cursor before the first row:
val adjustResultSetBeforeFirst : (ResultSet) => Try[ResultSet] =
(resultSet) => Try(resultSet.beforeFirst()) map (_ => resultSet)
Once we start iterating through rows we'll have to pull the value from the correct column:
val getNameFromResultSet : ResultSet => Name = _ getString "Name"
And now we can implement the Iterator Interface to create a Iterator[Name] from a ResultSet:
val convertResultSetToNameIterator : ResultSet => Iterator[Name] =
(resultSet) => new Iterator[Try[Name]] {
override def hasNext : Boolean = resultSet.next
override def next() : Try[Name] = Try(getNameFromResultSet(resultSet))
} flatMap (_.toOption)
And finally, glue all the pieces together to create the function we'll need to pass to Source.fromIterator:
val resultSetGenToNameIterator : (() => Try[ResultSet]) => () => Iterator[Name] =
(_ : () => Try[ResultSet])
.andThen(_ flatMap adjustResultSetBeforeFirst)
.andThen(_ map convertResultSetToNameIterator)
.andThen(_ getOrElse Iterator.empty)
This Iterator can now feed a Source:
val akkaSourceFromResultSet : Source[Name, _] =
Source fromIterator resultSetGenToNameIterator(resultSetGenerator)
This implementation is reactive all the way down to the database. Since the ResultSet pre-fetches a limited number of rows at a time, data will only come off the hard drive through the database as the stream Sink signals demand.

I find Alpakka documentation to be excellent and a much easier way to work with reactive-streams than than the Java Publisher interface.
The Alpakka project is an open source initiative to implement stream-aware, reactive, integration pipelines for Java and Scala. It is built on top of Akka Streams, and has been designed from the ground up to understand streaming natively and provide a DSL for reactive and stream-oriented programming, with built-in support for backpressure
Document for Alpakka with Slick: https://doc.akka.io/docs/alpakka/current/slick.html
Alpakka Github: https://github.com/akka/alpakka

Related

Binary File read in Google Dataflow

I need to read binary file in google dataflow ,
I just need to read file and parse every 64 byte as one record and apply some logic in each byte of every 64 byte of binary file in dataflow.
same thing I tried in spark , code smape as belows:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("RecordSplit")
.master("local[*]")
.getOrCreate()
val df = spark.sparkContext.binaryRecords("< binary-file-path>", 64)
val Table = df.map(rec => {
val c1= (convertHexToString(rec(0)))
val c2= convertBinaryToInt16(rec, 48)
val c3= rec(59)
val c4= convertHexToString(rec(50)) match {
case str =>
if (str.startsWith("c"))
2020 + str.substring(1).toInt
else if (str.startsWith("b"))
2010 + str.substring(1).toInt
else if (str.startsWith("b"))
2000 + str.substring(1).toInt
case _ => 1920
}
I would recommend the following:
If you are not limited to python/scala, OffsetBasedSource (FileBasedSource is a subclass) can address your needs because it uses offsets to define the starting and ending positions.
TikaIO can process metadata, however it can read binary data as per documentation.
The example dataflow-opinion-analysis contains information to read from arbitrary byte position.
There are additional docs to create a custom Read implementation. You may want to consider looking at these Beam examples for guidance on how to implement a custom source, like this python example.
A different approach would be making arrays of 64 bytes outside pipeline (in-memory) and then creating a PCollection from memory, just keep in mind that documentation recommends it for unit tests.

Return a List from multiple WS calls in a Scala foreach

I am making a WS call to a service that returns a list of a users connections. After I have received the response I do a foreach on the list and in the foreach I make a WS call to another service to get more details for each connection.
Currently I am trying to use a ListBuffer but due to the async nature of the calls it is being returned empty before the details have been gathered.
My code is as below which returns an empty List to my controller:
def getAllConnections(username: String) = {
connectionsConnector.getAllConnections(username).map {
connections =>
val connectionsList: ListBuffer[ConnectionsResponse] = ListBuffer()
connections.map {
connection =>
usersService.getUser(connection.connectionUsername).foreach {
case Some(user) =>
val blah = ConnectionsResponse(user, connection)
connectionsList.+=(blah)
}
}
connectionsList.toList
}
}
Any suggestions on how I can return a Future[List] to my controller would be great, thanks.
for {
connections <- connectionsConnector.getAllConnections(username)
usersWithConnection <- Future.traverse(connections){ c => userService.getUser(c.connectionUsername).map(u => (u,c)) }
} yield usersWithConnection.collect{ case (Some(user), conn) => ConnectionsResponse(user, conn)}
Should give you some ideas at least. We can use a for comprehension in the context of a future. Future.traverse will turn a list of futures into a future of a list. Needing to return the connection along with the user adds an extra complication but we can just map over the individual future to include the connection with the user.
Use the monadic for loop:
def getAllConnections(username: String) = connectionsConnector.getAllConnections(username) map { connections ->
for {
connection <- connections
user <- usersService.getUser(connection.connectionUsername)
}
yield ConnectionsResponse(user, connection)
}
I had to guess the exact types you're using so this may need to be adapted, but something very similar to the above should solve your problem.
The outer map maps the original future, and since the first generator of the for comprehension is a list, the result will also be a list.

Query parameters for GET requests using Akka HTTP (formally known as Spray)

One of the features of Akka HTTP (formally known as Spray) is its ability to automagically marshal and unmarshal data back and forth from json into case classes, etc. I've had success at getting this to work well.
At the moment, I am trying to make an HTTP client that performs a GET request with query parameters. The code currently looks like this:
val httpResponse: Future[HttpResponse] =
Http().singleRequest(HttpRequest(
uri = s"""http://${config.getString("http.serverHost")}:${config.getInt("http.port")}/""" +
s"query?seq=${seq}" +
s"&max-mismatches=${maxMismatches}" +
s"&pam-policy=${pamPolicy}"))
Well, that's not so pretty. It would be nice if I could just pass in a case class containing the query parameters, and have Akka HTTP automagically generate the query parameters, kind of like it does for json. (Also, the server side of Akka HTTP has a somewhat elegant way of parsing GET query parameters, so one would think that it would also have a somewhat elegant way to generate them.)
I'd like to do something like the following:
val httpResponse: Future[HttpResponse] =
Http().singleRequest(HttpRequest(
uri = s"""http://${config.getString("http.serverHost")}:${config.getInt("http.port")}/query""",
entity = QueryParams(seq = seq, maxMismatches = maxMismatches, pamPolicy = pamPolicy)))
Only, the above doesn't actually work.
Is what I want doable somehow with Akka HTTP? Or do I just need to do things the old-fashioned way? I.e, generate the query parameters explicitly, as I do in the first code block above.
(I know that if I were to change this from a GET to a POST, I could probably to get it to work more like I would like it to work, since then I could get the contents of the POST request automagically converted from a case class to json, but I don't really wish to do that here.)
You can leverage the Uri class to do what you want. It offers multiple ways to get a set of params into the query string using the withQuery method. For example, you could do something like this:
val params = Map("foo" -> "bar", "hello" -> "world")
HttpRequest(Uri(hostAndPath).withQuery(params))
Or
HttpRequest(Uri(hostAndPath).withQuery(("foo" -> "bar"), ("hello" -> "world")))
Obviously this could be done by altering the extending the capability of Akka HTTP, but for what you need (just a tidier way to build the query string), you could do it with some scala fun:
type QueryParams = Map[String, String]
object QueryParams {
def apply(tuples: (String, String)*): QueryParams = Map(tuples:_*)
}
implicit class QueryParamExtensions(q: QueryParams) {
def toQueryString = "?"+q.map{
case (key,value) => s"$key=$value" //Need to do URL escaping here?
}.mkString("&")
}
implicit class StringQueryExtensions(url: String) {
def withParams(q: QueryParams) =
url + q.toQueryString
}
val params = QueryParams(
"abc" -> "def",
"xyz" -> "qrs"
)
params.toQueryString // gives ?abc=def&xyz=qrs
"http://www.google.com".withParams(params) // gives http://www.google.com?abc=def&xyz=qrs

Using Thread.sleep() inside an foreach in scala

I've a list of URLs inside a List.
I want to get the data by calling WS.url(currurl).get(). However, I want add a delay between each request. Can I add Thread.sleep() ? or is there another way of doing this?
one.foreach {
currurl => {
import play.api.libs.ws.WS
println("using " + currurl)
val p = WS.url(currurl).get()
p.onComplete {
case Success(s) => {
//do something
}
case Failure(f) => {
println("failed")
}
}
}
}
Sure, you can call Thread.sleep inside your foreach function, and it will do what you expect.
That will tie up a thread, though. If this is just some utility that you need to run sometimes, then who cares, but if it's part of some server you are trying to write and you might tie up many threads, then you probably want to do better. One way you could do better is to use Akka (it looks like you are using Play, so you are already using Akka) to implement the delay -- write an actor that uses scheduler.schedule to arrange to receive a message periodically, and then handle one request each time the message is read. Note that Akka's scheduler itself ties up a thread, but it can then send periodic messages to an arbitrary number of actors.
You can do it with scalaz-stream
import org.joda.time.format.DateTimeFormat
import scala.concurrent.duration._
import scalaz.stream._
import scalaz.stream.io._
import scalaz.concurrent.Task
type URL = String
type Fetched = String
val format = DateTimeFormat.mediumTime()
val urls: Seq[URL] =
"http://google.com" :: "http://amazon.com" :: "http://yahoo.com" :: Nil
val fetchUrl = channel[URL, Fetched] {
url => Task.delay(s"Fetched " +
s"url:$url " +
s"at: ${format.print(System.currentTimeMillis())}")
}
val P = Process
val process =
(P.awakeEvery(1.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
Output:
Fetched url:http://google.com at: 1:04:25 PM
Fetched url:http://amazon.com at: 1:04:26 PM
Fetched url:http://yahoo.com at: 1:04:27 PM

How to communicate between Agents?

Using the MailboxProcessor in F#, what is the preferred way to communicate between them? - Wrapping the agents into objects like:
type ProcessAgent(saveAgent:SaveAgent) = ...
type SaveAgent() = ...
let saveAgent = new SaveAgent()
let processAgent = new ProcessAgent(mySaveAgent)
or what about:
type ProcessAgent(cont:string -> unit) = ...
type SaveAgent() = ...
let saveAgent = new SaveAgent()
let processAgent = new ProcessAgent(fun z -> saveAgent.Add z)
or maybe even something like:
type ProcessAgent() = ...
type SaveAgent() = ...
let saveAgent = new SaveAgent()
let processAgent = new ProcessAgent()
processAgent.Process item (fun z -> saveAgent.Add z)
Also is there ever any reason to wrap a normal function, that's is not maintaining some kind of state, into an agent?
The key thing about encapsulating the agents in classes is that it lets you break the direct dependencies between them. So, you can create the individual agents and then connect them into a bigger "agent network" just by registering event handlers, calling methods, etc.
An agent can essentially expose three kinds of members:
Actions are members of type 'T -> unit. They send some message to the agent without waiting for any reply from the agent. This is essentially wrapping a call to agent.Post.
Blocking actions are members of type 'T -> Async<'R>. This is useful when you're sending some message to the agent, but then want to wait for a response (or confirmation that the action was processed). These do not block the logical thread (they are asynchronous) but they block the execution of the caller. This is essentially wrapping a call to agent.PostAndAsyncReply.
Notifications are members of type IEvent<'T> or IObservable<'T> representing some sort of notification reported from the agent - e.g. when the agent finishes doing some work and wants to notify the caller (or other agents).
In your example, the processing agent is doing some work (asynchronously) and then returns the result, so I think it makes sense to use "Blocking action". The operation of the saving agent is just an "Action", because it does not return anything. To demonstrate the last case, I'll add "flushed" notification, which gets called when the saving agent saves all queued items to the actual storage:
// Two-way communication processing a string
type ProcessMessage =
PM of string * AsyncReplyChannel<string>
type ProcessAgent() =
let agent = MailboxProcessor.Start(fun inbox -> async {
while true do
let! (PM(s, repl)) = inbox.Receive()
repl.Reply("Echo: " + s) })
// Takes input, processes it and asynchronously returns the result
member x.Process(input) =
agent.PostAndAsyncReply(fun ch -> PM(input, ch))
type SaveAgent() =
let flushed = Event<_>()
let agent = (* ... *)
// Action to be called to save a new processed value
member x.Add(res) =
agent.Post(res)
// Notification triggered when the cache is flushed
member x.Flushed = flushed.Publish
Then you can create both agents and connect them in various ways using the members:
let proc = ProcessAgent()
let save = SaveAgent()
// Process an item and then save the result
async {
let! r = proc.Process("Hi")
save.Save(r) }
// Listen to flushed notifications
save.Flushed |> Event.add (fun () ->
printfn "flushed..." )
You don't need to create a class for your agents. Why not just write a function that returns your MailboxProcessor?
let MakeSaveAgent() =
MailboxProcessor<SaveMessageType>.Start(fun inbox ->
(* etc... *))
let MakeProcessAgent (saveAgent: MailboxProcessor<SaveMessageType>) =
MailboxProcessor<ProcessMessageType>.Start(fun inbox ->
(* etc... you can send messages to saveAgent here *))
For your final question: no, not really, that would be adding unnecessary complication when a simple function returning Async<_> would suffice.