I've been playing around with Akka Streams and get the idea of creating Flows and wiring them together using FlowGraphs.
I know this part of Akka is still under development so some things may not be finished and some other bits may change, but is it possible to create a FlowGraph that isn't "complete" - i.e. isn't attached to a Sink - and pass it around to different parts of my code to be extended by adding Flow's to it and finally completed by adding a Sink?
Basically, I'd like to be able to compose FlowGraphs but don't understand how... Especially if a FlowGraph has split a stream by using a Broadcast.
Thanks
The next week (December) will be documentation writing for us, so I hope this will help you to get into akka streams more easily! Having that said, here's a quick answer:
Basically you need a PartialFlowGraph instead of FlowGraph. In those we allow the usage of UndefinedSink and UndefinedSource which you can then"attach" afterwards. In your case, we also provide a simple helper builder to create graphs which have exactly one "missing" sink – those can be treated exactly as if it was a Source, see below:
// for akka-streams 1.0-M1
val source = Source() { implicit b ⇒
// prepare an undefined sink, which can be relpaced by a proper sink afterwards
val sink = UndefinedSink[Int]
// build your processing graph
Source(1 to 10) ~> sink
// return the undefined sink which you mean to "fill in" afterwards
sink
}
// use the partial graph (source) multiple times, each time with a different sink
source.runWith(Sink.ignore)
source.runWith(Sink.foreach(x ⇒ println(x)))
Hope this helps!
Related
Lets assume that I have an input DataStream and want to implement some functionality that requires "memory" so I need ProcessFunction that gives me access to state. Is it possible to do it straight to the DataStream or the only way is to keyBy the initial stream and work in keyed-context?
I'm thinking that one solution would be to keyBy the stream with a hardcoded unique key so the whole input stream ends up in the same group. Then technically I have a KeyedStream and I can normally use keyed state, like I'm showing below with keyBy(x->1). But is this a good solution?
DataStream<Integer> inputStream = env.fromSource(...)
DataStream<Integer> outputStream = inputStream
.keyBy(x -> 1)
.process(...) //I've got acess to state heree
As I understand that's not a common usecase because the main purpose of flink is to partition the stream, process them seperately and then merge the results. In my scenario thats exactly what I'm doing, but the problem is that the merge step requires state to produce the final "global" result. What I actually want to do is something like this:
DataStream<Integer> inputStream = env.fromElements(1,2,3,4,5,6,7,8,9)
//two groups: group1=[1,2,3,4] & group2=[5,6,7,8,9]
DataStream<Integer> partialResult = inputStream
.keyBy(val -> val/5)
.process(<..stateful processing..>)
//Can't do statefull processing here because partialResult is not a KeyedStream
DataStream<Integer> outputStream = partialResult
.process(<..statefull processing..>)
outputStream.print();
But Flink doesnt seem to allow me do the final "merge partial results operation" because I can't get access to state in process function as partialResult is not a KeyedStream.
I'm beginner to flink so I hope what I'm writing makes sense.
In general I can say that I haven't found a good way to do the "merging" step, especially when it comes to complex logic.
Hope someone can give me some info, tips or correct me if I'm missing something
Thank you for your time
Is "keyBy the stream with a hardcoded unique key" a good idea? Well, normally no, since it forces all data to flow through a single sub-task, so you get no benefit from the full parallelism in your Flink cluster.
If you want to get a global result (e.g. the "best" 3 results, from any results generated in the preceding step) then yes, you'll have to run all records through a single sub-task. So you could have a fixed key value, and use a global window. But note (as the docs state) you need to come up with some kind of "trigger condition", otherwise with a streaming workflow you never know when you really have the best N results, and thus you'd never emit any final result.
I'm having trouble getting my head around the purpose of supply {…} blocks/the on-demand supplies that they create.
Live supplies (that is, the types that come from a Supplier and get new values whenever that Supplier emits a value) make sense to me – they're a version of asynchronous streams that I can use to broadcast a message from one or more senders to one or more receivers. It's easy to see use cases for responding to a live stream of messages: I might want to take an action every time I get a UI event from a GUI interface, or every time a chat application broadcasts that it has received a new message.
But on-demand supplies don't make a similar amount of sense. The docs say that
An on-demand broadcast is like Netflix: everyone who starts streaming a movie (taps a supply), always starts it from the beginning (gets all the values), regardless of how many people are watching it right now.
Ok, fair enough. But why/when would I want those semantics?
The examples also leave me scratching my head a bit. The Concurancy page currently provides three examples of a supply block, but two of them just emit the values from a for loop. The third is a bit more detailed:
my $bread-supplier = Supplier.new;
my $vegetable-supplier = Supplier.new;
my $supply = supply {
whenever $bread-supplier.Supply {
emit("We've got bread: " ~ $_);
};
whenever $vegetable-supplier.Supply {
emit("We've got a vegetable: " ~ $_);
};
}
$supply.tap( -> $v { say "$v" });
$vegetable-supplier.emit("Radish"); # OUTPUT: «We've got a vegetable: Radish»
$bread-supplier.emit("Thick sliced"); # OUTPUT: «We've got bread: Thick sliced»
$vegetable-supplier.emit("Lettuce"); # OUTPUT: «We've got a vegetable: Lettuce»
There, the supply block is doing something. Specifically, it's reacting to the input of two different (live) Suppliers and then merging them into a single Supply. That does seem fairly useful.
… except that if I want to transform the output of two Suppliers and merge their output into a single combined stream, I can just use
my $supply = Supply.merge:
$bread-supplier.Supply.map( { "We've got bread: $_" }),
$vegetable-supplier.Supply.map({ "We've got a vegetable: $_" });
And, indeed, if I replace the supply block in that example with the map/merge above, I get exactly the same output. Further, neither the supply block version nor the map/merge version produce any output if the tap is moved below the calls to .emit, which shows that the "on-demand" aspect of supply blocks doesn't really come into play here.
At a more general level, I don't believe the Raku (or Cro) docs provide any examples of a supply block that isn't either in some way transforming the output of a live Supply or emitting values based on a for loop or Supply.interval. None of those seem like especially compelling use cases, other than as a different way to transform Supplys.
Given all of the above, I'm tempted to mostly write off the supply block as a construct that isn't all that useful, other than as a possible alternate syntax for certain Supply combinators. However, I have it on fairly good authority that
while Supplier is often reached for, many times one would be better off writing a supply block that emits the values.
Given that, I'm willing to hazard a pretty confident guess that I'm missing something about supply blocks. I'd appreciate any insight into what that might be.
Given you mentioned Supply.merge, let's start with that. Imagine it wasn't in the Raku standard library, and we had to implement it. What would we have to take care of in order to reach a correct implementation? At least:
Produce a Supply result that, when tapped, will...
Tap (that is, subscribe to) all of the input supplies.
When one of the input supplies emits a value, emit it to our tapper...
...but make sure we follow the serial supply rule, which is that we only emit one message at a time; it's possible that two of our input supplies will emit values at the same time from different threads, so this isn't an automatic property.
When all of our supplies have sent their done event, send the done event also.
If any of the input supplies we tapped sends a quit event, relay it, and also close the taps of all of the other input supplies.
Make very sure we don't have any odd races that will lead to breaking the supply grammar emit* [done|quit].
When a tap on the resulting Supply we produce is closed, be sure to close the tap on all (still active) input supplies we tapped.
Good luck!
So how does the standard library do it? Like this:
method merge(*#s) {
#s.unshift(self) if self.DEFINITE; # add if instance method
# [I elided optimizations for when there are 0 or 1 things to merge]
supply {
for #s {
whenever $_ -> \value { emit(value) }
}
}
}
The point of supply blocks is to greatly ease correctly implementing reusable operations over one or more Supplys. The key risks it aims to remove are:
Not correctly handling concurrently arriving messages in the case that we have tapped more than one Supply, potentially leading us to corrupt state (since many supply combinators we might wish to write will have state too; merge is so simple as not to). A supply block promises us that we'll only be processing one message at a time, removing that danger.
Losing track of subscriptions, and thus leaking resources, which will become a problem in any longer-running program.
The second is easy to overlook, especially when working in a garbage-collected language like Raku. Indeed, if I start iterating some Seq and then stop doing so before reaching the end of it, the iterator becomes unreachable and the GC eats it in a while. If I'm iterating over lines of a file and there's an implicit file handle there, I risk the file not being closed in a very timely way and might run out of handles if I'm unlucky, but at least there's some path to it getting closed and the resources released.
Not so with reactive programming: the references point from producer to consumer, so if a consumer "stops caring" but hasn't closed the tap, then the producer will retain its reference to the consumer (thus causing a memory leak) and keep sending it messages (thus doing throwaway work). This can eventually bring down an application. The Cro chat example that was linked is an example:
my $chat = Supplier.new;
get -> 'chat' {
web-socket -> $incoming {
supply {
whenever $incoming -> $message {
$chat.emit(await $message.body-text);
}
whenever $chat -> $text {
emit $text;
}
}
}
}
What happens when a WebSocket client disconnects? The tap on the Supply we returned using the supply block is closed, causing an implicit close of the taps of the incoming WebSocket messages and also of $chat. Without this, the subscriber list of the $chat Supplier would grow without bound, and in turn keep alive an object graph of some size for each previous connection too.
Thus, even in this case where a live Supply is very directly involved, we'll often have subscriptions to it that come and go over time. On-demand supplies are primarily about resource acquisition and release; sometimes, that resource will be a subscription to a live Supply.
A fair question is if we could have written this example without a supply block. And yes, we can; this probably works:
my $chat = Supplier.new;
get -> 'chat' {
web-socket -> $incoming {
my $emit-and-discard = $incoming.map(-> $message {
$chat.emit(await $message.body-text);
Supply.from-list()
}).flat;
Supply.merge($chat, $emit-and-discard)
}
}
Noting it's some effort in Supply-space to map into nothing. I personally find that less readable - and this didn't even avoid a supply block, it's just hidden inside the implementation of merge. Trickier still are cases where the number of supplies that are tapped changes over time, such as in recursive file watching where new directories to watch may appear. I don't really know how'd I'd express that in terms of combinators that appear in the standard library.
I spent some time teaching reactive programming (not with Raku, but with .Net). Things were easy with one asynchronous stream, but got more difficult when we started getting to cases with multiple of them. Some things fit naturally into combinators like "merge" or "zip" or "combine latest". Others can be bashed into those kinds of shapes with enough creativity - but it often felt contorted to me rather than expressive. And what happens when the problem can't be expressed in the combinators? In Raku terms, one creates output Suppliers, taps input supplies, writes logic that emits things from the inputs into the outputs, and so forth. Subscription management, error propagation, completion propagation, and concurrency control have to be taken care of each time - and it's oh so easy to mess it up.
Of course, the existence of supply blocks doesn't stop being taking the fragile path in Raku too. This is what I meant when I said:
while Supplier is often reached for, many times one would be better off writing a supply block that emits the values
I wasn't thinking here about the publish/subscribe case, where we really do want to broadcast values and are at the entrypoint to a reactive chain. I was thinking about the cases where we tap one or more Supply, take the values, do something, and then emit things into another Supplier. Here is an example where I migrated such code towards a supply block; here is another example that came a little later on in the same codebase. Hopefully these examples clear up what I had in mind.
I am using zeromq to create a generic dynamic graph setup. I already have a XPUB/XSUB setup but am wondering if there is a zmq way of adding a sequence number/timestamp to each message going through generated by the proxy in order to have a uniquely sequenced “tape” of events?
Q : "... but am wondering if there is a zmq way of adding ... to each message ...?"
No, there is not. ZeroMQ way would be to have this done with Zero-Copy and (almost) Zero-Latency.
Such way does not exist for your wished use-case.
Solution ? Doable :
Create a transforming-node, where each message will get transformed accordingly ( SEQ-number added and TimeSTAMP datum { pre | ap }-pended ). Such step requires one to implement such a node and handle all such steps altogether with any exceptions per incident.
The ready-made API-documented zmq_proxy() simply does not and cannot and should not cover these specific requirements, as it was designed for other purposes ( and uses a Zero-Copy for the most efficient pass-through + ev. efficient MITM-logger mode(s) of service ).
I am trying to test a MailboxProcessor in F#. I want to test that the function f I am giving is actually executed when posting a message.
The original code is using Xunit, but I made an fsx of it that I can execute using fsharpi.
So far I am doing this :
open System
open FSharp
open System.Threading
open System.Threading.Tasks
module MyModule =
type Agent<'a> = MailboxProcessor<'a>
let waitingFor timeOut (v:'a)=
let cts = new CancellationTokenSource(timeOut|> int)
let tcs = new TaskCompletionSource<'a>()
cts.Token.Register(fun (_) -> tcs.SetCanceled()) |> ignore
tcs ,Async.AwaitTask tcs.Task
type MyProcessor<'a>(f:'a->unit) =
let agent = Agent<'a>.Start(fun inbox ->
let rec loop() = async {
let! msg = inbox.Receive()
// some more complex should be used here
f msg
return! loop()
}
loop()
)
member this.Post(msg:'a) =
agent.Post msg
open MyModule
let myTest =
async {
let (tcs,waitingFor) = waitingFor 5000 0
let doThatWhenMessagepostedWithinAgent msg =
tcs.SetResult(msg)
let p = new MyProcessor<int>(doThatWhenMessagepostedWithinAgent)
p.Post 3
let! result = waitingFor
return result
}
myTest
|> Async.RunSynchronously
|> System.Console.WriteLine
//display 3 as expected
This code works, but it does not look fine to me.
1) is the usage of TaskCompletionSource normal in f# or is there some dedicated stuff to allow me waiting for a completion?
2) I am using a second argument in the waitingFor function in order to contraint it, I know I could use a type MyType<'a>() to do it, is there another option? I would rather not use a new MyType that I find cumbersome.
3) Is there any other option to test my agent than doing this? the only post I found so far about the subject is this blogpost from 2009 http://www.markhneedham.com/blog/2009/05/30/f-testing-asynchronous-calls-to-mailboxprocessor/
This is a tough one, I've been trying to tackle this for some time as well. This is what I found so far, it's too long for a comment but I'd hesitate to call it a full answer either...
From simplest to most complex, depends really how thoroughly you want to test, and how complex is the agent logic.
Your solution may be fine
What you have is fine for small agents whose only role is to serialize access to an async resource, with little or no internal state handling. If you provide the f as you do in your example, you can be pretty sure it will be called in a relatively short timeout of few hundred milliseconds. Sure, it seems clunky and it's double the size of code for all the wrappers and helpers, but those can be reused it you test more agents and/or more scenarios, so the cost gets amortized fairly quickly.
The problem I see with this is that it's not very useful if you also want to verify more than than the function was called - for example the internal agent state after calling it.
One note that's applicable to other parts of the response as well: I usually start agents with a cancellation token, it makes both production and testing life cycle easier.
Use Agent reply channels
Add AsyncReplyChannel<'reply> to the message type and post messages using PostAndAsyncReply instead of Post method on the Agent. It will change your agent to something like this:
type MyMessage<'a, 'b> = 'a * AsyncReplyChannel<'b>
type MyProcessor<'a, 'b>(f:'a->'b) =
// Using the MyMessage type here to simplify the signature
let agent = Agent<MyMessage<'a, 'b>>.Start(fun inbox ->
let rec loop() = async {
let! msg, replyChannel = inbox.Receive()
let! result = f msg
// Sending the result back to the original poster
replyChannel.Reply result
return! loop()
}
loop()
)
// Notice the type change, may be handled differently, depends on you
member this.Post(msg:'a): Async<'b> =
agent.PostAndAsyncReply(fun channel -> msg, channel)
This may seem like an artificial requirement for the agent "interface", but it's handy to simulate a method call and it's trivial to test - await the PostAndAsyncReply (with a timeout) and you can get rid of most of the test helper code.
Since you have a separate call to the provided function and replyChannel.Reply, the response can also reflect the agent state, not just the function result.
Black-box model-based testing
This is what I'll talk about in most detail as I think it's most general.
In case the agent encapsulates more complex behavior, I found it handy to skip testing individual messages and use model-based tests to verify whole sequences of operations against a model of expected external behavior. I'm using FsCheck.Experimental API for this:
In your case this would be doable, but wouldn't make much sense since there is no internal state to model. To give you an example what it looks like in my particular case, consider an agent which maintains client WebSocket connections for pushing messages to the clients. I can't share the whole code, but the interface looks like this
/// For simplicity, this adapts to the socket.Send method and makes it easy to mock
type MessageConsumer = ArraySegment<byte> -> Async<bool>
type Message =
/// Send payload to client and expect a result of the operation
| Send of ClientInfo * ArraySegment<byte> * AsyncReplyChannel<Result>
/// Client connects, remember it for future Send operations
| Subscribe of ClientInfo * MessageConsumer
/// Client disconnects
| Unsubscribe of ClientInfo
Internally the agent maintains a Map<ClientInfo, MessageConsumer>.
Now for testing this, I can model the external behavior in terms of informal specification like: "sending to a subscribed client may succeed or fail depending on the result of calling the MessageConsumer function" and "sending to an unsubscribed client shouldn't invoke any MessageConsumer". So I can define types for example like these to model the agent.
type ConsumerType =
| SucceedingConsumer
| FailingConsumer
| ExceptionThrowingConsumer
type SubscriptionState =
| Subscribed of ConsumerType
| Unsubscribed
type AgentModel = Map<ClientInfo, SubscriptionState>
And then use FsCheck.Experimental to define the operations of adding and removing clients with differently successful consumers and trying to send data to them. FsCheck then generates random sequences of operations and verifies the agent implementation against the model between each steps.
This does require some additional "test only" code and has a significant mental overhead at the beginning, but lets you test relatively complex stateful logic. What I particularly like about this is that it helps me test the whole contract, not just individual functions/methods/messages, the same way that property-based/generative testing helps test with more than just a single value.
Use Actors
I haven't gone that far yet, but what I've also heard as an alternative is using for example Akka.NET for full-fledged actor model support, and use its testing facilities which let you run agents in special test contexts, verify expected messages and so on. As I said, I don't have first-hand experience, but seems like a viable option for more complex stateful logic (even on a single machine, not in a distributed multi-node actor system).
It seems like this must be happening in many different contexts such as adding subtitles. What I want to do is grab a frame, change some feature within it and then "put it back" so that the end user sees this change. I think I know how to grab and modify the frame but re-inserting it into the stream I do not see how to do. Would appreciate a link or code.
On a live stream, there are a few things to consider depending on what the end goal might be. If it's true packet / frame level manipulation you would likely need to make the modification and set the output to a new stream (source remains unscathed but new stream has the modification). Modifying the stream inline will be very problematic.
Packet level modification using IMediaStreamLivePacketNotify
You can implement the IMediaStreamLivePacketNotify interface to handle new packets and modify them as necessary. Example implementation:
private class PacketListener implements IMediaStreamLivePacketNotify
{
#Override
public void onLivePacket(IMediaStream stream, AMFPacket packet)
{
// handle packet modifications
}
}
Upon modifying the packet you could publish it to a secondary stream that you publish through the Publisher object.
Publisher.createInstance(vhost, appName, appInstName);
The publisher contains functionality to add A/V data to your new stream:
switch (packet.getType())
{
case IVHost.CONTENTTYPE_AUDIO:
publisher.addAudioData(packet.getData(), packet.getAbsTimecode());
break;
case IVHost.CONTENTTYPE_VIDEO:
publisher.addVideoData(packet.getData(), packet.getAbsTimecode());
break;
case IVHost.CONTENTTYPE_DATA:
case IVHost.CONTENTTYPE_DATA3:
publisher.addDataData(packet.getData(), packet.getAbsTimecode());
}
There is similar functionality within the Duplicate Streams module for a broader look at this implementation.
Packet level modification using getPlayPackets()
You could also look at the IMediaStream object and leverage the IMediaStream.getPlayPackets() functionality. Then you can obtain the packets and modify as needed in a corresponding thread that continually processes the inbound stream. Thereafter, you could use the Publisher object to publish the new stream (similar to the above).
Metadata injection
However, if you are just looking to inject some metadata the process becomes much more basic. You can modify the AMFDataList within the source stream to include the new meta information.
Adding onto the stream
If you are looking to add data onto the inline stream (vs modifying it) you could simply add it via the ImediaStream object:
IMediaStream.addAudioData(..)