I am reading documentation about stateful functions in Apache Beam and I don't understand some parts of it.
I found that here if I will not assign any window to PCollection it means that global window will be used by default.
Note that even if you don’t set a windowing function, there is still a window – all elements in your PCollection are assigned to a single global window.
After that I read this article about stateful processing in Apache Beam. I found here that default parallelization for stateful function will be done per key and window.
A state cell in Beam is scoped to a key+window pair.
Am I right in case of usage of unbounded collection with ten unique keys for example I will have ten separate stateful PTransform for each key with infinite state, right?
Related
I have read a lot of articles recently, including the official documentation, trying to understand how the Global Window works in Apache Beam. I have read similar questions here in Stackoverflow but I couldn't come to an understanding.
Accordingly to the official docs:
You can use the single global window if you are working with an unbounded data set (e.g. from a streaming data source) but use caution when applying aggregating transforms such as GroupByKey and Combine. The single global window with a default trigger generally requires the entire data set to be available before processing, which is not possible with continuously updating data.
So the Global Window doesn't have an ending and it makes sense since it's global. The docs recommends to use a non-default trigger when doing aggregations because the default trigger is to fire panes when the window closes:
Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur.
I'm confused by this. The logic here would be that Global Window wouldn't be ble to fire events to the next step of the pipeline because it never ends thus the default trigger never occurs. However, this isn't what happens in a real scenario. If I read from an unbounded PCollection with a global window, the events would still be pushed downstream.
Could someone clarify this question to me? How the default Global Window with default trigger works in Apache Beam for unbounded pcollections? I'm assuming that it does not aggregate results at all and just handles the events as they arrive, one by one. I would like to be sure if that's the case.
Default trigger is to fire when the watermark reaches the end of the Window based on the event time. This never occurs for a GlobalWindow so if you use a GlobalWindow the default trigger will never be fired.
But if you set a non-default-trigger, for example to fire after a certain number of elements are processed (using the AfterCount trigger), your elements can be emitted even for a GlobalWindow. See here for more information regarding Beam triggers.
Triggers lets us decide when the window results are computed.
When we say Default Trigger , it implies to repeatable execution of AfterWatermark trigger
whereas, AfterWatermark creates a trigger firing the pane after the end of the window.
Coming back to your question ,
How the default Global Window with default trigger works in Apache Beam for unbounded pcollections?
If you use Global window with Default trigger,So data will be never aggregated because data will be constantly updating.
It will be resulting in non-firing of trigger as global window won't end.
And Yes,Your assumption is correct that it does not aggregate results at all and just handles the events as they arrive, one by one.
Reference:
https://www.waitingforcode.com/apache-beam/triggers-apache-beam/read#:~:text=Apache%20Beam%20comes%20with%204,on%20element's%20event%20time%20property.&text=processing%20time%20%2D%20this%20trigger%20is,value%20of%20processing%20time%20watermark.
My Dataflow pipeline collates event data into typed per session and per user PCollections output. I have employed GroupByKey for events keyed by session id. Sessions are grouped into parent types keyed by user id and device id using the same pattern at this next level of hierarchy. So a single session might generate many events, but in turn a single user might generate many sessions.
I would now like to summarize this data across each level of the hierarchy. I have used a StateSpec declaration to persist state at the event level. So for example, an event count property can be incremented in my event processing ParDo. (Use Case : generating an error event per session across all users for example.)
But as each ParDo is static - I cannot access the ValueState outside of the ParDo context even though my understanding is this state is maintained at the Window scope. (Maybe this is by design.) Is there a way to access this Window level state using the Beam State persistence lib in another ParDo than where it was originally declared? Like as if I could declare it at the pipeline level?
I understand that this may introduce some performance overhead as the framework must manage concurrency, but the actual processing seems negligible. (Just incrementing values.) So I would prefer to write this to a window level state field rather than percolate values up via my hierarchy.
State sharing cross ParDos is not supported, and it shouldn't even be encouraged as it brings dependencies among ParDos that breaks the simple contract: ParDo can work on PCollection independently thus unblocks massive parallelism.
My application is futures-based with async/await, and has the following structure within one of its components:
a "manager", which is responsible for starting/stopping/restarting "workers", based both on external input and on the current state of "workers";
a dynamic set of "workers", which perform some continuous work, but may fail or be stopped externally.
A worker is just a spawned task which does some I/O work. Internally it is a loop which is intended to be infinite, but it may exit early due to errors or other reasons, and in this case the worker must be restarted from scratch by the manager.
The manager is implemented as a loop which awaits on several channels, including one returned by async_std::stream::interval, which essentially makes the manager into a poller - and indeed, I need this because I do need to poll some Mutex-protected external state. Based on this state, the manager, among everything else, creates or destroys its workers.
Additionally, the manager stores a set of async_std::task::JoinHandles representing live workers, and it uses these handles to check whether any workers has exited, restarting them if so. (BTW, I do this currently using select(handle, future::ready()), which is totally suboptimal because it relies on the select implementation detail, specifically that it polls the left future first. I couldn't find a better way of doing it; something like race() would make more sense, but race() consumes both futures, which won't work for me because I don't want to lose the JoinHandle if it is not ready. This is a matter for another question, though.)
You can see that in this design workers can only be restarted when the next poll "tick" in the manager occurs. However, I don't want to use a too small interval for polling, because in most cases polling just wastes CPU cycles. Large intervals, however, can delay restarting a failed/canceled worker by too much, leading to undesired latencies. Therefore, I though I'd set up another channel of ()s back from each worker to the manager, which I'd add to the main manager loop, so when a worker stops due to an error or otherwise, it will first send a message to its channel, resulting in the manager being woken up earlier than the next poll in order to restart the worker right away.
Unfortunately, with any kinds of channels this might result in more polls than needed, in case two or more workers stop at approximately the same time (which due to the nature of my application, is somewhat likely to happen). In such case it would make sense to only run the manager loop once, handling all of the stopped workers, but with channels it will necessarily result in the number of polls equal to the number of stopped workers, even if additional polls don't do anything.
Therefore, my question is: how do I notify the manager from its workers that they are finished, without resulting in extra polls in the manager? I've tried the following things:
As explained above, regular unbounded channels just won't work.
I thought that maybe bounded channels could work - if I used a channel with capacity 0, and there was a way to try and send a message into it but just drop the message if the channel is full (like the offer() method on Java's BlockingQueue), this seemingly would solve the problem. Unfortunately, the channels API, while providing such a method (try_send() seems to be like it), also has this property of having capacity larger than or equal to the number of senders, which means it can't really be used for such notifications.
Some kind of atomic or a mutex-protected boolean flag also look as if it could work, but there is no atomic or mutex API which would provide a future to wait on, and would also require polling.
Restructure the manager implementation to include JoinHandles into the main select somehow. It might do the trick, but it would result in large refactoring which I'm unwilling to make at this point. If there is a way to do what I want without this refactoring, I'd like to use that first.
I guess some kind of combination of atomics and channels might work, something like setting an atomic flag and sending a message, and then skipping any extra notifications in the manager based on the flag (which is flipped back to off after processing one notification), but this also seems like a complex approach, and I wonder if anything simpler is possible.
I recommend using the FuturesUnordered type from the futures crate. This collection allows you to push many futures of the same type into a collection and wait for any one of them to complete at once.
It implements Stream, so if you import StreamExt, you can use unordered.next() to obtain a future that completes once any future in the collection completes.
If you also need to wait for a timeout or mutex etc., you can use select to create a future that completes once either the timeout or one of the join handles completes. The future returned by next() implements Unpin, so it is usable with select without problems.
I'm trying to implement this pattern on a "smart building" system design (using STL library). Various "sensors" placed in rooms, floors etc, dispatch signals that are handled by "controllers" (also placed in different rooms, floors etc.). The problem I'm facing is that the controller's subscription to an event isn't just event based, it is also location based.
For example, controller A can subscribe to a fire signal from room #1 in floor #4 and to a motion signal in floor #5. A floor-based subscription means that controller A will get an motion event about every room in the floor he's subscribed to (assuming the appropriate sensor is placed there). There's also a building-wide subscription for that matter.
The topology of the system is read from a configuration file at start up, so I don't want to map the whole building, just the relevant places that contain sensors and controllers.
What I've managed to think of :
Option 1: MonitoredArea class that contains the name of the area (Building1, Floor 2, Room 3) and a vector where the vector's index is an enumerated event type each member of the vector contains a list of controllers that are subscribed to this event. The class will also contain a pointer to a parent MonitoredArea, in the case it is a room in a floor, or a floor in a building.
A Sensor class will dispatch an Event to a center hub along with the sensor's name. The hub will run it through his sensor-name-to-location map, acquire the matching MonitoredArea and will alert all the controllers in the vector.
Cons:
Coupling of the location to the controller
Events are enumerated and are hard coded in the MonitoredArea class, adding future events is difficult.
Option 2:
Keeping all the subscriptions in the Controller class.
Cons:
Very inefficient. Every event will make the control center to iterate through all the controller and find out which are subscribed to this particular event.
Option 3:
Event based functionality. Event class (ie. FireEvent) will contain all the locations it can happen in (according to the sensor's setup) and for every location, a list of the controllers that are subscribed to it.
Cons:
A map of maps
Strong data duplication
No way to alert floor-based subscriptions about events in the various rooms.
As you can see, I'm not happy with any of the mentioned solutions. I'm sure I've reached the over-thinking stage and would be happy for a feedback or alternative suggestions as to how I approach this. Thanks.
There is design pattern (sort of speak) used a lot in game development called "Message Bus". And it is sometimes used to replace event based operations.
"A message bus is a connection between one or more senders and/or receivers. Think of it like a connection between computers in a bus topology: Every node can send a message by passing it to the bus, and all connected nodes will receive that message. If the node is processed and if a reply is sent is completely up to each receiver itself.
Having modules connected to a message bus gives us some advantages:
Every module is isolated, it does not need to know of any others.
Every module can react to any message that’s being sent to the bus; that means you get extra flexibility for free, without increasing dependencies at all.
It’s much easier to follow the YAGNI workflow: For example you’re going to add weapons. At first you implement the physics, then you add visuals in the renderer, and then playing sounds. All of those features can be implemented independently at any time, without interrupting each other.
You save yourself from thinking a lot about how to connect certain modules to each other. Sometimes it takes a huge amount of time, including drawing diagrams/dependency graphs."
Sources:
http://gameprogrammingpatterns.com/event-queue.html
http://www.optank.org/2013/04/02/game-development-design-3-message-bus/
Let's say I have 2 states, an Active state and an Idle state. If I receive some events in Active state I would like to defer them and execute them when I go back to Idle state.
But when I go back to Idle State is there a way to chose which previously deferred event to process? or is there a way to prioritize them or even ignore few of them?
Thanks,
I see that the basic capability of deferred events is covered in the documentation provided on the project, which I have found helpful in general. In the section titled Orthogonal regions, terminate state, event deferring look for the text "UML defines event deferring as a state property. To accommodate this, MSM lets you specify this in states by providing a deferred_events type..." Note that there are two different methods described there for implementing the deferred events.
Without testing an example, I cannot say whether or not the referenced material on Conflicting transitions and guards will allow you to establish the priority you are seeking on deferred events. You could post your problem or a simplified example.
I am not aware of a solution native to boost MSM. I have heard that the author Christophe Henry is quite responsive to this kind of question on the Mailing list.
If your situation really is that trivial (only two states) nothing is stopping you from implementing your own deferred event queue, passing "defferred events" to it in Active. You can implement an internal transition for each event type with an action that pushes them into your custom queue. Upon entering Idle you can reorder them however you want and post them all back to the SM. This solution doesn't scale all that well though and its a bit of a hack.