How do default window and default trigger work in apache beam - google-cloud-platform

I'm trying to implement the default window with the default trigger to evaluate the behavior but it's not emitting any result.
According to Apache Beam:
The default trigger for a PCollection is based on event time, and
emits the results of the window when the Beam’s watermark passes the
end of the window, and then fires each time late data arrives.
If you are using both the default windowing configuration and
the default trigger, the default trigger emits exactly once, and late
data is discarded. This is because the default windowing configuration
has an allowed lateness value of 0.
my code:
Nb_items = lines | beam.CombineGlobally(beam.combiners.CountCombineFn()).without_defaults() \
| 'print' >> beam.ParDo(PrintFn())
It only emits the data if I set a trigger
Nb_items = lines | 'window' >> beam.WindowInto(window.GlobalWindows(),
trigger=trigger.AfterProcessingTime(10),
accumulation_mode=trigger.AccumulationMode.DISCARDING) \
| 'CountGlobally' >> beam.CombineGlobally(beam.combiners.CountCombineFn()).without_defaults() \
| 'print' >> beam.ParDo(PrintFn())
How can I observe the default behavior without setting a trigger?
Is the problem in the combine transform?
If your input PCollection
uses the default global windowing, the default behavior is to return a
PCollection containing one item. That item’s value comes from the
accumulator in the combine function that you specified when applying
Combine

The current issue is that the watermark never reaches the end of the GlobalWindow. To have a default trigger, you can use any other window where the watermark can reach the end, e.g.: 'window' >> beam.WindowInto(window.FixedWindows(10))
As Guillaume rightly asks, if you're running on Batch, triggers are basically ignored.

See Sources Below:
https://github.com/apache/beam/blob/828b897a2439437d483b1bd7f2a04871f077bde0/examples/java/src/main/java/org/apache/beam/examples/complete/game/LeaderBoard.java#L274
For more information regarding Google Cloud Dataflow
https://stackoverflow.com/a/54151029/12149235

Related

Override field in the input before passing to the next state in AWS Step Function

Say I have 3 states, A -> B -> C. Let's assume inputs to A include a field called names which is of type List and each element contains two fields firstName and lastName. State B will process the inputs to A and and return a response called newLastName. If I want to override every element in names such that names[i].lastName = newLastName before passing this input to state C, is there an built-in syntax to achieve that? Thanks.
You control the events passed to the next task in a Step Function with three defintion attributes: ResultPath and OutputPath on leaving one task and InputPath on entering the next one.
You have to first understand how the event to the next task is crafted by a State Machine, and each of the 3 above parameters changes it.
You have to at least have Result Path. This is the key in the event that the output of your lambda will be placed under. so ResultPath="$.my_path" would result in a json object that has a top level key of my_path with the value equal to whatever is outputted from the lambda.
If this is the only attribute, it is tacked onto whatever the input was. So if your Input event was a json object with keys original_key1 and some_other_key your output with just the above result path would be:
{
"original_key_1": some value,
"some_other_key": some other value,
"my_path": the output of your lambda
}
Now if you add OutputPath, this cuts off everything OTHER than the path (AFTER adding the result path!) in the next output.
If you added OutputPath="$.my_path" you would end up with a json of:
{ output of your lambda }
(your output better be a json comparable object, like a python dict!)
InputPath does the same thing ... but for the Input. It cuts off everything other than the path described, and that is the only thing sent into the lambda. But it does not stop the input from being appeneded - so InputPath + ResultPath results in less being sent into the lambda, but everything all together on the exit
There isn't really a loop logic like the one you describe however - Task and State Machine definitions are static directions, not dynamic logic.
You can simply handle it inside the lambda. This is kinda the preferred method. HOWEVER if you do this, then you should use a combination of OutputPath and ResultPath to 'cut off' the input, having replaced the various fields of the incoming event with whatever you want before returning it at the end.

Temporary disable console output for boost::log

I added sink to file via boost::log::add_file_log and console output via boost::log::add_console_log. I am calling a logger via BOOST_LOG_SEV and everything workds perfectely. But there is a place, where a want output only to the file.
How I can disable cosole output in certain place?
You could achieve this with attributes and filters. For example, you could set up a filter in your console sink to suppress any log records that have (or don't have, depending on your preference) a particular attribute value attached.
boost::log::add_console_log
(
...
boost::log::keywords::filter = !boost::log::expressions::has_attr("NoConsole")
...
);
Then you could set this attribute in the code region that shouldn't output logs in the console. For example, you could use a scoped attribute:
BOOST_LOG_SCOPED_THREAD_ATTR("NoConsole", true);
BOOST_LOG(logger) << "No console output";
You can use whatever method of setting the attribute - as a thread-local attribute, or a logger-specific, it doesn't matter.
The important difference from temporarily removing the sink is that the solution with attributes will not affect other threads that may be logging while you're suspending console output.
You can easely do it with remove_sink() function.
console_sink = boost::log::add_console_log(std::cout);
boost::log::core::get()->remove_sink(console_sink);
After that you can call an add_console_log() again and enable console output.

What is the difference between startFlow and startTrackedFlow in Corda?

So what is the advantage of using startTrackedFlow over startFlow?
The difference is defined in the official documentation:
The process of starting a flow returns a FlowHandle that you can use to observe the result, and which also contains a permanent identifier for the invoked flow in the form of the StateMachineRunId. Should you also wish to track the progress of your flow (see Progress tracking) then you can invoke your flow instead using CordaRPCOps.startTrackedFlowDynamic or any of its corresponding CordaRPCOps.startTrackedFlow extension functions. These will return a FlowProgressHandle, which is just like a FlowHandle except that it also contains an observable progress field.

Update Dataflow Streaming job with Session and Siding window embedded in DF

In my use-case, I'm performing Session as well as Sliding window inside Dataflow job. So basically my Sliding window timing is 10 hour with sliding time 4 min. Since I'm applying grouping and performing max function on top of that, on every 3 min interval, window will fire the pane and it will go into Session window with triggering logic on it. Below is the code for the same.
Window<Map<String, String>> windowMap = Window.<Map<String, String>>into(
SlidingWindows.of(Duration.standardHours(10)).every(Duration.standardMinutes(4)));
Window<Map<String, String>> windowSession = Window
.<Map<String, String>>into(Sessions.withGapDuration(Duration.standardHours(10))).discardingFiredPanes()
.triggering(Repeatedly
.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))))
.withAllowedLateness(Duration.standardSeconds(10));
I would like to add logger on some steps for Debugging, so I'm trying to update the current streaming job using below code:
options.setRegion("asia-east1");
options.setUpdate(true);
options.setStreaming(true);
So previously I had around 10k data and I updated the existing pipeline using above config and now I'm not able to see that much data in steps of updated DF job. So help me with the understanding whether it preserves the previous job data or not as I'm not seeing previous DF step count in updated Job.

PBS Professional hook not updating Priority

I am trying to implement a hook to determine a job's priority upon entering the queue.
The hook is enabled, imported, and event type is "queuejob", so it is in place (like other hooks we have enabled). This hook however does not seem to alter a job's priority as I am expecting.
Here is a simplified example of how I'm trying to alter the Priority for a job:
import pbs
try:
e=pbs.event()
j=e.job
if j.server == 'myserver':
j.Priority = j.Priority + 50
e.accept()
except SystemExit:
pass
Whenever I submit a job after importing this hook, I run the 'qstat -f' on my job, the Priority is always 0, whether I set it to another value in my qsub script or leave it to the default.
Thank you.
Couple of things I discovered:
It appears that PBS does not like using j.Priority in a calculation and assignment, so I had to use another internal variable (which was fine since I had one already for something else)
i.e.:
j.Priority = High_Priority
if pbs.server() == 'myserver'
j.Priority = High_Priority + 50
Also, (as can be seen in the last example), j.server should actually be pbs.server().