Can someone please explain the proper usage of Timers and Triggers in Apache Beam?

Can someone please explain the proper usage of Timers and Triggers in Apache Beam? - google-cloud-platform

I'm looking for some examples of usage of Triggers and Timers in Apache beam, I wanted to use Processing-time timers for listening my data from pub sub in every 5 minutes and using Processing time triggers processing the above data collected in an hour altogether in python.

Please take a look at the following resources: Stateful processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam
The first blog post is more general in how to handle states for context, and the second has some examples on buffering and triggering after a certain period of time, which seems similar to what you are trying to do.
A full example was requested. Here is what I was able to come up with:
PCollection<String> records =
pipeline.apply(
"ReadPubsub",
PubsubIO.readStrings()
.fromSubscription(
"projects/{project}/subscriptions/{subscription}"));
TupleTag<Iterable<String>> every5MinTag = new TupleTag<>();
TupleTag<Iterable<String>> everyHourTag = new TupleTag<>();
PCollectionTuple timersTuple =
records
.apply("WithKeys", WithKeys.of(1)) // A KV<> is required to use state. Keying by data is more appropriate than hardcode.
.apply(
"Batch",
ParDo.of(
new DoFn<KV<Integer, String>, Iterable<String>>() {
#StateId("buffer5Min")
private final StateSpec<BagState<String>> bufferedEvents5Min =
StateSpecs.bag();
#StateId("count5Min")
private final StateSpec<ValueState<Integer>> countState5Min =
StateSpecs.value();
#TimerId("every5Min")
private final TimerSpec every5MinSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("bufferHour")
private final StateSpec<BagState<String>> bufferedEventsHour =
StateSpecs.bag();
#StateId("countHour")
private final StateSpec<ValueState<Integer>> countStateHour =
StateSpecs.value();
#TimerId("everyHour")
private final TimerSpec everyHourSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#ProcessElement
public void process(
#Element KV<Integer, String> record,
#StateId("count5Min") ValueState<Integer> count5MinState,
#StateId("countHour") ValueState<Integer> countHourState,
#StateId("buffer5Min") BagState<String> buffer5Min,
#StateId("bufferHour") BagState<String> bufferHour,
#TimerId("every5Min") Timer every5MinTimer,
#TimerId("everyHour") Timer everyHourTimer) {
if (Objects.firstNonNull(count5MinState.read(), 0) == 0) {
every5MinTimer
.offset(Duration.standardMinutes(1))
.align(Duration.standardMinutes(1))
.setRelative();
}
buffer5Min.add(record.getValue());
if (Objects.firstNonNull(countHourState.read(), 0) == 0) {
everyHourTimer
.offset(Duration.standardMinutes(60))
.align(Duration.standardMinutes(60))
.setRelative();
}
bufferHour.add(record.getValue());
}
#OnTimer("every5Min")
public void onTimerEvery5Min(
OnTimerContext context,
#StateId("buffer5Min") BagState<String> bufferState,
#StateId("count5Min") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(every5MinTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
#OnTimer("everyHour")
public void onTimerEveryHour(
OnTimerContext context,
#StateId("bufferHour") BagState<String> bufferState,
#StateId("countHour") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(everyHourTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
})
.withOutputTags(every5MinTag, TupleTagList.of(everyHourTag)));
timersTuple
.get(every5MinTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<<do something every 5 min>>);
timersTuple
.get(everyHourTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<< do something every hour>>);
pipeline.run().waitUntilFinish();

Related

BulkProcessor .add() not finishing when number of bulks > concurrentRequests

Here is a sample of the code flow:
Trigger the process with an API specifying bulkSize and totalRecords.
Use those parameters to acquire data from DB
Create a processor with the bulkSize.
Send both the data and processor into a method which:
-iterates over the resultset, assembles a JSON for each result, calls a method if the final JSON is not empty and adds that final JSON to the process using processor.add() method.
This is where the outcome of the code is split
After this, if the concurrentRequest parameter is 0 or 1 or any value < (totalRecords/bulkSize), the processor.add() line is where the code stalls and never continues to the next debug line.
However, when we increase the concurrentRequest parameter to a value > (totalRecords/bulkSize), the code is able to finish the .add() function and move onto the next line.
My reasoning leads me to believe we might be having issues with our BulkProcessListener which is making the .add() no close or finish like it is supposed to. I would really appreciate some more insight about this topic!
Here is the Listener we are using:
private class BulkProcessorListener implements Listener {
#Override
public void beforeBulk(long executionId, BulkRequest request) {
// Some log statements
}
#Override
public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
// More log statements
}
#Override
public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
// Log statements
}
}
Here is the createProcessor():
public synchronized BulkProcessor createProcessor(int bulkActions) {
Builder builder = BulkProcessor.builder((request, bulkListener) -> {
long timeoutMin = 60L;
try {
request.timeout(TimeValue.timeValueMinutes(timeoutMin));
// Log statements
client.bulkAsync(request, RequestOptions.DEFAULT,new ResponseActionListener<BulkResponse>());
}catch(Exception ex) {
ex.printStackTrace();
}finally {
}
}, new BulkProcessorListener());
builder.setBulkActions(bulkActions);
builder.setBulkSize(new ByteSizeValue(buldSize, ByteSizeUnit.MB));
builder.setFlushInterval(TimeValue.timeValueSeconds(5));
builder.setConcurrentRequests(0);
builder.setBackoffPolicy(BackoffPolicy.noBackoff());
return builder.build();
}
Here is the method where we call processor.add():
#SuppressWarnings("deprecation")
private void addData(BulkProcessor processor, String indexName, JSONObject finalDataJSON, Map<String, String> previousUniqueObject) {
// Debug logs
processor.add(new IndexRequest(indexName, INDEX_TYPE,
previousUniqueObject.get(COMBINED_ID)).source(finalDataJSON.toString(), XContentType.JSON));
// Debug logs
}

Reducing code duplication when testing a KtorClient

I am creating a service on top of a Ktor client. My payload is XML, and as such a simplified version of my client looks like this :
class MavenClient(private val client : HttpClient) {
private suspend fun getRemotePom(url : String) =
try{ MavenClientSuccess(client.get<POMProject>(url)) }catch (e: Exception) { MavenClientFailure(e)
}
companion object {
fun getDefaultClient(): HttpClient {
return HttpClient(Apache) {
install(JsonFeature) {
serializer = JacksonSerializer(jackson = kotlinXmlMapper)
accept(ContentType.Text.Xml)
accept(ContentType.Application.Xml)
accept(ContentType.Text.Plain)
}
}
}
}
}
Note the use of a custom XMLMapper, attached to a custom data class.
I want to test this class, and follow the documentation.
I end up with the following code for my test client :
private val mockClient = HttpClient(MockEngine) {
engine {
addHandler { request ->
when (request.url.fullUrl) {
"https://lengrand.me/minimal/1.2/minimal-1.2.pom" -> {
respond(minimalResourceStreamPom.readBytes()
, headers = headersOf("Content-Type" to listOf(ContentType.Application.Xml.toString())))
}
"https://lengrand.me/unknown/1.2/unknown-1.2.pom" -> {
respond("", HttpStatusCode.NotFound)
}
else -> error("Unhandled ${request.url.fullUrl}")
}
}
}
// TODO : How do I avoid repeating this again ? That's my implementation?!
install(JsonFeature) {
serializer = JacksonSerializer(jackson = PomParser.kotlinXmlMapper)
accept(ContentType.Text.Xml)
accept(ContentType.Application.Xml)
accept(ContentType.Text.Plain)
}
}
private val Url.hostWithPortIfRequired: String get() = if (port == protocol.defaultPort) host else hostWithPort
private val Url.fullUrl: String get() = "${protocol.name}://$hostWithPortIfRequired$fullPath"
private val mavenClient = MavenClient(mockClient)
Now, I am not worried about the Mapper itself, because I test it directly.
However what bothers me is that I essentially have to duplicate the complete logic of my client to test behaviour?
This seems very brittle, because for example it will cause my tests to fail and have to be updated if I move to Json tomorrow. Same if I start using Response Validation for example.
This is even more true for another client where I am using a defaultRequest, which I have to completely copy over as well:
private val mockClient = HttpClient(MockEngine) {
install(JsonFeature) {
serializer = JacksonSerializer(mapper)
accept(ContentType.Application.Json)
}
defaultRequest {
method = HttpMethod.Get
host = "api.github.com"
header("Accept", "application/vnd.github.v3+json")
if (GithubLogin().hasToken()) header("Authorization", GithubLogin().authToken)
}
Am I doing things wrong? Am I testing too much ? I am curious as to how I can improve this.
Thanks a lot for your input!
P.S : Unrelated but the page about testing on Ktor mentions adding the dependency to the implementation. Sounds like I should use testImplementation instead to avoid shipping the lib with my application ?

The MockEngine is designed for stubbing real HTTP client implementation to test objects that use it. The duplication problem, you encounter, lies in the fact that transforming response body responsibility belongs to the client. So I suggest either use Jackson directly to transform a response body (in this case you don't need to use JsonFeature) or extract common configuration in a extension function and call it for both engines.

External offset store with the debezium embedded connector

My team is building a CDC service with the Debezium embedded connector. For the offset storage we're thinking about using S3/DynamoDB. Just wondering if anyone here has written something similar to externalize the offset store and what they chose and why they chose that.

We have a Postgres DB as source. Change Data Capture (CDC) is implemented by the Postgres itself (done by the extension pglogical). This CDC subsystem of Postgres is responsible for offset management. The CDC subsytem will maintain a list of CDC clients (aka slots). So if your client creates a CDC connection the DB will start from the point where that client disconnected before (on the same slot). A new client will create a new slot and start receiving only the CDC records created from that point in time on. So there is no need for us to remember the offsets.

Had this challenge recently. You can write a custom class that implements org.apache.kafka.connect.storage.FileOffsetBackingStore or extend org.apache.kafka.connect.storage.MemoryOffsetBackingStore.
Subsequently ensure "offset.storage" config is set to the fully-qualified class name
Please see a sample below using redis (maybe not in production) as a backing store to give you an idea how this can work.
package com.sample.cdc.offsetbackingstore
import com.sample.cdc.service.RedisManager
import org.apache.kafka.connect.errors.ConnectException
import org.apache.kafka.connect.runtime.WorkerConfig
import org.apache.kafka.connect.storage.MemoryOffsetBackingStore
import java.io.IOException
import java.nio.ByteBuffer
import java.util.concurrent.Callable
import java.util.concurrent.Future
class RedisOffsetBackingStore : MemoryOffsetBackingStore() {
lateinit var redisManager : RedisManager
lateinit var redisHost : String
lateinit var redisPort : String
override fun configure(config: WorkerConfig?) {
super.configure(config)
redisHost = config?.getString("custom.config.redis.host")
redisPort = config?.getString("custom.config.redis.port")
}
// Called by Debezium Engine at some point
override fun start() {
super.start()
println("Initializing redis manager...")
redisManager = RedisManager(redisHost, redisPort)
}
// Called by Debezium Engine during graceful shutdown
override fun stop() {
super.stop()
println("Disposing redis client resources...")
if(this::redisManager.isInitialized)
redisManager.dispose()
}
// Called by DebeziumEngine OffsetReader to read Offset
override fun get(keys: MutableCollection<ByteBuffer>?): Future<MutableMap<ByteBuffer, ByteBuffer?>> {
if(data.isNotEmpty())
return super.get(keys)
return executor.submit(Callable<MutableMap<ByteBuffer, ByteBuffer?>> {
val result: MutableMap<ByteBuffer, ByteBuffer?> = HashMap()
keys?.forEach {
val offsetKey = String(it.array())
val offsetValue = redisManager.get(offsetKey)
if(offsetValue.isNotEmpty()){
val buffer = ByteBuffer.wrap(offsetValue.toByteArray())
result[it] = buffer
data[it] = buffer
}
}
result
})
}
// Invoked by set() in MemoryOffsetBackingStore class to persist Offset
// during commit or graceful shutdown
override fun save() {
try {
for ((key, value) in data) {
val offsetKey = String(key!!.array())
val offsetValue = String(value!!.array())
redisManager.save(offsetKey, offsetValue)
}
} catch (e: IOException) {
throw ConnectException(e)
}
}
}
//Ensure the below config setting is set in DebeziumConfig
//"offset.storage":"com.sample.cdc.offsetbackingstore.RedisOffsetBackingStore",
//"custom.config.redis.host": "localhost"
//"custom.config.redis.port": "6379"
Note: In case of multiple standalone embedded debezium services (for reliabilty and fault tolerance) with a custom offset backing store, you'll have to provide a way to handle offset race condition, and event deduplication.

Schedule/batch for large number of webservice callouts?

I'am new to Apex and I have to call a webservice for every account (for some thousands of accounts).
Usualy a single webservice request takes 500 to 5000 ms.
As far as I know schedulable and batchable classes are required for this task.
My idea was to group the accounts by country codes (Europe only) and start a batch for every group.
First batch is started by the schedulable class, next ones start in batch finish method:
global class AccValidator implements Database.Batchable<sObject>, Database.AllowsCallouts {
private List<String> countryCodes;
private countryIndex;
global AccValidator(List<String> countryCodes, Integer countryIndex) {
this.countryCodes = countryCodes;
this.countryIndex = countryIndex;
...
}
// Get Accounts for current country code
global Database.QueryLocator start(Database.BatchableContext bc) {...}
global void execute(Database.BatchableContext bc, list<Account> myAccounts) {
for (Integer i = 0; i < this.AccAccounts.size(); i++) {
// Callout for every Account
HttpRequest request ...
Http http = new Http();
HttpResponse response = http.send(request);
...
}
}
global void finish(Database.BatchableContext BC) {
if (this.countryIndex < this.countryCodes.size() - 1) {
// start next batch
Database.executeBatch(new AccValidator(this.countryCodes, this.countryIndex + 1), 200);
}
}
global static List<String> getCountryCodes() {...}
}
And my schedule class:
global class AccValidatorSchedule implements Schedulable {
global void execute(SchedulableContext sc) {
List<String> countryCodes = AccValidator.getCountryCodes();
Id AccAddressID = Database.executeBatch(new AccValidator(countryCodes, 0), 200);
}
}
Now I'am stuck with Salesforces execution governors and limits:
For nearly all callouts I get the exceptions "Read timed out" or "Exceeded maximum time allotted for callout (120000 ms)".
I also tried asynchronous callouts, but they don't work with batches.
So, is there any way to schedule a large number of callouts?

Have you tried to limit your execute method to 100? Salesforce only allows 100 callout per transaction. I.e.
Id AccAddressID = Database.executeBatch(new AccValidator(countryCodes, 0), 100);
Perhaps this might help you:
https://salesforce.stackexchange.com/questions/131448/fatal-errorsystem-limitexception-too-many-callouts-101

How to simulate a CRM plugin sandbox isolation mode in unit tests?

Context
I would like to write some unit tests against classes what will be utilized by CRM 2016 CodeActivity and Plugin classes. The final assembly will be registered in sandbox isolation mode.
I want to be sure if a test case is green when running unit tests, it will not be more restricted in sandbox isolation security restrictions when registered and run in CRM.
Question
Is there any way to simulate the sandbox isolation when running unit tests?

That's a really good question. You can maybe simulate running the plugin assemblies and code activities in a sandbox based on this Sandbox example.
With that example you could run the codeactivity with a limited set of permissions.
Now, what are the exact limitations of CRM online? Found this article. There is a Sandbox Limitations sections with some of them. If you find another one please let me know. Cause I'd be keen on adding this feature to FakeXrmEasy
Cheers,

I found this today: https://github.com/carltoncolter/DynamicsPlugin/blob/master/DynamicsPlugin.Tests/PluginContainer.cs
Which I used to turn into this:
using System;
using System.Diagnostics;
using System.Globalization;
using System.Net;
using System.Net.NetworkInformation;
using System.Reflection;
using System.Security;
using System.Security.Permissions;
using System.Text.RegularExpressions;
namespace Core.DLaB.Xrm.Tests.Sandbox
{
public static class SandboxWrapper
{
public static T Instantiate<T>(object[] constructorArguments = null)
{
return new SandboxWrapper<T>().Instantiate(constructorArguments);
}
public static T InstantiatePlugin<T>(string unsecureConfig = null, string secureConfig = null)
{
object[] args = null;
if (secureConfig == null)
{
if (unsecureConfig != null)
{
args = new object[] {unsecureConfig};
}
}
else
{
args = new object[]{unsecureConfig, secureConfig};
}
return new SandboxWrapper<T>().Instantiate(args);
}
}
public class SandboxWrapper<T> : MarshalByRefObject, IDisposable
{
private const string DomainSuffix = "Sandbox";
/// <summary>
/// The Sandbox AppDomain to execute the plugin
/// </summary>
public AppDomain SandboxedAppDomain { get; private set; }
public T Instantiate(object[] constructorArguments = null)
{
/*
* Sandboxed plug-ins and custom workflow activities can access the network through the HTTP and HTTPS protocols. This capability provides
support for accessing popular web resources like social sites, news feeds, web services, and more. The following web access restrictions
apply to this sandbox capability.
* Only the HTTP and HTTPS protocols are allowed.
* Access to localhost (loopback) is not permitted.
* IP addresses cannot be used. You must use a named web address that requires DNS name resolution.
* Anonymous authentication is supported and recommended. There is no provision for prompting the
on user for credentials or saving those credentials.
*/
constructorArguments = constructorArguments ?? new object[] { };
var type = typeof(T);
var source = type.Assembly.Location;
var sourceAssembly = Assembly.UnsafeLoadFrom(source);
var setup = new AppDomainSetup
{
ApplicationBase = AppDomain.CurrentDomain.BaseDirectory,
ApplicationName = $"{sourceAssembly.GetName().Name}{DomainSuffix}",
DisallowBindingRedirects = true,
DisallowCodeDownload = true,
DisallowPublisherPolicy = true
};
var ps = new PermissionSet(PermissionState.None);
ps.AddPermission(new SecurityPermission(SecurityPermissionFlag.SerializationFormatter));
ps.AddPermission(new SecurityPermission(SecurityPermissionFlag.Execution));
ps.AddPermission(new FileIOPermission(PermissionState.None));
ps.AddPermission(new ReflectionPermission(ReflectionPermissionFlag.RestrictedMemberAccess));
//RegEx pattern taken from: https://msdn.microsoft.com/en-us/library/gg334752.aspx
ps.AddPermission(new WebPermission(NetworkAccess.Connect,
new Regex(
#"^http[s]?://(?!((localhost[:/])|(\[.*\])|([0-9]+[:/])|(0x[0-9a-f]+[:/])|(((([0-9]+)|(0x[0-9A-F]+))\.){3}(([0-9]+)|(0x[0-9A-F]+))[:/]))).+")));
// We don't need to add these, but it is important to note that there is no access to the following
ps.AddPermission(new NetworkInformationPermission(NetworkInformationAccess.None));
ps.AddPermission(new EnvironmentPermission(PermissionState.None));
ps.AddPermission(new RegistryPermission(PermissionState.None));
ps.AddPermission(new EventLogPermission(PermissionState.None));
SandboxedAppDomain = AppDomain.CreateDomain(DomainSuffix, null, setup, ps, null);
return Create(constructorArguments);
}
private T Create(object[] constructorArguments)
{
var type = typeof(T);
return (T)Activator.CreateInstanceFrom(
SandboxedAppDomain,
type.Assembly.ManifestModule.FullyQualifiedName,
// ReSharper disable once AssignNullToNotNullAttribute
type.FullName, false, BindingFlags.CreateInstance,
null, constructorArguments,
CultureInfo.CurrentCulture, null
).Unwrap();
}
#region IDisposable Support
//Implementing IDisposable Pattern: https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/dispose-pattern
private bool _disposed; // To detect redundant calls
protected virtual void Dispose(bool disposing)
{
if (_disposed) return;
if (disposing)
{
if (SandboxedAppDomain != null)
{
AppDomain.Unload(SandboxedAppDomain);
SandboxedAppDomain = null;
}
}
_disposed = true;
}
// This code added to correctly implement the disposable pattern.
void IDisposable.Dispose()
{
// Do not change this code. Put cleanup code in Dispose(bool disposing) above.
Dispose(true);
}
#endregion
}
}
Which can be used as such:
SandboxWrapper.InstantiatePlugin<YourPluginType>(unsecureString, secureString)
Not sure how much of it is valid or not, but it worked for handling my testing of xml and JSON serialization correctly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can someone please explain the proper usage of Timers and Triggers in Apache Beam? - google-cloud-platform

I'm looking for some examples of usage of Triggers and Timers in Apache beam, I wanted to use Processing-time timers for listening my data from pub sub in every 5 minutes and using Processing time triggers processing the above data collected in an hour altogether in python.

Related

BulkProcessor .add() not finishing when number of bulks > concurrentRequests

Reducing code duplication when testing a KtorClient

External offset store with the debezium embedded connector

Schedule/batch for large number of webservice callouts?

How to simulate a CRM plugin sandbox isolation mode in unit tests?

Categories

Resources