AWS Kinesis KCL skips records added before startup - amazon-web-services

I started to use both KPL and KCL to exchange data between services. But whenever consumer service is offline, all data sent by KPL are lost forever. So I get only those chunks of data that were sent while consumer service is up and its shardConsumer is ready. I need to start from the last consumed point or somehow else process data left behind.
Here is my ShardProcessor code:
#Override
public void initialize(InitializationInput initializationInput) {
}
#Override
public void processRecords(ProcessRecordsInput processRecordsInput) {
processRecordsInput.records()
.forEach(record -> {
//my logic
});
}
#Override
public void leaseLost(LeaseLostInput leaseLostInput) {
}
#Override
public void shardEnded(ShardEndedInput shardEndedInput) {
try {
shardEndedInput.checkpointer().checkpoint();
} catch (ShutdownException | InvalidStateException e) {
LOG.error("Kinesis error on Shard Ended", e);
}
}
#Override
public void shutdownRequested(ShutdownRequestedInput shutdownRequestedInput) {
try {
shutdownRequestedInput.checkpointer().checkpoint();
} catch (ShutdownException | InvalidStateException e) {
LOG.error("Kinesis error on Shutdown Requested", e);
}
}
And configuration code:
public void configure(String streamName, ShardRecordProcessorFactory factory) {
Region region = Region.of(awsRegion);
KinesisAsyncClient kinesisAsyncClient =
KinesisClientUtil.createKinesisAsyncClient(KinesisAsyncClient.builder().region(region));
DynamoDbAsyncClient dynamoClient = DynamoDbAsyncClient.builder().region(region).build();
CloudWatchAsyncClient cloudWatchClient = CloudWatchAsyncClient.builder().region(region).build();
ConfigsBuilder configsBuilder =
new ConfigsBuilder(streamName, appName, kinesisAsyncClient, dynamoClient, cloudWatchClient,
UUID.randomUUID().toString(), factory);
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
configsBuilder.retrievalConfig()
.retrievalSpecificConfig(new PollingConfig(streamName, kinesisAsyncClient))
);
Thread schedulerThread = new Thread(scheduler);
schedulerThread.setDaemon(true);
schedulerThread.start();
}

There are two ways to address this. First, the problem.
By default, the KCL is configured to start reading the stream at LATEST. This setting tells the stream reader to pick up the stream at the "current" timestamp.
In your case, you have data in that stream that was placed in there before "now." In order to read that data, you might want to consider reading the earliest data you have in the stream. If you set up a default stream, the stream will store data for 24 hours.
To read the data from the "beginning" of that stream, or 24 hours before you start the KCL application, you'll want to set the stream reader to TRIM_HORIZON. This setting is called initialPositionInStream. You can read about it here. There are three different settings documented in the API.
To solve your issue, the preferred method, as noted in the first link, is to add an entry to the properties file. If you're not using a properties file, you can simply add this to your Scheduler ctor:
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
configsBuilder.retrievalConfig()
.initialPositionInStreamExtended(InitialPositionInStreamExtended.newInitialPosition(TRIM_HORIZON))
.retrievalSpecificConfig(new PollingConfig(streamName, kinesisAsyncClient))
);
One thing to keep in mind with this setting is startup functionality when you have data in the stream and you start at TRIM_HORIZON. In this scenario, the RecordProcessor will iterate through records as fast as it can. This could create performance issues at the Kinesis API, or even downstream systems (wherever you're sending the data once the RecordProcessor has it),

Related

What is different between EventHub Client with EventHub Processor?

I find there are two way to receive EventHub message data:
Using EventHub Processor, it seems will use checkpoint to save. It will make sure when the process running EventProcessor on a specific partition dies/crashes.
public class SimpleEventProcessor : IEventProcessor
{
public Task CloseAsync(PartitionContext context, CloseReason reason)
{
Console.WriteLine($"Processor Shutting Down. Partition '{context.PartitionId}', Reason: '{reason}'.");
return Task.CompletedTask;
}
public Task OpenAsync(PartitionContext context)
{
Console.WriteLine($"SimpleEventProcessor initialized. Partition: '{context.PartitionId}'");
return Task.CompletedTask;
}
public Task ProcessErrorAsync(PartitionContext context, Exception error)
{
Console.WriteLine($"Error on Partition: {context.PartitionId}, Error: {error.Message}");
return Task.CompletedTask;
}
public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach (var eventData in messages)
{
var data = Encoding.UTF8.GetString(eventData.Body.Array, eventData.Body.Offset, eventData.Body.Count);
Console.WriteLine($"Message received. Partition: '{context.PartitionId}', Data: '{data}'");
}
return context.CheckpointAsync();
}
}
Using EventHub client to receive message:
EventHubClient eventHub
var reciever = eventHub.CreateReceiver("consumer1", "0", EventPosition.FromStart());
var recieved = await reciever.ReceiveAsync(10);
What is the difference for them? Could we save the checkpoint for second ways? How to handle the crash case in second ways? Why does it need two different ways?
EventHubClient aka. low level API is used for building connectors. In this case, developer is responsible to manage partition receivers, checkpoints, load distribution, and crash recovery etc. Most won't be using this API to receive, and again this API is for building source to sink connectors.
Processor Host comes with built-in checkpointing, load distribution, and partition receiver manager. This API might look like an overkill when it comes to implement IEventProcessor, and provide storage checkpoint store however in the long run it is more worry-free.

Preventing a WCF client from issuing too many requests

I am writing an application where the Client issues commands to a web service (CQRS)
The client is written in C#
The client uses a WCF Proxy to send the messages
The client uses the async pattern to call the web service
The client can issue multiple requests at once.
My problem is that sometimes the client simply issues too many requests and the service starts returning that it is too busy.
Here is an example. I am registering orders and they can be from a handful up to a few 1000s.
var taskList = Orders.Select(order => _cmdSvc.ExecuteAsync(order))
.ToList();
await Task.WhenAll(taskList);
Basically, I call ExecuteAsync for every order and get a Task back. Then I just await for them all to complete.
I don't really want to fix this server-side because no matter how much I tune it, the client could still kill it by sending for example 10,000 requests.
So my question is. Can I configure the WCF Client in any way so that it simply takes all the requests and sends the maximum of say 20, once one completes it automatically dispatches the next, etc? Or is the Task I get back linked to the actual HTTP request and can therefore not return until the request has actually been dispatched?
If this is the case and WCF Client simply cannot do this form me, I have the idea of decorating the WCF Client with a class that queues commands, returns a Task (using TaskCompletionSource) and then makes sure that there are no more than say 20 requests active at a time. I know this will work but I would like to ask if anyone knows of a library or a class that does something like this?
This is kind of like Throttling but I don't want to do exactly that because I don't want to limit how many requests I can send in a given period of time but rather how many active requests can exist at any given time.
Based on #PanagiotisKanavos suggjestion, here is how I solved this.
RequestLimitCommandService acts as a decorator for the actual service which is passed in to the constructor as innerSvc. Once someone calls ExecuteAsync a completion source is created which along with the command is posted to the ActonBlock, the caller then gets back the a Task from the completion source.
The ActionBlock will then call the processing method. This method sends the command to the web service. Depending on what happens, this method will use the completion source to either notify the original sender that a command was processed successfully or attach the exception that occurred to the source.
public class RequestLimitCommandService : IAsyncCommandService
{
private class ExecutionToken
{
public TaskCompletionSource<bool> Source { get; }
public ICommand Command { get; }
public ExecutionToken(TaskCompletionSource<bool> source, ICommand command)
{
Source = source;
Command = command;
}
}
private IAsyncCommandService _innerSrc;
private ActionBlock<ExecutionToken> _block;
public RequestLimitCommandService(IAsyncCommandService innerSvc, int maxDegreeOfParallelism)
{
_innerSrc = innerSvc;
var options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism };
_block = new ActionBlock<ExecutionToken>(Execute, options);
}
public Task IAsyncCommandService.ExecuteAsync(ICommand command)
{
var source = new TaskCompletionSource<bool>();
var token = new ExecutionToken(source, command);
_block.Post(token);
return source.Task;
}
private async Task Execute(ExecutionToken token)
{
try
{
await _innerSrc.ExecuteAsync(token.Command);
token.Source.SetResult(true);
}
catch (Exception ex)
{
token.Source.SetException(ex);
}
}
}

Google cloud dataflow - batch insert in bigquery

I was able to create a dataflow pipeline which reads data from pub/sub and after processing it writes to big query in streaming mode.
Now instead of stream mode i would like to run my pipeline in batch mode to reduce the costs.
Currently my pipeline is doing streaming inserts in bigquery with dynamic destinations. I would like to know if there is a way to perform a batch insert operation with dynamic destinations.
Below is the
public class StarterPipeline {
public interface StarterPipelineOption extends PipelineOptions {
/**
* Set this required option to specify where to read the input.
*/
#Description("Path of the file to read from")
#Default.String(Constants.pubsub_event_pipeline_url)
String getInputFile();
void setInputFile(String value);
}
#SuppressWarnings("serial")
public static void main(String[] args) throws SocketTimeoutException {
StarterPipelineOption options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(StarterPipelineOption.class);
Pipeline p = Pipeline.create(options);
PCollection<String> datastream = p.apply("Read Events From Pubsub",
PubsubIO.readStrings().fromSubscription(Constants.pubsub_event_pipeline_url));
PCollection<String> windowed_items = datastream.apply(Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(300))))
.withAllowedLateness(Duration.standardDays(10)).discardingFiredPanes());
// Write into Big Query
windowed_items.apply("Read and make event table row", new
ReadEventJson_bigquery())
.apply("Write_events_to_BQ",
BigQueryIO.writeTableRows().to(new DynamicDestinations<TableRow, String>() {
public String getDestination(ValueInSingleWindow<TableRow> element) {
String destination = EventSchemaBuilder
.fetch_destination_based_on_event(element.getValue().get("event").toString());
return destination;
}
#Override
public TableDestination getTable(String table) {
String destination =
EventSchemaBuilder.fetch_table_name_based_on_event(table);
return new TableDestination(destination, destination);
}
#Override
public TableSchema getSchema(String table) {
TableSchema table_schema =
EventSchemaBuilder.fetch_table_schema_based_on_event(table);
return table_schema;
}
}).withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));
p.run().waitUntilFinish();
log.info("Events Pipeline Job Stopped");
}
}
Batch or Streaming are determined by the PCollection, so you would need to transform your data stream PCollection from Pub/Sub into a batch PCollection to write to BigQuery. The transform that allows to do this is GroupIntoBatches<K,InputT>.
Note that since this Transform uses Key-Value pairs, batches will contain only elements of a single key. For non-KV elements, check this related answer.
Once you have created your PCollection as batch using this transform, then apply the BigQuery write with Dynamic Destinations as you did with the stream PCollection.
You can limit the costs by using file loads for Streaming jobs. The Insertion Method section states that BigQueryIO.Write supports two methods of inserting data into BigQuery specified using BigQueryIO.Write.withMethod (org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method). If no method is supplied, then a default method will be chosen based on the input PCollection. See BigQueryIO.Write.Method for more information about the methods.
The different insertion methods provide different tradeoffs of cost, quota, and data consistency. Please see BigQuery documentation for more information about these tradeoffs.

wso2 msf4j: How to configure the server properties

I'm currently trying to use MSF4J with the StreamingOutput API. However, instead of streaming a File, I want to stream a series of unending short strings/texts. I want the strings to be flushed to the client immediately. However, the client is not getting it after the flush. I believe it is due to the default 8kb buffer, because my strings get flushed after a while in chunks. How do I override this default buffer the same way it is done in glassfish? https://jersey.java.net/apidocs/2.22/jersey/org/glassfish/jersey/server/ServerProperties.html#OUTBOUND_CONTENT_LENGTH_BUFFER
I want something like...
Properties properties = new Properties()
properties.set("jersey.config.server.contentLength.buffer", 0);**
new MicroservicesRunner()
.setProperties(properties)**
.addInterceptor(new HTTPMonitoringInterceptor())
.deploy(new MyService())
.start();
My streamingout class
new StreamingOutput(){
public void write(OutputStream os) throws IOException, WebApplicationException {
while(true){
os.write("some string".getBytes());
os.flush();
}
}
}
thank you.

How to lock a long async call in a WebApi action?

I have this scenario where I have a WebApi and an endpoint that when triggered does a lot of work (around 2-5min). It is a POST endpoint with side effects and I would like to limit the execution so that if 2 requests are sent to this endpoint (should not happen, but better safe than sorry), one of them will have to wait in order to avoid race conditions.
I first tried to use a simple static lock inside the controller like this:
lock (_lockObj)
{
var results = await _service.LongRunningWithSideEffects();
return Ok(results);
}
this is of course not possible because of the await inside the lock statement.
Another solution I considered was to use a SemaphoreSlim implementation like this:
await semaphore.WaitAsync();
try
{
var results = await _service.LongRunningWithSideEffects();
return Ok(results);
}
finally
{
semaphore.Release();
}
However, according to MSDN:
The SemaphoreSlim class represents a lightweight, fast semaphore that can be used for waiting within a single process when wait times are expected to be very short.
Since in this scenario the wait times may even reach 5 minutes, what should I use for concurrency control?
EDIT (in response to plog17):
I do understand that passing this task onto a service might be the optimal way, however, I do not necessarily want to queue something in the background that still runs after the request is done.
The request involves other requests and integrations that take some time, but I would still like the user to wait for this request to finish and get a response regardless.
This request is expected to be only fired once a day at a specific time by a cron job. However, there is also an option to fire it manually by a developer (mostly in case something goes wrong with the job) and I would like to ensure the API doesn't run into concurrency issues if the developer e.g. double-sends the request accidentally etc.
If only one request of that sort can be processed at a given time, why not implement a queue ?
With such design, no more need to lock nor wait while processing the long running request.
Flow could be:
Client POST /RessourcesToProcess, should receive 202-Accepted quickly
HttpController simply queue the task to proceed (and return the 202-accepted)
Other service (windows service?) dequeue next task to proceed
Proceed task
Update resource status
During this process, client should be easily able to get status of requests previously made:
If task not found: 404-NotFound. Ressource not found for id 123
If task processing: 200-OK. 123 is processing.
If task done: 200-OK. Process response.
Your controller could look like:
public class TaskController
{
//constructor and private members
[HttpPost, Route("")]
public void QueueTask(RequestBody body)
{
messageQueue.Add(body);
}
[HttpGet, Route("taskId")]
public void QueueTask(string taskId)
{
YourThing thing = tasksRepository.Get(taskId);
if (thing == null)
{
return NotFound("thing does not exist");
}
if (thing.IsProcessing)
{
return Ok("thing is processing");
}
if (!thing.IsProcessing)
{
return Ok("thing is not processing yet");
}
//here we assume thing had been processed
return Ok(thing.ResponseContent);
}
}
This design suggests that you do not handle long running process inside your WebApi. Indeed, it may not be the best design choice. If you still want to do so, you may want to read:
Long running task in WebAPI
https://blogs.msdn.microsoft.com/webdev/2014/06/04/queuebackgroundworkitem-to-reliably-schedule-and-run-background-processes-in-asp-net/