sending EventData to Azure Eventhub larger than 256k - azure-eventhub

I need to send some data to eventhub, but it is not going through because the size is too big. Is there a way to compress the data or some command that chunks up the data and joins it in event hub?
I am using the following java-code to send:
EventData sendEvent = new EventData(payloadBytes);
EventHubClient ehClient = EventHubClient.createFromConnectionStringSync(connStr.toString());
ehClient.sendSync(sendEvent);
What are my options if the payloadBytes is too big?

Another approach would be to switch from basic to standard tier, which allows you to send events up to size 1MB. Microsoft docs: Quotas and limits - basic vs standard tiers.

You can try to compress the message you send. According to the docs You can edit the properties map inside the EventData :
var eventData = new EventData(....);
eventData.getProperties().put("Compression","GZip");
However, their solution didn't work for me and the messages size didnt decrease. Therefore I gziped the data by myself before adding it to the batch :
private EventData compressData(Data data) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream gos = new GZIPOutputStream(baos);
gos.write(new DefaultJsonSerializer().serializeToBytes(data)); # worth to create the seriazlier outside of this functions scope..
gos.flush();
gos.close();
return new EventData(baos.toByteArray());
}

Could you describe what is your use case and what sort of data are you going to publish, it helps find a right solution. Here are some common options I used in my projects.
Some concerns to keep in mind when designing system that need message broker or event streaming:
Event hub is designed for event streaming as it is described here. It scale well when you have huge amount of events.
256KB limit per message is way more than what is usually needed to transfer events, which are usually text based.
In my project I had two different use cases when 256K was not enough, here is how I solved it:
We need to publish internal event of a our Monolithic system out which later can be consumed by our Microservices outside. most of the time the messages were small but sometime they go slightly bigger than 256K and fix was easy we compress them before publish, and unzip them at the receiver. you can find a good sample here.
In the second scenario messages were way bigger than 256K which compression was not enough, what I did was first create a blob with that content and then publish the event with a reference to that blob, then your event receiver can get the content from blob, no matter how big it is.
It should cover most of the use cases but let me know more about your problem if it is not useful for you.

Related

C++ lib for disk persistent FIFO queue of binary messages

Looking for C++ library or easy and robust combination of ones that will provide durable disk backed queue for variable sizes binary blocks.
My app is producing messages that are being sent out to subscribers (messages are variable sized binaries), in a case of subscribers failure or restart or networking issues I need something like circular buffer to queue them up until subscriber return. Available RAM is not enough to handle worst case failure scenario so I'm looking for easy way to offload data to disk.
In best case : set up maximum disk space like (100G) and file name, recover data after application restart, .pus_back() / .front() / .pop_front() like API, no performance drawback when queue is small (99.99% case), no need for strict persistence (fsync() on every message)
Average case : data is not preserved between restarts
Some combo of boost libs will be highly preferable

Measure the rate at which messages arrive in a Akka Streams Flow or a Sink

I have written an Akka Streams application and its working fine.
What I want to do is to attach my JMX console to the JVM intance running the Akka Streams application and then study the amount of messages coming into my Sink and Flows.
Is this possible? I googled but didn't find a concrete way.
The final stage of my application is a sink to a Cassandra database. I want to know the rate of messages per second coming into the Sink.
I also want to pick a random Flow in my graph and then know the number of messages per second flowing through the flow.
Is there anything out of box? or should I just code something like dropwizard into each of my flows to measure the rate.
Presently there is nothing "out of the box" you can leverage to monitor rates inside your Akka Stream.
However, this is a very simple facility you can extract in a monitoring Flow you can place wherever it fits your needs.
The example below is based on Kamon, but you can see it can be ported to Dropwizard very easily:
def meter[T](name: String): Flow[T, T, NotUsed] = {
val msgCounter = Kamon.metrics.counter(name)
Flow[T].map { x =>
msgCounter.increment()
x
}
}
mySource
.via(meter("source"))
.via(myFlow)
.via(meter("sink"))
.runWith(mySink)
The above is part of a demo you can find at this repo.
An ad-hoc solution like this has the advantage of being perfectly tailored for your application, whilst keeping simplicity.

Cloud Dataflow: Side effect when watermark advances

Working with a streaming, unbounded PCollection in Google Dataflow that originates from a Cloud PubSub subscription. We are using this as a firehose to simply deliver events to BigTable continuously. Everything with the delivery is performing nicely.
Our problem is that we have downstream batch jobs that expect to read a day's worth of data out of BigTable once it is delivered. I would like to utilize windowing and triggering to implement a side effect that will write a marker row out to bigtable when the watermark advances beyond the day threshold, indicating that dataflow has reason to believe that most of the events have been delivered (we don't need strong guarantees on completeness, just reasonable ones) and that downstream processing can begin.
What we've tried is write out the raw events as one sink in the pipeline, and then window into another sink, using the timing information in the pane to determine if the watermark has advanced. The problem with this approach is that it operates upon the raw events themselves again, which is undesirable since it would repeat writing the event rows. We can prevent this write, but the parallel path in the pipeline would still be operating over the windowed streams of events.
Is there an effecient way to attach a callback-of-sorts to the watermark, such that we can perform a single action when the watermark advances?
The general ability to set a timer in event time and receive a callback is definitely an important feature request, filed as BEAM-27, which is under active development.
But actually your approach of windowing into FixedWindows.of(Duration.standardDays(1)) seems like it will accomplish your goal using just the features of the Dataflow Java SDK 1.x. Instead of forking your pipeline, you can maintain the "firehose" behavior by adding the trigger AfterPane.elementCountAtLeast(1). It does incur the cost of a GroupByKey but does not duplicate anything.
The complete pipeline might look like this:
pipeline
// Read your data from Cloud Pubsub and parse to MyValue
.apply(PubsubIO.Read.topic(...).withCoder(MyValueCoder.of())
// You'll need some keys
.apply(WithKeys.<MyKey, MyValue>of(...))
// Window into daily windows, but still output as fast as possible
.apply(Window.into(FixedWindows.of(Duration.standardDays(1)))
.triggering(AfterPane.elementCountAtLeast(1)))
// GroupByKey adds the necessary EARLY / ON_TIME / LATE labeling
.apply(GroupByKey.<MyKey, MyValue>create())
// Convert KV<MyKey, Iterable<MyValue>>
// to KV<ByteString, Iterable<Mutation>>
// where the iterable of mutations has the "end of day" marker if
// it was ON_TIME
.apply(MapElements.via(new MessageToMutationWithEndOfWindow())
// Write it!
.apply(BigTableIO.Write.to(...);
Please do comment on my answer if I have missed some detail of your use case.

What are the options to process timeseries data from a Kinesis stream

I need to process data from an AWS Kinesis stream, which collects events from devices. Processing function has to be called each second with all events received during the last 10 seconds.
Say, I have two devices A and B that write events into the stream.
My procedure has name of MyFunction and takes the following params:
DeviceId
Array of data for a period
If I start processing at 10:00:00 (and already have accumulated events for devices A and B for the last 10 seconds)
then I need to make two calls:
MyFunction(А, {Events for device A from 09:59:50 to 10:00:00})
MyFunction(B, {Events for device B from 09:59:50 to 10:00:00})
In the next second, at 10:00:01
MyFunction(А, {Events for device A from 09:59:51 to 10:00:01})
MyFunction(B, {Events for device B from 09:59:51 to 10:00:01})
and so on.
Looks like the most simple way to accumulate all the data received from devices is just store it memory in a temp buffer (the last 10 seconds only, of course), so I'd like to try this first.
And the most convenient way to keep such a memory based buffer I have found is to create a Java Kinesis Client Library (KCL) based application.
I have also considered AWS Lambda based solution, but looks like it's impossible to keep data in memory for lambda. Another option for Lambda is to have 2 functions, the first one has to write all the data into DynamoDB, and the second one to be called each second to process data fetched from db, not from memory. (So this option is much more complicated)
So my questions is: what can be other options to implement such processing?
So, what you are doing is called "window operation" (or "windowed computation"). There are multiple ways to achieve that, like you said buffering is the best option.
In memory cache systems: Ehcache, Hazelcast
Accumulate data in a cache system and choose the proper eviction policy (10 minutes in your case). Then do a grouping summation operation and calculate the output.
In memory database: Redis, VoltDB
Just like a cache system, you can use a database architecture. Redis could be helpful and stateful. If you use VoltDB or such SQL system, calling a "sum()" or "avg()" operation would be easier.
Spark Streaming: http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
It is possible to use Spark to do that counting. You can try Elastic MapReduce (EMR), so you will stay in AWS ecosystem and integration would be easier.

Webservice protection against big messages

I am developing a WebService in Java upon the jax-ws stack and glassfish.
Now I am a bit concerned about a couple of things.
I need to pass in a unknown amount of binary data that will be processed with a MDB, it is written this way to be asynchronous (so the user does not have to wait for the calculation to take place, kind of fault tolerant aswell as being very scalable.
The input message can however be split into chunks and sent to the MDB or split in the client and sent to the WS itself in chunks.
What I am looking for is a way to be able to specify the max size of the input so I wont blow the heap even if someone delibratly tries to send a to big message. I have noticed that things tend to be a bit unstable once you hit the ceiling and I must be able to keep running.
Is it possible to be safe against big messages or should I try to use another method instead of WS. Which options do I have?
Well I am rather new to Java EE..
If you're passing binary data take a look at enabling MTOM for endpoint. It utilizes streaming and has 'threshold' parameter.