How many lines is best to batch before sending the ILP data to QuestDB? - questdb

I read somewhere that Influx supports only 1000 metrics (lines of data) one can send via ILP. What is the maximum for QuestDB?
I am batching 1000 lines currently before calling socket.send(), will the speed go up if I send more in one go?

When you call send() on the socket it does not create any application level batching, just starts sending the byte buffer over the network. QuestDB batches all incoming data using parameters
commitLag
maxUncommittedRows
described at
https://questdb.io/docs/guides/out-of-order-commit-lag/

Related

Can you do batch pull messages with Google Pub Sub?

Trying to optimize our application but doing batch pulling. Pub Sub seems to allow asynchronously pulling one message at a time with different client nodes, but is there no way for a single node to do a batch pull from pub sub?
Both Streaming Pull and Pull RPC both only allow the subscriber to consume one message at a time. Right now, it looks like we would have to pull one message at a time and do application level batching.
Any insight would be helpful. Pretty new to this GCP in general.
The underlying pull and streaming pull operations can receive batches of messages in the same response. The Cloud Pub/Sub client library, which uses streaming pull, breaks these batches apart and hands them to the provided user callback one at a time. Therefore, you need not worry about optimizing the underlying receiving of messages.
If your concern is optimizing the subscriber code at the application level, e.g., you want to batch writes into a database, then you have a couple of options:
Use Pull directly, which allows one to process all of the messages in a batch at a time. Note that using pull effectively requires many simultaneously outstanding pull requests and replacing requests that return with new requests immediately.
In your user callback, re-batch messages and once the batch reaches a desired size (or you've waited a sufficient amount of time to fill the batch), process all of the messages together and then ack them.
You probably can implement that by using Dataflow (Apache Beam). You can have a running streaming job, where you group, window, transform messages according to your requirements. The results of processing can be saved in batches or steam further. It probably makes sense in case the number of messages is really big.

How Databricks processes the incoming messages from EventHub?

Being novice to real time continuous data processing scenarios, would like to know how the incoming continuous series of messages get processed via databricks, whether those are processed Sequential one by one or in Parallel way?
Thanks.
One way to achieve this is to use Spark on Databricks to ingest data from EventHub. This is done by consuming a message queue. If only one consumer is used to read from the queue, the messages will be processed sequentially. However, if multiple consumers are used it is possible to process multiple messages in parallel.
Check out these examples for more info as well:
https://lenadroid.github.io/posts/connecting-spark-and-eventhubs.html
https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs

AWS Kinesis Stream as FIFO queue

We currently have an application that receives a large amount of sensor data. Each sensor has its own unique sensor id (eg '5834f7718273f92cc326f620') and emits its status at different intervals. The processing order of the messages that come in is not important, for example a newer message of one sensor can be processed before an older message of another sensor. What does matter though, is that each message for a given sensor must be processed sequentially; in the order that that they arrived in the stream.
I have taken a look at the Kinesis Client Library and understand that KCL pushes messages to a single processor per shard. Does this mean that if a stream has only one shard it will have only one processor and couldn't this create a bottleneck? Or does KCL have more than one processor, and somehow, perhaps using the partition key ensures messages with the same partition key are never processed concurrently?
Note: We have taken a look at sqs fifo, but ruled it out as the 300 messages per second limit would soon become an issue.
Yes, each shard can only have one processor at a given moment (per application).
But, you can use the sensor id as the partition key for your kinesis put record request. (see here)
This will make sure that all of this sensor events will get into the same shard and processor.
If you will do that you'll be able to scale your processes and shards and still get each sensor events processed in a single processor

How to stream a queue across multiple subscriber?

What I am trying to accomplish on higher level:
I have a function that does I/O and generate messages. I have multiple subscriber clients that can subscribe or leave at any time. When a new client subscribes, it should get x number of previous output before streaming new messages (much like unix "tail -f").
My idea was to send-off the messages to an agent, which is a ring buffer. New clients will read the agent and then add-watch to the agent. Problem is, how can I ensure no new message arrive between reading and add-watch?
Next idea was to create 2 refs, one for a list of clients, one for the ring buffer. I can then add clients or post message in transactions. Problem is, when I add clients, I have to read the ring buffer and send it to the client (I/O). This is side effect in transaction that may be retried.
Last idea is to use locks, but that can't be the only way?

How to use Kinesis to broadcast records?

I know that Kinesis typical use case is event streaming, however we'd like to use it to broadcast some information to have it in near real time in some apps besides making it available for further stream processing. KCL seems to be the only viable option to use Kinesis as stream API is too low level.
As far I understand to use KCL we'd have to generate random applicationId so all apps could receive all the data, but this means creating a new DynamoDB table each time an application starts. Of course we can perform clean up when application stops but when application doesn't stop gracefully there would be DynamoDB table hanging around.
Is there a way/pattern to use Kinesis streams in a broadcast fashion?