Viewing data in a circular buffer in real-time - c++

I have an incoming stream of messages, and want a window that allows the user to scroll through the messages.
This is my current thinking:
Incoming messages go into single producer single consumer queue
A thread reads them out and places them into a circular buffer with a sequential id
This way I could have multiple incoming streams safely placed in the circular buffer and it decouples the input
Mutex to coordinate circular buffer access between the UI and the thread
Two notifications from thread to UI one for the first id and one for the last id in the buffer when ever either changes.
This allows the UI to figure out what it can display, which parts of the circular buffer it needs access, delete overwritten messages. It only accesses the messages required to fill the window at its current size and scroll position.
I'm not happy about the notification into the UI. It would be generated with high frequency. These could be queued or otherwise throttled; latency should not affect the first id but delays in handling the last id could cause problems in corner cases such as viewing the very end of a full buffer unless the UI makes a copy of the messages it displays, which I would like to avoid.
Does this sound like the right approach? Any tweaks that could make it a bit more palatable?

(See Effo EDIT below, and this part is deprecated) The ring buffer is not necessary if there's a queue between the thread and each UI.
When message arrived, the thread pop it and push it to a UI's queue accordingly.
Furthermore each UI.Q could be operated atomically too. There's no mutex needed. Another benefit is that each message only had been copied twice: one is to low level queue, anther is to the display, because storing the message to elsewhere is not necessary (just assign a pointer from low level queue to UI.Q should be enough if C/C++).
So far the only concern is that might the length of a UI.Q is not run-time enough when messaging traffic is heavy. Per this question, you can either use a dynamic-length queue or let the UI itself stores the overflowed message to a posix memory-mapped file. High efficiency if using posix mapping, even though you are using a file and need to do extra message copying. But anyway it is only the exception handling. Queue could be set to a proper size so that normally you'll get excellent performances. The point is that when the UI need to store overflowed message to a mapped file, it should perform highly-concurrent operation too so that it will not affect the low level queue.
I prefer to dynamic-size queue proposal. It seems we have lots of memory on modern PCs.
see the document EffoNetMsg.pdf at http://code.google.com/p/effonetmsg/downloads/list to learn more about lock-free, queue facilities and Highly-concurrent Programming Models.
Effo EDIT#2009oct23: Show a Staged Model which support random message accessing for scrolling of message viewers.
+---------------+
+---> Ring Buffer-1 <---+
| +---------------+ |
+--+ +-----+
| | +---------------+ | |
| +---> Ring Buffer-2 <---+ |
| +---------------+ |
| |
+-------+-------+ +-----------+----------+
| Push Msg & | | GetHeadTail() |
| Send AckReq | | & Send UpdateReq |
+---------------+ +----------------------+
|App.MsgStage() | | App.DisPlayStage() |
+-------+-------+ +-----------+----------+
| Pop() | Pop()
^ +-V-+ +-V-+
| Events | Q | Msg Stage | | Q | Display Stage
| Go Up | 0 | Logic-Half | | 1 | Logic-Half
-+------------- | | -------------+------------ | | ---------------
| Requests | | I/O-Half | | | I/O-Half
| Move Down +-^-+ | +-^-+
V | Push() |
+--------------+-------------+ |
| Push OnRecv Event, | +-------+-------+
| 1 Event per message | | | Push()
| | +------+------+ +------+------+
| Epoll I/O thread for | |Push OnTimer | |Push OnTimer |
|multi-messaging connections | | Event/UI-1 | | Event/UI-2 |
+------^-------^--------^----+ +------+------+ +------+------+
| | | | |
Incoming msg1 msg2 msg3 Msg Viewer-1 Msg Viewer-2
The Points:
1 You understand different Highly-concurrent Models, specific shown in above figure, a Staged Model; so that you'll know why it runs fast.
2 Two kinds of I/O, one is Messaging or Epoll Thread if C/C++ and GNU Linux 2.6x; another is Displaying such as drawing screen or printing text, and so on. the 2 kinds of I/O are processed as 2 Stages accordingly. Note if Win/MSVC, use Completion Port instead of Epoll.
3 Still 2 message-copyings as mentioned before. a) Push-OnRecv generates the message ("CMsg *pMsg = CreateMsg(msg)" if C/C++); b) UI read and copy message from it's ring buffer accordingly, and only need to copy updated message parts, not the whole buffer. Note queues and ring buffers are only store a message handle ("queue.push(pMsg)" or "RingBuff.push(pMsg)" if C/C++), and any aged-out message will be deleted ("pMsg->Destroy()" if C/C++). In general the MsgStage() would rebuild the Msg Header before push it into the ring buffer.
4 After an OnTimer event, the UI will receive update from upper layer which contains new Head/Tail indicators of the ring buff. so UI could update the display accordingly. Hope UI has a local msg buffer, so don't need to copy the whole ring buffer, but just update. see point 3 above. If need to perform random-accessing on ring buffer, you could just let UI generate OnScroll event. actually if UI has a local buffer, OnScroll might be not necessary. anyway, you can do it. Note UI will determine wheter to discard an aged-out message or not, say generate OnAgedOut event, so that the ring buffers could be operated correctly and safely.
5 Exactly, OnTimer or OnRecv is the Event name, and OnTimer(){} or OnRecv(){} would be executed in DisplayStage() or MsgStage(). Again, Events go upwards and Requests go downstream, and this might be different from that what you had though or seen before.
6 Q0 and 2 ring buffers could be implemented as lock-free facilities to improve performances, since single producer and single consumer; no lock/mutex needed. while Q1 is somthing different. But I believe you are able to make it single producer and single consumer too by changing the above design figure slightly, e.g. add Q2 so every UI has a queue, and DisplayStage() could just polling Q1 and Q2 to process all events correctly. Note Q0 and Q1 are Event-Queue, the Request-Queues are not shown in above figure.
7 MsgStage() and DisplayStage() are in a single StagedModel.Stage() sequentially, say the Main Thread. Epoll I/O or Messaging is another thread, the MsgIO Thread, and every UI has an I/O thread, say Display Thread. so in above figure, there're 4 threads in total which are running concurrently. Effo had tested that just one MsgIO Thread should be enough for multi-liseners plus thousands of messaging clients.
Again, see the document EffoNetMsg.pdf at http://code.google.com/p/effonetmsg/downloads/list or EffoAddons.pdf at http://code.google.com/p/effoaddon/downloads/list to learn more about Highly-concurrent Programming Models and network messaging; see EffoDesign_LockFree.pdf at http://code.google.com/p/effocore/downloads/list to learn more about lock-free facilities such as lock-free queue and lock-free ring buffer.

The notification to the GUI shouldn't contain the ID, i.e. the current value. Instead it should just say "the current value has changed", and then let the GUI read the value: because there may be a delay between sending the notification and the GUI reading the value, and you want the GUI to read the current value (and not a potentially stale value). You want it to be an asynchronous notification.
Also you can afford to throttle notifications, e.g. send no more than 5 or 20 per second (delay a notification by up to 50 to 200 msec if necessary).
Also the GUI will inevitably be making a copy of the message it displays, in the sense that there will be a copy of the message on the screen (in the display driver)! As for whether the GUI makes a copy into a private RAM bufferof its own, although you might not want to copy the whole message, you might find it safer/easier to have a design where you copy just as much of the message as you need to paint/repaint the display (and because you can't paint very much on a screen at one time, that implies that the quantity of data which you need to copy to do that would be trivial).

Related

AWS SQS.FIFO reads the first 20,000 messages to determine message groups, what is the order of reading?

What mechanism does SQS.FIFO use to read the first 20,000 messages?
Here's an example for some context:
On a FIFO queue, we have message groups:
| Group | Number of messages |
| A | 50,000 |
| B | 100 |
| C | 5 |
Timing of messages received on the queue:
Group A added at a rate of 100 per second from 11:00:00 onwards
Group B added at a rate of 10 per second from 11:00:01 onwards
Group C added at 11:05:00
No delivery delays are applied to any of the messages. The queue is configured with visibility timeout to match a lambda consumer that will be added later. The queue isn't being processed by anything yet.
Later on, a lambda function is configured with the above queue as an event source with 3 maximum batch size of 5 and long polling of 2 seconds. The lambda function takes 1 minute to process the events.
What would the first few batches contain?
| batch | messages | consumer |
| 1 | AAAAA | lambda1 |
| 2 | AAAAA | lambda1 |
| 3 | AAAAA | lambda1 |
| 4 | BBBBB | lambda2 ? |
The above model is what I expect to see if SQS.FIFO reads the messages ordered by time across all message groups.
The alternative is that SQS.FIFO keeps reading from message group A until the total number of messages on the queue is down to <20,000
Could someone shed some light on the reading mechanism?
As stated in the docs:
AWS SQS.FIFO queue looks through the first 20k messages to determine
available message groups. This means that if you have a backlog of
messages in a single message group, you can't consume messages from
other message groups that were sent to the queue at a later time until
you successfully consume the messages from the backlog.
SQS.FIFO reads the first 20,000 messages across all message groups, ordered by time of receipt.
I created an experiment with 3 message groups adding respectively 21k, 1K and 21K in each and sending them in the order of listing above. The queue was processed by a lambda function, with a max size batch of 10 messages. I introduced a delay of 1s to the lambda function.
The total queue size of available messages was 42k. For the first 1000 messages the queue only had 10 messages in flight. Then when the queue drooped to <41k I could see 20 messages in flight. This remained so until the queue drained. Here is my mental model of what happened in that queue. The three message groups are represented with blue, green and red bars.

Dataflow reading from Kafka without data loss?

We're currently big users of Dataflow batch jobs and wanting to start using Dataflow streaming if it can be done reliably.
Here is a common scenario: We have a very large Kafka topic that we need to do some basic ETL or aggregation on and a non idempotent upstream queue. Here is an example of our Kafka data:
ID | msg | timestamp (mm,ss)
-----------------------
  1   | A   |  01:00
  2  | B   |  01:01
  3  | D |  06:00
  4  | E   |  06:01
  4.3 |   F   |  06:01
.... | ...... | ...... (millions more)
4.5 | ZZ | 19:58


Oops, the data changes from integers to decimals at some point, which will eventually cause some elements to fail, necessitating us to kill the pipeline, possibly modify the downstream service, and possibly make minor code changes to the Dataflow pipeline.
In Spark Structured Streaming, because of the ability to use external checkpoints, we would be able to restart a streaming job and resume processing the queue where the previous job left off (successfully processing) for exactly once processing. In a vanilla or spring boot Java Application we could loop through with a Kafka consumer, and only after writing results to our 'sink', commit offsets.
My overall question is can we achieve similar functionality in Dataflow? I'll list some of my assumptions and concerns:
It seems here
 in KafkaIO 

there is not a relationship between the offset commit PCollection and the User's one, does that mean they can drift apart?
It seems here in KafkaOffsetCommit 
this is taking a window of five minutes and emitting the highest offset, but this is not wall time, this is kafka record time. Going back to our sample data, to me it looks like the entire queue's offset would be committed (in chunks of five minutes) as fast as possible! This means that if we have only finished processing up to record F in the first five minutes, we may have committed almost the entire queue's offests?
Now in our scenario our Pipeline started failing around F, it seems our only choice is to start from the beginning or lose data? 

I believe this might be overcome with a lot of custom code (Custom DoFn to ensure the Kafka Consumer never commits) and some custom code for our upstream sink that would eventually commit offsets. Is there a better way to do this, and/or are some my assumptions wrong about how offset management is handled in Dataflow?


Thank you for the detailed question!
In Beam (hence Dataflow), all of the outputs for a "bundle" are committed together, along with all state updates, checkpoints, etc, so there is no drift between different output PCollections. In this specific case, the offsets are extracted directly from the elements to be output so they correspond precisely. The outputs and offsets are both durably committed to Dataflow's internal storage before the offset is committed back to Kafka.
You are correct that the offsets from the elements already processed are grouped into 5 minute event time windows (Kafka record time) and the maximum offset is taken. While 5 minutes is an arbitrary duration, the offsets correspond to elements that have been successfully pulled off the queue.

Long lived state with Google Dataflow

Just trying to get my head around the programming model here. Scenario is I'm using Pub/Sub + Dataflow to instrument analytics for a web forum. I have a stream of data coming from Pub/Sub that looks like:
ID | TS | EventType
1 | 1 | Create
1 | 2 | Comment
2 | 2 | Create
1 | 4 | Comment
And I want to end up with a stream coming from Dataflow that looks like:
ID | TS | num_comments
1 | 1 | 0
1 | 2 | 1
2 | 2 | 0
1 | 4 | 2
I want the job that does this rollup to run as a stream process, with new counts being populated as new events come in. My question is, where is the idiomatic place for the job to store the state for the current topic id and comment counts? Assuming that topics can live for years. Current ideas are:
Write a 'current' entry for the topic id to BigTable and in a DoFn query what the current comment count for the topic id is coming in. Even as I write this I'm not a fan.
Use side inputs somehow? It seems like maybe this is the answer, but if so I'm not totally understanding.
Set up a streaming job with a global window, with a trigger that goes off every time it gets a record, and rely on Dataflow to keep the entire pane history somewhere. (unbounded storage requirement?)
EDIT: Just to clarify, I wouldn't have any trouble implementing any of these three strategies, or a million different other ways of doing it, I'm more interested in what is the best way of doing it with Dataflow. What will be most resilient to failure, having to re-process history for a backfill, etc etc.
EDIT2: There is currently a bug with the dataflow service where updates fail if adding inputs to a flatten transformation, which will mean you'll need to discard and rebuild any state accrued in the job if you make a change to a job that includes adding something to a flatten operation.
You should be able to use triggers and a combine to accomplish this.
PCollection<ID> comments = /* IDs from the source */;
PCollection<KV<ID, Long>> commentCounts = comments
// Produce speculative results by triggering as data comes in.
// Note that this won't trigger after *every* element, but it will
// trigger relatively quickly (as the system divides incoming data
// into work units). You could also throttle this with something
// like:
// AfterProcessingTime.pastFirstElementInPane()
// .plusDelayOf(Duration.standardMinutes(5))
// which will produce output every 5 minutes
.apply(Window.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.accumulatingFiredPanes())
// Count the occurrences of each ID
.apply(Count.perElement());
// Produce an output String -- in your use case you'd want to produce
// a row and write it to the appropriate source
commentCounts.apply(new DoFn<KV<ID, Long>, String>() {
public void processElement(ProcessContext c) {
KV<ID, Long> element = c.element();
// This includes details about the pane of the window being
// processed, and including a strictly increasing index of the
// number of panes that have been produced for the key.
PaneInfo pane = c.pane();
return element.key() + " | " + pane.getIndex() + " | " + element.value();
}
});
Depending on your data, you could also read whole comments from the source, extract the ID, and then use Count.perKey() to get the counts for each ID. If you want a more complicated combination, you could look at defining a custom CombineFn and using Combine.perKey.
Since BigQuery does not support overwriting rows, one way to go about this is to write the events to BigQuery, and query the data using COUNT:
SELECT ID, COUNT(num_comments) from Table GROUP BY ID;
You can also do per-window aggregations of num_comments within Dataflow before writing the entries to BigQuery; the query above will continue to work.

How to list programmable wake devices with C++

I'm trying to achieve the results of the following command that lists all programmable wake devices, or those that can be set/reset to wake the system:
powercfg -devicequery wake_programmable
I need to do the same from a C++ service. I'm using the code similar to this, but it gives me a smaller list. Here's how I call DevicePowerEnumDevices:
if(DevicePowerEnumDevices(index,
DEVICEPOWER_FILTER_DEVICES_PRESENT,
PDCAP_WAKE_FROM_D0_SUPPORTED |
PDCAP_WAKE_FROM_D1_SUPPORTED |
PDCAP_WAKE_FROM_D2_SUPPORTED |
PDCAP_WAKE_FROM_D3_SUPPORTED |
PDCAP_WAKE_FROM_S0_SUPPORTED |
PDCAP_WAKE_FROM_S1_SUPPORTED |
PDCAP_WAKE_FROM_S2_SUPPORTED |
PDCAP_WAKE_FROM_S3_SUPPORTED,
buff, &dwBuffSize))
{
//Got it
}
What flags am I missing for wake_programmable?

multithread read from disk?

Suppose I need to read many distinct, independent chunks of data from the same file saved on disk.
Is it possible to multi-thread this upload?
Related: Do all threads on the same processor use the same IO device to read from disk? In this case, multi-threading would not speed up the upload at all - the threads would just be waiting in line.
(I am currently multi-threading with OpenMP.)
Yes, it is possible. However:
Do all threads on the same processor use the same IO device to read from disk?
Yes. The read head on the disk. As an example, try copying two files in parallel as opposed to in series. It will take significantly longer in parallel, because the OS uses scheduling algorithms to make sure the IO rate is "fair," or equal between the two threads/processes. Because of this, the read head will jump back and forth between different parts of the disk, slowing the process down A LOT. The time to actually read the data is pretty small compared to the time to seek to it, and when you're reading two different parts of the disk at once, you spend most of the time seeking.
Note that all of this assumes you're using a hard disk. If you're using an SSD, it will not be slower in parallel, but it will not be faster either. Edit: according to comments parallel is actually faster for an SSD. With RAID the situation becomes more complicated, and (obviously) depends on what kind of RAID you're using.
This is what it looks like (I've unwrapped the circular disk into a rectangle because ascii circles are hard, and simplified the data layout to make it easier to read):
Assume the files are separated by some space on the platter like so:
| |
A series read will look like (* indicates reading)
space ----->
| *| t
| *| i
| *| m
| *| e
| *| |
| / | |
| / | |
| / | V
| / |
|* |
|* |
|* |
|* |
While a parallel read will look like
| \ |
| *|
| / |
| / |
| / |
| / |
|* |
| \ |
| \ |
| \ |
| \ |
| *|
| / |
| / |
| / |
| / |
|* |
| \ |
| \ |
| \ |
| \ |
| *|
etc
If you're doing this on Windows you might want to look into the ReadFileScatter function. It will let you read multiple segments from a file in a single asynchronous call. This will allow the OS to better control the file IO bottle neck and hopefully optimizes the reads.
The matching write call on Windows would be WriteFileGather.
For UNIX you're looking at readv and writev to do the same thing.
As mentioned in the other answers a parallel read may be slower depending on the way the file is physically stored on disk. So if the head has to move a significant distance it can cause an actual slowdown. This being said there are however storage systems which can support multiple simultaneous reads and writes efficiently. The most simple one I can imagine is a SSD disk. I myself worked with magnificent storage systems from IBM which could perform simultaneous reads and writes with no slowdown.
So let's assume you have such a file system and physical storage which will not slow down on parallel reads.
In that case parallel reads are very logical. In general there are two ways to achieve that:
If you want to use the standard C/C++ library to perform the IO then the only option you have is to keep one open file handle (descriptor) per thread. This is because the file pointer (which points to where to read or write from in the file) is kept per handle. So if you try to read simultaneously from the same file handle you will not have any way of knowing what you are actually reading.
Use platform specific API to perform asynchronous (OVERLAPPED) IO. On windows you use the WinAPI functions with what is called OVERLAPPED IO. On Unix/Linux you have posix AIO although I understand that it's use is discouraged although I didn't see any satisfactory explanation as to why that is the case.
I myself implemented the both fd/thread approach on both linux and windows and the OVERLAPPED approach on windows. Both work great.
You won't be able to speed up the process of reading to disk. If you're calculating at the same time as you're writing, parallelizing will help. But the pure writing will be limited by the bandwidth of the lane between processor and hard drive and, more notably, by the harddrive itself (my hard drive does 30 MB/s, I've heard about raid setups serving 120 MB/s over network, but don't rely on that).
Multiple reads from a disk should be thread-safe by the design of the op system if you use the standard system functions there's no need to manually locking it, open the files read-only though. (Otherwise you'll get file access errors.)
Btw you are not necessary reading from the disk in practice, the op system will decide where it will serve you from. It typically prefetches the reads and serves from the memory.