C++ lib for disk persistent FIFO queue of binary messages - c++

Looking for C++ library or easy and robust combination of ones that will provide durable disk backed queue for variable sizes binary blocks.
My app is producing messages that are being sent out to subscribers (messages are variable sized binaries), in a case of subscribers failure or restart or networking issues I need something like circular buffer to queue them up until subscriber return. Available RAM is not enough to handle worst case failure scenario so I'm looking for easy way to offload data to disk.
In best case : set up maximum disk space like (100G) and file name, recover data after application restart, .pus_back() / .front() / .pop_front() like API, no performance drawback when queue is small (99.99% case), no need for strict persistence (fsync() on every message)
Average case : data is not preserved between restarts
Some combo of boost libs will be highly preferable

Related

AKKA: passing local resource to actor

Let's suppose I want to use AKKA actor model to create a program crunching data coming from files.
Since the model, as far as I understood, is winning if the actor really are unaware on where they are running, passing the path of the file in the message should be an error -some actors when the app scales will possibly not to have access to that path -. By opposite, passing the entire file as bytes would not be an option due to resource issue ( what if file is big and bigger? )
What is the correct strategy to handle this situation? On the same question: would be the assumption of having a distributed file system a good excuse to accept paths as messages?
I don't think there's a single definitive answer, because it depends on the nature of the data and the "crunching". However, in the typical case where you really are doing data processing of the files, you are going to have to read the files into memory at some point. So, yes, the generally answer is to read the entire file as bytes.
In answer to the question of "what if the file is bigger", that's why we have streaming libraries like Akka Streams. For example, a common case might be to use Alpakka to watch for files in a local directory (or FTP), parse them into records, filter/map the records to do some initial cleansing, and then stream those records to distributed actors to process. Because you are using streaming, Akka is not trying to load the whole file into memory at a time, and you get the benefit of backpressure so that you don't overload the actors doing the processing.
That's not to say a distributed file system might not have uses. For example, so that you have high availability. If you upload a file to the local filesystem of an Akka node and the Akka node fails, you obviously lose access to your file. But that's really a separate issue to how you do distributed processing.

Buffering to the hard disk

I am receiving a large quantity of data at a fixed rate. I need to do some processing on this data on a different thread, but this may run slower than the data is coming in, so I need to buffer the data. Due to the quantity of data coming in the available RAM would be quickly exhausted, so it needs to overflow onto the hard disk. What I could do with is something like a filesystem-backed pipe, so the writer could be blocked by the filesystem, but not by the reader running too slowly.
Here's a rough set of requirements:
Writing should not be blocked by the reader running too slowly.
If data is read slow enough that the available RAM is exhausted it should overflow to the filesystem. It's ok for writes to the disk to block.
Reading should block if no data is available unless the stream has been closed by the writer.
If the reader is able to keep up with the data then it should never hit the hard disk as the RAM buffer would be sufficient (nice but not essential).
Disk space should be recovered as the data is consumed (or soon after).
Does such a mechanism exist in Windows?
This looks like a classic message queue. Did you consider MSMQ or similar? MSMQ has all the properties you are asking for. You may want to use direct addressing to avoid Active Directory http://msdn.microsoft.com/en-us/library/ms700996(v=vs.85).aspx and use local or TCP/IP queue address.
Use an actual file. Write to the file as the data is received, and in another process read the data from the file and process it.
You even get the added benefits of no multithreading.

POSIX Queue Configuration

I want to know how can I configure posix queue on linux OS.
I know the ways I can edit in sysctl.conf and in code by
mq_open(**,**,**);
Is there any other way I can configure the number of messages per queue and the number of queues.
You are mixing different layers of the onion.
On the individual queue layer, the queue attributes (mq_maxmsg and mq_msgsize) are fixed at the time of the queue creation and can't be changed. mq_curmsgs doesn't make any sense to change unless you are looking to mangle your queue and can only be queried through mq_getattr. The mq_flags can be changed through mq_setattr` but the only flag to be changed is to toggle the blocking/non-blocking state of the queue.
As practical matter it is easy to write simple command line utilities to do most of the above and many organizations will already have them. They are usually among the first programs using queues that developers write for themselves anyway. Some systems will incorporate these little utilities into startup and shutdown scripts for their applications.
On the process layer, there are limits on message priorities (MQ_PRIO_MAX) and the number of queues a process can have open (MQ_OPEN_MAX). In linux neither of these are a real concern. The max priority is like 32k - sysconf(_SC_MQ_PRIO_MAX) - and if you are using that many priorities you have some real design issues. And because mqd_t types in linux are file descriptors the real limiting factors on the number of open queues is the total number of file descriptors to which a process is limited.
At the system level, there are limit files in /proc/sys/fs/mqueue that can be changed with appropriate permissions. (a) queues_max is the upper limit on the number of queues allowed on a system in toto but a privileged user can still create queues once this limit has been hit. (b) msgsize_max is the max message size of a message created by an unprivileged process. (c) msg_max is the largest message size allowed for a queue. (d) Linux also has two files msg_default and msgsize_default in /proc/sys/fs/mqueue that should be self-evident.

usb disk write latency (windows)

I am writing to USB disk from a lowest priority thread, using chunked buffer writing and still, from time to time the system in overall lags on this operation. If I disable writing to disk only, everything works fine. I can't use Windows file operations API calls, only C write. So I thought maybe there is a WinAPI function to turn on/off USB disk write caching which I could use in conjunction with FlushBuffers or similar alternatives? The number of drives for operations is undefined.
Ideally I would like to never be lagging using write call and the caching, if it will be performed transparently is ok too.
EDIT: would _O_SEQUENTIAL flag on write only operations be of any use here?
Try to reduce I/O priority for the thread.
See this article: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx
In particular use THREAD_MODE_BACKGROUND_BEGIN for your IO thread.
Warning: this doesn't work in Windows XP
The thread priority won't affect the delay that happens in the process of writing the media, because it's done in the kernel mode by the file system/disk drivers that don't pay attention to the priority of the calling thread.
You might try to use "T" flag (_O_SHORTLIVED) and flush the buffers at the end of the operation, also try to decrease the buffer size.
There are different types of data transfer for USB, for data there are 3:
1.Bulk Transfer,
2.Isochronous Transfer, and
3.Interrupt Transfer.
Bulk Transfers Provides:
Used to transfer large bursty data.
Error detection via CRC, with guarantee of delivery.
No guarantee of bandwidth or minimum latency.
Stream Pipe - Unidirectional
Full & high speed modes only.
Bulk transfer is good for data that does not require delivery in a guaranteed amount of time The USB host controller gives a lower priority to bulk transfer than the other types of transfer.
Isochronous Transfers Provides:
Guaranteed access to USB bandwidth.
Bounded latency.
Stream Pipe - Unidirectional
Error detection via CRC, but no retry or guarantee of delivery.
Full & high speed modes only.
No data toggling.
Isochronous transfers occur continuously and periodically. They typically contain time sensitive information, such as an audio or video stream. If there were a delay or retry of data in an audio stream, then you would expect some erratic audio containing glitches. The beat may no longer be in sync. However if a packet or frame was dropped every now and again, it is less likely to be noticed by the listener.
Interrupt Transfers Provides:
Guaranteed Latency
Stream Pipe - Unidirectional
Error detection and next period retry.
Interrupt transfers are typically non-periodic, small device "initiated" communication requiring bounded latency. An Interrupt request is queued by the device until the host polls the USB device asking for data.
From the above, it seems that you want a Guaranteed Latency, so you should use Isochronous mode. There are some libraries that you can use like libusb, or you can read more in msdn
To find out what is letting your system hang you first need to drill down to the Windows hang. What was Windows doing while you did experience the hang?
To find this out you can take a kernel dump. How to get and analyze a Kernel Dump read here.
Depending on the findings you get there you then need to decide if there is anything under your control you can do about. Since you are using a third party library to to the writing there is little you can do except to set the IO priority, thread priority on thread or process level. If the library you were given links against a specific CRT you could try to build your own customized version of it to e.g. flush after every write to prevent write combining by the OS to write only data in big chunks back to disc.
Edit1
Your best bet would be to flush the device after every write. This could force the OS to flush any pending data and write the current pending writes to disc without caching the writes up to certain amount.
The second best thing would be to simply wait after each write to give the OS the chance to write pending changes though small back to disc after a certain time interval.
If you are deeper into performance you should try out XPerf which has a nice GUI and shows you even the call stack where your process did hang. The Windows Team and many other teams at MS use this tool to troubleshoot hang experiences. The latest edition with many more features comes with the Windows 8 SDK. But beware that Xperf only works on OS > Vista.

MySQL++, storing realtime data

Firstly I'm an engineer, not a computer scientist, so please be gentle.
I currently have a C++ program which uses MySQL++. The program also incorporates the NI Visa runtime. One of the interrupt handlers receives data (1 byte) from a USB device about 200 times a second. I would like to store this data with a time stamp on each sample on a remote server. Is this feasible? Can anyone recommend a good approach?
Regards,
Michael
I think that performing 200 transactions/second against a remote server is asking a lot, especially when you consider that these transactions would be occurring in the context of an interrupt handler which has to do its job and get done quickly. I think it would be better to decouple your interrupt handler from your database access - perhaps have the interrupt handler store the incoming data and timestamp into some sort of in-memory data structure (array, circular linked list, or whatever, with appropriate synchronization) and have a separate thread that waits until data is available in the data structure and then pumps it to the database. I'd want to keep that interrupt handler as lean and deterministic as possible, and I'm concerned that database access across the network to a remote server would be too slow - or worse, would be OK most of the time, but sometimes would go to h*ll for no obvious reason.
This, of course, raises the question/problem of data overrun, where data comes in faster than it can be pumped to the database and the in-memory storage structure fills up. This could cause data loss. How bad a thing is it if you drop some samples?
I don't think you'll be able to maintain that speed with 1 separate insert per value, but if you batched them up into large enough batches you could send it all as one query and it should be fine.
INSERT INTO records(timestamp, value)
VALUES(1, 2), (3, 4), (5, 6), [...], (399, 400);
Just push the timestamp and value onto a buffer, and when the buffer hits 200 in size (or some other arbitrary figure), generate the SQL and send the whole lot off. Building this string up with sprintf shouldn't be too slow. Just beware of reading from a data structure that your interrupt routine might be writing to at the same time.
If you find that this SQL generation is too slow for some reason, and there's no quicker method using the API (eg. stored procedures), then you might want to run this concurrently with the data collection. Simplest is probably to stream the data across a socket or pipe to another process that performs the SQL generation. There are also multithreading approaches but they are more complex and error-prone.
In my opinion, you should do two things: 1. buffer the data and 2. one time stamp per buffer. The USB protocol is not byte based and more message based. If you are tracking messages, then time stamp the messages.
Also, databases would rather receive blocks or chunks of data than one byte at a time. There is overhead in the database with each transaction. To measure the efficiency, divide the overhead by the number of bytes in the transaction. You'll see that large blocks are more efficient than lots of little transactions.
Another option is to store the data into a file, then use the MySQL LOADFILE function to load the data into the the database. Also, there is storing the data into a buffer then using the MySQL C++ connector stream to load the data into the database.
multi-threading doesn't guarantee being any faster than apartment even if you cached it correctly on server side unless there was some strange cpu priority preference. What about using shaders and letting the pass by reference value in windows.h be the time stamp