Storing variable-sized chunks in a std::queue? - c++

I'm writing a message queue meant to operate over a socket, and for various reasons I'd like have the queue memory live in user space and have a thread that drains queues into their respective sockets.
Messages are going to be small blobs of memory (between 4 and 4K bytes probably), so I think avoiding malloc()ing memory constantly is a must to avoid fragmentation.
The mode of operation would be that a user calls something like send(msg) and the message is then copied into the queue memory and is sent over the socket at a convenient time.
My question is, is there a "nice" way to store variable sized chunks of data in something like a std::queue or std::vector or am I going to have to go the route of putting together a memory pool and handling my own allocation out of that?

You can create a large circular buffer, copy data from the chunks into that buffer, and store pairs of {start pointer, length} in your queue. Since the chunks are allocated in the same order that they are consumed, the math to check for overlaps should be relatively straightforward.
Memory allocators have become quite good these days, so I would not be surprised if a solution based on a "plain" allocator exhibited a comparable performance.

You could delegate the memory pool burden to Boost.Pool.

If they are below 4K you might have no fragmentation at all. You did not mention the OS where your are going to run your application but in case it is Linux or Windows they can handle blocks of this size. At least you may check this before writing your own pools. See for example this question: question about small block allocator

Unless you expect to have a lot of queued data packets, I'd probably just create a pool of vector<char>, with (say) 8K reserved in each. When you're done with a packet, recycle the vector instead of throwing it away (i.e., put it back in the pool, ready to use again).
If you're really sure your packets won't exceed 4K, you can obviously reduce that to 4K instead of 8K -- but assuming this is a long-running program, you probably gain more from minimizing reallocation than you do from minimizing the size of an individual vector.
An obvious alternative would be to handle this at the level of the Allocator, so you're just reusing memory blocks instead of reusing vectors. This would make it a bit easier to tailor memory usage a little bit. I'd still pre-allocate blocks, but only a few sizes -- something like 64 bytes, 256 bytes, 1K, 2K, 4K (and possibly 8K).

Related

UART stream packetisation; stream or vector?

I am writing some code to interface an STM32H7 with a BM64 Bluetooth module over UART.
The BM64 expects binary data in bytes; in general:
1. Start word (0xAA)
2-3. Payload length
4. Message ID
5-n. Payload
n+1. Checksum
My question is around best practice for message queuing, namely:
Custom iostream, message vectors inside an interface class or other?
My understanding so far, please correct if wrong or add if something missed:
Custom iostream has the huge benefit of concise usage inline with cout etc. Very usable and clean and most likely portable, at least in principle, to other devices on this project operating on other UART ports. The disadvantage is that it is relatively a lot of work to create a custom streambuf and not sure what to use for "endl" (can't use null or '\n' as these may exist in the message, with it being binary.)
Vectors seem a bit dirty to me and particularly for embedded stuff, the dynamic allocations could be stealing a lot of memory unless I ruthlessly spend cycles on resize() and reserve(). However, a vector of messages (defined as either a class or struct) would be very quick and easy to do.
Is there another solution? Note, I'd prefer not to use arrays, i.e. passing around buffer pointers and buffer lengths.
What would you suggest in this application?
On bare metal systems I prefer fixed sized buffers with the maximum possible payload size. Two of them, fixed allocated, one to fill and one to send in parallel, switch over when finished. All kind of dynamic memory allocation ends in memory fragmentation, especially if such buffers jitters in size.
Even if you system have an MMU, it is maybe a good idea to not do much dynamic heap allocation at all. I often used own written block pool memory management to get rid of long time fragmentation and late alloc failures.
If you fear to use more than currently needed ram, think again: If you have such less ram that you can't spend more than currently needed, your system may fail sometimes, if you really need the maximum buffer size. That is never an option on embedded at all. The last one is a good argument to have all memory allocated more or less fixed as long as it is possible that under real runtime conditions this can happen at "some point in the future" :-)

Multiple Producer Multiple Consumer Lockfree Non Blocking Ring Buffer With Variable Length Write

I want to pass variable-length messages from multiple producers to multiple consumers, with low latency queue on multi-socket Xeon E5 systems. (400 bytes with a latency of 300 ns would be nice, for example.)
I've looked for existing implementations of lockless multiple-producer-multiple consumer (MPMC) queues using a non-blocking ring-buffer. But most implementations/algorithms online are node based (i.e. node is fixed length) such as boost::lockfree::queue, midishare, etc.
Of course, one can argue that the node type can be set to uint8_t or alike, but then the write will be clumsy and the performance will be horrible.
I'd also like the algorithm to offer overwrite detection on the readers' side that the readers will detect data being overwritten.
How can I implement a queue (or something else) that does this?
Sorry for a but late answer, but have a look at DPDK's Ring library. It is free (BSD license), blazingly fast (doubt you will find a faster solution for free) and supports all major architectures. There are lot's of examples as well.
to pass variable-length messages
The solution is to pass a pointer to a message, not a whole message. DPDK also offers memory pools library to allocate/deallocate buffers between multiple threads or processes. The memory pool is also fast, lock-free and supports many architectures.
So overall solution would be:
Create mempool(s) to share buffers among threads/processes. Each mempool supports just a fixed size buffer, so you might want to create few mempools to match your needs.
Create one MPMC ring or a set of SPSC ring pairs between your threads/processes. The SPSC solution might be faster, but it might not fit your design.
Producer allocates a buffer, fills it and passes a pointer to that buffer via the ring.
Consumer receives the pointer, reads the message and deallocates the buffer.
Sounds like a lot of work, but there are lots of optimizations inside DPDK mempools and rings. But will it fit 300ns?
Have a look at the official DPDK performance reports. While there is no official report for ring performance, there is a vhost/vistio test results. Basically, packets travel like this:
Traffic gen. -- Host -- Virtual Machine -- Host -- Traffic gen.
Host runs as one process, virtual machine as another.
The test result is ~4M packets per second for 512 byte packets. It does not fit your budget, but you need to do much, much less work...
You probably want to put pointers in your queue, rather than actually copying data into / out of the shared ring itself. i.e. the ring buffer payload is just a pointer.
Release/acquire semantics takes care of making sure that the data is there when you dereference a pointer you get from the queue. But then you have a deallocation problem: how does a producer know when a consumer is done using a buffer so it can reuse it?
If it's ok to hand over ownership of the buffer, then that's fine. Maybe the consumer can use the buffer for something else, like add it to a local free-list or maybe use it for something it produces.
For the following, see the ring-buffer based lockless MPMC queue analyzed in Lock-free Progress Guarantees. I'm imagining modifications to it that would make it suit your purposes.
It has a read-index and a write-index, and each ring-buffer node has a sequence counter that lets it detect writers catching up with readers (queue full) vs. readers catching up with writers (queue empty), without creating contention between readers and writers. (IIRC, readers read the write-index or vice versa, but there's no shared data that's modified by both readers and writers.)
If there's a reasonable upper bound on buffer sizes, you could have shared fixed-size buffers associated with each node in the ring buffer. Like maybe 1kiB or 4kiB. Then you wouldn't need a payload in the ring buffer; the index would be the interesting thing.
If memory allocation footprint isn't a big deal (only cache footprint) even 64k or 1M buffers would be mostly fine even if you normally only use the low 400 bytes of each. Parts of the buffer that don't get used will just stay cold in cache. If you're using 2MiB hugepages, buffers smaller than that are a good idea to reduce TLB pressure: you want multiple buffers to be covered by the same TLB entry.
But you'd need to claim a buffer before writing to it, and finish writing to it before finishing the second step of adding an entry to the queue. You probably don't want to do more than just memcpy, because a partially-complete write blocks readers if it becomes the oldest entry in the queue before it finishes. Maybe you could write-prefetch the buffer (with prefetchw on Broadwell or newer)
before trying to claim it, to reduce the time between you're (potentially) blocking the queue. But if there's low contention for writers, that might not matter. And if there's high contention so you don't (almost) always succeed at claiming the first buffer you try, write-prefetch on the wrong buffer will slow down the reader or writer that does own it. Maybe a normal prefetch would be good.
If buffers are tied directly to queue entries, maybe you should just put them in the queue, as long as the MPMC library allows you to use custom reader code that reads a length and copies out that many bytes, instead of always copying a whole giant array.
Then every queue control entry that producers / consumers look at will be in a separate cache line, so there's no contention between two producers claiming adjacent entries.
If you need really big buffers because your upper bound is like 1MiB or something, retries because of contention will lead to touching more TLB entries, so a more compact ring buffer with the large buffers separate might be a better idea.
A reader half-way through claiming a buffer doesn't block other readers. It only blocks the queue if it wraps around and a producer is stuck waiting for it. So you can definitely have your readers use the data in-place in the queue, if it's big enough and readers are quick. But the more you do during a partially-complete read, the higher chance that you sleep and eventually block the queue.
This is a much bigger deal for producers, especially if the queue is normally (nearly) empty: consumers are coming up on newly-written entries almost as soon as they're produced. This is why you might want to make sure to prefetch the data you're going to copy in, and/or the shared buffer itself, before running a producer.
400 bytes is only 12.5 cycles of committing 32 bytes per clock to L1d cache (e.g. Intel Haswell / Skylake), so it's really short compared to inter-core latencies or the time you have to wait for an RFO on a cache write-miss. So the minimum time between a producer making the claim of a node globally visible to when you complete that claim so readers can read it (and later entries) is still very short. Blocking the queue for a long time is hopefully avoidable.
That much data even fits in YMM 13 registers, so a compiler could in theory actually load the data into registers before claiming a buffer entry, and just do stores. You could maybe do this by hand with intrinsics, with a fully-unrolled loop. (You can't index the register file, so it has to be fully unrolled, or always store 408 bytes, or whatever.)
Or 7 ZMM registers with AVX512, but you probably don't want to use 512-bit loads/stores if you aren't using other 512-bit instructions, because of the effects on max-turbo clock speed and shutting down port 1 for vector ALU uops. (I assume that still happens with vector load/store, but if we're lucky some of those effects only happen with 512-bit ALU uops...)

How can one analyse and/or eliminate performance variations due to memory allocation?

I have a real-time application that generally deals with each chunk of incoming data in a matter of a 2-5 milliseconds, but sometimes it spikes to several tens of milliseconds. I can generate and repeat the sequence of incoming data as often as I like, and prove that the spikes are not related to particular chunks of data.
My guess is that because the C++/Win32/MFC code also uses variable-length std:vectors and std::lists, it regularly needs to get memory from the OS, and periodically has to wait for the OS to do some garbage collections or something. How could I test this conjecture? Is there any way to tune the memory allocation to make OS processes have less of an impact?
Context: think of the application as a network protocol analyser which gathers data in real-time and makes it available for inspection. The data "capture" always runs in the highest priority thread.
The easy way to test is to not put your data into any structure. ie eliminate whatever you suspect may be the problem. You might also consider that the delays may be the OS switching your process out of context in order to give time to other processes.
If you are pushing lots of data onto a vector, such that it is constantly growing, then you will experience periodic delays as the vector is resized. In this case, the delays are likely to get longer and less frequent. One way to mitigate this is to use a deque which allocates data in chunks but relaxes the requirement that all data be in contiguous memory.
Another way around it is to create a background thread that handles the allocation, provided you know that it can allocate memory faster than the process consuming it. You can't directly use standard containers for this. However, you can implement something similar to a deque, by allocating constant size vector chunks or simply using traditional dynamic arrays. The idea here is that as soon as you begin using a new chunk, you signal your background process to allocate a new chunk.
All the above is based on the assumption that you need to store all your incoming data. If you don't need to do that, don't. In that case, it would suggest your symptoms are related to the OS switching you out. You could investigate altering the priority of your thread.

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

questions about memory pool

I need some clarifications for the concept & implementation on memory pool.
By memory pool on wiki, it says that
also called fixed-size-blocks allocation, ... ,
as those implementations suffer from fragmentation because of variable
block sizes, it can be impossible to use them in a real time system
due to performance.
How "variable block size causes fragmentation" happens? How fixed sized allocation can solve this? This wiki description sounds a bit misleading to me. I think fragmentation is not avoided by fixed sized allocation or caused by variable size. In memory pool context, fragmentation is avoided by specific designed memory allocators for specific application, or reduced by restrictly using an intended block of memory.
Also by several implementation samples, e.g., Code Sample 1 and Code Sample 2, it seems to me, to use memory pool, the developer has to know the data type very well, then cut, split, or organize the data into the linked memory chunks (if data is close to linked list) or hierarchical linked chunks (if data is more hierarchical organized, like files). Besides, it seems the developer has to predict in prior how much memory he needs.
Well, I could imagine this works well for an array of primitive data. What about C++ non-primitive data classes, in which the memory model is not that evident? Even for primitive data, should the developer consider the data type alignment?
Is there good memory pool library for C and C++?
Thanks for any comments!
Variable block size indeed causes fragmentation. Look at the picture that I am attaching:
The image (from here) shows a situation in which A, B, and C allocates chunks of memory, variable sized chunks.
At some point, B frees all its chunks of memory, and suddenly you have fragmentation. E.g., if C needed to allocate a large chunk of memory, that still would fit into available memory, it could not do because available memory is split in two blocks.
Now, if you think about the case where each chunk of memory would be of the same size, this situation would clearly not arise.
Memory pools, of course, have their own drawbacks, as you yourself point out. So you should not think that a memory pool is a magical wand. It has a cost and it makes sense to pay it under specific circumstances (i.e., embedded system with limited memory, real time constraints and so on).
As to which memory pool is good in C++, I would say that it depends. I have used one under VxWorks that was provided by the OS; in a sense, a good memory pool is effective when it is tightly integrated with the OS. Actually each RTOS offers an implementation of memory pools, I guess.
If you are looking for a generic memory pool implementation, look at this.
EDIT:
From you last comment, it seems to me that possibly you are thinking of memory pools as "the" solution to the problem of fragmentation. Unfortunately, this is not the case. If you want, fragmentation is the manifestation of entropy at the memory level, i.e., it is inevitable. On the other hand, memory pools are a way to manage memory in such a way as to effectively reduce the impact of fragmentation (as I said, and as wikipedia mentioned, mostly on specific systems like real time systems). This comes to a cost, since a memory pool can be less efficient than a "normal" memory allocation technique in that you have a minimum block size. In other words, the entropy reappears under disguise.
Furthermore, that are many parameters that affect the efficiency of a memory pool system, like block size, block allocation policy, or whether you have just one memory pool or you have several memory pools with different block sizes, different lifetimes or different policies.
Memory management is really a complex matter and memory pools are just a technique that, like any other, improves things in comparison to other techniques and exact a cost of its own.
In a scenario where you always allocate fixed-size blocks, you either have enough space for one more block, or you don't. If you have, the block fits in the available space, because all free or used spaces are of the same size. Fragmentation is not a problem.
In a scenario with variable-size blocks, you can end up with multiple separate free blocks with varying sizes. A request for a block of a size that is less than the total memory that is free may be impossible to be satisfied, because there isn't one contiguous block big enough for it. For example, imagine you end up with two separate free blocks of 2KB, and need to satisfy a request for 3KB. Neither of these blocks will be enough to provide for that, even though there is enough memory available.
Both fix-size and variable size memory pools will feature fragmentation, i.e. there will be some free memory chunks between used ones.
For variable size, this might cause problems, since there might not be a free chunk that is big enough for a certain requested size.
For fixed-size pools, on the other hand, this is not a problem, since only portions of the pre-defined size can be requested. If there is free space, it is guaranteed to be large enough for (a multiple of) one portion.
If you do a hard real time system, you might need to know in advance that you can allocate memory within the maximum time allowed. That can be "solved" with fixed size memory pools.
I once worked on a military system, where we had to calculate the maximum possible number of memory blocks of each size that the system could ever possibly use. Then those numbers were added to a grand total, and the system was configured with that amount of memory.
Crazily expensive, but worked for the defence.
When you have several fixed size pools, you can get a secondary fragmentation where your pool is out of blocks even though there is plenty of space in some other pool. How do you share that?
With a memory pool, operations might work like this:
Store a global variable that is a list of available objects (initially empty).
To get a new object, try to return one from the global list of available. If there isn't one, then call operator new to allocate a new object on the heap. Allocation is extremely fast which is important for some applications that might currently be spending a lot of CPU time on memory allocations.
To free an object, simply add it to the global list of available objects. You might place a cap on the number of items allowed in the global list; if the cap is reached then the object would be freed instead of returned to the list. The cap prevents the appearance of a massive memory leak.
Note that this is always done for a single data type of the same size; it doesn't work for larger ones and then you probably need to use the heap as usual.
It's very easy to implement; we use this strategy in our application. This causes a bunch of memory allocations at the beginning of the program, but no more memory freeing/allocating occurs which incurs significant overhead.