I have a steady stream of timestamped data, of which I want to always keep the last 5 seconds of data in a buffer.
Furthermore, I would like to provide support for extracting data of a given subinterval of the 5 seconds, so something like
interval = buffer.extractData(startTime, endTime);
What std data structure would be most appropriate for this?
1) The fact that a new sample pushes an old sample out hints that a Queue would be a good data structure
2) The fact that we have to have random access to any elements, in order to obtain the sub interval maybe suggests that vector is appropriate.
Furthermore, what would be a good way to present the subinterval to the user?
My suggestion would be using two iterators?
Unless you are in a fairly performance critical part of the code, a deque would seem reasonable. It can grow and shrink to accommodate changes in your data rate and has reasonable performance for double-ended queue operations and random access.
If the code is performance sensitive (or, even worse, has real-time requirements on top, as is the case with many timestamped buffers), you need to prevent memory allocations as much as possible. You would do this by using a ring buffer with a preallocated array (be it through unique_ptr<T[]> or vector) and either dropping elements when the buffer size is exceeded, or (taking one for the team and) increasing its size.
By never reducing size again, your ring buffer might waste some memory, but remember that in most cases memory is fairly plentiful.
Representing intervals by two iterators or a range object is both common, and although the C++ standard library often prefers iterators, my personal preference is for range objects due to their (in my opinion) slightly better usability.
Related
I have a thread running that reads a stream of bytes from a serial port. It does this continuously in the background, and reads from the stream come in at different, separate times. I store the data in a container like so:
using ByteVector = std::vector<std::uint8_t>;
ByteVector receive_queue;
When data comes in from the serial port, I append it to the end of the byte queue:
ByteVector read_bytes = serial_port->ReadBytes(100); // read 100 bytes; returns as a "ByteVector"
receive_queue.insert(receive_queue.end(), read_bytes.begin(), read_bytes.end());
When I am ready to read data in the receive queue, I remove it from the front:
unsigned read_bytes = 100;
// Read 100 bytes from the front of the vector by using indices or iterators, then:
receive_queue.erase(receive_queue.begin(), receive_queue.begin() + read_bytes);
This isn't the full code, but gives a good idea of how I'm utilizing the vector for this data streaming mechanism.
My main concern with this implementation is the removal from the front, which requires shifting each element removed (I'm not sure how optimized erase() is for vector, but in the worst case, each element removal results in a shift of the entire vector). On the flip side, vectors are candidates for CPU cache locality because of the contiguous nature of the data (but CPU cache usage is not guaranteed).
I've thought of maybe using boost::circular_buffer, but I'm not sure if it's the right tool for the job.
I have not yet coded an upper-limit for the growth of the receive queue, however I could easily do a reserve(MAX_RECEIVE_BYTES) somewhere, and make sure that size() is never greater than MAX_RECEIVE_BYTES as I continue to append to the back of it.
Is this approach generally OK? If not, what performance concerns are there? What container would be more appropriate here?
Erasing a from the front of a vector an element at the time can be quite slow, especially if the buffer is large (unless you can reorder elements, which you cannot with a FIFO queue).
A circular buffer is an excellent, perhaps ideal data structure for a FIFO queue of fixed size. But there is no implementation in the standard library. You'll have to implement it yourself or use a third party implementation such as the Boost one you've discovered.
The standard library provides a high level structure for a growing FIFO queue: std::queue. For a lower level data structure, the double ended queue is a good choice (std::deque, which is the default underlying container of std::queue).
On the flip side, vectors are candidates for CPU cache locality because of the contiguous nature of the data (but this is not guaranteed).
The continuous storage of std::vector is guaranteed. A fixed circular buffer also has continuous storage.
I'm not sure what is guaranteed about cache locality of std::deque but it is typically quite good in practice as the typical implementation is a linked list of arrays.
Performance will be poor, which may or may not matter. Taking from the head entails a cascade of moves. But STL has queues for exactly this purpose, just use one.
I'm looking to find the best way to setup the CL memory objects for my project, which does a device side physics simulation. The buffers will be accessed by the host every frame, approx every 16ms, to get the updated data for rendering. Unfortunately, I cannot send the new data straight to the GPU via a VBO.
The data in the buffer consists of structs with 3 cl_float4's and one cl_float. I also want to have the ability for the host to update some of the structs in the buffer, this will not be per-frame.
Currently I'm looking to have all the data be allocated/stored on the GPU and using map/unmap whenever the host requires access. But this brings up two issues that I can see:
Still require a device to host copy for rendering
Buffer must be rebuilt whenever objects are added/removed from the simulation. Or additional validation data must exist per struct to check if this object is "alive"/valid...
Any advice is appreciated. If you need any additional info or code snippets, just let me know.
Thank you.
Algorithm for efficient memory management
You're asking for the best setup for OpenCL memory. I'm assuming you mostly care about high performance and not too much of a much of a size overhead.
This means you should perform as many operations as possible on the GPU. Syncing between CPU/GPU should be minimized.
Memory model
I will now describe in detail how such a memory and processing model should look like.
Preallocate buffers with the maximum size and fill them over time.
Track how many elements currently are in the buffer
Have separate buffers for validity and your data. The validity buffer indicates the validity for each element
Adding elements
Adding elements can be done via the following principle:
Have a buffer with host pointer for input data. The size of the buffer is determined by the maximum number of input elements
When you receive data, copy it onto the host buffer and sync it to the GPU
(Optional) Preprocess input data on the GPU
In a kernel, add input data and corresponding validity behind the last element in the global buffer. Input points that are empty (maybe you just got 100 input points instead of 10000), just mark them as invalid.
This has several effects:
Adding can be completely done in parallel
You only have to sync a small buffer (input data buffer) to the GPU
When adding input data, you always add the maximum amount of input elements into the buffer, but most of them will be empty/invalid. So when you frequently add points
If your rendering step is not able to discard invalid points, you must remove invalid points from the model before rendering.
Otherwise, you can postpone cleaning up to a point, where it is only needed because the size of the model becomes to big and threatens to overflow.
Removing elements
Removing elements should be done via the following principle:
Have a kernel that determines if an elements becomes invalid. If so, just mark its validity accordingly (if you want you can zero nor NAN out the data, too, but that is not necessary).
Have an algorithm that is able to remove invalid elements from the buffer and give you the information about the number of valid,
consecutive elements in the buffer (that information is needed when adding elements).
Such an algorithm will require you to perform sorts and a search using parallel reduction.
Sorting elements in parallel
Sorting a buffer, especially one with many elements is highly demanding. You should use available implementations to do so.
Simple Bitonic sort:
If you do not care about the maximum possible performance and simple code, this is your choice.
Implementation available: https://software.intel.com/en-us/articles/bitonic-sorting
Simple to integrate, just a single kernel.
Can only sort 4*2^n elements (as far as I remember).
WARNING: This sort does not work with numbers larger than one billion (1,000,000,000). Not sure why but finding that out cost me quite some time.
Fast radix sort:
If you care about maximum performance and have lots of elements to sort (1 million up to 1 billion or even more), this is your choice.
Implementation available: https://github.com/sschaetz/nvidia-opencl-examples/tree/master/OpenCL/src/oclRadixSort
More difficult to integrate, serveral kernel calls
Can only sort 2^n elements (as far as I remember)
Faster than Bitonic sort, especially with more than 1 million elements
Finding out the number of valid elements
If the buffer has been sorted and all invalid elements have been removed, you could simply parallely count the number of valid values, or simply find the first index of the first invalid element (this requires you to have unused buffer space invalidated). Both ways will give you the number of valid elements
Problem size vs. sorting size restrictions
To overcome the problems that arise with only being able to sort a fixed number of elements, just pad out with values whose sorting behavior you know.
Example:
You want to sort 10,000 integers with values between 0 and 10 million in ascending order.
You can only sort 2^n elements
The closest you will get is 2^14 = 16384.
Have a buffer for sorting with 2^14 elements
Fill the buffer with the 10000 values to sort.
Fill all remaining values of the buffer with a value you know will be sorted behind the 10,000 actually existing values.
Since you know your value range (0 to 10 million), you could pick 11 million as filling value.
In-place sorting problem
In-place sorting and removing of elements is difficult (but possible) to implement. An easier solution is to determine the indices of consecutive valid elements and write them to a new buffer in that order and then swap buffers.
But this requires you to swap buffers or copy back which costs both performance and space. Chose the lesser evil in your case.
More advice
Only add wait-events, if you are still not content with the performance. However, this will complicate your code and possibly introduce bugs (which won't even be your fault - there is a nasty bug with Nvidia cards and OpenCL where wait-events are not destoyed and memory leaks - this will slowly but surely cause problems).
Be very careful with syncing/mapping buffers to CPU too early, as this sync-call will force all kernels using this buffer to finish
If adding elements rarely occurs, and your rendering step is able to discard invalid elements, you can postpone removing elements from the buffer until it is really needed (too many elements threaten to overflow your buffer).
Last week I have read about great concepts as cache locality and pipelining in a cpu. Although these concepts are easy to understand I have two questions. Suppose one can choose between a vector of objects or a vector of pointers to objects (as in this question).
Then an argument for using pointers is that shufling larger objects may be expensive. However, I'm not able to find when I should call an object large. Is an object of several bytes already large?
An argument against the pointers is the loss of cache locality. Will it help if one uses two vectors where the first one contains the objects and will not be reordered and the second one contains pointers to these objects? Say that we have a vector of 200 objects and create a vector with pointers to these objects and then randomly shuffle the last vector. Is the cache locality then lost if we loop over the vector with pointers?
This last scenario happens a lot in my programs where I have City objects and then have around 200 vectors of pointers to these Cities. To avoid having 200 instances of each City I use a vector of pointers instead of a vector of Cities.
There is no simple answer to this question. You need to understand how your system interacts with regards to memory, what operations you do on the container, and which of those operations are "important". But by understanding the concepts and what affects what, you can get a better understanding of how things work. So here's some "discussion" on the subject.
"Cache locality" is largely about "keeping things in the cache". In other words, if you look at A, then B, and A is located close to B, they are probably getting loaded into the cache together.
If objects are large enough that they fill one or more cache-lines (modern CPU's have cache-lines of 64-128 bytes, mobile ones are sometimes smaller), the "next object in line" will not be in the cache anyways [1], so the cache-locality of the "next element in the vector" is less important. The smaller the object is, the more effect of this you get - assuming you are accessing objects in the order they are stored. If you pick a random number, then other factors start to become important [2], and the cache locality is much less important.
On the other other hand, as objects get larger, moving them within the vector (including growing, removing, inserting, as well as "random shuffle") will be more time consuming, as copying more data gets more extensive.
Of course, one further step is always needed to read from a pointer vs. reading an element directly in a vector, since the pointer itself needs to be "read" before we can get to the actual data in the pointee object. Again, this becomes more important when random-accessing things.
I always start with "whatever is simplest" (which depends on the overall construct of the code, e.g. sometimes it's easier to create a vector of pointers because you have to dynamically create the objects in the first place). Most of the code in a system is not performance critical anyway, so why worry about it's performance - just get it working and leave it be if it doesn't turn up in your performance measurements.
Of course, also, if you are doing a lot of movement of objects in a container, maybe vector isn't the best container. That's why there are multiple container variants - vector, list, map, tree, deque - as they have different characteristics with regards to their access and insert/remove as well as characteristics for linearly walking the data.
Oh, and in your example, you talk of 200 city objects - well, they are probably going to all fit in the cache of any modern CPU anyways. So stick them in a vector. Unless a city contains a list of every individual living in the city... But that probably should be a vector (or other container object) in itself.
As an experiment, make a program that does the same operations on a std::vector<int> and std::vector<int*> [such as filling with random numbers, then sorting the elements], then make an object that is large [stick some array of integers in there, or some such], with one integer so that you can do the very same operations on that. Vary the size of the object stored, and see how it behaves. On YOUR system, where is the benefit of having pointers, over having plain objects. Of course, also vary the number of elements, to see what effect that has.
[1] Well, modern processors use cache-prefetching, which MAY load "next data" into the cache speculatively, but we certainly can't rely on this.
[2] An extreme case of this is a telephone exchange with a large number of subscribers (millions). When placing a call, the caller and callee are looked up in a table. But the chance of either caller or callee being in the cache is nearly zero, because (assuming we're dealing with a large city, say London) the number of calls placed and received every second is quite large. So caches become useless, and it gets worse, because the processor also caches the page-table entries, and they are also, most likely, out of date. For these sort of applications, the CPU designers have "huge pages", which means that the memory is split into 1GB pages instead of the usual 4K or 2MB pages that have been around for a while. This reduces the amount of memory reading needed before "we get to the right place". Of course, the same applies to various other "large database, unpredictable pattern" - airlines, facebook, stackoverflow all have these sort of problems.
I have a problem I am working on where I need to use some sort of 2 dimensional array. The array is fixed width (four columns), but I need to create extra rows on the fly.
To do this, I have been using vectors of vectors, and I have been using some nested loops that contain this:
array.push_back(vector<float>(4));
array[n][0] = a;
array[n][1] = b;
array[n][2] = c;
array[n][3] = d;
n++
to add the rows and their contents. The trouble is that I appear to be running out of memory with the number of elements I was trying to create, so I reduced the number that I was using. But then I started reading about deque, and thought it would allow me to use more memory because it doesn't have to be contiguous. I changed all mentions of "vector" to "deque", in this loop, as well as all declarations. But then it appeared that I ran out of memory again, this time with even with the reduced number of rows.
I looked at how much memory my code is using, and when I am using deque, the memory rises steadily to above 2GB, and the program closes soon after, even when using the smaller number of rows. I'm not sure exactly where in this loop it is when it runs out of memory.
When I use vectors, the memory usage (for the same number of rows) is still under 1GB, even when the loop exits. It then goes on to a similar loop where more rows are added, still only reaching about 1.4GB.
So my question is. Is this normal for deque to use more than twice the memory of vector, or am I making an erroneous assumption in thinking I can just replace the word "vector" with "deque" in the declarations/initializations and the above code?
Thanks in advance.
I'm using:
MS Visual C++ 2010 (32-bit)
Windows 7 (64-bit)
The real answer here has little to do with the core data structure. The answer is that MSVC's implementation of std::deque is especially awful and degenerates to an array of pointers to individual elements, rather than the array of arrays it should be. Frankly, only twice the memory use of vector is surprising. If you had a better implementation of deque you'd get better results.
It all depends on the internal implementation of deque (I won't speak about vector since it is relatively straightforward).
Fact is, deque has completely different guarantees than vector (the most important one being that it supports O(1) insertion at both ends while vector only supports O(1) insertion at the back). This in turn means the internal structures managed by deque have to be more complex than vector.
To allow that, a typical deque implementation will split its memory in several non-contiguous blocks. But each individual memory block has a fixed overhead to allow the memory management to work (eg. whatever the size of the block, the system may need another 16 or 32 bytes or whatever in addition, just for bookkeeping). Since, contrary to a vector, a deque requires many small, independent blocks, the overhead stacks up which can explain the difference you see. Also note that those individual memory blocks need to be managed (maybe in separate structures?), which probably means some (or a lot of) additional overhead too.
As for a way to solve your problem, you could try what #BasileStarynkevitch suggested in the comments, this will indeed reduce your memory usage but it will get you only so far because at some point you'll still run out of memory. And what if you try to run your program on a machine that only has 256MB RAM? Any other solution which goal is to reduce your memory footprint while still trying to keep all your data in memory will suffer from the same problems.
A proper solution when handling large datasets like yours would be to adapt your algorithms and data structures in order to be able to handle small partitions at a time of your whole dataset, and load/save those partitions as needed in order to make room for the other partitions. Unfortunately since it probably means disk access, it also means a big drop in performance but hey, you can't eat the cake and have it too.
Theory
There two common ways to efficiently implement a deque: either with a modified dynamic array or with a doubly linked list.
The modified dynamic array uses is basically a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end.
There are several implementations of modified dynamic array:
Allocating deque contents from the center of the underlying array,
and resizing the underlying array when either end is reached. This
approach may require more frequent resizings and waste more space,
particularly when elements are only inserted at one end.
Storing deque contents in a circular buffer, and only resizing when
the buffer becomes full. This decreases the frequency of resizings.
Storing contents in multiple smaller arrays, allocating additional
arrays at the beginning or end as needed. Indexing is implemented by
keeping a dynamic array containing pointers to each of the smaller
arrays.
Conclusion
Different libraries may implement deques in different ways, but generally as a modified dynamic array. Most likely your standard library uses the approach #1 to implement std::deque, and since you append elements only from one end, you ultimately waste a lot of space. For that reason, it makes an illusion that std::deque takes up more space than usual std::vector.
Furthermore, if std::deque would be implemented as doubly-linked list, that would result in a waste of space too since each element would need to accommodate 2 pointers in addition to your custom data.
Implementation with approach #3 (modified dynamic array approach too) would again result in a waste of space to accommodate additional metadata such as pointers to all those small arrays.
In any case, std::deque is less efficient in terms of storage than plain old std::vector. Without knowing what do you want to achieve I cannot confidently suggest which data structure do you need. However, it seems like you don't even know what deques are for, therefore, what you really want in your situation is std::vector. Deques, in general, have different application.
Deque can have additional memory overhead over vector because it's made of a few blocks instead of contiguous one.
From en.cppreference.com/w/cpp/container/deque:
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The primary issue is running out of memory.
So, do you need all the data in memory at once?
You may never be able to accomplish this.
Partial Processing
You may want to consider processing the data into "chunks" or smaller sub-matrices. For example, using the standard rectangular grid:
Read data of first quadrant.
Process data of first quandrant.
Store results (in a file) of first quandrant.
Repeat for remaining quandrants.
Searching
If you are searching for a particle or a set of datum, you can do that without reading in the entire data set into memory.
Allocate a block (array) of memory.
Read a portion of the data into this block of memory.
Search the block of data.
Repeat steps 2 and 3 until the data is found.
Streaming Data
If your application is receiving the raw data from an input source (other than a file), you will want to store the data for later processing.
This will require more than one buffer and is more efficient using at least two threads of execution.
The Reading Thread will be reading data into a buffer until the buffer is full. When the buffer is full, it will read data into another empty one.
The Writing Thread will initially wait until either the first read buffer is full or the read operation is finished. Next, the Writing Thread takes data out of the read buffer and writes to a file. The Write Thread then starts writing from the next read buffer.
This technique is called Double Buffering or Multiple Buffering.
Sparse Data
If there is a lot of zero or unused data in the matrix, you should try using Sparse Matrices. Essentially, this is a list of structures that hold the data's coordinates and the value. This also works when most of the data is a common value other than zero. This saves a lot of memory space; but costs a little bit more execution time.
Data Compression
You could also change your algorithms to use data compression. The idea here is to store the data location, value and the number or contiguous equal values (a.k.a. runs). So instead of storing 100 consecutive data points of the same value, you would store the starting position (of the run), the value, and 100 as the quantity. This saves a lot of space, but requires more processing time when accessing the data.
Memory Mapped File
There are libraries that can treat a file as memory. Essentially, they read in a "page" of the file into memory. When the requests go out of the "page", they read in another page. All this is performed "behind the scenes". All you need to do is treat the file like memory.
Summary
Arrays and deques are not your primary issue, quantity of data is. Your primary issue can be resolved by processing small pieces of data at a time, compressing the data storage, or treating the data in the file as memory. If you are trying to process streaming data, don't. Ideally, streaming data should be placed into a file and then processed later.
A historical purpose of a file is to contain data that doesn't fit into memory.
I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).