Fast memory allocation for real time data acquisition

Fast memory allocation for real time data acquisition - c++

I have a range of sensors connected to a PC that measure various physical parameters, like force, rotational speed and temperature. These sensors continuously produce samples at some sample rate. A sample consists of a timestamp and the measured dimension itself; the sample rates are in magnitudes of single-digit kilohertz (i.e., somewhere between 1 and 9000 samples per second).
The PC is supposed to read and store these samples during a given period of time. Afterwards the collected data is further treated and evaluated.
What would be a sensible way to buffer the samples? At some realistic setup the acquisition could easily gather a couple of megabytes per second. Also paging could be critical in case memory is allocated fast but needs swapping upon write.
I could think of a threaded approach where a separate thread allocates and manages a pool of (locked, so non-swappable) memory chunks. Given there are always enough of these chunks pre-allocated, further allocation would only block (in case other processes' pages have to be swapped out before) this memory pool's thread and the acquisition could proceed without interruption.
This basically is a conceptual question. Yet, to be more specific:
It should only rely on portable features, like POSIX. Features out Qt's universe is fine, too.
The sensors can be interfaced in various ways. IP is one possibility. Usually the sensors are directly connected to the PC via local links (RS232, USB, extension cards and such). That is, fast enough.
The timestamps are mostly applied by the acquisition hardware itself if it is capable in doing so, to avoid jitter over network etc.
Thinking it over
Should I really worry? Apparently the problem diverts into three scenarios:
There is only little data collected at all. It can easily be buffered in one large pre-allocated buffer.
Data is collected slowly. Allocating the buffers on the fly is perfectly fine.
There is so much data acquired at high sample rates. Then allocation is not the problem because the buffer will eventually overflow anyway. The problem is rather how to transfer the data from the memory buffer to permanent storage fast enough.

The idea for solving this type of problems can be as follows:
Separate the problem into 2 or more processes depending what you need to do with your data:
Acquirer
Analyzer (if you want to process data in real time)
Writer
Store data in a circular buffer in shared memory (I recommend using boost::interprocess).
Acquirer will continuously read data from the device and store it in a shared memory. In the meantime, once is enough data read for doing any analysis, the Analyzer will start processing it. It can store results into another circular buffer shared memory if needed. Also in the meantime Reader will read the data from shared memory (acquired or already processed) and store it in the output file.
You need to make sure all the processes are synchronized properly so that they do their job simultaneously and you don't lose the data (the data is not being overwritten before is processed or saved into output file).

Related

OpenCL: parrallel write from host to device buffer?

I have a cl_mem buffer that's quite large (100 million floats). I'm trying to decrease the amount of time it takes to fill it with data from the host (I have to pass data from host to device many times, and currently I re-initialize the buffer each time).
Instead of initializing with clCreateBuffer/CL_MEM_COPY_HOST_PTR over and over, it seems it would be more efficient to initialize the buffer once, and then update its data with a multi-threaded approach each subsequent time (so multiple CPU threads each update subsets of the data simultaneously).
Is such an approach possible? I've looked into clEnqueueWriteBuffer, and while it allows a subset of a buffer to be updated, it seems like multiple calls to it would still be executed sequentially by the command queue. Do I need multiple command queues? Is this approach even possible?

It's not entirely clear from your question whether your initialisation/update would be the same every time, or whether the whole of the buffer needs updating between runs. Obviously the easiest way to speed things up will be to remove any duplication of effort and don't copy the same data multiple times.
Do your measurements suggest that you are not limited by the interface between your CPU and device? Because if you need to copy N MB every time, your device is connected to CPU/system memory by a B MB/s interface, and your copying time is not wildly more than N/B seconds, no amount of multithreading is going to help you.
If you are limited by the sequential nature of some CPU calculation and the subsequent copy to the the buffer, you could use the asynchronous variant of clEnqueueWriteBuffer() to start copying the first chunk of data while calculating the next, etc. Note that clEnqueueWriteBuffer()/CL_MEM_COPY_HOST_PTR typically makes use of the device's DMA engines, which typically doesn't require much intervention from the host CPUs, and so can run entirely in parallel with calculations. (Host memory bandwidth is of course shared as always.)
If that is too cumbersome for your purposes, it may be useful to use clEnqueueMapBuffer to map the buffer into the host application's address space. This allows any number of threads to access arbitrary areas of it simultaneously. Be aware, however, that this is no silver bullet, and unless your OpenCL implementation explicitly specifies how this is implemented in practice, it can be that you actually make things worse with it because it might end up copying more than previously.
If your device kernels don't actually end up reading all of the buffer (and you just don't know in advance which parts it will need), or possibly if they only read all of it precisely once, in a nice and predictable pattern, but your host code needs to read & write lots or write to random locations, you could try buffers created with CL_MEM_USE_HOST_PTR. This isn't zero-copy in all implementations, but the idea is to give the device direct access to host memory. You're again limited by the device uplink interface bandwidth, and latency is typically much worse than to device memory, but if your device doesn't actually need to read all of it, this could be faster as you don't have to push the whole buffer down the pipe.
Finally, if your CPUs are somehow preprocessing/unpacking the data, you could try offloading that to the device instead.

Need help in debugging this apparent paradox in C++ [duplicate]

I am unable to find the underlying concept of IO Stream Buffering and what does it mean.
Any tutorials and links will be helpful.

Buffering is a fundamental part of software that handles input and output. The buffer holds data that is in between the software interface and the hardware interface, since hardware and software run at different speeds.
A component which produces data can put it into a buffer, and later the buffer is "flushed" by sending the collected data to the next component. Likewise the other component may be "waiting on the buffer" until a complete piece of data, or enough data to be efficiently processed, is available for input.
In C++, std::basic_filebuf implements a buffer over a filesystem file. It stores up to a fixed number of bytes so the operating system always works with a minimum transaction size, while the program can access individual characters if desired.
See Wikipedia.

Buffering is using memory (users memory) instead of sending the data straight to the OS (i.e. disk). Saves on a context switch.

Here's the concept. Imagine you have an application that needs to write it's data onto the hard drive. Let's say it wants to write something (e.g. update a log file) every half of a second. Is this good? No, and here is the reason.
Software can be very fast, but the speed on which the HDD can operate is limited, and it's much slower than the memory, and your application. To write something, the HDD needs to reposition it's magnetic heads to a specific sector (which probably involves slowing the disc rotation speed), write the data, and reposition back to where it was. So your application could operate very slowly (well, that's a theoretical example of course).
Buffering helps to deal with this. Instead of writing to the disc each time, the data is being accumulated in the buffer somewhere in the memory. Once a sufficient amount of data is gathered, the buffer is flushed: the data from it gets written on the disk. Such approach helps to minimize HDD operations and improve overall speed.

device driver memory buffer processor cache issue

I have a device which sends image data and video frame using two different bulk channels in USB.
My workstation processor cache is little bit large enough to hold around 100 video frames without any issue but not image data.
I'm using same buffer for image and video data and that buffer have around 50 blocks and one block is 1MB size.
The video frames come quickly and then the image frame.
My question is, is there is a memory corruption issue in the following secnario ? Somebody who have knowledge in processor cache could help me.
Because of video frames are small the pages in the memory buffer which writes video frames are almost in the cache. Since video data comes as a stream it never flushed out.
but when the image data comes, the large area of the memory buffer will be used, then video memory pages will be flushed out. But scheduled to be flushed but still not written to the physical memory.
Now image data was written to the memory, I've used volatile there.
And that data will be corrupted by the cache flush when they were flushed after the image data write.
Can this happen?
So I applied volatile to video data write too and this issue looks like it disappeared. But I need to make a report, so is it possible for this above mentioned scenario to happen?

The comments are the giveaway: two threads, and volatile is misused as a threading mechanism.
Two threads can run on two CPU cores. While the cores usually do share memory, they usually do not share the L1 cache. Intermediate caches vary. As a result, dereferencing the same pointer on two CPU cores may give different results. This is not a problem for variables that are properly shared across threads; the compiler will use the correct instructions. But the keyword is properly shared.
Here we get into the slight problem that you've tagged your question both as C and C++, because the two languages forked before threading was standardized in either language. However, the two threading mechanisms are intentionally similar so that a compiler pair can (as an extension) define how C threading and C++ threading interact. You'll need to consult your documentation for that.
It may be easier to wrap the libusb thread in your own code, so that you receive the data without threading issues, and then dispatch from your code to other threads that are also under your control.
Back to the memory corruption you're seeing: what you probably see is that one thread is writing out its view of memory, which turns out to be stale data in its cache. Had you used something like a mutex, this stale data would have been noted and caches synchronized.

How to write data into a buffer and write the buffer into a binary file with a second thread?

I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?

First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.

What is IO Stream Buffering?

I am unable to find the underlying concept of IO Stream Buffering and what does it mean.
Any tutorials and links will be helpful.

Buffering is a fundamental part of software that handles input and output. The buffer holds data that is in between the software interface and the hardware interface, since hardware and software run at different speeds.
A component which produces data can put it into a buffer, and later the buffer is "flushed" by sending the collected data to the next component. Likewise the other component may be "waiting on the buffer" until a complete piece of data, or enough data to be efficiently processed, is available for input.
In C++, std::basic_filebuf implements a buffer over a filesystem file. It stores up to a fixed number of bytes so the operating system always works with a minimum transaction size, while the program can access individual characters if desired.
See Wikipedia.

Buffering is using memory (users memory) instead of sending the data straight to the OS (i.e. disk). Saves on a context switch.

Here's the concept. Imagine you have an application that needs to write it's data onto the hard drive. Let's say it wants to write something (e.g. update a log file) every half of a second. Is this good? No, and here is the reason.
Software can be very fast, but the speed on which the HDD can operate is limited, and it's much slower than the memory, and your application. To write something, the HDD needs to reposition it's magnetic heads to a specific sector (which probably involves slowing the disc rotation speed), write the data, and reposition back to where it was. So your application could operate very slowly (well, that's a theoretical example of course).
Buffering helps to deal with this. Instead of writing to the disc each time, the data is being accumulated in the buffer somewhere in the memory. Once a sufficient amount of data is gathered, the buffer is flushed: the data from it gets written on the disk. Such approach helps to minimize HDD operations and improve overall speed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js