Search block in network stream as quick as possible

Search block in network stream as quick as possible - c++

I want ask the method for searching pattern in network stream.
My current method is to allocate a big cache, put data from socket to cache, when the data size exceed a threshold, then start to search all of sync headers (using KMP algorithm) in cache. It works, but looks somehow cumbersome.
The header is very simple flag such as "0xFFEEBBAA1290".
Is there a trick to check the header as quick as possible in realtime without accumulation? That is while receiving data, check whether the complete data block is arrived just in time.
The data is arriving continuously and no any interval to indicates different data blcok.
I used the circular buffer to check the first header and next header to decide the whole block, but numerous modulo (for circular buffer index) operations slow down the speed drastically. I just used memcmp to find header.
FYI, My prefer language is C/C++.
Hope to get your advice. Any reference link also welcome.
Thank you.
Please allow me to add some details about this problem.
The data come from a board which is not in my control.
The device send data from arbitray position from his source and don't follow any rule like when connection established must start with a package the header on the front. And even worse the block length is not fixed, I must get block by checking 2 headers.
In my approach, I try to find first header at begin time, if it not meet, I will drop each byte until the header come.
On this way at least I can gurantee the first header is at the begin of cache(The cache size is much smaller than KMP approach because I don't want search headers in delay), then continue to receive data and check next header simultaneously.
If found the block, the block data will move to other process, then second header will move to front of cache.
It causes the cache should be re-aligned to accept next data, this is why I used the circular buffer (store data to array) to implement. i.e., just set read and write position, not actually move remain data in cache.
list or vector is tried but not used because of byte chunk operations and performance consideration.
The problem is I have to continuously check the next header while data arriving.
Is there an elegant way to avoid such frequent byte scan?
Or if the speed is reasonable I also can accept the frequent byte scan, but the modulo operation for calculating reading and writing position in circular buffer seems slow down the performance.
I used different profiling tool and all indicate the frequent modulo is performance bottleneck.

"as quick as possible in realtime" is already a contradiction. Realtime means as fast as the data arrives; no need to be faster than that. In fact, realtime often is slower than batch processing.
Realtime also requires hard figures on time available and time taken, neither of which are available here.
Your header appears to <8 bytes, which is 1 cache line. KMP or similar algorithms are unlikely to be needed. Checking all bytes in a cache line for 0xFF is almost certainly faster than checking a single byte against 0xFF, 0xEE, 0xBB, 0xAA, 0x12 or 0x90.
Now, "numerous modulo (for circular buffer index) operations slow down the speed drastically" is a realistic problem. But that does have a straightforward solution. Make sure that the buffer size is a compile time constant, and a power of two. x%(1<<N) is equal to x & ((1<<N)-1)

Related

How to write data into a buffer and write the buffer into a binary file with a second thread?

I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?

First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.

Reading inputs in smaller, more frequent reads or one larger read

I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?

The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.

The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.

Packet CRC Computation Approach

I am writing a class that is reading in incoming packets of serial data. The packets are laid out with a header, some data, and are followed by a two byte CRC.
I also have written a class where I can build up packets to send. This class has GenerateCRC() method which allows the caller to compute a CRC for a packet which they have built up via calls to other methods. The GenerateCRC() call is only meant to be called once the packet header and data have been set up properly. As a result, this method iterates over the packet in a for loop and computes the CRC this way.
Now that I'm writing code to read in the packets, I need to verify them by computing a CRC. I'm trying to reuse the previous "builder" class as much as possible given that as I'm reading in the packet, I want to store it in memory and the best way to do so is to use the "builder" class. However, I have hit a snag with computation of the CRC.
There are two main approaches that I'm considering and I am having trouble weighing the pros and cons and deciding on an approach. Here are my two choices:
Compute the CRC as I read in the bytes. The data that I'm reading in is pushed onto a queue, so I pop off the bytes one at a time. I would keep a running "total" CRC and be finished with the computation as soon as the last data byte is read in.
Compute the CRC only once I have read in the full packet. In this case, I don't have to keep a running total, but I would have to iterate over the packet again. I should note that this would allow me to reuse my previously written code.
Currently I am leaning towards option 1 and moving any common functionality between the "builder" and the "reader" to a separate header file. However, I want to make sure that the first option is in fact the better one in terms of performance since it does make my code a bit more jumbled.
Thanks in advance for the help.

I would pick Door #2. That allows simpler validation of the code by using identical code on both ends, and also permits faster CRC algorithms to be used that process four or eight bytes at a time.

Processing instrument capture data

I have an instrument that produces a stream of data; my code accesses this data though a callback onDataAcquisitionEvent(const InstrumentOutput &data). The data processing algorithm is potentially much slower than the rate of data arrival, so I cannot hope to process every single piece of data (and I don't have to), but would like to process as many as possible. Thank of the instrument as an environmental sensor with the rate of data acquisition that I don't control. InstrumentOutput could for example be a class that contains three simultaneous pressure measurements in different locations.
I also need to keep some short history of data. Assume for example that I can reasonably hope to process a sample of data every 200ms or so. Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data that arrived prior to that latest sample, depending on whether abnormal readings are present in the last sample.
The other requirement is to get out of the onDataAcquisitionEvent() callback as soon as possible, to avoid data loss in the sensor.
Data acquisition library (third party) collects the instrument data on a separate thread.
I thought of the following design; have single producer/single consumer queue and push the data tokens into the synchronized queue in the onDataAcquisitionEvent() callback.
On the receiving end, there is a loop that pops the data from the queue. The loop will almost never sleep because of the high rate of data arrival. On each iteration, the following happens:
Pop all the available data from the queue,
The popped data is copied into a circular buffer (I used boost circular buffer), this way some history is always available,
Process the last element in the buffer (and potentially look at the prior ones),
Repeat the loop.
Questions:
Is this design sound, and what are the pitfalls? and
What could be a better design?
Edit: One problem I thought of is when the size of the circular buffer is not large enough to hold the needed history; currently I simply reallocate the circular buffer, doubling its size. I hope I would only need to do that once or twice.

I have a bit of experience with data acquisition, and I can tell you a lot of developers have problems with premature feature creep. Because it sounds easy to simply capture data from the instrument into a log, folks tend to add unessential components to the system before verifying that logging is actually robust. This is a big mistake.
The other requirement is to get out of the onDataAcquisitionEvent() callback as soon as possible, to avoid data loss in the sensor.
That's the only requirement until that part of the product is working 110% under all field conditions.
Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data that arrived prior to that latest sample, depending on whether abnormal readings are present in the last sample.
"Most of the time" doesn't matter. Code for the worst case, because onDataAcquisitionEvent() can't be spending its time thinking about contingencies.
It sounds like you're falling into the pitfall of designing it to work with the best data that might be available, and leaving open what might happen if it's not available or if providing the best data to the monitor is ultimately too expensive.
Decimate the data at the source. Specify how many samples will be needed for the abnormal case processing, and attempt to provide that many, at a constant sample rate, plus a margin of maybe 20%.
There should certainly be no loops that never sleep. A circular buffer is fine, but just populate it with whatever minimum you need, and analyze it only as frequently as necessary.
The quality of the system is determined by its stability and determinism, not trying to go an extra mile and provide as much as possible.

Your producer/consumer design is exactly the right design. In real-time systems we often also give different run-time priorities to the consuming threads, not sure this applies in your case.
Use a data structure that's basically a doubly-linked-list, so that if it grows you don't need to re-allocate everything, and you also have O(1) access to the samples you need.
If your memory isn't large enough to hold your several seconds worth of data (which it should -- one sample every 200ms? 5 samples per second.) then you need to see whether you can stand reading from auxiliary memory, but that's throughput and in your case has nothing to do with your design and requirement for "Getting out of the callback as soon as possible".
Consider an implementation of the queue that does not need locking (remember: single reader and single writer only!), so that your callback doesn't stall.
If your callback is really quick, consider disabling interrupts/giving it a high priority. May not be necessary if it can never block and has the right priority set.

Questions, (1) is this design sound, and what are the pitfalls, and (2) what could be a better design. Thanks.
Yes, it is sound. But for performance reasons, you should design the code so that it processes an array of input samples at each processing stage, instead of just a single sample each. This results in much more optimal code for current state of the art CPUs.
The length of such a an array (=a chunk of data) is either fixed (simpler code) or variable (flexible, but some processing may become more complicated).
As a second design choice, you probably should ignore the history at this architectural level, and relegate that feature...
Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data [...]
Maybe, tracking a history should be implemented in just that special part of the code, that occasionally requires access to it. Maybe, that should not be part of the "overall architecture". If so, it simplifies processing at all.

What is the best compression algorithm that allows random reads/writes in a file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js