Accumulating backlog on std::cin causes it to lose data - c++

I have an external binary that produces an infinite stream of binary data (~10 MiB/s) to stdout, and I am trying to write a C++ program to consume it. My consumer program has a somewhat unusual pattern: it first reads a small amount of data from the stream (~32KiB), processes that for about one minute, and then starts consuming data again at a much faster rate (> 50 MiB/s).
I am invoking the program as follows: $ my_producer | my_consumer.
Because the consumer is initially much slower than the consumer, I expect the pipe to build a backlog of about 60s * 10 MiB/s = 600 MiB. However, after the initial delay of one minute, my consumer starts consuming data at only ~10MiB/s, implying that the data produced during that one-minute interval was lost; why?
The relevant code for the consumer is something like this:
std::vector<char> StreamSource::Read(std::size_t size) {
auto data = std::vector<char>(size);
stream_.read(data.data(), size);
data.resize(stream_.gcount());
assert(stream_.good() || stream_.eof());
return data;
}
std::istream& stream_; // Initialized to std::cin
Interestingly, writing to file and then piping that file to the consumer works as expected!
$ my_producer > ~/Desktop/data
$ cat ~/Desktop/data | my_consumer
I run a bunch of tests to make sure my producer is not likely to blame; the following fails because the producer detects a "short write":
$ my_producer | throttle -M1 > ~/Desktop/data
I'm looking for advice on how to explain the missing data. In case it's relevant, I am running on MacOS.
Thank you!

I'm pretty sure no OS allows the pipe to build up to 600mb, the write end will block when the pipe is full. If you want to allow that much data to build up you need to have a thread in your application which is reading from cin continuously and buffering the data, your application can then read from that buffer rather than cin.
According to https://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer the pipe is generally maximum of 64K.

Related

GNURadio issues with timing

I am having trouble getting a custom block to operate at high frequency.
The block I would like to use is going to take in data from an external radio.
I am using an Ettus USRP block to stream data in from this radio, and I can display this on the QT Scope. I can set this block's sample rate to 15 MHz, and with the scope this seems to work ok.
Problem:
I have tried making a simple block with the gnuradio gr_modtool which takes in 2 floats as input and has 0 outputs. The block has private members "timer", a time_t, and "counter", an int. In the "work" function, my code simply does this at the moment:
const float *in_i = (const float *) input_items[0];
const float *in_q = (const float *) input_items[1];
if (count == 0){
if (*in_i > 0.5){
timer = clock();
count = 30000;
}
}else{
count --;
if(count == 0){
timer = clock()-timer;
printf("Count took %d clicks, or %f seconds\n",timer,(float)timer/CLOCKS_PER_SEC);
}
}
// Tell runtime system how many output items we produced.
return 0;
However, when I run this code, it takes longer than the expected time.
For 30000 cycles, it takes 0.872970 to complete, instead of the desired 0.002 seconds. Since the standard gnuradio block generated with gr_modtool is a sync block, and the input stream to the block is coming from the 15 MHz USRP, I would have expected this block to run at that same frequency. This is not currently the case.
Eventually my goal is to be able to store data streaming in over a period of time, and write it to file with certain formatting(A block already exists to do this, but there is some sort of bug that is preventing that block and the USRP block from working at the same time, so I am attempting to write my own.). However, unless I can keep up with the sample rate of 15 MHz, I will lose data. Since this block is fairly simple, I would have hoped it would be able to run quickly enough to keep up. However, the input stream block is able to pull data from the radio and output at 15 MHz, so I know my computer is capable of it.
How can I make this custom block operate more quickly, and keep up with the 15 MHz frequency?(Or, how can I make this sync block operate at the input stream frequency, since it currently does not)
Your block is not consuming any samples. I presume you're writing a sync_block (work function, not general_work), so your number of produced items is identical to the number of consumed items. But as your source code says:
// Tell runtime system how many output items we produced.
return 0;
In other words, your block tells GNU Radio that it didn't use any of the input GNU Radio offered, and produced no output. That means GNU Radio can't do nothing. You must return the number of items you've produced, and for sync blocks, that's the number of items you consumed – even if you're a sink, with zero output streams!

Python 2.7 pySerial read() returns all characters in Python interpreter but not Python script. Also need robust way to read until buffer is empty

I am communicating with a microcontroller that automatically initializes its flash memory whenever you open its serial port. So on a serial port read, the microcontroller prints up to 10,000 bytes of data showing the addresses and their initial values. You need to leave the port open for the entirety of the print to ensure that initialization completed. I don't ever perform any writes, just reads.
I modified the pySerial buffer from 4k to 32k since I do not want any breaks between reads (subsequent reads will simply restart the init cycle). Below is a snippet of my code where I read from the microcontroller serial port. When I run this snippet from the interpreter, I can tell from print and sizeof that temp contains all 9956 bytes. However when I run the py file, I get only 296 bytes. I inserted the sleep() method after read() but this did not have any effect. I cannot tell from the microcontroller if the initialization completed.
Is there a robust way to read until the serial buffer is empty? The microcontroller image is application-specific, so I cannot always predict the required read() size or timeout.
Any ideas what I could try? I've searched other threads but haven't found anything specific to this problem.
# Create serial port instance
self.ser_port = serial.Serial()
self.ser_port.port = 0
self.ser_port.timeout = 0
.
.
.
self.ser_port.open()
time.sleep(1)
temp = self.ser_port.read(32768)
time.sleep(4)
self.ser_port.close()
A timeout of 0 tells pySerial to return immediately with whatever data is already available. If you want read() to wait to allow more data to come in, use a non-zero timeout.
The first sleep() could cause you to lose some of the data that comes in after open but before read, because the internal buffer might not be as big as the buffer you are passing to read(). But that is highly application-dependent -- if the device you are communicating with takes awhile after the open() to respond, then it might not be a problem.
The second sleep() does nothing for you, because the read() is already finished at that point.
Note that the timeout value is the maximum total amount of time to wait. read() will return when it has filled its buffer, or it has been this long after it was called, whichever comes first.
Here is some example code that will return under the following circumstances:
a) 10-12 seconds with no data
b) 2-4 silent seconds after last data returned, if at least 100 bytes was received
self.ser_port = serial.Serial()
self.ser_port.port = 0
self.ser_port.timeout = 2
.
.
.
self.ser_port.open()
timeout_count = 0
received = []
data_count = 0
while 1:
buffer = self.ser_port.read(32768)
if buffer:
received.append(buffer)
data_count += len(buffer)
timeout_count = 0
continue
if data_count > 100:
# Break if we received at least 100 bytes of data,
# and haven't received any data for at least 2 seconds
break
timeout_count += 1
if timeout_count >= 5:
# Break if we haven't received any data for at
# least 10 seconds, even if we never received
# any data
break
received = ''.join(received) # If you need all the data
Depending on your microcontroller, you may be able to reduce these timeout values considerably, and still have a fairly robust solution for retrieving all the startup data from the microcontroller.

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?
I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.
I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.
If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

Winsock2 tcp/ip - some data packets are ignored probably due to null terminator from the previous packet

I wrote a simple client-server program. Network.h is a header file which uses Winsock2.h (TCP/IP mode) to create socket, accept/connect in blocking mode, send/recv in non-blocking mode. I made it so that the function string TNetwork::Recv(int size) will return the string "Nothing" if it gets WSAWOULDBLOCK error (no data is received yet)
Here is my main function:
int main(){
string Ans;
TNetwork::StartUp(); //WSA start up, etc
cin >> Ans;
if (Ans == "0"){ // 0 --> server
TNetwork::SetupAsServer(); //accept connection (in blocking mode!)
while (true){
TNetwork::Send("\nAss" + '\0'); //without null terminator, the client may read extra bytes, causing undefined behavior (?)
TNetwork::Send("embly" + '\0');
cin >> Ans;
}
}
else{ // others --> regard Ans as IP address. e.g. I can type "127.0.0.1"
TNetwork::SetupAsClient(Ans);
string Rec;
while (true){
Rec = TNetwork::Recv(1000);
if (Rec != "Nothing"){
cout << Rec;
}
}
}
system("PAUSE");
}
Supposedly, the client would print "Assembly" when connected, and when the server enters anything to its console window. Sometimes, though, the client would only print out "\nAss" in the console without the "embly.
To my understanding, TCP/IP ensures all data to be sent and in the correct order, so I guess what happens is that both packets arrive at the same time, which happen quite often over the unstable internet. And due to this null terminator, the client would ignore the "embly", since the Recv() function stopped reading when it hits a null terminator.
So, how can I ensure that the client will always read all data packets correctly?
Yes, the network stack will send the data in the correct order and doesn't care what termination type you use. This has to do with how you're receiving and processing the data stream (note: not packets, stream). If you receive all 11 bytes and print it to the screen, the print function will stop when it reaches the zero, but the rest of the data is still there.
Note: since it's a stream, what happens if you received only 10 bytes of data from the stream? You need to scan what you receive for the zero to know if you've received a full "zero-terminated string" if that's how you want to communicate your data.
EDIT: Also, I don't think "\nAss" + '\0' is doing what you think it is. Instead of adding a 0 character to the end of the string (which already has one, by the way), it's adding 0 to your string pointer.
As #mark points out, TCP is all about streams, not packets. TCP takes care of ensuring that data is reliably transmitted from A to B and that the data is delivered to the consumer in the order in which it was transmitted. Yes, the data is packetized on the wire, but the TCP stack on the system takes those packets and builds the stream which it makes available to you through the recv() function. The TCP stack handles out-of-order data, missing data, and duplicated data such that by the time your application sees it, the stream is a mirror-copy of when the sender sent.
To properly receive TCP data, you will typically need some kind of loop that reads data from the socket when it becomes available. The way I normally do this is to have a thread that is dedicated to servicing the socket. In the thread function is a loop that reads data from the socket when it becomes available and is idle otherwise. This loop reads data into a buffer of, say, 1 KB. Once the data is received from the socket into this buffer, the buffer is copied to another thread for processing. In the thread function for the processing thread is a loop that receives the 1 KB buffers from the socket thread and adds them to the back end of a master buffer of, say, 1 MB. The processing thread then processes the messages out of this master buffer and makes them available to the application.
For a simple demo application, two threads may be overkill. The two threads I've described could be certainly be combined into one, but for my application, it is more efficient to have two threads and take advantage of the multiple cores on my system. The point is, if you're going to have a front-end UI, there's not going to be a way around using at least one thread and still have the UI be responsive.
One other thing. There are two commonly-used mechanisms for protocol design. You're using one, namely, a marker (e.g., a null terminator, etc.) to signal the begin/end of a message. I don't prefer this mechanism mainly because the marker may actually need to be part of the message at some point. The other mechanism is to have a header on each message that tells, at a minimum, how long the message is. I prefer this mechanism and include in my headers a sync word and the message type as well. For example,
struct Header
{
__int16 _sync; // a hex pattern, e.g., 0xABCD
__int16 _type;
__int32 _length;
}
That's a total of 8 bytes. So when processing from the master buffer, I read the first 8 bytes, verify the sync word, and get the length. I determine if there are 'length' bytes available in the master buffer. If not, I have to wait until the socket thread provides me more data before checking again. If so, I extract 'length' bytes from the master buffer and pass that to an object created according to the specified type, which knows how to interpret that particular message. Then repeat.
As I mentioned, I use a master buffer of 1 MB or so. As messages are processed, it is important to remove them from the master buffer so there is additional space available for new data on the back end. This involves simply copying the unprocessed data, if any, to the beginning of the buffer. In cases where data comes in faster than you can process it, the master buffer may need the ability to resize itself to accommodate the additional data.
I hope that's not overwhelming. Start simple and add as you go.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.
On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly
in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files
in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.
It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?
One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.