Is there a way prevent libcurl from buffering? - c++

I am using libcurl with CURLOPT_WRITEFUNCTION to download a certain file.
I ask for a certain buffer size using CURLOPT_BUFFERSIZE.
When my callback function is called the first time and I get about that many bytes, there are much more data actually downloaded.
For example, if I ask for 1024 bytes of data, when I first get that, the process has already consumed 100K of data (based on process explorer and similar tools. I can see the continuous stream of data and ACKs in wireshark), so I assume it is downloading in advance and buffering the data.
The thing I am trying to achieve here is to be able to cancel the retrieval based on first few chunks of data without downloading anything that is unnecessary.
Is there a way to prevent that sort of buffering and only download the next chunk of data once I have finished processing the current one (or at least not to buffer tens and hundreds of kilobytes)?
I would prefer the solution to be server agnostic, so CURLOPT_RANGE won't actually work here.

Related

Buffering to the hard disk

I am receiving a large quantity of data at a fixed rate. I need to do some processing on this data on a different thread, but this may run slower than the data is coming in, so I need to buffer the data. Due to the quantity of data coming in the available RAM would be quickly exhausted, so it needs to overflow onto the hard disk. What I could do with is something like a filesystem-backed pipe, so the writer could be blocked by the filesystem, but not by the reader running too slowly.
Here's a rough set of requirements:
Writing should not be blocked by the reader running too slowly.
If data is read slow enough that the available RAM is exhausted it should overflow to the filesystem. It's ok for writes to the disk to block.
Reading should block if no data is available unless the stream has been closed by the writer.
If the reader is able to keep up with the data then it should never hit the hard disk as the RAM buffer would be sufficient (nice but not essential).
Disk space should be recovered as the data is consumed (or soon after).
Does such a mechanism exist in Windows?
This looks like a classic message queue. Did you consider MSMQ or similar? MSMQ has all the properties you are asking for. You may want to use direct addressing to avoid Active Directory http://msdn.microsoft.com/en-us/library/ms700996(v=vs.85).aspx and use local or TCP/IP queue address.
Use an actual file. Write to the file as the data is received, and in another process read the data from the file and process it.
You even get the added benefits of no multithreading.

Need an efficient way to handle ReadDirectoryChangesW in C++

I want to get notified about a change in directory (new file addition/deletion/updation).
I used an API - "ReadDirectoryChangesW" which correctly notifies about any change in directory. But, the API accepts a buffer in which it returns the details of file(s) added/deleted/modified in the directory.
This pose a limitation as the change in directory is not certain and can be huge sometimes. For ex: 1000 files get added in the directory.
In this case I alwasys need to be ready with a large enough buffer to hold notification about all 1000 files.
I dont want to always create this large buffer.
Is there any other alternate way which is more efficient?
If I read the documentation correctly, it will return as many changes as fit in your buffer, and then when next you call it will give you more changes. If want to get 1000 files worth of changes at once, you've got to give it a big buffer, but if you can handle them in smaller chunks, just pass in a smaller buffer and you'll get the rest of the changes on subsequent calls.
One approach that you could use is to use the ReadDirectoryChangesW() as a way to be notified that there has been some change in the directory and to then use this notification as an event to review the directory for changes.
The idea is to discover what has changed yourself rather than depending on ReadDirectoryChangesW() to tell you what has changed.
The documentation for the function indicates that a system buffer is allocated to track changes and it is possible, with a large number of changes that the buffer allocated will overflow. This results in an error returned and requires you to discover what has changed for yourself anyway.
This article on using ReadDirectoryChangesW() may help you.
In my case, I am using the function to monitor a print spooler folder into which a number of text files might be dropped. The number of files is small so I have just allocated a large buffer. What I then do is to use a queue to provide to the actual print functionality, which runs on another thread, the list of files to print.

How to handle 100 Mbps input stream when my program can process data only at 1 Mbps rate

I am working on a project where we can have input data stream with 100 Mbps.
My program can be used overnight for capturing these data and thus will generate huge data file. My program logic which interpret these data is complex and can process only 1 Mb data per second.
We also dump the bytes to some log file after getting processed. We do not want to loose any incoming data and at the same time want my program to work in real time.So; we are maintaining a circular buffer which acts like a cache.
Right now only way to save incoming data from getting lost is to increase size of this buffer.
Please suggest better way to do this and also what are the alternate way of caching I can try?
Stream the input to a file. Really, there is no other choice. It comes in faster than you can process it.
You could create one file per second of input data. That way you can directly start processing old files while new files are being streamed on the disk.

Using libpcap, is there a way to determine the file offset of a captured packet from an offline pcap file?

I'm writing a program to reconstruct TCP streams captured by Snort. Most of the examples I've read regarding session reconstruction either:
load the entire pcap file in to memory to start with (not a solution because of hardware constraints and the fact that some of the capture files are 10 GB in size), or
cache each packet in memory as it reads through the capture and discards the irrelevant ones as it goes; this presents basically the same problems as reading the entire file in to memory
My current solution was to write my own pcap file parser since the format is simple. I save the offsets of each packet in a vector and can reload each one after I've passed it. This, like libpcap, only streams one packet in to memory at a time; I am only using sequence numbers and flags for ordering, NOT the packet data. Unlike libpcap, it is noticeably slower. processing a 570 MB capture with libpcap takes roughly 0.9 seconds whereas my code takes 3.2 seconds. However, I have the advantage of being able to seek backwards without reloading the entire capture.
If I were to stick with libpcap for speed issues, I was thinking I could just make a currentOffset variable with an initial value of 24 (the size of the pcap file global header), push it to a vector every time I load a new packet, and increment it every time I call pcap_next_ex by the size of the packet + 16 (for the size of the pcap record header). Then, whenever I wanted to read an individual packet, I could load it using conventional means and seek to packetOffsets[packetNumber].
Is there a better way to do this using libpcap?
Solved the problem myself.
Before I call pcap_next_ex, I push ftell(pcap_file(myPcap)) in to a vector<unsigned long>. I manually parse the packets after that as needed.
EZPZ. It just took 24+ hours of brain wrack...

MySQL++, storing realtime data

Firstly I'm an engineer, not a computer scientist, so please be gentle.
I currently have a C++ program which uses MySQL++. The program also incorporates the NI Visa runtime. One of the interrupt handlers receives data (1 byte) from a USB device about 200 times a second. I would like to store this data with a time stamp on each sample on a remote server. Is this feasible? Can anyone recommend a good approach?
Regards,
Michael
I think that performing 200 transactions/second against a remote server is asking a lot, especially when you consider that these transactions would be occurring in the context of an interrupt handler which has to do its job and get done quickly. I think it would be better to decouple your interrupt handler from your database access - perhaps have the interrupt handler store the incoming data and timestamp into some sort of in-memory data structure (array, circular linked list, or whatever, with appropriate synchronization) and have a separate thread that waits until data is available in the data structure and then pumps it to the database. I'd want to keep that interrupt handler as lean and deterministic as possible, and I'm concerned that database access across the network to a remote server would be too slow - or worse, would be OK most of the time, but sometimes would go to h*ll for no obvious reason.
This, of course, raises the question/problem of data overrun, where data comes in faster than it can be pumped to the database and the in-memory storage structure fills up. This could cause data loss. How bad a thing is it if you drop some samples?
I don't think you'll be able to maintain that speed with 1 separate insert per value, but if you batched them up into large enough batches you could send it all as one query and it should be fine.
INSERT INTO records(timestamp, value)
VALUES(1, 2), (3, 4), (5, 6), [...], (399, 400);
Just push the timestamp and value onto a buffer, and when the buffer hits 200 in size (or some other arbitrary figure), generate the SQL and send the whole lot off. Building this string up with sprintf shouldn't be too slow. Just beware of reading from a data structure that your interrupt routine might be writing to at the same time.
If you find that this SQL generation is too slow for some reason, and there's no quicker method using the API (eg. stored procedures), then you might want to run this concurrently with the data collection. Simplest is probably to stream the data across a socket or pipe to another process that performs the SQL generation. There are also multithreading approaches but they are more complex and error-prone.
In my opinion, you should do two things: 1. buffer the data and 2. one time stamp per buffer. The USB protocol is not byte based and more message based. If you are tracking messages, then time stamp the messages.
Also, databases would rather receive blocks or chunks of data than one byte at a time. There is overhead in the database with each transaction. To measure the efficiency, divide the overhead by the number of bytes in the transaction. You'll see that large blocks are more efficient than lots of little transactions.
Another option is to store the data into a file, then use the MySQL LOADFILE function to load the data into the the database. Also, there is storing the data into a buffer then using the MySQL C++ connector stream to load the data into the database.
multi-threading doesn't guarantee being any faster than apartment even if you cached it correctly on server side unless there was some strange cpu priority preference. What about using shaders and letting the pass by reference value in windows.h be the time stamp