I was trying to transfer large number of data (long int arrays) from multiple (8) remote computers to a single computer(main process). All these are connected via a 100 MBps LAN and are identical machines(so no worry about endianess).
Each remote machine has an array of 8GB long int's and I have to transmit it to the single computer for processing. My question is what is the best way to transfer these arrays quickly to the main process . I tried using traditional TCP to do this job and it takes a lot of time for transferring the data (about 28 minutes). Is there any way to boost this speed up? . Will switching to UDP help me? Will using multiple ports/sockets help me for buffering? Whats the best approach to solve such problems?
I probably cannot compress the data (as most of them are unique) and I need to send everything (as I carry out important operations in the main process)
First, upgrade your hardware. With 1GB NIC (or 10GB if you have the budget) and a decent switch you get 10x boost with no coding, transferring 8GB data takes about just one minute. Push it further with NIC bonding you double it again to just 30 seconds (or 60 times faster than your).
Next, adjust your algorithm, do you need to send the whole 8GB data frequently? Can you pipeline it, do in streaming way, or send only diffs (replica), so that you get good data processing throughput?
The last thing you can do is compression, and better do in chunks so that you don't compress the whole 8GB at once.
You can try to compress your array. There are several algorithm you can find and this post may help you. It provides an explanation for the three most known lossless algorithm :
1. Huffman a tree based algorithm it has a lot of applications and specialization
2. RLE for Run-length encoding is well suited for icons compression
3. LZ77 which use a dictionnary and is a basis to a lot of different algorithms
A lossless algorithm is what you need because you don't want to lose the datas in your array. That's why I wouldn't recommend UDP since it does not check if the data has been received.
Related
I am currently working on a big dataset (approximately a billion data points) and I have decided to use C++ over R in particular for convenience in memory allocation.
However, there does not seem to exist an equivalent to R Studio for C++ in order to "store" the data set and avoid to have to read the data every time I run the program, which is extremely time consuming...
What kind of techniques do C++ users use for big data in order to read the data "once for all" ?
Thanks for your help!
If I understand what you are trying to achieve, i.e. load some data into memory once and use the same data (in memory) with multiple runs of your code, with possible modifications to that code, there is no such IDE, as IDE are not ment to store any data.
What you can do is first load your data into some in-memory database and write your c++ program to read data from that database instead of reading it directly from data-source in C++.
how avoid multiple reads of big data set.
What kind of techniques do C++ users use for big data in order to read
the data "once for all" ?
I do not know of any C++ tool with such capabilities, but I doubt that I have ever searched for one ... seems like something you might do. Keywords appear to be 'data frame' and 'statistical analysis' (and C++).
If you know the 'data set' format, and wish to process raw data no more than one time, you might consider using Posix shared memory.
I can imagine that (a) the 'extremely time consuming' effort could (read and) transform the 'raw' data, and write into a 'data set' (a file) suitable for future efforts (i.e. 'once and for all').
Then (b) future efforts can 'simply' "map" the created 'data set' (a file) into the program's memory space, all ready for use with no (or at least much reduced) time consuming effort.
Expanding the memory map of your program is about using 'Posix' access to shared memory. (Ubuntu 17.10 has it, I have 'gently' used it in C++) Terminology includes, shm_open, mmap, munmap, shm_unlink, and a few others.
From 'man mmap':
mmap() creates a new mapping in the virtual address space of the
calling process. The starting address for
the new mapping is specified in ...
how avoid multiple reads of big data set. What kind of techniques do
C++ users use for big data in order to read the data "once for all" ?
I recently retried my hand at measuring std::thread context switch duration (on my Ubuntu 17.10, 64 bit desktop). My app captured <30 million entries over 10 seconds of measurement time. I also experimented with longer measurement times, and with larger captures.
As part of debugging info capture, I decided to write intermediate results to a text file, for a review of what would be input to the analysis.
The code spent only about 2.3 seconds to save this info to the capture text file. My original software would then proceed with analysis.
But this delay to get on with testing the analysis results (> 12 sec = 10 + 2.3) quickly became tedious.
I found the analysis effort otherwise challenging, and recognized I might save time by capturing intermediate data, and thus avoiding most (but not all) of the data measurement and capture effort. So the debug capture to intermediate file became a convenient split to the overall effort.
Part 2 of the split app reads the <30 million byte intermediate file in somewhat less 0.5 seconds, very much reducing the analysis development cycle (edit-compile-link-run-evaluate), which was was (usually) no longer burdened with the 12+ second measure and data gen.
While 28 M Bytes is not BIG data, I valued the time savings for my analysis code development effort.
FYI - My intermediate file contained a single letter for each 'thread entry into the critical section event'. With 10 threads, the letters were 'A', 'B', ... 'J'. (reminds me of dna encoding)
For each thread, my analysis supported splitting counts per thread. Where vxWorks would 'balance' the threads blocked at a semaphore, Linux does NOT ... which was new to me.
Each thread ran a different number of times through the single critical section, but each thread got about 10% of the opportunities.
Technique: simple encoded text file with captured information ready to be analyzed.
Note: I was expecting to test piping the output of app part 1 into app part 2. Still could, I guess. WIP.
My current approach:
I have one domain class - Application
Each application in my system is stored in "applications" bucket under APPLICATION_KEY key
Apart from application metadata stored in this bucket, each application has its own bucket called "time_metrics/APPLICATION_KEY" where I store time series in a way:
KEY - timestamp / VALUE - some attributes
My concern is efficiency of queries made over specific time window for given application. Currently to get time series from some specific time window and eventually make some reductions I have to make map/reduce over whole "time_metric/APPLICATION_KEY" bucket, which what I have found is not the recommended use case for Riak Map/Reduce.
My question: what would be the best db structure for this kind of a system and how efficiently query it.
Adding onto #macintux's answer.
Basho has had a few customers that have used riak for time series metrics.
Boundary has a nice tech talk about how they use Riak with their network monitoring software. They rollup data into different chunks of time (1m, 5m, 15m) for analysis.
They also have a series of blog posts about lessons learned while implementing this system.
Kivra also has a good slide deck about how they use timeseries data with riak.
You could roll up your data into some sort of arbitrary time length, then read the range you need by issuing regular K/V gets, and then reconstruct the larger picture / reduce in your application.
If you have spare computing power and you know in advance what keys you need, you certainly can use Riak's MapReduce, but often retrieving the keys and running your processing on the client will be as fast (and won't strain your cluster).
Some general ideas:
Roll up your data into larger blocks
If you're concerned about losing data if your client crashes while buffering it, you can always store the data as it arrives
Similar idea: store the data as it arrives, then retrieve it and roll it up at certain intervals
You can automatically expire data once you're confident it is being reliably stored in larger blocks, using either the Bitcask or Memory backends
Memory backend is quite useful (RAM permitting) for any data that only needs to be stored for a limited period of time
Related: don't be afraid to store multiple copies of your data to make reading/reporting easier later
Multiple chunks of time (5- and 15-minute blocks, for example)
Multiple report formats
Having said all that, if you're doing straight key/value requests (it's ideal to always be able to compute the keys you need, rather than doing indexing or searching), Riak can support very heavy traffic loads, so I wouldn't recommend spending too much time creating alternative storage mechanisms unless you know you're going to face latency problems.
I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.
I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c
I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.
I'm going to write a program that plots data from a sensor connected to the computer. The sensor value is going to be plotted as a function of the time (sensor value on the y-axis, time on the x-axis). I want to be able to add new values to the plot in real time. What would be best to do this with in C++?
Edit: And by the way, the program will be running on a Linux machine
Are you particularly concerned about the C++ aspect? I've done 10Hz or so rate data without breaking a sweat by putting gnuplot into a read/plot/refresh loop or with LiveGraph with no issues.
Write a function that can plot a std::deque in a way you like, then .push_back() values from the sensor onto the queue as they come available, and .pop_front() values from the queue if it becomes too long for nice plotting.
The exact nature of your plotting function depends on your platform, needs, sense of esthetics, etc.
You can use ring buffers. In such buffer you have read position and write position. This way one thread can write to buffer and other read and plot a graph. For efficiency you usually end up writing your own framework.
Size of such buffer can be estimated using eg.: data delivery speed from sensor (40KHz?), size of one probe and time span you would like to keep for plotting purposes.
It also depends whether you would like to store such data uncompressed, store rendered plot - all for further offline analysis. In non-RTOS environment your "real-time" depends on processing speed: how fast you can retrieve/store/process and plot data. Usually it is near-real time efficiency.
You might want to check out RRDtool to see whether it meets your requirements.
RRDtool is a high performance data logging and graphing system for time series data.
I did a similar thing for a device that had a permeability sensor attached via RS232.
package bytes received from sensor into packets
use a collection (mainly a list) to store them
prevent the collection to go over a fixed size by trashing least recent values before new ones arrive
find a suitable graphics library to draw with (maybe SDL if you wanna keep it easy and cross-platform), but this choice depends on what kind of graph you need (ncurses may be enough)
last but not least: since you are using a sensor I suppose your approach will be multi-threaded so think about it and use a synchronized collection or a collection that allows adding values when other threads are retrieving them (so forgot iterators, maybe an array is enough)
Btw I think there are so many libraries, just search for them:
first
second
...
I assume that you will deploy this application on a RTOS. But, what will be the data rate and what are real-time requirements! Therefore, as written above, a simple solution may be more than enough. But, if you have hard-real time constraints everything changes drastically. A multi-threaded design with data pipes may solve your real-time problems.
I am trying to compress TCP packets each one of about 4 KB in size. The packets can contain any byte (from 0 to 255). All of the benchmarks on compression algorithms that I found were based on larger files. I did not find anything that compares the compression ratio of different algorithms on small files, which is what I need. I need it to be open source so it can be implemented on C++, so no RAR for example. What algorithm can be recommended for small files of about 4 kilobytes in size? LZMA? HACC? ZIP? gzip? bzip2?
Choose the algorithm that is the quickest, since you probably care about doing this in real time. Generally for smaller blocks of data, the algorithms compress about the same (give or take a few bytes) mostly because the algorithms need to transmit the dictionary or Huffman trees in addition to the payload.
I highly recommend Deflate (used by zlib and Zip) for a number of reasons. The algorithm is quite fast, well tested, BSD licensed, and is the only compression required to be supported by Zip (as per the infozip Appnote). Aside from the basics, when it determines that the compression is larger than the decompressed size, there's a STORE mode which only adds 5 bytes for every block of data (max block is 64k bytes). Aside from the STORE mode, Deflate supports two different types of Huffman tables (or dictionaries): dynamic and fixed. A dynamic table means the Huffman tree is transmitted as part of the compressed data and is the most flexible (for varying types of nonrandom data). The advantage of a fixed table is that the table is known by all decoders and thus doesn't need to be contained in the compressed stream. The decompression (or Inflate) code is relatively easy. I've written both Java and Javascript versions based directly off of zlib and they perform rather well.
The other compression algorithms mentioned have their merits. I prefer Deflate because of its runtime performance on both the compression step and particularly in decompression step.
A point of clarification: Zip is not a compression type, it is a container. For doing packet compression, I would bypass Zip and just use the deflate/inflate APIs provided by zlib.
This is a follow-up to Rick's excellent answer which I've upvoted. Unfortunately, I couldn't include an image in a comment.
I ran across this question and decided to try deflate on a sample of 500 ASCII messages that ranged in size from 6 to 340 bytes. Each message is a bit of data generated by an environmental monitoring system that gets transported via an expensive (pay-per-byte) satellite link.
The most fun observation is that the crossover point at which messages are smaller after compression is the same as the Ultimate Question of Life, the Universe, and Everything: 42 bytes.
To try this out on your own data, here's a little bit of node.js to help:
const zlib = require('zlib')
const sprintf = require('sprintf-js').sprintf
const inflate_len = data_packet.length
const deflate_len = zlib.deflateRawSync(data_packet).length
const delta = +((inflate_len - deflate_len)/-inflate_len * 100).toFixed(0)
console.log(`inflated,deflated,delta(%)`)
console.log(sprintf(`%03i,%03i,%3i`, inflate_len, deflate_len, delta))
If you want to "compress TCP packets", you might consider using a RFC standard technique.
RFC1978 PPP Predictor Compression Protocol
RFC2394 IP Payload Compression Using DEFLATE
RFC2395 IP Payload Compression Using LZS
RFC3173 IP Payload Compression Protocol (IPComp)
RFC3051 IP Payload Compression Using ITU-T V.44 Packet Method
RFC5172 Negotiation for IPv6 Datagram Compression Using IPv6 Control Protocol
RFC5112 The Presence-Specific Static Dictionary for Signaling Compression (Sigcomp)
RFC3284 The VCDIFF Generic Differencing and Compression Data Format
RFC2118 Microsoft Point-To-Point Compression (MPPC) Protocol
There are probably other relevant RFCs I've overlooked.
All of those algorithms are reasonable to try. As you say, they aren't optimized for tiny files, but your next step is to simply try them. It will likely take only 10 minutes to test-compress some typical packets and see what sizes result. (Try different compress flags too). From the resulting files you can likely pick out which tool works best.
The candidates you listed are all good first tries. You might also try bzip2.
Sometimes simple "try them all" is a good solution when the tests are easy to do.. thinking too much sometimes slow you down.
I don't think the file size matters - if I remember correctly, the LZW in GIF resets its dictionary every 4K.
ZLIB should be fine. It is used in MCCP.
However, if you really need good compression, I would do an analysis of common patterns and include a dictionary of them in the client, which can yield even higher levels of compression.
I've had luck using zlib compression libraries directly and not using any file containers. ZIP, RAR, have overhead to store things like filenames. I've seen compression this way yield positive results (compression less than original size) for packets down to 200 bytes.
You may test bicom.
This algorithm is forbidden for commercial use.
If you want it for professional or commercial usage look at "range coding algorithm".
You can try delta compression. Compression will depend on your data. If you have any encapsulation on the payload, then you can compress the headers.
I did what Arno Setagaya suggested in his answer: made some sample tests and compared the results.
The compression tests were done using 5 files, each of them 4096 bytes in size. Each byte inside of these 5 files was generated randomly.
IMPORTANT: In real life, the data would not likely be all random, but would tend to have quiet a bit of repeating bytes. Thus in real life application the compression would tend to be a bit better then the following results.
NOTE: Each of the 5 files was compressed by itself (i.e. not together with the other 4 files, which would result in better compression). In the following results I just use the sum of the size of the 5 files together for simplicity.
I included RAR just for comparison reasons, even though it is not open source.
Results: (from best to worst)
LZOP: 20775 / 20480 * 100 = 101.44% of original size
RAR : 20825 / 20480 * 100 = 101.68% of original size
LZMA: 20827 / 20480 * 100 = 101.69% of original size
ZIP : 21020 / 20480 * 100 = 102.64% of original size
BZIP: 22899 / 20480 * 100 = 111.81% of original size
Conclusion: To my surprise ALL of the tested algorithms produced a larger size then the originals!!! I guess they are only good for compressing larger files, or files that have a lot of repeating bytes (not random data like the above). Thus I will not be using any type of compression on my TCP packets. Maybe this information will be useful to others who consider compressing small pieces of data.
EDIT:
I forgot to mention that I used default options (flags) for each of the algorithms.