I am currently working on an application (Qt) which needs to deal with a (for me) huge data stream. I have 3 sensors which each provide 3 measurement values (+ a timestamp) at about 25 kHz. Im pretty sure, that there will be more sensors in the future.
The application should run 24/7.
The task is now to
collect the data from the sensors
convert the measurement values
save them to a file
and visualize the converted values.
Part 1 and 2 are working. For part 3 I do have a simple ofstream doing its task well.
For part 4 I am currently unsure how to deal with the high storage amount the whole application will need. I need to visualize different parts of the measurements, sometimes the whole period (with low resolution), but sometimes also only a short period with nearly full resolution.
I´m actually storing all the values in a QVector and drawing the requested period in my custom QQuickItem which will then be shown via qml.
This concept does have 2 major problems:
the performance
the limited QVector size
Are there any other qt containers/concepts which will perform better?
Or do I need to use something like a time-series databases? If so, any recommendations for a free one working (offline) under Windows (with Qt)?
Related
Currently, we have a data pipeline that takes about 1 week to compute in parallel (written in python) and occupies about 4 TB. We would like to research if the data representation we chose is optimal or not and how different architectures perform on the different data representation. Thus, we would like to be able change the data representation and build models on these new data representations. However, to make a new representation we would have to wait 7 days maybe longer depending on the changes we made and then begin training our model on the data (3 days of training as of the current model). Thus, our current cycle requires 10 days to generate and then train our neural net. This is very time and space expensive if we would like to explore 50 different data representations.
Thus, we have began rewriting the entire data pipeline in C++ to speed up the data generation. While this alleviates the time cost of generating the data, it does not solve our issue of storing that much data. We would much rather save the metadata used to generate the data.
Is it possible for us to generate the data and pass it straight to tensorflow without ever writing it to disk?
The supercomputers we have access to have 16 CPUs and 4 GPUs per node. So we were wondering if we could train a model on the GPUs and generate the next batch of data on the CPUs while the net is training.
I have not been able to find anything on the internet that addresses this specific question. I am wondering if this functionality is built into tensorflow or pytorch? We are almost done rewriting the data pipeline in C++ and are a few weeks away from trying to access the data from tensorflow. Again, our whole goal is never write the data to disk.
I have been developing a time domain simulation research software in Fortran 2008. In the beginning, I thought it was a good idea to dump everything in binary for speed and then extract what I wanted with another software. I now want to shift to saving directly to HDF5.
During the simulation, at each time-step I get a vector of values (first number is the component and second number is the variable inside that component):
t_i Var1-1 Var1-2 ... VarN-M
I was thinking to create several groups according to the class of components I have in the simulation and then within each group, another group for each instance that would include the data. Then, during the simulation, at each time step, I will append to each instance the values.
The data is always accessed as time-series, e.g. (t, Var1-1). I am not interested in snapshots of a single time instance.
My questions: Should I keep the time in the root and common to all? Is there something to avoid based on past experience (would like to avoid design errors)? I am mainly concerned about the performance. Now I am simply buffering all the vectors generated and dump them at the end in a single write — very high performance.
I have to give some background first. I want to implement an optimized storage engine for OSM planet data (50GB+). The purpose of this engine is to enable map area extractions as fast as possible - while also remaining the ability for minutely updates. The design I've chosen for several reasons (not mentioning all of them here) is to use one data cell per grid. E.g. think of a all cells on a map being distinct files or databases: http://3.bp.blogspot.com/_CntRFtGsdQo/TTU5UMlLkTI/AAAAAAAAARk/_hW8n33t4Ok/s1600/utmworld.gif
(Jut to get the idea though, this is not the actual cell grid I'll be using)
I have never used leveldb before, but settled on it for it's bulk insert and update performance. However, I'd like to know about the "performance characteristics" when opening many very small and very large leveldb databases. very small meaning just a few kB, very large meaning a few hundred MB
I expect that I have to open / close somewhere between 10-100 dbs per minute. I'd rule out leveldb if it needs significant initialization time.
An answer to this question could be either concrete figures, or insight to what leveldb does during initialization and how it relates to data / index size.
PS. I'll also do my own measurements of course. But as with all tests, I may draw wrong conclusions from my sample data.
I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.
I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c
I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.
I'm going to write a program that plots data from a sensor connected to the computer. The sensor value is going to be plotted as a function of the time (sensor value on the y-axis, time on the x-axis). I want to be able to add new values to the plot in real time. What would be best to do this with in C++?
Edit: And by the way, the program will be running on a Linux machine
Are you particularly concerned about the C++ aspect? I've done 10Hz or so rate data without breaking a sweat by putting gnuplot into a read/plot/refresh loop or with LiveGraph with no issues.
Write a function that can plot a std::deque in a way you like, then .push_back() values from the sensor onto the queue as they come available, and .pop_front() values from the queue if it becomes too long for nice plotting.
The exact nature of your plotting function depends on your platform, needs, sense of esthetics, etc.
You can use ring buffers. In such buffer you have read position and write position. This way one thread can write to buffer and other read and plot a graph. For efficiency you usually end up writing your own framework.
Size of such buffer can be estimated using eg.: data delivery speed from sensor (40KHz?), size of one probe and time span you would like to keep for plotting purposes.
It also depends whether you would like to store such data uncompressed, store rendered plot - all for further offline analysis. In non-RTOS environment your "real-time" depends on processing speed: how fast you can retrieve/store/process and plot data. Usually it is near-real time efficiency.
You might want to check out RRDtool to see whether it meets your requirements.
RRDtool is a high performance data logging and graphing system for time series data.
I did a similar thing for a device that had a permeability sensor attached via RS232.
package bytes received from sensor into packets
use a collection (mainly a list) to store them
prevent the collection to go over a fixed size by trashing least recent values before new ones arrive
find a suitable graphics library to draw with (maybe SDL if you wanna keep it easy and cross-platform), but this choice depends on what kind of graph you need (ncurses may be enough)
last but not least: since you are using a sensor I suppose your approach will be multi-threaded so think about it and use a synchronized collection or a collection that allows adding values when other threads are retrieving them (so forgot iterators, maybe an array is enough)
Btw I think there are so many libraries, just search for them:
first
second
...
I assume that you will deploy this application on a RTOS. But, what will be the data rate and what are real-time requirements! Therefore, as written above, a simple solution may be more than enough. But, if you have hard-real time constraints everything changes drastically. A multi-threaded design with data pipes may solve your real-time problems.