I have been developing a time domain simulation research software in Fortran 2008. In the beginning, I thought it was a good idea to dump everything in binary for speed and then extract what I wanted with another software. I now want to shift to saving directly to HDF5.
During the simulation, at each time-step I get a vector of values (first number is the component and second number is the variable inside that component):
t_i Var1-1 Var1-2 ... VarN-M
I was thinking to create several groups according to the class of components I have in the simulation and then within each group, another group for each instance that would include the data. Then, during the simulation, at each time step, I will append to each instance the values.
The data is always accessed as time-series, e.g. (t, Var1-1). I am not interested in snapshots of a single time instance.
My questions: Should I keep the time in the root and common to all? Is there something to avoid based on past experience (would like to avoid design errors)? I am mainly concerned about the performance. Now I am simply buffering all the vectors generated and dump them at the end in a single write — very high performance.
Related
I am currently working on an application (Qt) which needs to deal with a (for me) huge data stream. I have 3 sensors which each provide 3 measurement values (+ a timestamp) at about 25 kHz. Im pretty sure, that there will be more sensors in the future.
The application should run 24/7.
The task is now to
collect the data from the sensors
convert the measurement values
save them to a file
and visualize the converted values.
Part 1 and 2 are working. For part 3 I do have a simple ofstream doing its task well.
For part 4 I am currently unsure how to deal with the high storage amount the whole application will need. I need to visualize different parts of the measurements, sometimes the whole period (with low resolution), but sometimes also only a short period with nearly full resolution.
I´m actually storing all the values in a QVector and drawing the requested period in my custom QQuickItem which will then be shown via qml.
This concept does have 2 major problems:
the performance
the limited QVector size
Are there any other qt containers/concepts which will perform better?
Or do I need to use something like a time-series databases? If so, any recommendations for a free one working (offline) under Windows (with Qt)?
I am currently working on a big dataset (approximately a billion data points) and I have decided to use C++ over R in particular for convenience in memory allocation.
However, there does not seem to exist an equivalent to R Studio for C++ in order to "store" the data set and avoid to have to read the data every time I run the program, which is extremely time consuming...
What kind of techniques do C++ users use for big data in order to read the data "once for all" ?
Thanks for your help!
If I understand what you are trying to achieve, i.e. load some data into memory once and use the same data (in memory) with multiple runs of your code, with possible modifications to that code, there is no such IDE, as IDE are not ment to store any data.
What you can do is first load your data into some in-memory database and write your c++ program to read data from that database instead of reading it directly from data-source in C++.
how avoid multiple reads of big data set.
What kind of techniques do C++ users use for big data in order to read
the data "once for all" ?
I do not know of any C++ tool with such capabilities, but I doubt that I have ever searched for one ... seems like something you might do. Keywords appear to be 'data frame' and 'statistical analysis' (and C++).
If you know the 'data set' format, and wish to process raw data no more than one time, you might consider using Posix shared memory.
I can imagine that (a) the 'extremely time consuming' effort could (read and) transform the 'raw' data, and write into a 'data set' (a file) suitable for future efforts (i.e. 'once and for all').
Then (b) future efforts can 'simply' "map" the created 'data set' (a file) into the program's memory space, all ready for use with no (or at least much reduced) time consuming effort.
Expanding the memory map of your program is about using 'Posix' access to shared memory. (Ubuntu 17.10 has it, I have 'gently' used it in C++) Terminology includes, shm_open, mmap, munmap, shm_unlink, and a few others.
From 'man mmap':
mmap() creates a new mapping in the virtual address space of the
calling process. The starting address for
the new mapping is specified in ...
how avoid multiple reads of big data set. What kind of techniques do
C++ users use for big data in order to read the data "once for all" ?
I recently retried my hand at measuring std::thread context switch duration (on my Ubuntu 17.10, 64 bit desktop). My app captured <30 million entries over 10 seconds of measurement time. I also experimented with longer measurement times, and with larger captures.
As part of debugging info capture, I decided to write intermediate results to a text file, for a review of what would be input to the analysis.
The code spent only about 2.3 seconds to save this info to the capture text file. My original software would then proceed with analysis.
But this delay to get on with testing the analysis results (> 12 sec = 10 + 2.3) quickly became tedious.
I found the analysis effort otherwise challenging, and recognized I might save time by capturing intermediate data, and thus avoiding most (but not all) of the data measurement and capture effort. So the debug capture to intermediate file became a convenient split to the overall effort.
Part 2 of the split app reads the <30 million byte intermediate file in somewhat less 0.5 seconds, very much reducing the analysis development cycle (edit-compile-link-run-evaluate), which was was (usually) no longer burdened with the 12+ second measure and data gen.
While 28 M Bytes is not BIG data, I valued the time savings for my analysis code development effort.
FYI - My intermediate file contained a single letter for each 'thread entry into the critical section event'. With 10 threads, the letters were 'A', 'B', ... 'J'. (reminds me of dna encoding)
For each thread, my analysis supported splitting counts per thread. Where vxWorks would 'balance' the threads blocked at a semaphore, Linux does NOT ... which was new to me.
Each thread ran a different number of times through the single critical section, but each thread got about 10% of the opportunities.
Technique: simple encoded text file with captured information ready to be analyzed.
Note: I was expecting to test piping the output of app part 1 into app part 2. Still could, I guess. WIP.
My current approach:
I have one domain class - Application
Each application in my system is stored in "applications" bucket under APPLICATION_KEY key
Apart from application metadata stored in this bucket, each application has its own bucket called "time_metrics/APPLICATION_KEY" where I store time series in a way:
KEY - timestamp / VALUE - some attributes
My concern is efficiency of queries made over specific time window for given application. Currently to get time series from some specific time window and eventually make some reductions I have to make map/reduce over whole "time_metric/APPLICATION_KEY" bucket, which what I have found is not the recommended use case for Riak Map/Reduce.
My question: what would be the best db structure for this kind of a system and how efficiently query it.
Adding onto #macintux's answer.
Basho has had a few customers that have used riak for time series metrics.
Boundary has a nice tech talk about how they use Riak with their network monitoring software. They rollup data into different chunks of time (1m, 5m, 15m) for analysis.
They also have a series of blog posts about lessons learned while implementing this system.
Kivra also has a good slide deck about how they use timeseries data with riak.
You could roll up your data into some sort of arbitrary time length, then read the range you need by issuing regular K/V gets, and then reconstruct the larger picture / reduce in your application.
If you have spare computing power and you know in advance what keys you need, you certainly can use Riak's MapReduce, but often retrieving the keys and running your processing on the client will be as fast (and won't strain your cluster).
Some general ideas:
Roll up your data into larger blocks
If you're concerned about losing data if your client crashes while buffering it, you can always store the data as it arrives
Similar idea: store the data as it arrives, then retrieve it and roll it up at certain intervals
You can automatically expire data once you're confident it is being reliably stored in larger blocks, using either the Bitcask or Memory backends
Memory backend is quite useful (RAM permitting) for any data that only needs to be stored for a limited period of time
Related: don't be afraid to store multiple copies of your data to make reading/reporting easier later
Multiple chunks of time (5- and 15-minute blocks, for example)
Multiple report formats
Having said all that, if you're doing straight key/value requests (it's ideal to always be able to compute the keys you need, rather than doing indexing or searching), Riak can support very heavy traffic loads, so I wouldn't recommend spending too much time creating alternative storage mechanisms unless you know you're going to face latency problems.
Imagine following situation:
There is a distributed key/value database stored on computer network. One central "main" computer that fetches request, and multiple child machines that store portions of data.
I.e. something like this:
main computer
|
+--child A
+--child B
+--child C
.....
I.e. "star" topology.
Additional description:
Portions of database overlap, and several different versions of record with same "key" can be stored on several machines at same time.
Key is not guaranteed to exist on all machines or on specific machines.
"Children" do not synchronize data with each other.
Data is requested/read only via main computer, which must return most recent version of data for requested key.
Data is written only through children - they receive new values from several sources.
Data is never deleted.
Now the main problem:
With such structure, how do I determine which version is most recent?
I can think of two ways to deal with the problem:
Add timestamp for every record, when it is written into database vial child machine, use timestamp to determine version.
Use "revision number" or "write operation index" (issued by main computer, increments by one for every write operation) instead of timestamps.
However, both approaches are not perfect:
1st approach requires perfect clock synchronization for all machines, otherwise system will fail to deliver most recent record value.
2nd approach will cause every child to ask main machine for timestamp via network, which will introduce writing delays, plus main machine will have to be locked by mutex, so multithreading performance will suffer.
What is the better way to deal with this situation?
How do real clustered databases deal with this situation (most recent record version in cluster)?
Your statement that the first approach requires perfect clock synchronization is not correct.
You do not care about the absolute timestamps issued by a child, only the relative timestamps. So as long as the clocks advance at the same rate, they need not be synchronized; you can correct for the known offsets.
If the clocks on the children advance at different rates, then you must use a method which involves coordination (writing cannot be lock-free in the slow path). This is provable by contradiction, since obviously two children independently writing a value with time-records that cannot be related to each other will not let an outside observer determine which was written later.
However, you can do the coordination in parallel with the actual write: write to the child and, simultaneously, to an ordered log which allows a determination of which write happened first (you don't need a ticket-type system like you seem to suggest if you've got a write log). So it doesn't necessarily delay the process of writing at all!
Take a look at logical-timestamp key-value systems like Accumulo, an HBase alternative (currently in Apache-project incubation) - this is real world clustered database doing exactly what you're asking for.
I'm going to write a program that plots data from a sensor connected to the computer. The sensor value is going to be plotted as a function of the time (sensor value on the y-axis, time on the x-axis). I want to be able to add new values to the plot in real time. What would be best to do this with in C++?
Edit: And by the way, the program will be running on a Linux machine
Are you particularly concerned about the C++ aspect? I've done 10Hz or so rate data without breaking a sweat by putting gnuplot into a read/plot/refresh loop or with LiveGraph with no issues.
Write a function that can plot a std::deque in a way you like, then .push_back() values from the sensor onto the queue as they come available, and .pop_front() values from the queue if it becomes too long for nice plotting.
The exact nature of your plotting function depends on your platform, needs, sense of esthetics, etc.
You can use ring buffers. In such buffer you have read position and write position. This way one thread can write to buffer and other read and plot a graph. For efficiency you usually end up writing your own framework.
Size of such buffer can be estimated using eg.: data delivery speed from sensor (40KHz?), size of one probe and time span you would like to keep for plotting purposes.
It also depends whether you would like to store such data uncompressed, store rendered plot - all for further offline analysis. In non-RTOS environment your "real-time" depends on processing speed: how fast you can retrieve/store/process and plot data. Usually it is near-real time efficiency.
You might want to check out RRDtool to see whether it meets your requirements.
RRDtool is a high performance data logging and graphing system for time series data.
I did a similar thing for a device that had a permeability sensor attached via RS232.
package bytes received from sensor into packets
use a collection (mainly a list) to store them
prevent the collection to go over a fixed size by trashing least recent values before new ones arrive
find a suitable graphics library to draw with (maybe SDL if you wanna keep it easy and cross-platform), but this choice depends on what kind of graph you need (ncurses may be enough)
last but not least: since you are using a sensor I suppose your approach will be multi-threaded so think about it and use a synchronized collection or a collection that allows adding values when other threads are retrieving them (so forgot iterators, maybe an array is enough)
Btw I think there are so many libraries, just search for them:
first
second
...
I assume that you will deploy this application on a RTOS. But, what will be the data rate and what are real-time requirements! Therefore, as written above, a simple solution may be more than enough. But, if you have hard-real time constraints everything changes drastically. A multi-threaded design with data pipes may solve your real-time problems.