Continuous Stream Calculations Across Blocks of Data - c++

What is the recommended method(s) for accessing data from a previous or upcoming block of data from a bulkio input stream? Edit: I am using c++.
For example, if I wanted to perform convolution on a stream of incoming data where each calculation is dependent on some number of values forward/backward in the stream how would I reach into the "next" block of data or "previous" block of data to perform the convolution calculations at the block boundaries? Or temporarily store this information somewhere within the component so it can be used from one block to the next?
Or a simpler example, if I send a repeating vector of 8 octet values into my component, I want the component to simply flip from 0 to 1 or vice versa whenever a 1 is received (dependent on last index of the previous block of data to calculate the first index of the next block of data).
Desired:
in: [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] ->
out: [1,1,1,0,0,0,0] [0,0,0,1,1,1,1] [1,1,1,0,0,0,0]
What I have been able to achieve:
in: [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] ->
out: [1,1,1,0,0,0,0] [1,1,1,0,0,0,0] [1,1,1,0,0,0,0]
I've thought to store the relevant information from the previously processed block in a variable somewhere inside the component code serviceFunction() although I haven't found a way to do this without the value being reinitialized (is each block of data a new call to serviceFunction()?).
Alternatively, I thought to make a read-only property to hold the values I care about but I suspect there may be a better approach I am unaware of.
Thanks,
-Mark

If you need to retain some number of samples between reads, BulkIO input streams support data overlapping. You provide both a number of samples to read, and a number of samples to consume (which is necessarily less than or equal to the read size), and the next read will start at the first sample that was not consumed. Refer to the REDHAWK manual section on the BulkIO Stream API (5.7.3 in the 2.0.8 manual) for further details. You can also maintain history by simply keeping the last data block read as a member variable of your component class.
Your simple example suggests a state machine rather than the contents of the last read. In general, if there is information you want to store between iterations of your serviceFunction(), you are free to add your own member variables to your component class (e.g., "MyComponent_i"). It is not necessary to declare a property to add members. Only the base class (e.g., "MyComponent_base") is intended to be regenerated as you modify the properties or ports of the component, so any changes you make to the component class are preserved.

Related

RocksDb: Multiple values per key (c++)

RocksDb: Multiple values per key (c++)
what i am trying to do
I am trying to adapt my simple blockchain implementation to save the blockchain to the hard drive periodically and so i looked info different db solutions. i decided to use RocksDb due to its ease of use and good documentation & examples. i read through the documentation and could not figure out how to adapt it to my use case.
i have a class Block
`
class Block {
public:
string PrevHash;
private:
blockheader header; // The header of the block
uint32_t index; // height of this block
std::vector<tx_data> transactions; // All transactions in the block in a vector
std::string hash; // The hash of the block
uint64_t timestamp; // The timestamp this block was created by the node
std::string data; // Extra data that can be appended to blocks (for example text or a smart contract)
// - The larger this feild the higher the fee and the max size is defined in config.h
};
which contains a few variables and a vector of a struct tx_data. i want to load this data into a rocksdb database.
what i have tried
after google failed to return any results on storing multiple values with one keypair i decided i would have to just enclose each block data in 0xa1 at the beginning then at the end 0x2a
*0x2a*
header
index
txns
hash
timestamp
data
*0x2a*
but decided there was surely a simpler way. I tried looking at the code used by turtlecoin, a currency that uses rocksdb for its database but the code there is practically indecipherable, i have heard about serialization but there seems to be little info out there on it.
perhaps i am misunderstanding the use of a DB?
You need to serialization it. Serialization is the process of taking a structured set of data and making it into one string, number or vector of bytes that can then be de-serialized later on back into that struct. One method would be to take the hash of the block and use it as the key in the db then crate a new struct which does not contain the hash. Then write a function that takes a Block struct and a path and constructs a BlockNoHash struct and saves it. Then another function to read a block from a hash and spit out a Block Struct. Very basically you could split each field with a charector which will never occur in the data (eg ` or |), though this means if one piece of the data is corrupted then you cant get any of the other data
There are two related questions here.
One is: how do you store complex data -- more than just a simple integer or string -- within a key-value store like RocksDB. As Leo says, you need to serialize them.
Rather than writing your own code, the typical easier way is to use a framework like Protobuf or Thrift to generate code to translate between your in-memory structures and a flat bytes representation suitable to store in a database (or send over the network.)
A related question, from the title: how do you store multiple values per key?
There are two main options:
Use a compound key, that distinguishes the various values. By walking a key prefix you can find all the values in a set of related keys. This is better if the values get very large or if you want to find and update them independently.
Or, make the value for a single key actually be a compound object that includes several inner values. This is easiest if you always want to fetch all the sub-values in a single operation.

FatFS - can I create multiple seek locations?

I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.
On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).

Efficiently read data from a structured file in C/C++

I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.

A way to retrieve data by address (c++)

Using c++, is it possible to store data to a file, and retrieve that data by address for quicker access? I want to get around having to parse or iterate large files of data, with the ability to gain direct access to a subset of that data. In your answers, it does not matter how the data is stored; whatever works best with the answer you have.
Yes. Assuming you're using iostreams, you can use tellg and tellp to retrieve the current get and put (i.e., read and write) locations respectively. You can later feed the same value back to seekg or seekp to get back to the same location (again, for reading or writing respectively).
You can use these to (for one example) create an index into a file. Before writing each record to your primary data file, you'd use tellp to retrieve the current location. Then you'd store the data to the data file, and save the value tellp returned into the index file. Depending on what sort of index you want, that might just contain a series of locations, so you can seek directly to record #N in the data file (even if the records are of different sizes).
Alternatively, you might store the data for some key field in the index file. For example, you might have a main data file with a set of records about people. Then you might build a number of indices into that, one with last names and a location for each, another with birthdays and a location for each, and so on, so you can search by name or birthday (or do an intersection between them to support things like people older than 18 with a last name starting with "M", "N" or "O").

Binary-tree data storage implementation

I have started using binary trees in c++, and i must say i really like the idea and things are clear for me, until i think of storing data on the disk in an order where later i can instantly read a chunk of data.
So far i have stored everything (nodes) into the ram... but this is just a simple and not real life app. I am not interested in storing the whole binary tree on the disk as that would be useless again since you have to read it again back to the memory! what i am after is a method just like for example MYSQL.
I haven't found any good article on this so i would appreciate if i someone include some urls or books.
The main difference from b-tree and b+tree:
- The leaf nodes are linked for fast lockup sequential reads. Can point ascending, can point descending , or both (like i saw in one IBM DB)
You should write it on disk, if the table or file grows, you will have memory problems.
(SEEK operations on files ARE REALLY FAST. You can create a 1 GB file on disk in less than 1 second... C# filestream,method .SetFilesize)
If you manage to have multiple readers/writers, you need concurrency control over the index and table(or file).... You gona do that in memory? If a power failure occures, how do you rollback?Ye, you dont.
IE:Field f1 is indexed.
WHERE 1=1 (dont need to access b+tree, give me all and the order is irrelevant)
WHERE 1=1 ORDER BY f1 ASC/DESC (Need to access b+tree, give me all by ascending/descending order)
WHERE f1>=100 (Need to access b+tree, lock up where the leaf node =100 and give all leaf node items following right pointers. If this process is a multithreaded read, they probablly come with a strange order, but no problem... no order by in clause).
WHERE f1>=100 order by f1 asc (Need to access b+tree, lock up where the leaf node =100 and give all leaf node items following right pointers. This process shouldnt be multithreaded following the b+tree, comes naturally in order.
Field f2 indexed with a b+tree and type string.
Where name like '%ODD' (Internally, the compared value must be inverted and the all symbol stays at the end Like starts with 'DDO' and ends with anything. 'DDOT' is in the group so 'TODD' must belongs to the result!!!! Tricky, tricky logic ;P)
with this statement,
WHERE name like '%OD%' (has in the middle 'OD'). The things get hot :))))
Internally, the result is the UNION of the sub result for 'OD%' with the sub result inverted 'DO%'. After that, removes of starting 'OD' and ending 'OD' without 'OD' in the middle, otherwise its a valid result('ODODODODOD' its a valid result. Invalid results 'ODABCD' and 'ABCDOD' ).
Consider what i said and check some more things if you gona do deep:
- FastIO on files:C# Filestream no_buffered_flag, wriththought disk flag on.
- Memory mapped files/memory views: Yes we can manipulate an huge file in small portions as we need it
- Indexes:Bitmap index, hash index (hash function;perfect hash function;ambiguity of the hashfunction), sparse index, dense index, b+tree, r-tree, reversed index.
- Multithreads: lock, mutexes,semaphores
- Transactional conciderations (Log file, 2phase commit;3phase commit);
- Locks (database,table,page,record)
- Deadlocks: 3 ways to kill it (longer conflicting process;Younger conflicting process;The process which locks more objects). Modern RDBMs use a mixed of the 3 ways...
- SQL parsing (AST-Tree).
- Caching recurrent queries.
- Triggers, procedures, views, etc.
- Passing parameters to the procedures (can use the object type ;P)
DONT LOAD EVERYTHING IN MEMORY,INTELLIGENT SOLUTIONS LOADS PARTS AS THEY NEED IT AND RELEASES WHEN ITS NO LONGER USABLE. Why=> your db engine (and PC) becomes more responsive using less memory. Using b+tree for lockup the branch leaf nodes needs just 2 Disk IO's. Knowing the lockup value, you get the record long pointer. SEEK the main file for the position, read the content. This is too fast. Memory is faster... Yes it is, but can you put 10 GB's of a b+tree on memory? If so, how your DB engine program starts to behave? Slowlly?
Forget binary trees and convencional btrees: they are academic tutorials. Real life they are replaced by hashtables or b+trees (B PLUS TREE showing storage and ordered ascending- http://en.wikipedia.org/wiki/B%2B_tree)
Consider using dataspaces for the db data in multiple disks. You can parallelize Disk IO performance. Dont forget to mirrored them... Each dataspace, should have a fragment of the table with a fragment of the indice, with a partial log file. You should develop the coordinator which presents wiselly the queries for the sub units.
IE: 3 dataspaces...
INSERT INTO etc...... only should happend in 1 table space.
but
select * from TB_XPTO, should be presented to all dataspaces.
select * from TB_XPTO order by an indexed field, should be presented to all dataspaces. Each data space executes the instruction, so now we have a 3 subsets by their sub order.
The result will be on the coordinator, where will reorder it again.
Confuse, BUT FAST!!!!!!
The coordinator should controls the master transaction.
if dataspace A commited
dataspace B commited
dataspace C is in uncommited state
the coordinator will rollback a C,B and A.
if dataspace A commited
dataspace B commited
dataspace C commited
the coordinator will Commit the overall transaction.
COORDINATOR LOG:
CREATE MASTER TRANSACTION UID 121212, CHILD TRANSACTIONS(1111,2222,3333)
DATA SPACE A LOG
1111 INSERT len byte array
1111 INSERT len byte array
COMMIT 1111
DATA SPACE B LOG
2222 INSERT len byte array
2222 INSERT len byte array
COMMIT 2222
DATA SPACE C LOG
3333 INSERT len byte array
3333 ---> No more nothing..... Power failure here!!!!!!!
On startup coordinator check if the db was properlly closed, if not, it will check his log file. Well, is missing a master commit line like COMMIT 121212. So it will enquire the data spaces for the log consistency.
A,B repplies COMMITED, but C, after checked his log file, detects a failure. Replies UNCOMMITED.
Master Coordinator FORCES TABLESPACE A,B,C FOR ROLLBACK 1111,2222,3333
After that, himself rollbacks his master transaction and puts DB state=OK.
The main point here is speed on insert,selects, updates, and deletes
Consider to maintain the index well balanced. Many deletes on the index will unbalanced it. An unbalanced index drops its performance.... Add a heap on the head of the index file, for controlling it. Some math here would help. If deletes are higher than 5% of records, balance it and reset the counter. If an update is over on an indexed field, should count it too.
Be smart considering the field index. If the column is Gender, there are only 2 options(i hope, lol.... ops, can be nullable too....), a bitmap index is well applied. If the distinctness (i think i spell it badlly) of a field is 100% (all values heterogeneous), like a sequence applied on a field like Oracle do, or an identity field like SQL Server do, a b+tree is well applied. If a field is kind of geometric type, like in Oracle, the R-Tree is the best. For strings, reversed Index is well applied, or b+tree if heterogenous.
Houston, we have problems....
NULL value fields, should be considered too in the index. Its a value too!!!!
IE: WHERE F1 is null
Add some socket functionality:Async TCP/IP SERVER
-If you delete a record, dont resize the file right now. Mark it as deleted. You should do some metrics here too. If unused space > x and transactions =0, do a database lock and re-allocate pointers, then resize database. Some spaces appears on the DB file, you can try to do some page locks instead of database lock... Things can keep going and no one gets hurt.... Measure the last unlocked page of the DB, lock it for you. Check a deleted page that you can fill with your page. Not Found, release lock; If found, move for the new position, fix pointers, mark old page as deleted, resize file, release lock. Why so many operations? To keep the log well formed!!!! You can split the page in small pages, but you get fragmentation (argh...we lost speed commander?)... 2 algorithms comes here. Best-Fit, and Worst-Fit....Google it. The best is .... using both :P
And if you solve all of this stuff, you can shout out loud "DAM, I DID A DATABASE... IM GONA NAME IT ORACLE!!!!" ;P