RocksDb: Multiple values per key (c++)
what i am trying to do
I am trying to adapt my simple blockchain implementation to save the blockchain to the hard drive periodically and so i looked info different db solutions. i decided to use RocksDb due to its ease of use and good documentation & examples. i read through the documentation and could not figure out how to adapt it to my use case.
i have a class Block
`
class Block {
public:
string PrevHash;
private:
blockheader header; // The header of the block
uint32_t index; // height of this block
std::vector<tx_data> transactions; // All transactions in the block in a vector
std::string hash; // The hash of the block
uint64_t timestamp; // The timestamp this block was created by the node
std::string data; // Extra data that can be appended to blocks (for example text or a smart contract)
// - The larger this feild the higher the fee and the max size is defined in config.h
};
which contains a few variables and a vector of a struct tx_data. i want to load this data into a rocksdb database.
what i have tried
after google failed to return any results on storing multiple values with one keypair i decided i would have to just enclose each block data in 0xa1 at the beginning then at the end 0x2a
*0x2a*
header
index
txns
hash
timestamp
data
*0x2a*
but decided there was surely a simpler way. I tried looking at the code used by turtlecoin, a currency that uses rocksdb for its database but the code there is practically indecipherable, i have heard about serialization but there seems to be little info out there on it.
perhaps i am misunderstanding the use of a DB?
You need to serialization it. Serialization is the process of taking a structured set of data and making it into one string, number or vector of bytes that can then be de-serialized later on back into that struct. One method would be to take the hash of the block and use it as the key in the db then crate a new struct which does not contain the hash. Then write a function that takes a Block struct and a path and constructs a BlockNoHash struct and saves it. Then another function to read a block from a hash and spit out a Block Struct. Very basically you could split each field with a charector which will never occur in the data (eg ` or |), though this means if one piece of the data is corrupted then you cant get any of the other data
There are two related questions here.
One is: how do you store complex data -- more than just a simple integer or string -- within a key-value store like RocksDB. As Leo says, you need to serialize them.
Rather than writing your own code, the typical easier way is to use a framework like Protobuf or Thrift to generate code to translate between your in-memory structures and a flat bytes representation suitable to store in a database (or send over the network.)
A related question, from the title: how do you store multiple values per key?
There are two main options:
Use a compound key, that distinguishes the various values. By walking a key prefix you can find all the values in a set of related keys. This is better if the values get very large or if you want to find and update them independently.
Or, make the value for a single key actually be a compound object that includes several inner values. This is easiest if you always want to fetch all the sub-values in a single operation.
Related
I am working with large binary files (aprox 2 Gb each) that contain raw data. These files have a well defined structure, where each file is an array of events, and each event is an array of data banks. Each event and data bank have a structure (header, data type, etc.).
From these files, all I have to do is extract whatever data I might need, and then I just analyze and play with the data. I might not need all of the data, sometimes I just extract XType data, other just YType, etc.
I don't want to shoot myself in the foot, so I am asking for guidance/best practice on how to deal with this. I can think of 2 possibilities:
Option 1
Define a DataBank class, this will contain the actual data (std::vector<T>) and whatever structure this has.
Define a Event class, this has a std::vector<DataBank> plus whatever structure.
Define a MyFile class, this is a std::vector<Event> plus whatever structure.
The constructor of MyFile will take a std:string (name of the file), and will do all the heavy lifting of reading the binary file into the classes above.
Then, whatever I need from the binary file will just be a method of the MyFile class; I can loop through Events, I can loop through DataBanks, everything I could need is already in this "unpacked" object.
The workflow here would be like:
int main() {
MyFile data_file("data.bin");
std::vector<XData> my_data = data_file.getXData();
\\Play with my_data, and never again use the data_file object
\\...
return 0;
}
Option 2
Write functions that take std::string as an argument, and extract whatever I need from the file e.g. std::vector<XData> getXData(std::string), int getNumEvents(std::string), etc.
The workflow here would be like:
int main() {
std::vector<XData> my_data = getXData("data.bin");
\\Play with my_data, and I didn't create a massive object
\\...
return 0;
}
Pros and Cons that I see
Option 1 seems like a cleaner option, I would only "unpack" the binary file once in the MyFile constructor. But I will have created a huge object that contains all the data from a 2 Gb file, which I will never use. If I need to analyze 20 files (each of 2 Gb), will I need 40 Gb of ram? I don't understand how these are handled, will this affect performance?
Option number 2 seems to be faster; I will just extract whatever data I need, and that's it, I won't "unpack" the entire binary file just to later extract the data I care about. The problem is that I will have to deal with the binary file structure in every function; if this ever changes, that will be a pain. I will only create objects of the data I will play with.
As you can see from my question, I don't have much experience with dealing with large structures and files. I appreciate any advice.
I do not know whether the following scenario matches yours.
I had a case of processing huge log files of hardware signal logging in the automotive area. Signals like door locked, radio on, temperature, and thousands more, appearing sometimes periodically. The operator selects some signal types and then analizes diagrams of signal values.
This scenario is based on a huge log file growing on passing time.
What I did was for every signal type creating its own logfile extract, in optimized binary format (one would load a fixed sized byte[] array).
This meant that having the diagram for just 10 types would be feasible to display fast, in real time. Zooming in on a time interval, dynamically selecting signal types, and so on.
I hope you got some ideas.
What is the recommended method(s) for accessing data from a previous or upcoming block of data from a bulkio input stream? Edit: I am using c++.
For example, if I wanted to perform convolution on a stream of incoming data where each calculation is dependent on some number of values forward/backward in the stream how would I reach into the "next" block of data or "previous" block of data to perform the convolution calculations at the block boundaries? Or temporarily store this information somewhere within the component so it can be used from one block to the next?
Or a simpler example, if I send a repeating vector of 8 octet values into my component, I want the component to simply flip from 0 to 1 or vice versa whenever a 1 is received (dependent on last index of the previous block of data to calculate the first index of the next block of data).
Desired:
in: [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] ->
out: [1,1,1,0,0,0,0] [0,0,0,1,1,1,1] [1,1,1,0,0,0,0]
What I have been able to achieve:
in: [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] [0,0,0,1,0,0,0] ->
out: [1,1,1,0,0,0,0] [1,1,1,0,0,0,0] [1,1,1,0,0,0,0]
I've thought to store the relevant information from the previously processed block in a variable somewhere inside the component code serviceFunction() although I haven't found a way to do this without the value being reinitialized (is each block of data a new call to serviceFunction()?).
Alternatively, I thought to make a read-only property to hold the values I care about but I suspect there may be a better approach I am unaware of.
Thanks,
-Mark
If you need to retain some number of samples between reads, BulkIO input streams support data overlapping. You provide both a number of samples to read, and a number of samples to consume (which is necessarily less than or equal to the read size), and the next read will start at the first sample that was not consumed. Refer to the REDHAWK manual section on the BulkIO Stream API (5.7.3 in the 2.0.8 manual) for further details. You can also maintain history by simply keeping the last data block read as a member variable of your component class.
Your simple example suggests a state machine rather than the contents of the last read. In general, if there is information you want to store between iterations of your serviceFunction(), you are free to add your own member variables to your component class (e.g., "MyComponent_i"). It is not necessary to declare a property to add members. Only the base class (e.g., "MyComponent_base") is intended to be regenerated as you modify the properties or ports of the component, so any changes you make to the component class are preserved.
Using c++, is it possible to store data to a file, and retrieve that data by address for quicker access? I want to get around having to parse or iterate large files of data, with the ability to gain direct access to a subset of that data. In your answers, it does not matter how the data is stored; whatever works best with the answer you have.
Yes. Assuming you're using iostreams, you can use tellg and tellp to retrieve the current get and put (i.e., read and write) locations respectively. You can later feed the same value back to seekg or seekp to get back to the same location (again, for reading or writing respectively).
You can use these to (for one example) create an index into a file. Before writing each record to your primary data file, you'd use tellp to retrieve the current location. Then you'd store the data to the data file, and save the value tellp returned into the index file. Depending on what sort of index you want, that might just contain a series of locations, so you can seek directly to record #N in the data file (even if the records are of different sizes).
Alternatively, you might store the data for some key field in the index file. For example, you might have a main data file with a set of records about people. Then you might build a number of indices into that, one with last names and a location for each, another with birthdays and a location for each, and so on, so you can search by name or birthday (or do an intersection between them to support things like people older than 18 with a last name starting with "M", "N" or "O").
i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael
How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.
Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.
I'm busy with programming a class that creates an index out of a text-file ASCII/BINARY.
My problem is that I don't really know how to start. I already had some tries but none really worked well for me.
I do NOT need to find the address of the file via the MFT. Just loading the file and finding stuff much faster by searching for the key in the index-file and going in the text-file to the address it shows.
The index-file should be built up as follows:
KEY ADDRESS
1 0xABCDEF
2 0xFEDCBA
. .
. .
We have a text-file with the following example value:
1, 8752 FW,
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++,
******************************************************************************,
------------------------------------------------------------------------------;
I hope that this explains my question a bit better.
Thanks!
It seems to me that all your class needs to do is store an array of pointers or file start offsets to the key locations in the file.
It really depends on what your Key locations represent.
I would suggest that you access the file through your class using some public methods. You can then more easily tie in Key locations with the data written.
For example, your Key locations may be where each new data block written into the file starts from. e.g. first block 1000 bytes, key location 0; second block 2500 bytes, key location 1000; third block 550 bytes; key location 3500; the next block will be 4050 all assuming that 0 is the first byte.
Store the Key values in a variable length array and then you can easily retrieve the starting point for a data block.
If your Key point is signified by some key character then you can use the same class, but with a slight change to store where the Key value is stored. The simplest way is to step through the data until the key character is located, counting the number of characters checked as you go. The count is then used to produce your key location.
Your code snippet isn't so much of an idea as it is the functionality you wish to have in the end.
Recognize that "indexing" merely means "remembering" where things are located. You can accomplish this using any data structure you wish... B-Tree, Red/Black tree, BST, or more advanced structures like suffix trees/suffix arrays.
I recommend you look into such data structures.
edit:
with the new information, I would suggest making your own key/value lookup. Build an array of keys, and associate their values somehow. this may mean building a class or struct that contains both the key and the value, or instead contains the key and a pointer to a struct or class with a value, etc.
Once you have done this, sort the key array. Now, you have the ability to do a binary search on the keys to find the appropriate value for a given key.
You could build a hash table in a similar manner. you could build a BST or similar structure like i mentioned earlier.
I still don't really understand the question (work on your question asking skillz), but as far as I can tell the algorithm will be:
scan the file linearly, the first value up to the first comma (',') is a key, probably. All other keys occur wherever a ';' occurs, up to the next ',' (you might need to skip linebreaks here). If it's a homework assignment, just use scanf() or something to read the key.
print out the key and byte position you found it at to your index file
AFAIUI that's the algorithm, I don't really see the problem here?