Making an index-creating class

Making an index-creating class - c++

I'm busy with programming a class that creates an index out of a text-file ASCII/BINARY.
My problem is that I don't really know how to start. I already had some tries but none really worked well for me.
I do NOT need to find the address of the file via the MFT. Just loading the file and finding stuff much faster by searching for the key in the index-file and going in the text-file to the address it shows.
The index-file should be built up as follows:
KEY ADDRESS
1 0xABCDEF
2 0xFEDCBA
. .
. .
We have a text-file with the following example value:
1, 8752 FW,
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++,
******************************************************************************,
------------------------------------------------------------------------------;
I hope that this explains my question a bit better.
Thanks!

It seems to me that all your class needs to do is store an array of pointers or file start offsets to the key locations in the file.
It really depends on what your Key locations represent.
I would suggest that you access the file through your class using some public methods. You can then more easily tie in Key locations with the data written.
For example, your Key locations may be where each new data block written into the file starts from. e.g. first block 1000 bytes, key location 0; second block 2500 bytes, key location 1000; third block 550 bytes; key location 3500; the next block will be 4050 all assuming that 0 is the first byte.
Store the Key values in a variable length array and then you can easily retrieve the starting point for a data block.
If your Key point is signified by some key character then you can use the same class, but with a slight change to store where the Key value is stored. The simplest way is to step through the data until the key character is located, counting the number of characters checked as you go. The count is then used to produce your key location.

Your code snippet isn't so much of an idea as it is the functionality you wish to have in the end.
Recognize that "indexing" merely means "remembering" where things are located. You can accomplish this using any data structure you wish... B-Tree, Red/Black tree, BST, or more advanced structures like suffix trees/suffix arrays.
I recommend you look into such data structures.
edit:
with the new information, I would suggest making your own key/value lookup. Build an array of keys, and associate their values somehow. this may mean building a class or struct that contains both the key and the value, or instead contains the key and a pointer to a struct or class with a value, etc.
Once you have done this, sort the key array. Now, you have the ability to do a binary search on the keys to find the appropriate value for a given key.
You could build a hash table in a similar manner. you could build a BST or similar structure like i mentioned earlier.

I still don't really understand the question (work on your question asking skillz), but as far as I can tell the algorithm will be:
scan the file linearly, the first value up to the first comma (',') is a key, probably. All other keys occur wherever a ';' occurs, up to the next ',' (you might need to skip linebreaks here). If it's a homework assignment, just use scanf() or something to read the key.
print out the key and byte position you found it at to your index file
AFAIUI that's the algorithm, I don't really see the problem here?

Related

RocksDb: Multiple values per key (c++)

RocksDb: Multiple values per key (c++)
what i am trying to do
I am trying to adapt my simple blockchain implementation to save the blockchain to the hard drive periodically and so i looked info different db solutions. i decided to use RocksDb due to its ease of use and good documentation & examples. i read through the documentation and could not figure out how to adapt it to my use case.
i have a class Block
`
class Block {
public:
string PrevHash;
private:
blockheader header; // The header of the block
uint32_t index; // height of this block
std::vector<tx_data> transactions; // All transactions in the block in a vector
std::string hash; // The hash of the block
uint64_t timestamp; // The timestamp this block was created by the node
std::string data; // Extra data that can be appended to blocks (for example text or a smart contract)
// - The larger this feild the higher the fee and the max size is defined in config.h
};
which contains a few variables and a vector of a struct tx_data. i want to load this data into a rocksdb database.
what i have tried
after google failed to return any results on storing multiple values with one keypair i decided i would have to just enclose each block data in 0xa1 at the beginning then at the end 0x2a
*0x2a*
header
index
txns
hash
timestamp
data
*0x2a*
but decided there was surely a simpler way. I tried looking at the code used by turtlecoin, a currency that uses rocksdb for its database but the code there is practically indecipherable, i have heard about serialization but there seems to be little info out there on it.
perhaps i am misunderstanding the use of a DB?

You need to serialization it. Serialization is the process of taking a structured set of data and making it into one string, number or vector of bytes that can then be de-serialized later on back into that struct. One method would be to take the hash of the block and use it as the key in the db then crate a new struct which does not contain the hash. Then write a function that takes a Block struct and a path and constructs a BlockNoHash struct and saves it. Then another function to read a block from a hash and spit out a Block Struct. Very basically you could split each field with a charector which will never occur in the data (eg ` or |), though this means if one piece of the data is corrupted then you cant get any of the other data

There are two related questions here.
One is: how do you store complex data -- more than just a simple integer or string -- within a key-value store like RocksDB. As Leo says, you need to serialize them.
Rather than writing your own code, the typical easier way is to use a framework like Protobuf or Thrift to generate code to translate between your in-memory structures and a flat bytes representation suitable to store in a database (or send over the network.)
A related question, from the title: how do you store multiple values per key?
There are two main options:
Use a compound key, that distinguishes the various values. By walking a key prefix you can find all the values in a set of related keys. This is better if the values get very large or if you want to find and update them independently.
Or, make the value for a single key actually be a compound object that includes several inner values. This is easiest if you always want to fetch all the sub-values in a single operation.

Qt: storing QKeySequence, extracting it from-form

My question refers to a couple of interesting problems I faced while developing an application for physics. The program is being written for some specific physical processes modeling. Scientists prefer to set-up controls personally, not use built-in ones. So, the problems I faced are:
to find a way to read key sequence from-form( the key sequence is bound by the user by pressing keys)
to find a way to store the key sequence in some file
The solution for the 2nd problem may be following: store bytes of the key sequence in hex in the string, and just read-write. The most interesting for me now is the 1st problem...

If I understand correct, QKeySequenceEdit ( http://doc.qt.io/qt-5/qkeysequenceedit.html#details) and QKeySequence (http://doc.qt.io/qt-5/qkeysequence.html#details), will solve both your problems.
QKeySequenceEdit is a widget, the key sequence starts as soon as the widget get focus and the combination of keys continues till you release the last key.
You no need to store the key sequence in a file, as the QKeySequenceEdit itself has a function keySequence() that returns QKeySequence.
From 'QKeySequence', you can convert all the keys to string by using toString.

A way to retrieve data by address (c++)

Using c++, is it possible to store data to a file, and retrieve that data by address for quicker access? I want to get around having to parse or iterate large files of data, with the ability to gain direct access to a subset of that data. In your answers, it does not matter how the data is stored; whatever works best with the answer you have.

Yes. Assuming you're using iostreams, you can use tellg and tellp to retrieve the current get and put (i.e., read and write) locations respectively. You can later feed the same value back to seekg or seekp to get back to the same location (again, for reading or writing respectively).
You can use these to (for one example) create an index into a file. Before writing each record to your primary data file, you'd use tellp to retrieve the current location. Then you'd store the data to the data file, and save the value tellp returned into the index file. Depending on what sort of index you want, that might just contain a series of locations, so you can seek directly to record #N in the data file (even if the records are of different sizes).
Alternatively, you might store the data for some key field in the index file. For example, you might have a main data file with a set of records about people. Then you might build a number of indices into that, one with last names and a location for each, another with birthdays and a location for each, and so on, so you can search by name or birthday (or do an intersection between them to support things like people older than 18 with a last name starting with "M", "N" or "O").

How to deserialize a file containing multiple records

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael

How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.

Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

What is a good way to traverse through a big file in c++

I have really big files, which contain data packages. The file itself is simply a really big string, and the packages are seperated with a string "PACK1.0".
Assuming "XXX" is data, a package looks like this :
PACK1.0XXXXXXXXXXXXXXXXXPACK1.0XXXXXXXXXXXXXXPACK1.0XXXXXXXXXX
I am creating a hash map which contains the number of packages, and the bytes where it begins.
Example:
PACKAGE NR | BYTE WHERE IT BEGINS IN THE STREAM
0 | 0
1 | 128
2 | 256
. | .
. | .
If I want package number 5340, I look in the hashmap at which byte the package begins, go to the byte with stream.seekg(POSITION) and parse the package, in theory.
My final problem is: I want to travel trough the file with a slider, with play&pause options. My thought was that the slider has a min=0 and max=packagecount range.
Is this a good way to traverse through a file?
What problems can this cause? What is a better way to do this?
This is my Code for storing the hashmap (this code assumes a package is 128byte long) :
std::map<int, int> THEMAP;
thefile.seekg(0,std::ios::end);
dataLength=thefile.tellg();
thefile.seekg(0,std::ios::beg);
while(position<dataLength)
{
thefile.seekg(0,position);
position=position+128;
packagecount++;
THEMAP.insert(std::make_pair(packagecount,position));
}

This is usuually a case for memory-mapped-io (MMIO). If you are Windows only then use the MapViewOfFile and the other functions in that family. For cross-platform usage I recommend glib's file map functions. What MMIO does is to map a part of a file (or an entire file) into the process' memory space, so you can access it via a simple pointer. You can determine which part of the file and which size of it is mapped, arbitrarily.
A possible strategy for you could be that you, on startup, map a fixed block of the file into memory in a loop, block by block) and search for the first package identifer in each block. This is relatively quick and gives you a first set of markers. On next access you can use this initial set to find the proper part of the file, map this and scan only this part. Of course, you'd store any marker that comes along.
Later, when you scroll through your file you just map the page (can be smaller this time, depending on how much data you need at a certain point in time) and display the needed data. Obviously, the address of the package markers can at the same time be used as start address for the memory mapping.
Nice side effects are that it is completely irrelevant what size the packages are and you can map files of any size, even gigabyte sized files. By using small views on the file the memory requirement of your application can be very small.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Making an index-creating class - c++

Related

RocksDb: Multiple values per key (c++)

Qt: storing QKeySequence, extracting it from-form

A way to retrieve data by address (c++)

How to deserialize a file containing multiple records

What is a good way to traverse through a big file in c++

Categories

Resources