Saving/storing/exporting large objects in C++ - c++

Background:
I am a new c++ programmer and am trying to build a program that returns a string stating the color of a given hex colorcode. The overall function is to request the hexcode of the pixel the mousepointer is on, and return a string that describes the color (like "Dark Red" for #8B0000). (I am colorblind, it would be a great help)
As a first try, I created a .txt file that contains all possible colorcodes on newlines. Needless to say, the document has 16777216 lines and is 134.2MB big. I have searched the internet and I've found that the only way to read a .txt file in C++ is line by line, start to end. That would result in 16777216 calls to "getline()" for the string "Black". This approach got my "hopeless" stamp on it for now.
Idea:
I would like to create a vector that contains 16777216 instances of (String colour) and use a hex-to-int conversion to use as an index to locate the correct String. This vector would also get quite big and pretty unhandy to build or use.
Problem:
I need to find the best way (if possible) to save/preserve a large object along with my c++ classes, so that I could just import the object and use it right away.
Thanks in advance.

A) Your file contains more than 16777216 lines, which means it contains more than the number of words in English, and probably Russian, Greek, Chinese and Japanese all combined.
B) You need to put things into ranges and then do a binary scan for the correct range. In other words map the range for navy blue as an object with a low and a high value into the range.
C) Put all the ranges into a big list and sort the list.
D) Then do a binary scan for any particular color and it will intersect the correct range.
For example:
// Navy blue might be this range
Low = RGB(0,0,170)
High = RGB(0,0,200)
// Light Red might be this range
Low = RGB(240,0,0)
High = RGB(255,0,0)
I mean, why would you want to name each color if you can name the range instead?

The OP posted:
I have searched the internet and I've found that the only way to read a .txt file in C++ is line by line, start to end.
This is incorrect. I'm not sure what C++ file reading/writing classes you are using, but if the one you are using doesn't support random access, then find a different one.
If you reset back and use fopen, you can use fseek to go a specific place in the file.
If you format all the records in your file to be the same length, you can then easily calculate the offset into the file as recordnumber*recordlength (assuming the first record is number 0).

Thanks for the discussion David and Alex :)
Simple solution for one way lookups value to name
So let me suggest first quantize the color space based in the four MSBs of every color value.
val = ((hexval & 0xf00000) >> 12) | ((hexval & 0x00f000) >> 8) | ((hexval & 0x0000f0) >> 4)
Also create a std::vector<std::string> with 4096 entries into which you read the color names.
std::vector<std::string> names(4096);
//Read file and do for each line
names[val] = /*name for the value*/
//Lookup
const std::string& name = names[val];
For bidirectional lookup I would still look to boost::bimap it can be configured to look like a vector when finding the names from color values. And configured as a hash table when finding a color values that match a certain named color.

If you want to "preserve" an object of a class then I suggest pickling it! Here is a library that I think will work for c++
These routines come from chooser.h in that library and should be usefull
// C++
DumpValToFile (const Val& thing_to_serialize,
const string& output_filename,
Serialization_e how_to_dump_the_data);
LoadValFromFile (const string& input_filename,
Val& result,
Serialization_e how_data_was_dumped);
I believe the parameter Val& is where you pass the object needing to be pickeled
What these tools do is serialize an object so that it can be stored easily on a hardrive.
I have never used this tool personally, but I have used something similar to it in python and so I suggest experimenting with pickling things in Python first. Google "python pickle" for more information on that.

I think you'll only want to actually 'name' a very small number of the possible 2^24 8-bit RGB values, so std::map is your friend here:
std::map<int, string> colors;
colors[0x000000] = "Black";
...
colors[0xFFFFFF] = "White";
You could start using the HTML color names from here: http://www.w3schools.com/html/html_colornames.asp
You'll also want to write a 'findNearest' function, (unless of course you actually have 16 million different names for colors). Your findNearest would compute the distance in RGB-space between each named colour and the target colour.

I would read it all into a std::map at the start of the program. Then use that map for fast lookups. If reading the text-file takes to long, consider converting it to some binary representation. Parsing the text file for every look-up will be slow.
If you want bidirectional lookups, i.e. from value to name and name to value. Check boost::multi_index
http://www.boost.org/doc/libs/1_53_0/libs/multi_index/doc/index.html
or boost::bimap
http://www.boost.org/doc/libs/1_53_0/libs/bimap/doc/html/index.html
I would also consider using boost::serialization to store and retrieve the data for the map in between runs.
http://www.boost.org/doc/libs/1_53_0/libs/serialization/doc/index.html

Related

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

Index-based access on Matrix-like structure in C++

I have a mapping Nx2 between two set of encodings (not relevant: Unicode and GB18030) under this format:
Warning: huge XML, don't open if having slow connection:
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Snapshot:
<a u="00B7" b="A1 A4"/>
<a u="00B8" b="81 30 86 30"/>
<a u="00B9" b="81 30 86 31"/>
<a u="00BA" b="81 30 86 32"/>
I would like to save the b-values (right column) in a data structure and to access them directly (no searching) with indexes based on a-values (left column).
Example:
I can store those elements in a data structure like this:
unsigned short *my_page[256] = {my_00,my_01, ....., my_ff}
, where the elements are defined like:
static unsigned short my_00[256] etc.
.
So basically a matrix of matrix => 256x256 = 65536 available elements.
In the case of other encodings with less elements and different values (ex. Chinese Big5, Japanese Shift, Korean KSC etc), I can access the elements using a bijective function like this:
element = my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF];, where unicode[i] is filled with the a-like elements from the mapping (as mentioned above). How do I generate and fill the my_page structure is analogous. For the working encodings, I have like around 7000 characters to store (and they are stored in a unique place in my_page).
The problem comes with the GB18030 encoding, trying to store 30861 elements in my_page (65536 elements). I am trying to use the same bijective function for filling (and then accessing, analogously) the my_page structure, but it fails since the access mode does not return unique results.
For example: For the unicode values, there are more than 1 element accessed via
my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF] since the indexes can be the same for i and for i+1 for example. Do you know another way of accessing/filling the elements in the my_page structure based only on pre-computed indexes like I was trying to do?
I assume I have to use something like a pseudo-hash function that returns me a range of values VRange and based on a set of rules I can extract from the range VRange the integer indexes of my_page[256][256].
If you have any advice, please let me know :)
Thank you !
For GB18030, refer to this document: http://icu-project.org/docs/papers/gb18030.html
As explained in this article:
“The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size.”
So most probably is not good to implement a conversion based on a pure mapping table.
For large parts, there is a direct mapping between GB18030 and Unicode. Most of the four-bytes characters can be translated algorithmically. The author of the article suggests to handle them such ranges with a special code, and the other ones with a classic mapping table. These characters are the ones given in the XML mapping table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Therefore, the index-based access on Matrix-like structure in C++ can be a problem opened for whom wants to research on such bijective functions.

How to deserialize a file containing multiple records

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael
How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.
Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

How to create an index for a collection of vectors/histograms for content based image retrieval

I'm currently writing a Bag of visual words-based image retrieval system which is similar to the Vector Space Model in text retrieval. Under this framework, each image is represented by a vector (or sometimes also called histogram in the literature). Basically each number in the vector counts the number of times each "visual word" occur in that image. If 2 images have vectors which are "close" together, this means they have many image features in common and are hence similar.
I'm basically trying to create the inverted file index for a set of such vectors. I want something that can scale from thousands (during trial stage) to hundred of thousands or million+ images so a home made data structure hack will not work.
I've looked at Lucene but apparently it only indexes text (correct me if I'm wrong) whereas in my case I want it to index numbers (i.e. the vectors themselves). I've seen cases where people convert the vector to a text document in the following way:
<3, 6, ..., 5> --> "w1 w2... wn". Basically any component that is non-zero is replaced by a textual word "w[n]" where n is the index of that number. This "document" is then passed to Lucene to index.
The problem with using this method is that the text representation for the vector does not encode how frequently the particular "word" occur so the ranking of the retrieved images would not be good.
Does anyone know of a mature indexing API that can handle vectors or perhaps suggest a different encoding scheme for my vectors so that I can continue to use Lucene? I've also looked at Lucene for Image Retrieval (LIRE) project and have tried the demo that came with it but the number of exceptions that were generated when I ran that demo makes me unsure about using it.
As for language of API, I'm open to C++ or Java.
Thanks in advance for any replies.
You can try GRire which is a Java library that implements the Bag of Visual Words model. It is my project and I am currently working on implementing an inverted index.

Making an index-creating class

I'm busy with programming a class that creates an index out of a text-file ASCII/BINARY.
My problem is that I don't really know how to start. I already had some tries but none really worked well for me.
I do NOT need to find the address of the file via the MFT. Just loading the file and finding stuff much faster by searching for the key in the index-file and going in the text-file to the address it shows.
The index-file should be built up as follows:
KEY ADDRESS
1 0xABCDEF
2 0xFEDCBA
. .
. .
We have a text-file with the following example value:
1, 8752 FW,
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++,
******************************************************************************,
------------------------------------------------------------------------------;
I hope that this explains my question a bit better.
Thanks!
It seems to me that all your class needs to do is store an array of pointers or file start offsets to the key locations in the file.
It really depends on what your Key locations represent.
I would suggest that you access the file through your class using some public methods. You can then more easily tie in Key locations with the data written.
For example, your Key locations may be where each new data block written into the file starts from. e.g. first block 1000 bytes, key location 0; second block 2500 bytes, key location 1000; third block 550 bytes; key location 3500; the next block will be 4050 all assuming that 0 is the first byte.
Store the Key values in a variable length array and then you can easily retrieve the starting point for a data block.
If your Key point is signified by some key character then you can use the same class, but with a slight change to store where the Key value is stored. The simplest way is to step through the data until the key character is located, counting the number of characters checked as you go. The count is then used to produce your key location.
Your code snippet isn't so much of an idea as it is the functionality you wish to have in the end.
Recognize that "indexing" merely means "remembering" where things are located. You can accomplish this using any data structure you wish... B-Tree, Red/Black tree, BST, or more advanced structures like suffix trees/suffix arrays.
I recommend you look into such data structures.
edit:
with the new information, I would suggest making your own key/value lookup. Build an array of keys, and associate their values somehow. this may mean building a class or struct that contains both the key and the value, or instead contains the key and a pointer to a struct or class with a value, etc.
Once you have done this, sort the key array. Now, you have the ability to do a binary search on the keys to find the appropriate value for a given key.
You could build a hash table in a similar manner. you could build a BST or similar structure like i mentioned earlier.
I still don't really understand the question (work on your question asking skillz), but as far as I can tell the algorithm will be:
scan the file linearly, the first value up to the first comma (',') is a key, probably. All other keys occur wherever a ';' occurs, up to the next ',' (you might need to skip linebreaks here). If it's a homework assignment, just use scanf() or something to read the key.
print out the key and byte position you found it at to your index file
AFAIUI that's the algorithm, I don't really see the problem here?