Index-based access on Matrix-like structure in C++ - c++

I have a mapping Nx2 between two set of encodings (not relevant: Unicode and GB18030) under this format:
Warning: huge XML, don't open if having slow connection:
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Snapshot:
<a u="00B7" b="A1 A4"/>
<a u="00B8" b="81 30 86 30"/>
<a u="00B9" b="81 30 86 31"/>
<a u="00BA" b="81 30 86 32"/>
I would like to save the b-values (right column) in a data structure and to access them directly (no searching) with indexes based on a-values (left column).
Example:
I can store those elements in a data structure like this:
unsigned short *my_page[256] = {my_00,my_01, ....., my_ff}
, where the elements are defined like:
static unsigned short my_00[256] etc.
.
So basically a matrix of matrix => 256x256 = 65536 available elements.
In the case of other encodings with less elements and different values (ex. Chinese Big5, Japanese Shift, Korean KSC etc), I can access the elements using a bijective function like this:
element = my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF];, where unicode[i] is filled with the a-like elements from the mapping (as mentioned above). How do I generate and fill the my_page structure is analogous. For the working encodings, I have like around 7000 characters to store (and they are stored in a unique place in my_page).
The problem comes with the GB18030 encoding, trying to store 30861 elements in my_page (65536 elements). I am trying to use the same bijective function for filling (and then accessing, analogously) the my_page structure, but it fails since the access mode does not return unique results.
For example: For the unicode values, there are more than 1 element accessed via
my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF] since the indexes can be the same for i and for i+1 for example. Do you know another way of accessing/filling the elements in the my_page structure based only on pre-computed indexes like I was trying to do?
I assume I have to use something like a pseudo-hash function that returns me a range of values VRange and based on a set of rules I can extract from the range VRange the integer indexes of my_page[256][256].
If you have any advice, please let me know :)
Thank you !

For GB18030, refer to this document: http://icu-project.org/docs/papers/gb18030.html
As explained in this article:
“The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size.”
So most probably is not good to implement a conversion based on a pure mapping table.
For large parts, there is a direct mapping between GB18030 and Unicode. Most of the four-bytes characters can be translated algorithmically. The author of the article suggests to handle them such ranges with a special code, and the other ones with a classic mapping table. These characters are the ones given in the XML mapping table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Therefore, the index-based access on Matrix-like structure in C++ can be a problem opened for whom wants to research on such bijective functions.

Related

Is there a established data structure for place/transition petri-nets?

I'm trying to come up with an elegant solution for representing place/transition petri nets.
So far I save them as follows:
{:netname {:places {:name tokens, ...}
:transitions #{:t1, :t2, :t3, ...}
:edges_in #{[:from :to tokens], ...}
:edges_out #{[:from :to tokens], ...}}}
tokens is a number, everything starts with a symbol with the corresponding name.
//edit - Some more clarification:
The :netname and :name are unique, because it has to be possible to merge 2 nets, where the places again have to have unique names. The numerical tokens are determined by the user of the petri nets during creation of a place or edge.
I would be thankful for some pointers or links to a more elaborate / better data structure for my problem.
//edit 2 - I reworked my first take on the data-structure, because of the uniquenes of place-names. :places now references a hashmap. Also edges_in and out are now hashmaps, because every edge is unique with its origin, destination and token number.
//edit 3 - The use of the structure: It is read and written to in the same quantity i would say. The way a petri net is used, there is a back and forth between modifying the net and reading it, with maybe slightly more reading towards the end.
I also modified my structure above slightly, so :edges_in and :edges_out now saves the triplets as a vector instead of a list. This simplyfies saving the hashmap to file and reading it from it, because load-string evaluates lists as expressions.
You could look at ISO 15909 interchange format for HLPNs called PNML. This would at least provide you with a basis for a standard interface to your data structures.

Saving/storing/exporting large objects in C++

Background:
I am a new c++ programmer and am trying to build a program that returns a string stating the color of a given hex colorcode. The overall function is to request the hexcode of the pixel the mousepointer is on, and return a string that describes the color (like "Dark Red" for #8B0000). (I am colorblind, it would be a great help)
As a first try, I created a .txt file that contains all possible colorcodes on newlines. Needless to say, the document has 16777216 lines and is 134.2MB big. I have searched the internet and I've found that the only way to read a .txt file in C++ is line by line, start to end. That would result in 16777216 calls to "getline()" for the string "Black". This approach got my "hopeless" stamp on it for now.
Idea:
I would like to create a vector that contains 16777216 instances of (String colour) and use a hex-to-int conversion to use as an index to locate the correct String. This vector would also get quite big and pretty unhandy to build or use.
Problem:
I need to find the best way (if possible) to save/preserve a large object along with my c++ classes, so that I could just import the object and use it right away.
Thanks in advance.
A) Your file contains more than 16777216 lines, which means it contains more than the number of words in English, and probably Russian, Greek, Chinese and Japanese all combined.
B) You need to put things into ranges and then do a binary scan for the correct range. In other words map the range for navy blue as an object with a low and a high value into the range.
C) Put all the ranges into a big list and sort the list.
D) Then do a binary scan for any particular color and it will intersect the correct range.
For example:
// Navy blue might be this range
Low = RGB(0,0,170)
High = RGB(0,0,200)
// Light Red might be this range
Low = RGB(240,0,0)
High = RGB(255,0,0)
I mean, why would you want to name each color if you can name the range instead?
The OP posted:
I have searched the internet and I've found that the only way to read a .txt file in C++ is line by line, start to end.
This is incorrect. I'm not sure what C++ file reading/writing classes you are using, but if the one you are using doesn't support random access, then find a different one.
If you reset back and use fopen, you can use fseek to go a specific place in the file.
If you format all the records in your file to be the same length, you can then easily calculate the offset into the file as recordnumber*recordlength (assuming the first record is number 0).
Thanks for the discussion David and Alex :)
Simple solution for one way lookups value to name
So let me suggest first quantize the color space based in the four MSBs of every color value.
val = ((hexval & 0xf00000) >> 12) | ((hexval & 0x00f000) >> 8) | ((hexval & 0x0000f0) >> 4)
Also create a std::vector<std::string> with 4096 entries into which you read the color names.
std::vector<std::string> names(4096);
//Read file and do for each line
names[val] = /*name for the value*/
//Lookup
const std::string& name = names[val];
For bidirectional lookup I would still look to boost::bimap it can be configured to look like a vector when finding the names from color values. And configured as a hash table when finding a color values that match a certain named color.
If you want to "preserve" an object of a class then I suggest pickling it! Here is a library that I think will work for c++
These routines come from chooser.h in that library and should be usefull
// C++
DumpValToFile (const Val& thing_to_serialize,
const string& output_filename,
Serialization_e how_to_dump_the_data);
LoadValFromFile (const string& input_filename,
Val& result,
Serialization_e how_data_was_dumped);
I believe the parameter Val& is where you pass the object needing to be pickeled
What these tools do is serialize an object so that it can be stored easily on a hardrive.
I have never used this tool personally, but I have used something similar to it in python and so I suggest experimenting with pickling things in Python first. Google "python pickle" for more information on that.
I think you'll only want to actually 'name' a very small number of the possible 2^24 8-bit RGB values, so std::map is your friend here:
std::map<int, string> colors;
colors[0x000000] = "Black";
...
colors[0xFFFFFF] = "White";
You could start using the HTML color names from here: http://www.w3schools.com/html/html_colornames.asp
You'll also want to write a 'findNearest' function, (unless of course you actually have 16 million different names for colors). Your findNearest would compute the distance in RGB-space between each named colour and the target colour.
I would read it all into a std::map at the start of the program. Then use that map for fast lookups. If reading the text-file takes to long, consider converting it to some binary representation. Parsing the text file for every look-up will be slow.
If you want bidirectional lookups, i.e. from value to name and name to value. Check boost::multi_index
http://www.boost.org/doc/libs/1_53_0/libs/multi_index/doc/index.html
or boost::bimap
http://www.boost.org/doc/libs/1_53_0/libs/bimap/doc/html/index.html
I would also consider using boost::serialization to store and retrieve the data for the map in between runs.
http://www.boost.org/doc/libs/1_53_0/libs/serialization/doc/index.html

How to create an index for a collection of vectors/histograms for content based image retrieval

I'm currently writing a Bag of visual words-based image retrieval system which is similar to the Vector Space Model in text retrieval. Under this framework, each image is represented by a vector (or sometimes also called histogram in the literature). Basically each number in the vector counts the number of times each "visual word" occur in that image. If 2 images have vectors which are "close" together, this means they have many image features in common and are hence similar.
I'm basically trying to create the inverted file index for a set of such vectors. I want something that can scale from thousands (during trial stage) to hundred of thousands or million+ images so a home made data structure hack will not work.
I've looked at Lucene but apparently it only indexes text (correct me if I'm wrong) whereas in my case I want it to index numbers (i.e. the vectors themselves). I've seen cases where people convert the vector to a text document in the following way:
<3, 6, ..., 5> --> "w1 w2... wn". Basically any component that is non-zero is replaced by a textual word "w[n]" where n is the index of that number. This "document" is then passed to Lucene to index.
The problem with using this method is that the text representation for the vector does not encode how frequently the particular "word" occur so the ranking of the retrieved images would not be good.
Does anyone know of a mature indexing API that can handle vectors or perhaps suggest a different encoding scheme for my vectors so that I can continue to use Lucene? I've also looked at Lucene for Image Retrieval (LIRE) project and have tried the demo that came with it but the number of exceptions that were generated when I ran that demo makes me unsure about using it.
As for language of API, I'm open to C++ or Java.
Thanks in advance for any replies.
You can try GRire which is a Java library that implements the Bag of Visual Words model. It is my project and I am currently working on implementing an inverted index.

How can I obfuscate/de-obfuscate integer properties?

My users will in some cases be able to view a web version of a database table that stores data they've entered. For various reasons I need to include all the stored data, including a number of integer flags for each record that encapsulate adjacencies and so forth within the data (this is for speed and convenience at runtime). But rather than exposing them one-for-one in the webview, I'd like to have an obfuscated field that's just called "reserved" and contains a single unintelligible string representing those flags that I can easily encode and decode.
How can I do this efficiently in C++/Objective C?
Thanks!
Is it necessary that this field is exposed to the user visually, or just that it’s losslessly captured in the HTML content of the webview? If possible, can you include the flags as a hidden input element with each row, i.e., <input type=“hidden” …?
Why not convert each of the fields to hex, and append them as a string and save that value?
As long as you always append the strings in the same order, breaking them back apart and converting them back to numbers should be trivial.
Use symmetric encryption (example) to encode and decode the values. Of course, only you should know of the key.
Alternatively, Assymetric RSA is more powerfull encryption but is less efficient and is more complex to use.
Note: i am curios about the "various reasons" that require this design...
Multiply your flag integer by 7, add 3, and convert to base-36. To check if the resulting string is modified, convert back to base-2, and check if the result modulo 7 is still 3. If so, divide by 7 to get the flags. note that this is subject to replay attacks - users can copy any valid string in.
Just calculate a CRC-32 (or similar) and append it to your value. That will tell you, with a very high probability, if your value has been corrupted.

Making an index-creating class

I'm busy with programming a class that creates an index out of a text-file ASCII/BINARY.
My problem is that I don't really know how to start. I already had some tries but none really worked well for me.
I do NOT need to find the address of the file via the MFT. Just loading the file and finding stuff much faster by searching for the key in the index-file and going in the text-file to the address it shows.
The index-file should be built up as follows:
KEY ADDRESS
1 0xABCDEF
2 0xFEDCBA
. .
. .
We have a text-file with the following example value:
1, 8752 FW,
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++,
******************************************************************************,
------------------------------------------------------------------------------;
I hope that this explains my question a bit better.
Thanks!
It seems to me that all your class needs to do is store an array of pointers or file start offsets to the key locations in the file.
It really depends on what your Key locations represent.
I would suggest that you access the file through your class using some public methods. You can then more easily tie in Key locations with the data written.
For example, your Key locations may be where each new data block written into the file starts from. e.g. first block 1000 bytes, key location 0; second block 2500 bytes, key location 1000; third block 550 bytes; key location 3500; the next block will be 4050 all assuming that 0 is the first byte.
Store the Key values in a variable length array and then you can easily retrieve the starting point for a data block.
If your Key point is signified by some key character then you can use the same class, but with a slight change to store where the Key value is stored. The simplest way is to step through the data until the key character is located, counting the number of characters checked as you go. The count is then used to produce your key location.
Your code snippet isn't so much of an idea as it is the functionality you wish to have in the end.
Recognize that "indexing" merely means "remembering" where things are located. You can accomplish this using any data structure you wish... B-Tree, Red/Black tree, BST, or more advanced structures like suffix trees/suffix arrays.
I recommend you look into such data structures.
edit:
with the new information, I would suggest making your own key/value lookup. Build an array of keys, and associate their values somehow. this may mean building a class or struct that contains both the key and the value, or instead contains the key and a pointer to a struct or class with a value, etc.
Once you have done this, sort the key array. Now, you have the ability to do a binary search on the keys to find the appropriate value for a given key.
You could build a hash table in a similar manner. you could build a BST or similar structure like i mentioned earlier.
I still don't really understand the question (work on your question asking skillz), but as far as I can tell the algorithm will be:
scan the file linearly, the first value up to the first comma (',') is a key, probably. All other keys occur wherever a ';' occurs, up to the next ',' (you might need to skip linebreaks here). If it's a homework assignment, just use scanf() or something to read the key.
print out the key and byte position you found it at to your index file
AFAIUI that's the algorithm, I don't really see the problem here?