Creating a generalized resource map without using strings?

Creating a generalized resource map without using strings? - c++

Let's assume I want to create an object that will hold some arbitrary data.
// Pseudocode
class MyContainer {
map<key, pair<void*, size>>;
}
The key in this case also identifies the kind of data stored in the void* (e.g an image, a struct of some kind, maybe even a function).
The most general way to del with this is have the key be a string. Then you can put whatever on earth you want and then you can just read it. As a silly example, the key can just be:
"I am storing a png image and the source file was at location/assets/image.png and today is sunday".
i.e you can encode whatever you want. This is however slow. A much faster alternative is using enumerators and your keys are then IMAGE, HASHMAP, FUNCTION, THE_ANSWER_TO_LIFE...
However that requires you know every single case you need to handle beforehand and create an enumerator for it manually (which is tedious and not very extensible).
Is there a compromise that can be made? i.e something that uses only one key but is faster than strings and more extensible than enums?
Edit:
The exact use case I am trying to use this for is as a generalized storage for rendering data. This includes images, vertex buffers, volumetric data, lighting information... Or any other conceivable thing you may need.
The only way I know to create "absolute polymorphism" (i.e represent literally any form of conceivable data) is to use void pointers and rely on algorithms to understand the data.
Example:
Say our key is a JSON string where the key of each element is the name of a field in a compact struct and the value is the offset in bytes.
E.g
{
m_field1: 0,
m_field2: 32,
m_field3: 128,
}
Then to access any of the elements in the void* all you need to do is do symbol manipulation to get the number and then ptr + offset.
You can do the same with a set of unique identifiers (enums) and associated functions that get you the fields based on the identifier (hard coded approach).
Hopefully this makes the question less obscure.

Related

Can I reinterpret a memory mapped file of key-value pairs as a map in order to sort them?

I have a memory mapped file that contains key-value pairs. Both the key and value are uint32_t, and all the keys and values are stored in the file in binary, where a key immediately proceeds the value. The file contains only these pairs, no delimiters.
I want to be able to sort all of these key-value pairs by increasing key.
The following just compiled in my code:
struct FileAsMap { map<uint32_t, uint32_t> keyValueMap; };
const FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
but I don't really know what to do from here, since by definition the map container keeps a strict weak ordering of the pairs by key. If I just reinterpret the mapped file as a map, how can I get the pairs to order?

it's not an answer but explanations don't fit into comment limitations.
The keys in a map are usually unique (at least in std::map they are). But maps in general differ one from another in method they sort stored keys. For example std::map is based on a balanced binary tree with average complexity of retrieving a given key equal to O(ln(n)) where n is a number of elements in the map. Or e.g. std::unordered_map is a hashmap internally with the average access time = O(1). That is it looks for a key in constant time regardless of number of elements inside.
In any case all these data containers demand dedicated internal in-memory structure which practically never looks like a simple stream of key-value pairs. That's why I've told above in the first comment that it's almost impossible to reuse one of standard maps as a convenient data accessor for mmap-ed data w/o prior read and unpack the data stream.
But you can create your own map-like class which would iterate over data in mmap-ed area and would check in its operator[](size_t i) if a stored key matches the requested one. Iguess that a simplest implementation would take a single screen of code.
But beware: sequental scan is a relatively expensive operation, so if you got enough elements in the file, it could become unacceptable slow. In this case you'll need some optimized indexing. For example all keys are read in the beginning of processing and an indexing array is built. But all these questions heavily depend on task details, ao it's better to stop explanations now.
If you have any further questions feel free to ask. Of course a good question assumes that you have already studied the subject and now have encountered a particular problem that you can't solve yoursef

There are a lot of reasons why the answer is no. The two simplest are:
Maps are a structure that stores data in a form in which it's already sorted. Your data isn't already sorted, so it's simply not a map.
The map class has its own internal data structure that it uses to store maps. Unless your file replicates this internal structure perfectly (which it almost certainly can't since it likely includes pointers into memory) the map class will misunderstand the data in the file.

How did u serialize the data to the file?
Assuming that you serialized a struct consisting of maps, you'd de-serialize as below:
FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
Gives access to entire structure (blob).
(*fileAsMap).keyValueMap gives access to map.

Storing named data, where the 'name' is larger than the 'data'?

I'm writing the logic portion of a game, and want to create, retrieve, and store values (integers) to keep track of progress. For instance, a door would create the pair ("location.room.doorlock", 0) in an std::map, and unlocking this door would set that value to 1. Anytime the player wants to go through this door, it would retrieve the value by that keyname to see if it's passable. (Just an example, but it's important that this information exist outside of the "door" object itself, as characters or other events might retrieve this data and act on it.)
The problem though is that the name (or map key) itself is far larger than the data it's referring to, which seems wasteful, and feels 'wrong' as a result.
Is there a commonly used or best approach for storing this type of data, one where the key isn't so much larger than the data itself?
It is possible to know how much space to allocate at compile time for the progress data itself, if it's important. It need not use std::map either, so long as I don't have to use raw array indices to get or store data.

It seems like you have two options, if you really want to diminish the size of the string (although the string length does not seem to be that bad at all).
You can either just change your naming conventions or implement hashing. Hashing can be implemented in the form of a hashmap (also known as an unordered map) or by hand (you can create a small program that hashes your names to an int, then use that as a pair). Hashmaps/unordered maps are probably your best bet, as there is a lot of support code out there for it and you don't run the risk of having to deal with bugs in your own programs.
http://www.cplusplus.com/reference/unordered_map/unordered_map/

Storing classes in a boost::property_tree

I'm working on a program to record information about the variables within a program. I'd like to group this information by file -> function -> variable.
The boost::property_tree seemed like a good fit for this as I could store an Access object at a path in the tree (file.function.variable) and then easily convert the tree to XML, JSON, etc.
Say I'm recording the number of uses of a variable. I can have a class Access that keeps track of the number of writes and reads to a variable. I can then store this object at file.function.variable in the tree. Each time the variable is accessed I can find the variable in the tree and update information about it.
However, I cannot figure out how to store a class in the tree. I assume there is something I need to implement or subclass, but the documentation doesn't address what I'm trying to do.
Is there a solution to my problem? Is there a better alternative to boost::property_tree?
Thank you.

boost::property_tree is designed to hold text data. That's what makes it suitable for exporting to XML, JSON, etc.
Modify your class Access so it includes methods for converting to/from text and store that text in the tree.

You could drop the idea of the tree and just stick with a flat map of key value pairs.
Example:
std::map<std::string, Access> accesses;
// add one access
accesses["file.function.variable"] += 1;
You just need to write a routine that produces the JSON from its content, which should be straightforward. (Assuming the first part of the key is always the file, the second is always the function, the third is always the variable.)

What are some good methods to replace string names with integer hashes

Usually, entities and components or other parts of the game code in data-driven design will have names that get checked if you want to find out which object you're dealing with exactly.
void Player::Interact(Entity *myEntity)
{
if(myEntity->isNearEnough(this) && myEntity->GetFamilyName() == "guard")
{
static_cast<Guard*>(myEntity)->Say("No mention of arrows and knees here");
}
}
If you ignore the possibility that this might be premature optimization, it's pretty clear that looking up entities would be a lot faster if their "name" was a simple 32 bit value instead of an actual string.
Computing hashes out of the string names is one possible option. I haven't actually tried it, but with a range of 32bit and a good hashing function the risk of collision should be minimal.
The question is this: Obviously we need some way to convert in-code (or in some kind of external file) string-names to those integers, since the person working on these named objects will still want to refer to the object as "guard" instead of "0x2315f21a".
Assuming we're using C++ and want to replace all strings that appear in the code, can this even be achieved with language-built in features or do we have to build an external tool that manually looks through all files and exchanges the values?

Jason Gregory wrote this on his book :
At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune.
So you may want to look into that.
And about the build step you mentioned, he also talked about it. They basically encapsulate the strings that need to be hashed in something like:
_ID("string literal")
And use an external tool at build time to hash all the occurrences. This way you avoid any runtime costs.

This is what enums are for. I wouldn't dare to decide which resource is best for the topic, but there are plenty to choose from: https://www.google.com/search?q=c%2B%2B+enum

I'd say go with enums!
But if you already have a lot of code already using strings, well, either just keep it that way (simple and usually enough fast on a PC anyway) or hash it using some kind of CRC or MD5 into an integer.

This is basically solved by adding an indirection on top of a hash map.
Say you want to convert strings to integers:
Write a class wraps both an array and a hashmap. I call these classes dictionaries.
The array contains the strings.
The hash map's key is the string (shared pointers or stable arrays where raw pointers are safe work as well)
The hash map's value is the index into the array the string is located, which is also the opaque handle it returns to calling code.
When adding a new string to the system, it is searched for already existing in the hashmap, returns the handle if present.
If the handle is not present, add the string to the array, the index is the handle.
Set the string and the handle in the map, and return the handle.
Notes/Caveats:
This strategy makes getting the string back from the handle run in constant time (it is merely an array deference).
handle identifiers are first come first serve, but if you serialize the strings instead of the values it won't matter.
Operator[] overloads for both the key and the value are fairly simple (registering new strings, or getting the string back), but wrapping the handle with a user-defined class (wrapping an integer) adds a lot of much needed type safety, and also avoids ambiguity if you want the key and the values to be the same types (overloaded[]'s wont compile and etc)
You have to store the strings in RAM, which can be a problem.

compressed vector/array class with random data access

I would like to make "compressed array"/"compressed vector" class (details below), that allows random data access with more or less constant time.
"more or less constant time" means that although element access time isn't constant, it shouldn't keep increasing when I get closer to certain point of the array. I.e. container shouldn't do significantly more calculations (like "decompress everything once again to get last element", and "do almost nothing to get the first") to get one element. Can be probably achieved by splitting array into chunks of compressed data. I.e. accessing one element should take "averageTime" +- some deviation. I could say that I want best-case access time and worst-case access time to be relatively close to average access time.
What are my options (suitable algorithms/already available containers - if there are any)?
Container details:
Container acts as a linear array of identical elements (such as std::vector)
Once container is initialized, data is constant and never changes. Container needs to provide read-only access.
Container should behave like array/std::vector - i.e. values accessed via operator[], there is .size(), etc.
It would be nice if I could make it as template class.
Access to data should be more or less constant-time. I don't need same access time for every element, but I shouldn't have to decompress everything to get last element.
Usage example:
Binary search on data.
Data details:
1. Data is structs mostly consisting of floats and a few ints. There are more floats than ints. No strings.
2. It is unlikely that there are many identical elements in array, so simply indexeing data won't be possible.
3. Size of one element is less than 100 bytes.
4. Total data size per container is between few kilobytes and a few megabytes.
5. Data is not sparse - it is continuous block of elements, all of them are assigned, there are no "empty slots".
The goal of compression is to reduce amount of ram the block takes when compared to uncompressed representation as array, while keeping somewhat reasonable read access performance, and allowing to randomly access elements as array. I.e. data should be stored in compressed form internally, and I should be able to access it (read-only) as if it is a std::vector or similar container.
Ideas/Opinions?

I take it that you want an array whose elements are not stored vanilla, but compressed, to minimize memory usage.
Concerning compression, you have no exceptional insight about the structure of your data, so you're fine with some kind of standard entropy encoding. Ideally, would like like to run GZIP on your whole array and be done with it, but that would lose O(1) access, which is crucial to you.
A solution is to use Huffmann coding together with an index table.
Huffmann coding works by replacing each input symbol (for instance, an ASCII byte) with another symbol of variable bit length, depending on frequency of occurency in the whole stream. For instance, the character E appears very often, so it gets a short bit sequence, while 'W' is seldom and gets a long bit sequence.
E -> 0b10
W -> 0b11110
Now, compress your whole array with this method. Unfortunately, since the output symbols have variable length, you can no longer index your data as before: item number 15 is no longer at stream[15*sizeof(item)].
Fortunately, this problem can solved by using an additional index table index that stores where each item start in the compressed stream. In other words, the compressed data for item 15 can be found at stream[index[15]]; the index table accumulates the variable output lengths.
So, to get item 15, you simply start decompressing the bytes at stream[index[15]]. This works because the Huffmann coding doesn't do anything fancy to the output, it just concatenates the new code words, and you can start decoding inside the stream without having to decode all previous items.
Of course, the index table adds some overhead; you may want to tweak the granularity so that compressed data + index table is still smaller than original data.

Are you coding for an embedded system and/or do you have hundreds or thousands of these containers? If not, while I think this is an interesting theoretical question (+1), I suspect that the slowdown as a result of doing the decompression will be non-trivial and that it would be better to use use a std::vector.
Next, are you sure that the data you're storing is sufficiently redundant that smaller blocks of it will actually be compressible? Have you tried saving off blocks of different sizes (powers of 2 perhaps) and tried running them through gzip as an exercise? It may be that any extra data needed to help the decompression algorithm (depending on approach) would reduce the space benefits of doing this sort of compressed container.
If you decide that it's still reasonable to do the compression, then there are at least a couple possibilities, none pre-written though. You could compress each individual element, storing a pointer to the compressed data chunk. Then index access is still constant, just needing to decompress the actual data. Possibly using a proxy object would make doing the actual data decompression easier and more transparent (and maybe even allow you to use std::vector as the underlying container).
Alternately, std::deque stores its data in chunks already, so you could use a similar approach here. For example std::vector<compressed_data_chunk> where each chunk holds say 10 items compressed together as your underlying container. Then you can still directly index the chunk you need, decompress it, and return the item from the decompressed data. If you want, your containing object (that holds the vector) could even cache the most recently decompressed chunk or two for added performance on consecutive access (although this wouldn't help binary search very much at all).

I've been thinking about this for a while now. From a theoretical point of view I identified 2 possibilities:
Flyweight, because repetition can be lessened with this pattern.
Serialization (compression is some form of serialization)
The first is purely object oriented and fits well I think in general, it doesn't have the disadvantage of messing up pointers for example.
The second seems better adapted here, although it does have a slight disadvantage in general: pointer invalidation + issues with pointer encoding / decoding, virtual tables, etc... Notably it doesn't work if the items refer to each others using pointers instead of indices.
I have seen a few "Huffman coding" solutions, however this means that for each structure one needs to provide a compressing algorithm. It's not easy to generalize.
So I'd rather go the other way and use a compressing library like 'zlib', picking up a fast algorithm like lzo for example.
B* tree (or a variant) with large number of items per node (since it doesn't move) like say 1001. Each node contains a compressed representation of the array of items. Indices are not compressed.
Possibly: cache_view to access the container while storing the last 5 (or so) decompressed nodes or something. Another variant is to implement reference counting and keep the data uncompressed as long as someones got a handle to one of the items in the node.
Some remarks:
if you should a large number of items/keys per node you have near constant access time, for example with 1001 it means that there are only 2 levels of indirection as long as you store less than a million items, 3 levels of indirection for a billion etc...
you can build a readable/writable container with such a structure. I would make it so that I only recompress once I am done writing the node.

Okay, from the best of my understanding, what you want is some kind of accessor template. Basically, create a template adapter that has as its argument one of your element types which it accesses internally via whatever, a pointer, an index into your blob, etc. Make the adapter pointer-like:
const T &operator->(void) const;
etc. since it's easier to create a pointer adapter than it is a reference adapter (though see vector if you want to know how to write one of those). Notice, I made this accessor constant as per your guidelines. Then, pre-compute your offsets when the blob is loaded / compressed and populate the vector with your templated adapter class. Does this make sense? If you need more details, I will be happy to provide.
As for the compression algorithm, I suggest you simply do a frequency analysis of bytes in your blob and then run your uncompressed blob through a hard-coded Huffman encoding (as was more or less suggested earlier), capturing the offset of each element and storing it in your proxy adapter which in turn are the elements of your array. Indeed, you could make this all part of some compression class that compresses and generates elements that can be copy-back-inserted into your vector from the beginning. Again, reply if you need sample code.

Can some of the answers to "What is the best compression algorithm that allows random reads/writes in a file?"
be adapted to your in-memory data?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js