Tabulation hashing and N3980 - c++

I am having trouble adapting the pending C++1z proposal N3980 by #HowardHinnant to work with tabulation hashing.
Ab initio computing a tabulation hash works the same as for the hashing algorithm (Spooky, Murmur, etc.) described in N3980. It is not that complicated: just serialize an object of any user-defined type through hash_append() and let the hash function update a pointer into a table of random numbers as you go along.
The trouble starts when trying to implement one of the nice properties of tabulation hashing: very cheap to compute incremental updates to the hash if an object is mutated. For "hand-made" tabulation hashes, one just recomputes the hash of the object's affected bytes.
My question is: how to communicate incremental updates to a uhash<MyTabulationAlgorithm> function object while keeping true to the central theme of N3980 (Types don't know #)?
To illustrate the design difficulties: say I have a user-defined type X with N data members xi of various types Ti
struct X
{
T1 x1;
...
TN xN;
};
Now create an object and compute its hash
X x { ... }; // initialize
std::size_t h = uhash<MyTabulationAlgorithm>(x);
Update a single member, and recompute the hash
x.x2 = 42;
h ^= ...; // ?? I want to avoid calling uhash<>() again
I could compute the incremental update as something like
h ^= hash_update(x.x2, start, stop);
where [start, stop) represents the range of the table of random numbers that correspond to the x2 data member. However, in order to incrementally (i.e. cheaply!) update the hash for arbitrary mutations, every data member needs to somehow know its own subrange in the serialized byte stream of its containing class. This doesn't feel like the spirit of N3980. E.g., adding new data members to the containing class, would change the class layout and therefore the offsets in the serialized byte stream.
Application: tabulation hashing is very old, and it has recently been shown that it has very nice mathematical properties (see the Wikipedia link). It's also very popular in the board game programming community (computer chess and go e.g.) where it goes under the name of Zobrist hashing. There, a board position plays the role of X, and a move the role of a small update (move a piece from its source to its destination e.g.). It would be nice if N3980 could not only be adapted to such tabulation hashing, but that it can also accomodate the cheap incremental updates.

It seems that you should be able to do this by telling MyTabulationAlgorithm to ignore the values of all class members except that which has changed:
x.x2 = 42;
IncrementalHashAdaptor<MyTabulationAlgorithm, T2> inc{x.x2};
hash_append(inc, x);
h ^= inc;
All IncrementalHashAdaptor has to do is check the memory range it is passed to see whether x2 is included in it:
template<class HashAlgorithm, class T>
struct IncrementalHashAdaptor
{
T& t;
HashAlgorithm h = {};
bool found = false;
void operator()(void const* key, std::size_t len) noexcept
{
if (/* t contained within [key, key + len) */) {
assert(!found);
found = true;
char const* p = addressof(t);
h.ignore(key, (p - key));
h(p, sizeof(T));
h.ignore(p + sizeof(T), len - (p - key) - sizeof(T));
}
else {
h.ignore(key, len);
}
}
operator std:size_t() const { assert(found); return h; }
};
Obviously, this will only work for members whose object location is both possible to determine externally and corresponds to the memory block passed to the hash algorithm; but this should correspond to the vast majority of cases.
You would probably want to wrap IncrementalHashAdaptor and the following hash_append into a uhash_incremental utility; this is left as an exercise for the reader.
There is a question mark over performance; assuming HashAlgorithm::ignore(...) is visible to the compiler and is uncomplicated it should optimize well; if this does not occur you should be able to calculate the byte-stream address of X::x2 at program startup using a similar strategy.

Related

C++ Hash Table - How is collision for unordered_map with custom data type as keys resolved?

I have defined a class called Point which is to be used as a key inside an unordered_map. So, I have provided an operator== function inside the class and I have also provided a template specialization for std::hash. Based on my research, these are the two things I found necessary. The relevant code is as shown:
class Point
{
int x_cord = {0};
int y_cord = {0};
public:
Point()
{
}
Point(int x, int y):x_cord{x}, y_cord{y}
{
}
int x() const
{
return x_cord;
}
int y() const
{
return y_cord;
}
bool operator==(const Point& pt) const
{
return (x_cord == pt.x() && y_cord == pt.y());
}
};
namespace std
{
template<>
class hash<Point>
{
public:
size_t operator()(const Point& pt) const
{
return (std::hash<int>{}(pt.x()) ^ std::hash<int>{}(pt.y()));
}
};
}
// Inside some function
std::unordered_map<Point, bool> visited;
The program compiled and gave the correct results in the cases that I tested. However, I am not convinced if this is enough when using a user-defined class as key. How does the unordered_map know how to resolve collision in this case? Do I need to add anything to resolve collision?
That's a terrible hash function. But it is legal, so your implementation will work.
The rule (and really the only rule) for Hash and Equals is:
if a == b, then std::hash<value_type>(a) == std::hash<value_type>(b).
(It's also important that both Hash and Equals always produce the same value for the same arguments. I used to think that went without saying, but I've seen several SO questions where unordered_map produced unexpected results precisely because one or both of these functions depended on some external value.)
That would be satisfied by a hash function which always returned 42, in which case the map would get pretty slow as it filled up. But other than the speed issue, the code would work.
std::unordered_map uses a chained hash, not an open-addressed hash. All entries with the same hash value are placed in the same bucket, which is a linked list. So low-quality hashes do not distribute entries very well among the buckets.
It's clear that your hash gives {x, y} and {y, x} the same hash value. More seriously, any collection of points in a small rectangle will share the same small number of different hash values, because the high-order bits of the hash values will all be the same.
Knowing that Point is intended to store coordinates within an image, the best hash function here is:
pt.x() + pt.y() * width
where width is the width of the image.
Considering that x is a value in the range [0, width-1], the above hash function produces a unique number for any valid value of pt. No collisions are possible.
Note that this hash value corresponds to the linear index for the point pt if you store the image as a single memory block. That is, given y is also in a limited range ([0, height-1]), all hash values generated are within the range [0, width* height-1], and all integers in that range can be generated. Thus, consider replacing your hash table with a simple array (i.e. an image). An image is the best data structure to map a pixel location to a value.

Map, pair-vector or two vectors...?

I read through some posts and "wikis" but still cannot decide what approach is suitable for my problem.
I create a class called Sample which contains a certain number of compounds (lets say this is another class Nuclide) at a certain relative quantity (double).
Thus, something like (pseudo):
class Sample {
map<Nuclide, double>;
}
If I had the nuclides Ba-133, Co-60 and Cs-137 in the sample, I would have to use exactly those names in code to access those nuclides in the map. However, the only thing I need to do, is to iterate through the map to perform calculations (which nuclides they are is of no interest), thus, I will use a for- loop. I want to iterate without paying any attention to the key-names, thus, I would need to use an iterator for the map, am I right?
An alternative would be a vector<pair<Nuclide, double> >
class Sample {
vector<pair<Nuclide, double> >;
}
or simply two independent vectors
Class Sample {
vector<Nuclide>;
vector<double>;
}
while in the last option the link between a nuclide and its quantity would be "meta-information", given by the position in the respective vector only.
Due to my lack of profound experience, I'd ask kindly for suggestions of what approach to choose. I want to have the iteration through all available compounds to be fast and easy and at the same time keep the logical structure of the corresponding keys and values.
PS.: It's possible that the number of compunds in a sample is very low (1 to 5)!
PPS.: Could the last option be modified by some const statements to prevent changes and thus keep the correct order?
If iteration needs to be fast, you don't want std::map<...>: its iteration is a tree-walk which quickly gets bad. std::map<...> is really only reasonable if you have many mutations to the sequence and you need the sequence ordered by the key. If you have mutations but you don't care about the order std::unordered_map<...> is generally a better alternative. Both kinds of maps assume you are looking things up by key, though. From your description I don't really see that to be the case.
std::vector<...> is fast to iterated. It isn't ideal for look-ups, though. If you keep it ordered you can use std::lower_bound() to do a std::map<...>-like look-up (i.e., the complexity is also O(log n)) but the effort of keeping it sorted may make that option too expensive. However, it is an ideal container for keeping a bunch objects together which are iterated.
Whether you want one std::vector<std::pair<...>> or rather two std::vector<...>s depends on your what how the elements are accessed: if both parts of an element are bound to be accessed together, you want a std::vector<std::pair<...>> as that keeps data which is accessed together. On the other hand, if you normally only access one of the two components, using two separate std::vector<...>s will make the iteration faster as more iteration elements fit into a cache-line, especially if they are reasonably small like doubles.
In any case, I'd recommend to not expose the external structure to the outside world and rather provide an interface which lets you change the underlying representation later. That is, to achieve maximum flexibility you don't want to bake the representation into all your code. For example, if you use accessor function objects (property maps in terms of BGL or projections in terms of Eric Niebler's Range Proposal) to access the elements based on an iterator, rather than accessing the elements you can change the internal layout without having to touch any of the algorithms (you'll need to recompile the code, though):
// version using std::vector<std::pair<Nuclide, double> >
// - it would just use std::vector<std::pair<Nuclide, double>::iterator as iterator
auto nuclide_projection = [](Sample::key& key) -> Nuclide& {
return key.first;
}
auto value_projecton = [](Sample::key& key) -> double {
return key.second;
}
// version using two std::vectors:
// - it would use an iterator interface to an integer, yielding a std::size_t for *it
struct nuclide_projector {
std::vector<Nuclide>& nuclides;
auto operator()(std::size_t index) -> Nuclide& { return nuclides[index]; }
};
constexpr nuclide_projector nuclide_projection;
struct value_projector {
std::vector<double>& values;
auto operator()(std::size_t index) -> double& { return values[index]; }
};
constexpr value_projector value_projection;
With one pair these in-place, for example an algorithm simply running over them and printing them could look like this:
template <typename Iterator>
void print(std::ostream& out, Iterator begin, Iterator end) {
for (; begin != end; ++begin) {
out << "nuclide=" << nuclide_projection(*begin) << ' '
<< "value=" << value_projection(*begin) << '\n';
}
}
Both representations are entirely different but the algorithm accessing them is entirely independent. This way it is also easy to try different representations: only the representation and the glue to the algorithms accessing it need to be changed.

Better understanding the LRU algorithm

I need to implement a LRU algorithm in a 3D renderer for texture caching. I write the code in C++ on Linux.
In my case I will use texture caching to store "tiles" of image data (16x16 pixels block). Now imagine that I do a lookup in the cache, get a hit (tile is in the cache). How do I return the content of the "cache" for that entry to the function caller? I explain. I imagine that when I load a tile in the cache memory, I allocate the memory to store 16x16 pixels for example, then load the image data for that tile. Now there's two solutions to pass the content of the cache entry to the function caller:
1) either as pointer to the tile data (fast, memory efficient),
TileData *tileData = cache->lookup(tileId); // not safe?
2) or I need to recopy the tile data from the cache within a memory space allocated by the function caller (copy can be slow).
void Cache::lookup(int tileId, float *&tileData)
{
// find tile in cache, if not in cache load from disk add to cache, ...
...
// now copy tile data, safe but ins't that slow?
memcpy((char*)tileData, tileDataFromCache, sizeof(float) * 3 * 16 * 16);
}
float *tileData = new float[3 * 16 * 16]; // need to allocate the memory for that tile
// get tile data from cache, requires a copy
cache->lookup(tileId, tileData);
I would go with 1) but the problem is, what happens if the tile gets deleted from the cache just after the lookup, and that the function tries to access the data using the return pointer? The only solution I see to this, is to use a form of referencing counting (auto_ptr) where the data is actually only deleted when it's not used anymore?
the application might access more than 1 texture. I can't seem to find of a way of creating a key which is unique to each texture and each tile of a texture. For example I may have tile 1 from file1 and tile1 from file2 in the cache, so making the search on tildId=1 is not enough... but I can't seem to find a way of creating the key that accounts for the file name and the tileID. I can build a string that would contain the file name and the tileID (FILENAME_TILEID) but wouldn't a string used as a key be much slower than an integer?
Finally I have a question regarding time stamp. Many papers suggest to use a time stamp for ordering the entry in the cache. What is a good function to use a time stamp? the time() function, clock()? Is there a better way than using time stamps?
Sorry I realise it's a very long message, but LRU doesn't seem as simple to implement than it sounds.
Answers to your questions:
1) Return a shared_ptr (or something logically equivalent to it). Then all of the "when-is-it-safe-to-delete-this-object" issues pretty much go away.
2) I'd start by using a string as a key, and see if it actually is too slow or not. If the strings aren't too long (e.g. your filenames aren't too long) then you may find it's faster than you expect. If you do find out that string-keys aren't efficient enough, you could try something like computing a hashcode for the string and adding the tile ID to it... that would probably work in practice although there would always be the possibility of a hash-collision. But you could have a collision-check routine run at startup that would generate all of the possible filename+tileID combinations and alert you if map to the same key value, so that at least you'd know immediately during your testing when there is a problem and could do something about it (e.g. by adjusting your filenames and/or your hashcode algorithm). This assumes that what all the filenames and tile IDs are going to be known in advance, of course.
3) I wouldn't recommend using a timestamp, it's unnecessary and fragile. Instead, try something like this (pseudocode):
typedef shared_ptr<TileData *> TileDataPtr; // automatic memory management!
linked_list<TileDataPtr> linkedList;
hash_map<data_key_t, TileDataPtr> hashMap;
// This is the method the calling code would call to get its tile data for a given key
TileDataPtr GetData(data_key_t theKey)
{
if (hashMap.contains_key(theKey))
{
// The desired data is already in the cache, great! Just move it to the head
// of the LRU list (to reflect its popularity) and then return it.
TileDataPtr ret = hashMap.get(theKey);
linkedList.remove(ret); // move this item to the head
linkedList.push_front(ret); // of the linked list -- this is O(1)/fast
return ret;
}
else
{
// Oops, the requested object was not in our cache, load it from disk or whatever
TileDataPtr ret = LoadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.put(theKey, ret);
// Don't let our cache get too large -- delete
// the least-recently-used item if necessary
if (linkedList.size() > MAX_LRU_CACHE_SIZE)
{
TileDataPtr dropMe = linkedList.tail();
hashMap.remove(dropMe->GetKey());
linkedList.remove(dropMe);
}
return ret;
}
}
In the same order as your questions:
Copying over the texture date does not seem reasonable from a performance standpoint. Reference counting sound far better, as long as you can actually code it safely. The data memory would be freed as soon as it is not used by the renderer or have a reference stored in the cache.
I assume that you are going to use some sort of hash table for the look-up part of what you are describing. The common solution to your problem has two parts:
Using a suitable hashing function that combines multiple values e.g. the texture file name and the tile ID. Essentially you create a composite key that is treated as one entity. The hashing function could be a XOR operation of the hashes of all elementary components, or something more complex.
Selecting a suitable hash function is critical for performance reasons - if the said function is not random enough, you will have a lot of hash collisions.
Using a suitable composite equality check to handle the case of hash collisions.
This way you can look-up the combination of all attributes of interest in a single hash table look-up.
Using timestamps for this is not going to work - period. Most sources regarding caching usually describe the algorithms in question with network resource caching in mind (e.g. HTTP caches). That is not going to work here for three reasons:
Using natural time only makes sense of you intend to implement caching policies that take it into account, e.g. dropping a cache entry after 10 minutes. Unless you are doing something very weird something like this makes no sense within a 3D renderer.
Timestamps have a relatively low actual resolution, even if you use high precision timers. Most timer sources have a precision of about 1ms, which is a very long time for a processor - in that time your renderer would have worked through several texture entries.
Do you have any idea how expensive timer calls are? Abusing them like this could even make your system perform worse than not having any cache at all...
The usual solution to this problem is to not use a timer at all. The LRU algorithm only needs to know two things:
The maximum number of entries allowed.
The order of the existing entries w.r.t. their last access.
Item (1) comes from the configuration of the system and typically depends on the available storage space. Item (2) generally implies the use of a combined linked list/hash table data structure, where the hash table part provides fast access and the linked list retains the access order. Each time an entry is accessed, it is placed at the end of the list, while old entries are removed from its start.
Using a combined data structure, rather than two separate ones allows entries to be removed from the hash table without having to go through a look-up operation. This improves the overall performance, but it is not absolutely necessary.
As promised I am posting my code. Please let me know if I have made mistakes or if I could improve it further. I am now going to look into making it work in a multi-threaded environment. Again thanks to Jeremy and Thkala for their help (sorry the code doesn't fit the comment block).
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <cstdint>
#include <iostream>
typedef uint32_t data_key_t;
class TileData
{
public:
TileData(const data_key_t &key) : theKey(key) {}
data_key_t theKey;
~TileData() { std::cerr << "delete " << theKey << std::endl; }
};
typedef std::shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return std::shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
// the linked list keeps track of the order in which the data was accessed
std::list<TileDataPtr> linkedList;
// the hash map (unordered_map is part of c++0x while hash_map isn't?) gives quick access to the data
std::unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
std::unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(std::make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 8;
uint32_t cacheMiss, cacheHit;
};
int main(int argc, char **argv)
{
CacheLRU cache;
for (uint32_t i = 0; i < 238; ++i) {
int key = random() % 32;
TileDataPtr tileDataPtr = cache.getData(key);
}
std::cerr << "Cache hit: " << cache.cacheHit << ", cache miss: " << cache.cacheMiss << std::endl;
return 0;
}

Sort objects of dynamic size

Problem
Suppose I have a large array of bytes (think up to 4GB) containing some data. These bytes correspond to distinct objects in such a way that every s bytes (think s up to 32) will constitute a single object. One important fact is that this size s is the same for all objects, not stored within the objects themselves, and not known at compile time.
At the moment, these objects are logical entities only, not objects in the programming language. I have a comparison on these objects which consists of a lexicographical comparison of most of the object data, with a bit of different functionality to break ties using the remaining data. Now I want to sort these objects efficiently (this is really going to be a bottleneck of the application).
Ideas so far
I've thought of several possible ways to achieve this, but each of them appears to have some rather unfortunate consequences. You don't necessarily have to read all of these. I tried to print the central question of each approach in bold. If you are going to suggest one of these approaches, then your answer should respond to the related questions as well.
1. C quicksort
Of course the C quicksort algorithm is available in C++ applications as well. Its signature matches my requirements almost perfectly. But the fact that using that function will prohibit inlining of the comparison function will mean that every comparison carries a function invocation overhead. I had hoped for a way to avoid that. Any experience about how C qsort_r compares to STL in terms of performance would be very welcome.
2. Indirection using Objects pointing at data
It would be easy to write a bunch of objects holding pointers to their respective data. Then one could sort those. There are two aspects to consider here. On the one hand, just moving around pointers instead of all the data would mean less memory operations. On the other hand, not moving the objects would probably break memory locality and thus cache performance. Chances that the deeper levels of quicksort recursion could actually access all their data from a few cache pages would vanish almost completely. Instead, each cached memory page would yield only very few usable data items before being replaced. If anyone could provide some experience about the tradeoff between copying and memory locality I'd be very glad.
3. Custom iterator, reference and value objects
I wrote a class which serves as an iterator over the memory range. Dereferencing this iterator yields not a reference but a newly constructed object to hold the pointer to the data and the size s which is given at construction of the iterator. So these objects can be compared, and I even have an implementation of std::swap for these. Unfortunately, it appears that std::swap isn't enough for std::sort. In some parts of the process, my gcc implementation uses insertion sort (as implemented in __insertion_sort in file stl_alog.h) which moves a value out of the sequence, moves a number items by one step, and then moves the first value back into the sequence at the appropriate position:
typename iterator_traits<_RandomAccessIterator>::value_type
__val = _GLIBCXX_MOVE(*__i);
_GLIBCXX_MOVE_BACKWARD3(__first, __i, __i + 1);
*__first = _GLIBCXX_MOVE(__val);
Do you know of a standard sorting implementation which doesn't require a value type but can operate with swaps alone?
So I'd not only need my class which serves as a reference, but I would also need a class to hold a temporary value. And as the size of my objects is dynamic, I'd have to allocate that on the heap, which means memory allocations at the very leafs of the recusrion tree. Perhaps one alternative would be a vaue type with a static size that should be large enough to hold objects of the sizes I currently intend to support. But that would mean that there would be even more hackery in the relation between the reference_type and the value_type of the iterator class. And it would mean I would have to update that size for my application to one day support larger objects. Ugly.
If you can think of a clean way to get the above code to manipulate my data without having to allocate memory dynamically, that would be a great solution. I'm using C++11 features already, so using move semantics or similar won't be a problem.
4. Custom sorting
I even considered reimplementing all of quicksort. Perhaps I could make use of the fact that my comparison is mostly a lexicographical compare, i.e. I could sort sequences by first byte and only switch to the next byte when the firt byte is the same for all elements. I haven't worked out the details on this yet, but if anyone can suggest a reference, an implementation or even a canonical name to be used as a keyword for such a byte-wise lexicographical sorting, I'd be very happy. I'm still not convinced that with reasonable effort on my part I could beat the performance of the STL template implementation.
5. Completely different algorithm
I know there are many many kinds of sorting algorithms out there. Some of them might be better suited to my problem. Radix sort comes to my mind first, but I haven't really thought this through yet. If you can suggest a sorting algorithm more suited to my problem, please do so. Preferrably with implementation, but even without.
Question
So basically my question is this:
“How would you efficiently sort objects of dynamic size in heap memory?”
Any answer to this question which is applicable to my situation is good, no matter whether it is related to my own ideas or not. Answers to the individual questions marked in bold, or any other insight which might help me decide between my alternatives, would be useful as well, particularly if no definite answer to a single approach turns up.
The most practical solution is to use the C style qsort that you mentioned.
template <unsigned S>
struct my_obj {
enum { SIZE = S; };
const void *p_;
my_obj (const void *p) : p_(p) {}
//...accessors to get data from pointer
static int c_style_compare (const void *a, const void *b) {
my_obj aa(a);
my_obj bb(b);
return (aa < bb) ? -1 : (bb < aa);
}
};
template <unsigned N, typename OBJ>
void my_sort (const char (&large_array)[N], const OBJ &) {
qsort(large_array, N/OBJ::SIZE, OBJ::SIZE, OBJ::c_style_compare);
}
(Or, you can call qsort_r if you prefer.) Since STL sort inlines the comparision calls, you may not get the fastest possible sorting. If all your system does is sorting, it may be worth it to add the code to get custom iterators to work. But, if most of the time your system is doing something other than sorting, the extra gain you get may just be noise to your overall system.
Since there are only 31 different object variations (1 to 32 bytes), you could easily create an object type for each and select a call to std::sort based on a switch statement. Each call will get inlined and highly optimized.
Some object sizes might require a custom iterator, as the compiler will insist on padding native objects to align to address boundaries. Pointers can be used as iterators in the other cases since a pointer has all the properties of an iterator.
I'd agree with std::sort using a custom iterator, reference and value type; it's best to use the standard machinery where possible.
You worry about memory allocations, but modern memory allocators are very efficient at handing out small chunks of memory, particularly when being repeatedly reused. You could also consider using your own (stateful) allocator, handing out length s chunks from a small pool.
If you can overlay an object onto your buffer, then you can use std::sort, as long as your overlay type is copyable. (In this example, 4 64bit integers). With 4GB of data, you're going to need a lot of memory though.
As discussed in the comments, you can have a selection of possible sizes based on some number of fixed size templates. You would have to have pick from these types at runtime (using a switch statement, for example). Here's an example of the template type with various sizes and example of sorting the 64bit size.
Here's a simple example:
#include <vector>
#include <algorithm>
#include <iostream>
#include <ctime>
template <int WIDTH>
struct variable_width
{
unsigned char w_[WIDTH];
};
typedef variable_width<8> vw8;
typedef variable_width<16> vw16;
typedef variable_width<32> vw32;
typedef variable_width<64> vw64;
typedef variable_width<128> vw128;
typedef variable_width<256> vw256;
typedef variable_width<512> vw512;
typedef variable_width<1024> vw1024;
bool operator<(const vw64& l, const vw64& r)
{
const __int64* l64 = reinterpret_cast<const __int64*>(l.w_);
const __int64* r64 = reinterpret_cast<const __int64*>(r.w_);
return *l64 < *r64;
}
std::ostream& operator<<(std::ostream& out, const vw64& w)
{
const __int64* w64 = reinterpret_cast<const __int64*>(w.w_);
std::cout << *w64;
return out;
}
int main()
{
srand(time(NULL));
std::vector<unsigned char> buffer(10 * sizeof(vw64));
vw64* w64_arr = reinterpret_cast<vw64*>(&buffer[0]);
for(int x = 0; x < 10; ++x)
{
(*(__int64*)w64_arr[x].w_) = rand();
}
std::sort(
w64_arr,
w64_arr + 10);
for(int x = 0; x < 10; ++x)
{
std::cout << w64_arr[x] << '\n';
}
std::cout << std::endl;
return 0;
}
Given the enormous size (4GB), I would seriously consider dynamic code generation. Compile a custom sort into a shared library, and dynamically load it. The only non-inlined call should be the call into the library.
With precompiled headers, the compilation times may actually be not that bad. The whole <algorithm> header doesn't change, nor does your wrapper logic. You just need to recompile a single predicate each time. And since it's a single function you get, linking is trivial.
#define OBJECT_SIZE 32
struct structObject
{
unsigned char* pObject;
bool operator < (const structObject &n) const
{
for(int i=0; i<OBJECT_SIZE; i++)
{
if(*(pObject + i) != *(n.pObject + i))
return (*(pObject + i) < *(n.pObject + i));
}
return false;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
std::vector<structObject> vObjects;
unsigned char* pObjects = (unsigned char*)malloc(10 * OBJECT_SIZE); // 10 Objects
for(int i=0; i<10; i++)
{
structObject stObject;
stObject.pObject = pObjects + (i*OBJECT_SIZE);
*stObject.pObject = 'A' + 9 - i; // Add a value to the start to check the sort
vObjects.push_back(stObject);
}
std::sort(vObjects.begin(), vObjects.end());
free(pObjects);
To skip the #define
struct structObject
{
unsigned char* pObject;
};
struct structObjectComparerAscending
{
int iSize;
structObjectComparerAscending(int _iSize)
{
iSize = _iSize;
}
bool operator ()(structObject &stLeft, structObject &stRight)
{
for(int i=0; i<iSize; i++)
{
if(*(stLeft.pObject + i) != *(stRight.pObject + i))
return (*(stLeft.pObject + i) < *(stRight.pObject + i));
}
return false;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
int iObjectSize = 32; // Read it from somewhere
std::vector<structObject> vObjects;
unsigned char* pObjects = (unsigned char*)malloc(10 * iObjectSize);
for(int i=0; i<10; i++)
{
structObject stObject;
stObject.pObject = pObjects + (i*iObjectSize);
*stObject.pObject = 'A' + 9 - i; // Add a value to the start to work with something...
vObjects.push_back(stObject);
}
std::sort(vObjects.begin(), vObjects.end(), structObjectComparerAscending(iObjectSize));
free(pObjects);

Simulation design - flow of data, coupling

I am writing a simulation and need some hint on the design. The basic idea is that data for the given stochastic processes is being generated and later on consumed for various calculations. For example for 1 iteration:
Process 1 -> generates data for source 1: x1
Process 2 -> generates data for source 1: x2
and so on
Later I want to apply some transformations for example on the output of source 2, which results in x2a, x2b, x2c. So in the end up with the following vector: [x1, x2a, x2b, x2c].
I have a problem, as for N-multivariate stochastic processes (representing for example multiple correlated phenomenons) I have to generate N dimensional sample at once:
Process 1 -> generates data for source 1...N: x1...xN
I am thinking about the simple architecture that would allow to structuralize the simulation code and provide flexibility without hindering the performance.
I was thinking of something along these lines (pseudocode):
class random_process
{
// concrete processes would generate and store last data
virtual data_ptr operator()() const = 0;
};
class source_proxy
{
container_type<process> processes;
container_type<data_ptr> data; // pointers to the process data storage
data operator[](size_type number) const { return *(data[number]);}
void next() const {/* update the processes */}
};
Somehow I am not convinced about this design. For example, if I'd like to work with vectors of samples instead of single iteration, then above design should be changed (I could for example have the processes to fill the submatrices of the proxy-matrix passed to them with data, but again not sure if this is a good idea - if yes then it would also fit nicely the single iteration case). Any comments, suggestions and criticism are welcome.
EDIT:
Short summary of the text above to summarize the key points and clarify the situation:
random_processes contain the logic to generate some data. For example it can draw samples from multivariate random gaussian with the given means and correlation matrix. I can use for example Cholesky decomposition - and as a result I'll be getting a set of samples [x1 x2 ... xN]
I can have multiple random_processes, with different dimensionality and parameters
I want to do some transformations on individual elements generated by random_processes
Here is the dataflow diagram
random_processes output
x1 --------------------------> x1
----> x2a
p1 x2 ------------transform|----> x2b
----> x2c
x3 --------------------------> x3
p2 y1 ------------transform|----> y1a
----> y1b
The output is being used to do some calculations.
When I read this "the answer" doesn't materialize in my mind, but instead a question:
(This problem is part of a class of problems that various tool vendors in the market have created configurable solutions for.)
Do you "have to" write this or can you invest in tried and proven technology to make your life easier?
In my job at Microsoft I work with high performance computing vendors - several of which have math libraries. Folks at these companies would come much closer to understanding the question than I do. :)
Cheers,
Greg Oliver [MSFT]
I'll take a stab at this, perhaps I'm missing something but it sounds like we have a list of processes 1...N that don't take any arguments and return a data_ptr. So why not store them in a vector (or array) if the number is known at compile time... and then structure them in whatever way makes sense. You can get really far with the stl and the built in containers (std::vector) function objects(std::tr1::function) and algorithms (std::transform)... you didn't say much about the higher level structure so I'm assuming a really silly naive one, but clearly you would build the data flow appropriately. It gets even easier if you have a compiler with support for C++0x lambdas because you can nest the transformations easier.
//compiled in the SO textbox...
#include <vector>
#include <functional>
#include <numerics>
typedef int data_ptr;
class Generator{
public:
data_ptr operator()(){
//randomly generate input
return 42 * 4;
}
};
class StochasticTransformation{
public:
data_ptr operator()(data_ptr in){
//apply a randomly seeded function
return in * 4;
}
};
public:
data_ptr operator()(){
return 42;
}
};
int main(){
//array of processes, wrap this in a class if you like but it sounds
//like there is a distinction between generators that create data
//and transformations
std::vector<std::tr1::function<data_ptr(void)> generators;
//TODO: fill up the process vector with functors...
generators.push_back(Generator());
//transformations look like this (right?)
std::vector<std::tr1::function<data_ptr(data_ptr)> transformations;
//so let's add one
transformations.push_back(StochasticTransformation);
//and we have an array of results...
std::vector<data_ptr> results;
//and we need some inputs
for (int i = 0; i < NUMBER; ++i)
results.push_back(generators[0]());
//and now start transforming them using transform...
//pick a random one or do them all...
std::transform(results.begin(),results.end(),
results.begin(),results.end(),transformation[0]);
};
I think that the second option (the one mentioned in the last paragraph) makes more sense. In the one you had presented you are playing with pointers and indirect access to random process data. The other one would store all the data (either vector or a matrix) in one place - the source_proxy object. The random processes objects are then called with a submatrix to populate as a parameter, and themselves they do not store any data. The proxy manages everything - from providing the source data (for any distinct source) to requesting new data from the generators.
So changing a bit your snippet we could end up with something like this:
class random_process
{
// concrete processes would generate and store last data
virtual void operator()(submatrix &) = 0;
};
class source_proxy
{
container_type<random_process> processes;
matrix data;
data operator[](size_type source_number) const { return a column of data}
void next() {/* get new data from the random processes */}
};
But I agree with the other comment (Greg) that it is a difficult problem, and depending on the final application may require heavy thinking. It's easy to go into the dead-end resulting in rewriting lots of code...