Best Data Structure for Genetic Algorithm in C++?

Best Data Structure for Genetic Algorithm in C++? - c++

i need to implement a genetic algorithm customized for my problem (college project), and the first version had it coded as an matrix of short ( bits per chromosome x size of population).
That was a bad design, since i am declaring a short but only using the "0" and "1" values... but it was just a prototype and it worked as intended, and now it is time for me to develop a new, improved version. Performance is important here, but simplicity is also appreciated.
I researched around and came up with:
for the chromosome :
- String class (like "0100100010")
- Array of bool
- Vector (vectors appears to be optimized for bool)
- Bitset (sounds the most natural one)
and for the population:
- C Array[]
- Vector
- Queue
I am inclined to pick vector for chromossome and array for pop, but i would like the opinion of anyone with experience on the subject.
Thanks in advance!

I'm guessing you want random access to the population and to the genes. You say performance is important, which I interpret as execution speed. So you're probably best off using a vector<> for the chromosomes and a vector<char> for the genes. The reason for vector<char> is that bitset<> and vector<bool> are optimized for memory consumption, and are therefore slow. vector<char> will give you higher speed at the cost of x8 memory (assuming char = byte on your system). So if you want speed, go with vector<char>. If memory consumption is paramount, then use vector<bool> or bitset<>. bitset<> would seem like a natural choice here, however, bear in mind that it is templated on the number of bits, which means that a) the number of genes must be fixed and known at compile time (which I would guess is a big no-no), and b) if you use different sizes, you end up with one copy per bitset size of each of the bitset methods you use (though inlining might negate this), i.e., code bloat. Overall, I would guess vector<bool> is better for you if you don't want vector<char>.
If you're concerned about the aesthetics of vector<char> you could typedef char gene; and then use vector<gene>, which looks more natural.
A string is just like a vector<char> but more cumbersome.

Specifically to answer your question. I am not exactly sure what you are suggestion. You talk about Array and string class. Are you talking about the STL container classes where you can have a queue, bitset, vector, linked list etc. I would suggest a vector for you population (closest thing to a C array there is) and a bitset for you chromosome if you are worried about memory capacity. Else as you are already using a vector of your string representaion of your dna. ("10110110")
For ideas and a good tool to dabble. Recommend you download and initially use this library. It works with the major compilers. Works on unix variants. Has all the source code.
All the framework stuff is done for you and you will learn a lot. Later on you could write your own code from scratch or inherit from these classes. You can also use them in commercial code if you want.
Because they are objects you can change representaion of your DNA easily from integers to reals to structures to trees to bit arrays etc etc.
There is always learning cure involved but it is worth it.
I use it to generate thousands of neural nets then weed them out with a simple fitness function then run them for real.
galib
http://lancet.mit.edu/ga/

Assuming that you want to code this yourself (if you want an external library kingchris seems to have a good one there) it really depends on what kind of manipulation you need to do. To get the most bang for your buck in terms of memory, you could use any integer type and set/manipulate individual bits via bitmasks etc. Now this approach likely not optimal in terms of ease of use... The string example above would work ok, however again its not significantly different than the shorts, here you are now just representing either '0' or '1' with an 8 bit value as opposed to 16 bit value. Also, again depending on the manipulation, the string case will probably prove unwieldly. So if you could give some more info on the algorithm we could maybe give more feedback. Myself I like the individual bits as part of an integer (a bitset), but if you aren't used to masks, shifts, and all that good stuff it may not be right for you.

I suggest writing a class for each member of population, that simplifies things considerably, since you can keep all your member relevant functions in the same place nicely wrapped with the actual data.
If you need a "array of bools" I suggest using an int or several ints (then use mask and bit wise operations to access (modify / flip) each bit) depending on number of your chromosomes.
I usually used some sort of collection class for the population, because just an array of population members doesn't allow you to simply add to your population. I would suggest implementing some sort of dynamic list (if you are familiar with ArrayList then that is a good example).
I had major success with genetic algorithms with the recipe above. If you prepare your member class properly it can really simplify things and allows you to focus on coding better genetic algorithms instead of worrying about your data structures.

Related

unordered_set vs vector -- prefer idiomatic or performant?

I'm working with data that is unique from other data of the same type. Very abstractly, a set fits the definition of the data I'm working with. I feel inclined to use std::unordered_set instead of std::vector for that reason.
Beyond that, both classes can fit my requirements. My question is about performance -- which might perform better? I cannot write out the code one way and benchmark it, then rewrite it the other way. That will take me hundreds of hours. If they'll perform similarly, do you think it would be worth-while to stick with the idiomatic unordered_set?
Here is a simpler use case. A company is selling computers. Each is unique from another in at least one way, guaranteed.
struct computer_t
{
std::string serial;
std::uint32_t gb_of_ram;
};
std::unordered_set<computer_t> all_computers_in_existence;
std::unordered_set<computer_t> computers_for_sale; // subset of above
// alternatively
std::vector<computer_t> all_computers_in_existence;
std::vector<computer_t> computers_for_sale; // subset of above
The company wants to stop selling computers that aren't popular and replace them with other computers that might be.
std::unordered_set<computer_t> computers_not_for_sale;
std::set_difference(all_computers_in_existence.begin(), all_computers_in_existence.end(),
computers_for_sale.begin(), computers_for_sale.end(),
std::inserter(computers_not_for_sale, computers_not_for_sale.end()));
calculate_and_remove_least_sold(computers_for_sale);
calculate_and_add_most_likely_to_sell(computers_for_sale, computers_not_for_sale);
Based on the above sample code, what should I choose? Or is there another, new STL feature (in C++17) I should investigate? This really is as generic as it gets for my use-case without making this post incredibly long with details.

Idiomatic should be your first choice. If you implement it using unordered_set and the performance is not good enough, there are faster non-STL hash tables which are easy to switch to. 99% of the time it won't come to that.
Your example code using std::set_difference will not work, because that requires the inputs be sorted, which unordered_set is not. That's OK though, subtracting is done easily using unordered_set::erase(key).

Fast hamming distance between 2 bitset

I'm writing a software that heavily relies on (1) accessing single bit and (2) Hamming distance computations between 2 bitset A and B (ie. the numbers of bits that differ between A and B). The bitsets are quite big, between 10K and 1M bits and i have a bunch of them. Since it is impossible to know the bitset sizes at compilation time, i'm using vector < bool > , but i plan to migrate to boost::dynamic_bitset soon.
Hereafter are my questions:
(1) Any ideas about which implementations have the fastest single bit access time?
(2) To compute Hamming distance, the naive approach is to loop over the single bits and to count differences between the 2 bitsets. But, my feeling is that it might be much faster to loop over bytes instead of bits, perform R = byteA XOR byteB, and look in a table with 255 entries what "local" distance is associated with R. Another solutions would be store a 255 x 255 matrix and access directly without operation to the distance between byteA and byteB. So my question: Any idea how to implement that from std::vector < bool > or boost::dynamic_bitset? In other words, do you know if there is a way to get access to the bytes array or i have to recode everything from scratch?

(1) Probably vector<char> (or even vector<int>), but that wastes at least 7/8 space on typical hardware. You don't need to unpack the bits if you use a byte or more to store them. Which of vector<bool> or dynamic_bitset is faster, I don't know. That might depend on the C++ implementation.
(2) boost::dynamic_bitset has operator^ and a count member, which together can be used to compute the Hamming distance in a probably fast, though memory-wasting way. You can also get to the underlying buffer with to_block_range; to use that, you need to implement a Hamming distance calculator as an OutputIterator.

If you do code from scratch, you can probably do even better than a byte at a time: take a word at a time from each bitset. The cost of XOR should be very low, then use either an implementation-specific builtin popcount, or else the fastest bit-twiddling popcount you can find (which may or may not involve a 256-entry lookup).
[Edit: looks as if this could apply to boost::dynamic_bitset::to_block_range, with the Block chosen as either int or long. It's a shame that it writes to an OutputIterator rather than giving you an InputIterator -- I can't immediately see how to use it to iterate over two bitsets together, except by using an extra thread or else copying one of the bitsets out to an int array first. Either way you'll take some copy overhead that could have been avoided if it had left the program control to you. The thread is pretty complicated for this task, and of course has its own overheads, and copying out the data probably isn't any better than using operator^ and count().]

I know this will get downvoted for heresy, but here it is: you can get a pointer to the actual data from a vector using &vector[0]; (for vector ymmv). Then, you can iterate over it using c-style functions; meaning, cast your pointer to an int pointer or something big like that, perform your hamming arithmetic as above, and move the pointer one word-length at a time. This would only work because you know that the bits are packed together continuously, and would be vulnerable (for example, if the vector is modified, it could move memory locations).

An integer hashing problem

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?
Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.
Thank you.
Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.

How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?

If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.
If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?
std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;
If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:
struct
{
int marker;
T *p_x;
} T_wrapper;
std::map<int,T_wrapper> my_map;
If you're going to need to do lookups anyway, then this will be no slower.
EDIT: As #tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...

The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).
Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.
If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.
If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.
A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.
Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.

If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

How fast is the code

I'm developing game. I store my game-objects in this map:
std::map<std::string, Object*> mObjects;
std::string is a key/name of object to find further in code. It's very easy to point some objects, like: mObjects["Player"] = .... But I'm afraid it's to slow due to allocation of std::string in each searching in that map. So I decided to use int as key in that map.
The first question: is that really would be faster?
And the second, I don't want to remove my current type of objects accesing, so I found the way: store crc string calculating as key. For example:
Object *getObject(std::string &key)
{
int checksum = GetCrc32(key);
return mObjects[checksum];
}
Object *temp = getOject("Player");
Or this is bad idea? For calculating crc I would use boost::crc. Or this is bad idea and calculating of checksum is much slower than searching in map with key type std::string?

Calculating a CRC is sure to be slower than any single comparison of strings, but you can expect to do about log2N comparisons before finding the key (e.g. 10 comparisons for 1000 keys), so it depends on the size of your map. CRC can also result in collisions, so it's error prone (you could detect collisions relatively easily detect, and possibly even handle them to get correct results anyway, but you'd have to be very careful to get it right).
You could try an unordered_map<> (possibly called hash_map) if your C++ environment provides one - it may or may not be faster but won't be sorted if you iterate. Hash maps are yet another compromise:
the time to hash is probably similar to the time for your CRC, but
afterwards they can often seek directly to the value instead of having to do the binary-tree walk in a normal map
they prebundle a bit of logic to handle collisions.
(Silly point, but if you can continue to use ints and they can be contiguous, then do remember that you can replace the lookup with an array which is much faster. If the integers aren't actually contiguous, but aren't particularly sparse, you could use a sparse index e.g. array of 10000 short ints that are indices into 1000 packed records).
Bottom line is if you care enough to ask, you should implement these alternatives and benchmark them to see which really works best with your particular application, and if they really make any tangible difference. Any of them can be best in certain circumstances, and if you don't care enough to seriously compare them then it clearly means any of them will do.

For the actual performance you need to profile the code and see it. But I would be tempted to use hash_map. Although its not part of the C++ standard library most of the popular implentations provide it. It provides very fast lookup.

The first question: is that really would be faster?
yes - you're comparing an int several times, vs comparing a potentially large map of strings of arbitrary length several times.
checksum: Or this is bad idea?
it's definitely not guaranteed to be unique. it's a bug waiting to bite.
what i'd do:
use multiple collections and embrace type safety:
// perhaps this simplifies things enough that t_player_id can be an int?
std::map<t_player_id, t_player> d_players;
std::map<t_ghoul_id, t_ghoul> d_ghouls;
std::map<t_carrot_id, t_carrot> d_carrots;
faster searches, more type safety. smaller collections. smaller allocations/resizes.... and on and on... if your app is very trivial, then this won't matter. use this approach going forward, and adjust after profiling/as needed for existing programs.
good luck

If you really want to know you have to profile your code and see how long does the function getObject take. Personally I use valgrind and KCachegrind to profile and render data on UNIX system.
I think using id would be faster. It's faster to compare int than string so...

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.

I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.

What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.

integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.

A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.

For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.

It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.

Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.

A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js