String search algorithm used by string::find() c++ - c++

how its faster than cstring functions? is the similar source available for C?

There's no standard implementation of the C++ Standard Library, but you should be able to take a look at the implementation shipped with your compiler and see how it works yourself.
In general, most STL functions are not faster than their C counterparts, though. They're usually safer, more generalized and designed to accommodate a much broader range of circumstances than the narrow-purpose C equivalents.

A standard optimization with any string class is to store the string length along with the string. Which will make any string operation that requires the string length to be known to be O(1) instead of O(n), strlen() being the obvious one.
Or copying a string, there's no gain in the actual copy but figuring out how much memory to allocate before the copy is O(1). The overall algorithm is still O(n). The basic operation is still the same, shoveling bytes takes just as long in any language.
String classes are useful because they are safer (harder to shoot your foot) and easier to use (require less explicit code). They became popular and widely used because they weren't slower.

The string class almost certainly stores far more data about the string than you'd find in a C string. Length is a good example. In tradeoff for the extra memory use, you will gain some spare CPU cycles.
Edit:
However, it's unlikely that one is substantially slower than the other, since they'll perform fundamentally the same actions. MSDN suggests that string::find() doesn't use a functor-based system, so they won't have that optimization.

There are many possiblities how you can implement a find string technique. The easiest way is to check every position of the destination string if there is the searchstring. You can code that very quickly, but its the slowest possiblity. (O(m*n), m = length search string, n = length destination string)
Take a look at the wikipedia page, http://en.wikipedia.org/wiki/String_searching_algorithm, there are different options presented.
The fastest way is to create a finite state machine, and then you can insert the string without going backwards. Thats then just O(n).
Which algorithm the STL actually uses, I don't know. But you could search for the sourcecode and compare it with the algorithms.

Related

Efficient way to get a reversed copy of std::string

Dealing with algorithmic tasks I frequently need to get a copy of reversed std::string. Also, the source string should not be modified. As far as I concerned, there are two ways to do it:
Use std::reverse :
// std::string sourceString has been initialized before.
std::string reversedString = sourceString;
std::reverse(reversedString.begin(), reversedString.end());
Use reverse iterators. This one I found on the Internet:
// std::string sourceString has been initialized before.
std::string reversedString{sourceString.rbegin(), sourceString.rend()};
My question is which approach I should prefer according to efficiency and best practices.
C-style solutions are not in my concern, I am only interested in STL-way approaches.
My question is which approach I should prefer according to efficiency
The one which should be preferred according to efficiency is the one that has been measured to be more efficient. Both have the same asymptotic complexity.
But, I won't bother to measure the difference unless it happens to be a bottleneck. I prefer 2, but it's subjective.
I could say that constructing a data structure with the right data initially is faster generally, but general statements about performance is generally wrong. You should measure the performance and benchmark if you're concerned about performance.
If you're not concerned enough about performance to write benchmark code, then you should take the style that looks the best for you.
Also, you forgot C++20 style:
auto reversed = sourceString | std::views::reverse;
std::string reversedString{begin(reversed), end(reversed)};
Which in the end is not that different from the iterator range style since the string still need a iterator pair.
Like others have said you should first decide what is more meaningful to your code-base, style or speed. If style, just use std::reverse which has an average runtime of O(n). If speed is a bottleneck and you run this reverse string method all the time, I would consider creating a doubly-linked list. Then reversing the LL can happen in O(1) runtime.

UTF8 String: simple idea to make index() lookup O(1)

Background
I am using the UTF8-CPP class. The vast majority of my strings are using the ASCII character set (0-127). The problem with UTF8-based strings is that the index function (i.e. to retrieve a character a specific position) is slow.
Idea
A simple technique is to use a flag as a property which basically says if the string is pure ASCII or note (isAscii). This flag would be updated whenever the string is modified.
This solution seems too simple, and there may be things I am overlooking. But, if this solution is viable, does it not provide the best of both worlds (i.e. Unicode when needed and performance for the vast majority of cases), and would it not gaurantee O(1) for index loopkups?
UPDATE
I'm going to attach a diagram to clarify what I mean. I think a lot of people are misunderstanding what I mean (or I am misunderstanding basic concepts).
All good replies.
I think the point here is that while your vast majority of strings is ASCII, in general, the designer of an UTF-8 library should expect general UTF-8 strings. And there, checking and setting this flag is an unnecessary overhead.
In your case, it might be worth the effort to wrap or modify the UTF8 class accordingly. But before you do that, ask your favorite profiler if it's worth it.
"It depends" on your needs for thread safety and updates, and the length of your strings, and how many you've got. In other words, only profiling your idea in your real application will tell you if it makes things better or worse.
If you want to speed up the UTF8 case...
First, consider sequential indexing of code points, thus avoiding counting them from the very beginning of the string again and again. Implement and use routines to index the next and the previous code points.
Second, you may build an array of indices into the UTF8 string's code points and use it as the first step while searching, it will give you an approximate location of the sought code point.
You may either have it (the array) of a fixed size, in which case you will still get search time ~ O(n) with O(1) memory cost, or have it contain equally-spaced indices (that is, indices into every m'th code point, where m is some constant), in which case you will get search time ~ O(m+log(n)) with O(n) memory cost.
You could also embed indices inside the code point data encoding them as reserved/unused/etc code points or use invalid encoding (say, first byte being 11111110 binary, then, for example, 6 10xxxxxx bytes containing the index, or whatever you like).

How fast is the code

I'm developing game. I store my game-objects in this map:
std::map<std::string, Object*> mObjects;
std::string is a key/name of object to find further in code. It's very easy to point some objects, like: mObjects["Player"] = .... But I'm afraid it's to slow due to allocation of std::string in each searching in that map. So I decided to use int as key in that map.
The first question: is that really would be faster?
And the second, I don't want to remove my current type of objects accesing, so I found the way: store crc string calculating as key. For example:
Object *getObject(std::string &key)
{
int checksum = GetCrc32(key);
return mObjects[checksum];
}
Object *temp = getOject("Player");
Or this is bad idea? For calculating crc I would use boost::crc. Or this is bad idea and calculating of checksum is much slower than searching in map with key type std::string?
Calculating a CRC is sure to be slower than any single comparison of strings, but you can expect to do about log2N comparisons before finding the key (e.g. 10 comparisons for 1000 keys), so it depends on the size of your map. CRC can also result in collisions, so it's error prone (you could detect collisions relatively easily detect, and possibly even handle them to get correct results anyway, but you'd have to be very careful to get it right).
You could try an unordered_map<> (possibly called hash_map) if your C++ environment provides one - it may or may not be faster but won't be sorted if you iterate. Hash maps are yet another compromise:
the time to hash is probably similar to the time for your CRC, but
afterwards they can often seek directly to the value instead of having to do the binary-tree walk in a normal map
they prebundle a bit of logic to handle collisions.
(Silly point, but if you can continue to use ints and they can be contiguous, then do remember that you can replace the lookup with an array which is much faster. If the integers aren't actually contiguous, but aren't particularly sparse, you could use a sparse index e.g. array of 10000 short ints that are indices into 1000 packed records).
Bottom line is if you care enough to ask, you should implement these alternatives and benchmark them to see which really works best with your particular application, and if they really make any tangible difference. Any of them can be best in certain circumstances, and if you don't care enough to seriously compare them then it clearly means any of them will do.
For the actual performance you need to profile the code and see it. But I would be tempted to use hash_map. Although its not part of the C++ standard library most of the popular implentations provide it. It provides very fast lookup.
The first question: is that really would be faster?
yes - you're comparing an int several times, vs comparing a potentially large map of strings of arbitrary length several times.
checksum: Or this is bad idea?
it's definitely not guaranteed to be unique. it's a bug waiting to bite.
what i'd do:
use multiple collections and embrace type safety:
// perhaps this simplifies things enough that t_player_id can be an int?
std::map<t_player_id, t_player> d_players;
std::map<t_ghoul_id, t_ghoul> d_ghouls;
std::map<t_carrot_id, t_carrot> d_carrots;
faster searches, more type safety. smaller collections. smaller allocations/resizes.... and on and on... if your app is very trivial, then this won't matter. use this approach going forward, and adjust after profiling/as needed for existing programs.
good luck
If you really want to know you have to profile your code and see how long does the function getObject take. Personally I use valgrind and KCachegrind to profile and render data on UNIX system.
I think using id would be faster. It's faster to compare int than string so...

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

Best Data Structure for Genetic Algorithm in C++?

i need to implement a genetic algorithm customized for my problem (college project), and the first version had it coded as an matrix of short ( bits per chromosome x size of population).
That was a bad design, since i am declaring a short but only using the "0" and "1" values... but it was just a prototype and it worked as intended, and now it is time for me to develop a new, improved version. Performance is important here, but simplicity is also appreciated.
I researched around and came up with:
for the chromosome :
- String class (like "0100100010")
- Array of bool
- Vector (vectors appears to be optimized for bool)
- Bitset (sounds the most natural one)
and for the population:
- C Array[]
- Vector
- Queue
I am inclined to pick vector for chromossome and array for pop, but i would like the opinion of anyone with experience on the subject.
Thanks in advance!
I'm guessing you want random access to the population and to the genes. You say performance is important, which I interpret as execution speed. So you're probably best off using a vector<> for the chromosomes and a vector<char> for the genes. The reason for vector<char> is that bitset<> and vector<bool> are optimized for memory consumption, and are therefore slow. vector<char> will give you higher speed at the cost of x8 memory (assuming char = byte on your system). So if you want speed, go with vector<char>. If memory consumption is paramount, then use vector<bool> or bitset<>. bitset<> would seem like a natural choice here, however, bear in mind that it is templated on the number of bits, which means that a) the number of genes must be fixed and known at compile time (which I would guess is a big no-no), and b) if you use different sizes, you end up with one copy per bitset size of each of the bitset methods you use (though inlining might negate this), i.e., code bloat. Overall, I would guess vector<bool> is better for you if you don't want vector<char>.
If you're concerned about the aesthetics of vector<char> you could typedef char gene; and then use vector<gene>, which looks more natural.
A string is just like a vector<char> but more cumbersome.
Specifically to answer your question. I am not exactly sure what you are suggestion. You talk about Array and string class. Are you talking about the STL container classes where you can have a queue, bitset, vector, linked list etc. I would suggest a vector for you population (closest thing to a C array there is) and a bitset for you chromosome if you are worried about memory capacity. Else as you are already using a vector of your string representaion of your dna. ("10110110")
For ideas and a good tool to dabble. Recommend you download and initially use this library. It works with the major compilers. Works on unix variants. Has all the source code.
All the framework stuff is done for you and you will learn a lot. Later on you could write your own code from scratch or inherit from these classes. You can also use them in commercial code if you want.
Because they are objects you can change representaion of your DNA easily from integers to reals to structures to trees to bit arrays etc etc.
There is always learning cure involved but it is worth it.
I use it to generate thousands of neural nets then weed them out with a simple fitness function then run them for real.
galib
http://lancet.mit.edu/ga/
Assuming that you want to code this yourself (if you want an external library kingchris seems to have a good one there) it really depends on what kind of manipulation you need to do. To get the most bang for your buck in terms of memory, you could use any integer type and set/manipulate individual bits via bitmasks etc. Now this approach likely not optimal in terms of ease of use... The string example above would work ok, however again its not significantly different than the shorts, here you are now just representing either '0' or '1' with an 8 bit value as opposed to 16 bit value. Also, again depending on the manipulation, the string case will probably prove unwieldly. So if you could give some more info on the algorithm we could maybe give more feedback. Myself I like the individual bits as part of an integer (a bitset), but if you aren't used to masks, shifts, and all that good stuff it may not be right for you.
I suggest writing a class for each member of population, that simplifies things considerably, since you can keep all your member relevant functions in the same place nicely wrapped with the actual data.
If you need a "array of bools" I suggest using an int or several ints (then use mask and bit wise operations to access (modify / flip) each bit) depending on number of your chromosomes.
I usually used some sort of collection class for the population, because just an array of population members doesn't allow you to simply add to your population. I would suggest implementing some sort of dynamic list (if you are familiar with ArrayList then that is a good example).
I had major success with genetic algorithms with the recipe above. If you prepare your member class properly it can really simplify things and allows you to focus on coding better genetic algorithms instead of worrying about your data structures.