UTF8 String: simple idea to make index() lookup O(1) - c++

Background
I am using the UTF8-CPP class. The vast majority of my strings are using the ASCII character set (0-127). The problem with UTF8-based strings is that the index function (i.e. to retrieve a character a specific position) is slow.
Idea
A simple technique is to use a flag as a property which basically says if the string is pure ASCII or note (isAscii). This flag would be updated whenever the string is modified.
This solution seems too simple, and there may be things I am overlooking. But, if this solution is viable, does it not provide the best of both worlds (i.e. Unicode when needed and performance for the vast majority of cases), and would it not gaurantee O(1) for index loopkups?
UPDATE
I'm going to attach a diagram to clarify what I mean. I think a lot of people are misunderstanding what I mean (or I am misunderstanding basic concepts).
All good replies.

I think the point here is that while your vast majority of strings is ASCII, in general, the designer of an UTF-8 library should expect general UTF-8 strings. And there, checking and setting this flag is an unnecessary overhead.
In your case, it might be worth the effort to wrap or modify the UTF8 class accordingly. But before you do that, ask your favorite profiler if it's worth it.

"It depends" on your needs for thread safety and updates, and the length of your strings, and how many you've got. In other words, only profiling your idea in your real application will tell you if it makes things better or worse.

If you want to speed up the UTF8 case...
First, consider sequential indexing of code points, thus avoiding counting them from the very beginning of the string again and again. Implement and use routines to index the next and the previous code points.
Second, you may build an array of indices into the UTF8 string's code points and use it as the first step while searching, it will give you an approximate location of the sought code point.
You may either have it (the array) of a fixed size, in which case you will still get search time ~ O(n) with O(1) memory cost, or have it contain equally-spaced indices (that is, indices into every m'th code point, where m is some constant), in which case you will get search time ~ O(m+log(n)) with O(n) memory cost.
You could also embed indices inside the code point data encoding them as reserved/unused/etc code points or use invalid encoding (say, first byte being 11111110 binary, then, for example, 6 10xxxxxx bytes containing the index, or whatever you like).

Related

Building a set of unique lines in a file efficiently, without storing actual lines in the set

Recently I was trying to solve the following issue:
I have a very large file, containing long lines, and I need to find and print out all the unique lines in it.
I don't want to use a map or set storing the actual lines, as the file is very big and the lines are long, so this would result in O(N) space complexity with poor constants (where N is the number of lines). Preferably, I would rather generate a set storing the pointers to the lines in the files that are unique. Clearly, the size of such a pointer (8 bytes on 64 bit machine I believe) is generally much smaller than the size of line (1 byte per character I believe) in memory. Although space complexity is still O(N), the constants are much better now. Using this implementation, the file never needs to be fully loaded in memory.
Now, let's say I'll go through the file line by line, checking for uniqueness. To see if it is already in the set, I could compare to all lines pointed by the set so far, comparing character by character. This gives O(N^2*L) complexity, with L the average length of a line. When not caring about storing the full lines in the set, O(N*L) complexity can be achieved, thanks to hashing. Now, when using a set of pointers instead (to reduce space requirements), how can I still achieve this? Is there a neat way to do it? The only thing I can come up with is this approach:
Hash the sentence. Store the hash value to map (or actually: unordered_multimap unordered to get the hashmap style, multi as double keys may be inserted in case of 'false matches').
For every new sentence: check if its hash value is already in the map. If not, add it. If yes, compare the full sentences (new one and the one in the unordered map with same hash) character by character, to make sure there is no 'false match'. If it is a 'false match', still add it.
Is this the right way? Or is there a nicer way you could do it? All suggestions are welcome!
And can I use some clever 'comparison object' (or something like that, I don't know much about that yet) to make this checking for already existing sentences fully automated on every unordered_map::find() call?
Your solution looks fine to me since you are storing O(unique lines) hashes not N, so that's a lower bound.
Since you scan the file line by line you might as well sort the file. Now duplicate lines will be contiguous and you need only check against the hash of the previous line. This approach uses O(1) space but you've got to sort the file first.
As #saadtaame's answer says, your space is actually O(unique lines) - depending on your use case, this may be acceptable or not.
While hashing certainly has its merits, it can conceivably have many problems with collisions - and if you can't have false positives, then it is a no-go unless you actually keep the contents of the lines around for checking.
The solution you describe is to maintain a hash-based set. That is obviously the most straightforward thing to do, and yes it would require to maintain all the unique lines in memory. That may or may not be a problem, though. That solution would also the easiest to implement -- what you are trying to do is exactly what any implementation of a (hash-based) set would do. You can just use std::unordered_set, and add every line to the set.
Since we are throwing around ideas, you could also use a trie as a substitute for the set. You would maybe save some space, but it still would be O(unique lines).
If there isn't some special structure in the file you can leverage, then definitively go for hashing the lines. This will - by orders of magnitude - be faster than actually comparing each line against each other line in the file.
If your actual implementation is still too slow, you can e.g. limit the hashing to the first portion of each line. This will produce more false positives, but assuming, that most lines will deviate already in the first few words, it will significantly speed up the file processing (especially, if you are I/O-bound).

What is the most efficient way to tell if a string of unicode characters has unique characters in C++?

If it was just ASCII characters I just use array of bool of size 256. But Unicode has so many characters.
1. Wikipedia says unicode has more than 110000 characters. So bool[110000] might not be a good idea?
2. Lets say the characters are coming in a stream and I just want to stop whenever a duplicate is detected. How do I do this?
3. Since the set is so big, I was thinking hash table. But how do I tell when the collision happens because I do not want to continue once a collision is detected. Is there a way to do this in STL implementation of hash table?
Efficient in terms of speed and memory utilization.
Your Options
There are a couple of possible solutions:
bool[0x110000] (note that this is a hex constant, not a decimal constant as in the question)
vector<bool> with a size of 0x110000
sorted vector<uint32_t> or list<uint32_t> containing every encountered codepoint
map<uint32_t, bool> or unordered_map<uint32_t, bool> containing a mapping of codepoints to whether they have been encountered
set<uint32_t> or unordered_set<uint32_t> containing every encountered codepoint
A custom container, e.g. a bloom filter which provides high-density probabilistic storage for exactly this kind of problem
Analysis
Now, let's perform a basic analysis of the 6 variants:
Requires exactly 0x110000 bytes = 1.0625 MiB plus whatever overhead a single allocation does. Both setting and testing is extremely fast.
While it would seem that this is pretty exactly the same solution, it only requires roughly 1/8 of the memory, since it will store the bools in one bit each instead of one byte each. Both setting and testing are extremely fast, performance relative to the first solution may be better or worse, depending on things like cpu cache size, memory performance and of course your test data.
While potentially taking the least amount of memory (4 bytes per encountered codepoint, so it will require less memory as long as the input stream contains at most 0x110000 / 8 / 4 = 34816), performance for this solution will be abysmal: Testing takes O(log(n)) for the vector (binary search) and O(n) for the list (binary search requires random access), while inserting takes O(n) for the vector (all following elements must be moved) and O(1) for the list (assuming that you kept the result of your failed test). This means, that a test + insert cycles takes O(n). Therefore your total runtime will be O(n^2)...
Without doing a lot of talking about this, it should be obvious that we do not need the bool, but would rather just test for existence, leading to solution 5.
Both sets are fairly similar in performance, set is usually implemented with a binary tree and unordered_set with a hash map. This means, that both are fairly inefficient memory wise: Both contain additional overhead (non-leaf nodes in trees and the actual tables containing the hashes in hashmaps) meaning they will probably take 8-16 bytes per entry. Testing and Inserting are O(log(n)) for the set and O(1) amortized for the unordered_set.
To answer the comment, testing whether uint32 const x is contained in unordered_set<uint32_t> data is done like so: if(data.count(x)) or if(data.find(x) != data.end()).
The major drawbacks here is a significant amount of work that the developer has to invest. Also, the bloom filter that was given as an example is a probabilistic data structure, meaning false-positives are possible (false negatives are not in this specific case).
Conclusion
Since your test data is not actual textual data, using a set of either kind is actually very memory inefficient (with high probability). Since this is also the only possibility to achieve better performance than the naive solutions 1 and 2, it will in all probability be significantly slower as well.
Taking into consideration the fact that it is easier to handle and that you seem to be fairly conscious of memory consumption, the final verdict is that vector<bool> seems the most appropriate solution.
Notes
Unicode is intended to represent text. This means that your test case and any analysis following from this is probably highly flawed.
Also, while it is simple to detect duplicate code points, the idea of a character is far more ambiguous, and may require some kind of normalization (e.g. "รค" can be either one codepoint or an "a" and a diacritical mark, or even be based on the cyrillic "a").

Are std::strings good at random insertion / random erasing?

I want to make a simple text editor using std::strings. If my text is 500,000 characters and I want to insert or remove at the 253,000th character, will this be slow, or will it be just as fast as if my text contained 10 characters? Otherwise I'm not sure what I'll do to fix it (unless I use a linked list but then reading is slow and it is sort of reinventing the wheel.
Thanks
I've never used it myself, but I believe this is what rope is for.
It will likely be slow since it has to copy the memory. It depends on the internal implementation of your operating system/processor and its memory operations.
In practice, it will probably be "fast enough". However, I'd
still write a EditBuffer class, encapsulating it, and giving
this new class an interface tuned to my application. That way,
the fact that I'm using std::string, and not something else,
becomes an implementation detail of EditBuffer, which can be
changed at any time. (You might want to try std::vector
as well. And one common optimization is maintaining a hole at
the cursor: the text behind the cursor is at the end of the
buffer. Advancing the cursor means moving one character, but
insertion is normally in constant time.)
It's likely to be slow, though whether or not it is slow enough to be an issue is something you'll have to test. One alternate implementation is to use a list of strings, one per line of text.
This is the wrong type.
Insertion (not at the end) on a std::string has a complexity of O(n).
You want a structure that has average complexity for insertion/deletion/modification of O(1).
ie. the cost of insertion should not be related to the size of the data.
Considering that memory bandwidth is given in GB/s
http://en.wikipedia.org/wiki/DDR3_SDRAM
how long would you estimate copying 256k would take?
I'd seriously consider storing the text not as a single large string, but as individual lines. std::list<std::string> or std::vector<std::string> would seem appropriate. Such an approac would effectively distribute your large string over multiple smaller ones, and reallocations upon modification would only ever occur to the individual line, or to the array of lines in itself. The only tradeoff you'd have to choose is between std::vector and std::list, although I'd tend to prefer std::list here.
Another advantage of line-wise approach is when reading files, you can easily read line by line with std::getline and won't have to care about read buffers yourself.

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

String search algorithm used by string::find() c++

how its faster than cstring functions? is the similar source available for C?
There's no standard implementation of the C++ Standard Library, but you should be able to take a look at the implementation shipped with your compiler and see how it works yourself.
In general, most STL functions are not faster than their C counterparts, though. They're usually safer, more generalized and designed to accommodate a much broader range of circumstances than the narrow-purpose C equivalents.
A standard optimization with any string class is to store the string length along with the string. Which will make any string operation that requires the string length to be known to be O(1) instead of O(n), strlen() being the obvious one.
Or copying a string, there's no gain in the actual copy but figuring out how much memory to allocate before the copy is O(1). The overall algorithm is still O(n). The basic operation is still the same, shoveling bytes takes just as long in any language.
String classes are useful because they are safer (harder to shoot your foot) and easier to use (require less explicit code). They became popular and widely used because they weren't slower.
The string class almost certainly stores far more data about the string than you'd find in a C string. Length is a good example. In tradeoff for the extra memory use, you will gain some spare CPU cycles.
Edit:
However, it's unlikely that one is substantially slower than the other, since they'll perform fundamentally the same actions. MSDN suggests that string::find() doesn't use a functor-based system, so they won't have that optimization.
There are many possiblities how you can implement a find string technique. The easiest way is to check every position of the destination string if there is the searchstring. You can code that very quickly, but its the slowest possiblity. (O(m*n), m = length search string, n = length destination string)
Take a look at the wikipedia page, http://en.wikipedia.org/wiki/String_searching_algorithm, there are different options presented.
The fastest way is to create a finite state machine, and then you can insert the string without going backwards. Thats then just O(n).
Which algorithm the STL actually uses, I don't know. But you could search for the sourcecode and compare it with the algorithms.