Are std::strings good at random insertion / random erasing? - c++

I want to make a simple text editor using std::strings. If my text is 500,000 characters and I want to insert or remove at the 253,000th character, will this be slow, or will it be just as fast as if my text contained 10 characters? Otherwise I'm not sure what I'll do to fix it (unless I use a linked list but then reading is slow and it is sort of reinventing the wheel.
Thanks

I've never used it myself, but I believe this is what rope is for.

It will likely be slow since it has to copy the memory. It depends on the internal implementation of your operating system/processor and its memory operations.

In practice, it will probably be "fast enough". However, I'd
still write a EditBuffer class, encapsulating it, and giving
this new class an interface tuned to my application. That way,
the fact that I'm using std::string, and not something else,
becomes an implementation detail of EditBuffer, which can be
changed at any time. (You might want to try std::vector
as well. And one common optimization is maintaining a hole at
the cursor: the text behind the cursor is at the end of the
buffer. Advancing the cursor means moving one character, but
insertion is normally in constant time.)

It's likely to be slow, though whether or not it is slow enough to be an issue is something you'll have to test. One alternate implementation is to use a list of strings, one per line of text.

This is the wrong type.
Insertion (not at the end) on a std::string has a complexity of O(n).
You want a structure that has average complexity for insertion/deletion/modification of O(1).
ie. the cost of insertion should not be related to the size of the data.

Considering that memory bandwidth is given in GB/s
http://en.wikipedia.org/wiki/DDR3_SDRAM
how long would you estimate copying 256k would take?

I'd seriously consider storing the text not as a single large string, but as individual lines. std::list<std::string> or std::vector<std::string> would seem appropriate. Such an approac would effectively distribute your large string over multiple smaller ones, and reallocations upon modification would only ever occur to the individual line, or to the array of lines in itself. The only tradeoff you'd have to choose is between std::vector and std::list, although I'd tend to prefer std::list here.
Another advantage of line-wise approach is when reading files, you can easily read line by line with std::getline and won't have to care about read buffers yourself.

Related

Map of vector of struct vs Vector of struct

I am making a small project program that involves inputting quotes that would be later saved into a database (in this case a .txt file). There are also commands that the user would input such as list (which shows the quote by author) and random (which displays a random quote).
Here's the structure if I would use a map (with the author string as the key):
struct Information{
string quoteContent;
vector<string> tags;
}
and here's the structure if I would use the vector instead:
struct Information{
string author;
string quoteContent;
vector<string> tags;
}
note: The largest largest number of quotes I've had in the database is 200. (imported from a file)
I was just wondering which data structure would yield better performance. I'm still pretty new to this c++ thing, so any help would be appreciated!
For your data volumes it obviously doesn't matter from a performance perspective, but multi_map will likely let you write shorter, more comprehensible and maintainable code. Regarding general performance of vector vs maps (which is good to know about but likely only becomes relevant with millions of data elements or low-latency requirements)...
vector doesn't do any automatic sorting for you, so you'd probably push_back quotes as you read them, then do one std::sort once the data's loaded, after which you can find elements very quickly by author with std::binary_search or std::lower_bound, or identify insertion positions for new quotes using e.g. std::lower_bound, but if you want to insert a new quote thereafter you have to move the existing vector elements from that position on out of the way to make room - that's relatively slow. As you're just doing a few ad-hoc insertions based on user input, the time to do that with only a few hundred quotes in the vector will be totally insignificant. For the purposes of learning programming though, it's good to understand that a multimap is arranged as a kind of branching binary tree, with pointers linking the data elements, which allows for relatively quick insertion (and deletion). For some applications following all those pointers around can be more expensive (i.e. slower) than vector's contiguous memory (which works better with CPU cache memory), but in your case the data elements are all strings and vectors of strings that will likely (unless Short String Optimisations kick in) require jumping all over memory anyway.
In general, if author is naturally a key for your data just use a multi_map... it'll do all your operations in reasonable time, maybe not the fastest but never particularly slow, unlike vector for post-data-population mid-container insertions (/deletions).
Depends on the purpose of usage. Both data-structures have their pros and cons.
Vectors
Position index at() or operator []
Find function not present You would have to use find algorithm func.
Maps:
Key can be searched
Position index is not applicable. Keys are stored
(use unordered map for better performance than map.)
Use datastructure on basis of what you want to achieve.
The golden rule is: "When in doubt, measure."
i.e. Write some tests, do some benchmarking.
Anyway, considering that you have circa 200 items, I don't think there should be an important difference from the two cases on modern PC hardware. Big-O notation matters when N is big (e.g. 10,000s, 100,000s, 1,000,000s, etc.)
vector tends to be simpler than map, and I'd use it as the default container of choice (unless your main goal is to access the items given the author's name as a key, in this case map seems more logically suited).
Another option might be to have a vector with items sorted using author's names, so you can use binary search (which is O(logN)) inside the vector.

Concatenate 2 STL vectors in constant O(1) time

I'll give some context as to why I'm trying to do this, but ultimately the context can be ignored as it is largely a classic Computer Science and C++ problem (which must surely have been asked before, but a couple of cursory searches didn't turn up anything...)
I'm working with (large) real time streaming point clouds, and have a case where I need to take 2/3/4 point clouds from multiple sensors and stick them together to create one big point cloud. I am in a situation where I do actually need all the data in one structure, whereas normally when people are just visualising point clouds they can get away with feeding them into the viewer separately.
I'm using Point Cloud Library 1.6, and on closer inspection its PointCloud class (under <pcl/point_cloud.h> if you're interested) stores all data points in an STL vector.
Now we're back in vanilla CS land...
PointCloud has a += operator for adding the contents of one point cloud to another. So far so good. But this method is pretty inefficient - if I understand it correctly, it 1) resizes the target vector, then 2) runs through all Points in the other vector, and copies them over.
This looks to me like a case of O(n) time complexity, which normally might not be too bad, but is bad news when dealing with at least 300K points per cloud in real time.
The vectors don't need to be sorted or analysed, they just need to be 'stuck together' at the memory level, so the program knows that once it hits the end of the first vector it just has to jump to the start location of the second one. In other words, I'm looking for an O(1) vector merging method. Is there any way to do this in the STL? Or is it more the domain of something like std::list#splice?
Note: This class is a pretty fundamental part of PCL, so 'non-invasive surgery' is preferable. If changes need to be made to the class itself (e.g. changing from vector to list, or reserving memory), they have to be considered in terms of the knock on effects on the rest of PCL, which could be far reaching.
Update: I have filed an issue over at PCL's GitHub repo to get a discussion going with the library authors about the suggestions below. Once there's some kind of resolution on which approach to go with, I'll accept the relevant suggestion(s) as answers.
A vector is not a list, it represents a sequence, but with the additional requirement that elements must be stored in contiguous memory. You cannot just bundle two vectors (whose buffers won't be contiguous) into a single vector without moving objects around.
This problem has been solved many times before such as with String Rope classes.
The basic approach is to make a new container type that stores pointers to point clouds. This is like a std::deque except that yours will have chunks of variable size. Unless your clouds chunk into standard sizes?
With this new container your iterators start in the first chunk, proceed to the end then move into the next chunk. Doing random access in such a container with variable sized chunks requires a binary search. In fact, such a data structure could be written as a distorted form of B+ tree.
There is no vector equivalent of splice - there can't be, specifically because of the memory layout requirements, which are probably the reason it was selected in the first place.
There's also no constant-time way to concatenate vectors.
I can think of one (fragile) way to concatenate raw arrays in constant time, but it depends on them being aligned on page boundaries at both the beginning and the end, and then re-mapping them to be adjacent. This is going to be pretty hard to generalise.
There's another way to make something that looks like a concatenated vector, and that's with a wrapper container which works like a deque, and provides a unified iterator and operator[] over them. I don't know if the point cloud library is flexible enough to work with this, though. (Jamin's suggestion is essentially to use something like this instead of the vector, and Zan's is roughly what I had in mind).
No, you can't concatenate two vectors by a simple link, you actually have to copy them.
However! If you implement move-semantics in your element type, you'd probably get significant speed gains, depending on what your element contains. This won't help if your elements don't contain any non-trivial types.
Further, if you have your vector reserve way in advance the memory needed, then that'd also help speed things up by not requiring a resize (which would cause an undesired huge new allocation, possibly having to defragment at that memory size, and then a huge memcpy).
Barring that, you might want to create some kind of mix between linked-lists and vectors, with each 'element' of the list being a vector with 10k elements, so you only need to jump list links once every 10k elements, but it allows you to dynamically grow much easier, and make your concatenation breeze.
std::list<std::vector<element>> forIllustrationOnly; //Just roll your own custom type.
index = 52403;
listIndex = index % 1000
vectorIndex = index / 1000
forIllustrationOnly[listIndex][vectorIndex] = still fairly fast lookups
forIllustrationOnly[listIndex].push_back(vector-of-points) = much faster appending and removing of blocks of points.
You will not get this scaling behaviour with a vector, because with a vector, you do not get around the copying. And you can not copy an arbitrary amount of data in fixed time.
I do not know PointCloud, but if you can use other list types, e.g. a linked list, this behaviour is well possible. You might find a linked list implementation which works in your environment, and which can simply stick the second list to the end of the first list, as you imagined.
Take a look at Boost range joint at http://www.boost.org/doc/libs/1_54_0/libs/range/doc/html/range/reference/utilities/join.html
This will take 2 ranges and join them. Say you have vector1 and vector 2.
You should be able to write
auto combined = join(vector1,vector2).
Then you can use combined with algorithms, etc as needed.
No O(1) copy for vector, ever, but, you should check:
Is the element type trivially copyable? (aka memcpy)
Iff, is my vector implementation leveraging this fact, or is it stupidly looping over all 300k elements executing a trivial assignment (or worse, copy-ctor-call) for each element?
What I have seen is that, while both memcpyas well as an assignment-for-loop have O(n) complexity, a solution leveraging memcpy can be much, much faster.
So, the problem might be that the vector implementation is suboptimal for trivial types.

UTF8 String: simple idea to make index() lookup O(1)

Background
I am using the UTF8-CPP class. The vast majority of my strings are using the ASCII character set (0-127). The problem with UTF8-based strings is that the index function (i.e. to retrieve a character a specific position) is slow.
Idea
A simple technique is to use a flag as a property which basically says if the string is pure ASCII or note (isAscii). This flag would be updated whenever the string is modified.
This solution seems too simple, and there may be things I am overlooking. But, if this solution is viable, does it not provide the best of both worlds (i.e. Unicode when needed and performance for the vast majority of cases), and would it not gaurantee O(1) for index loopkups?
UPDATE
I'm going to attach a diagram to clarify what I mean. I think a lot of people are misunderstanding what I mean (or I am misunderstanding basic concepts).
All good replies.
I think the point here is that while your vast majority of strings is ASCII, in general, the designer of an UTF-8 library should expect general UTF-8 strings. And there, checking and setting this flag is an unnecessary overhead.
In your case, it might be worth the effort to wrap or modify the UTF8 class accordingly. But before you do that, ask your favorite profiler if it's worth it.
"It depends" on your needs for thread safety and updates, and the length of your strings, and how many you've got. In other words, only profiling your idea in your real application will tell you if it makes things better or worse.
If you want to speed up the UTF8 case...
First, consider sequential indexing of code points, thus avoiding counting them from the very beginning of the string again and again. Implement and use routines to index the next and the previous code points.
Second, you may build an array of indices into the UTF8 string's code points and use it as the first step while searching, it will give you an approximate location of the sought code point.
You may either have it (the array) of a fixed size, in which case you will still get search time ~ O(n) with O(1) memory cost, or have it contain equally-spaced indices (that is, indices into every m'th code point, where m is some constant), in which case you will get search time ~ O(m+log(n)) with O(n) memory cost.
You could also embed indices inside the code point data encoding them as reserved/unused/etc code points or use invalid encoding (say, first byte being 11111110 binary, then, for example, 6 10xxxxxx bytes containing the index, or whatever you like).

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.