string cursors/markers in C++ strings - c++

I am working with some big (megabytes) strings, and I need to modify them by inserting and removing characters at different locations.
To make it more efficient, instead of searching insertion/deletion points every time, I would like to have "cursors" or "tags" which are still valid if text is inserted (i.e. they are moved accordingly), and are still valid if the removed text is not "enclosing" the cursor position (i.e. a cursor becomes invalid only if it was in the removed substring, other cursors are moved accordingly).
I do not need to operate on the string concurrently, but insertion/deletion operations happen always one at time.
Do you know how this can be done with standard C++, boost or a portable, lightweight library?

If the number of insertion points will be relatively small, why not just keep a list (or array) of your insertion points, ordered by and including their offset into the data string.
Then, any time you insert/remove some text, simply pass through that list and adjust any insertion points that are past the offset of the modification, either up or down by the size of the insertion/removal.
Of course, you'd have to decide what it meant to have a modification "hit" one of your insertion points (e.g. what to do if a deleted range includes one or more insertion points), but that'd depend on what you're trying to maintain those markers for.

Maybe you can use special keywords in your text, which you match and replace using regular expressions (regex): http://www.cplusplus.com/reference/regex/
Be careful which keywords you use, because they could potentially occur naturally in the string.

Related

how does a language *know* when a list is sorted?

Forgive me if this question is dumb, but it occurred to me I don't know how a language knows a list is sorted or not.
Say I have a list:
["Apple","Apricot","Blueberry","Cardamom","Cumin"]
and I want to insert "Cinnamon".
AFAIK The language I'm using doesn't know the list is sorted; it's just a list. And it doesn't have a "wide screen" field of view like we do, so it doesn't know where the A-chunk ends and the C-chunk begins from outside the list. So it goes through and compares the first letter of each array string to the first letter of the insert string. If the insert char is greater, it moves to the next string. If the chars match, it moves to the next letter. If it moves on to the next string and the array's char is greater than the insert's char, then the char is inserted there.
My question is, can a language KNOW when a list is sorted?
If the process for combing through a unsorted and sorted list is the same, and the list is still iterated through, then how does sorting save time?
EDIT:
I understand that "sorting allows algorithms that rely on sorting to work"; I apologize for not making that clear. I guess I'm asking if there's anything intrinsic about sorting inside computer languages, or if it's a strategy that people built on top of it. I think it's the latter and you guys have confirmed it. A language doesn't know if it's sorting something or not, but we recognize the performance difference.
Here's the key. The language doesn't / can't / shouldn't know whether your data structure is sorted or unsorted. In fact it doesn't even care what data structure it really is.
Now consider this: What does insertion or deletion really mean? What exact steps need to be taken to insert a new item or delete an existing one. It turns out that the exact meaning of these operations depend upon the data structure that you're using. An array will insert a new element very differently than a linked list.
So it stands to reason that these operations must be defined in the context of the data structure on which these are being applied. The language in general does not supply any keywords to deal with these data structures. Rather the accompanying libraries provide built-in implementations of these structures that contain methods to perform these operations.
Now to the original question: How does the language "know" if a list is sorted or not and why is it more efficient to deal with sorted lists? The answer is, as evident from what I said above, the language doesn't and shouldn't know about the internals of a list. It is the implementation of the list structure that you're using that knows if it is sorted or not, and how to insert a new node in an ordered manner. For example, a certain data structure may use an index (much like the index of a book) to locate the position of the words starting with a certain letter, thus reducing the amount of time that an unsorted list would require to traverse through the entire list, one element at a time.
Hope this makes it a bit clearer.
Languages don't know such things.
Some programming languages come with a standard library containing data structures, and if they don't, you generally can link to libraries that do.
Some of those data structures may be collection types that maintain order.
So given you have a data structure that represents an ordered collection type, then it is that data structure that maintains the order of the collection, whenever an item is added or removed.
If you're using the most basic collection type, an array, then neither the language nor the runtime nor the data structure itself (the array) care in the slightest at what point you insert an item.
can a language KNOW when a list is sorted
Do you mean a language interpreter? Of course it can check whether a list is sorted, simply by checking each elements is "larger" than the previous. I doubt that interpreters do this; why should they care if the list is sorted or not?
In general, if you want to insert "Cinammon" into your list, you need to either specify where to insert it, or just append it at the end. It doesn't matter to the interpreter if the list is sorted beforehand or not. It's how you use the list that determines whether a sorted list will remain sorted, and whether or not it needs to be sorted to begin with. (For example, if you try to find something in the list using a binary search, then the list must be sorted. But you must arrange for this to be the case).
AFAIK The language I'm using ...
(which is?)
... doesn't know the list is sorted; it's just a list. And it doesn't have a "wide screen" field of view like we do, so it doesn't know where the A-chunk ends and the C-chunk begins from outside the list. So it goes through and compares the first letter of each array string to the first letter of the insert string. If the insert char is greater, it moves to the next string. If the chars match, it moves to the next letter. If it moves on to the next string and the array's char is greater than the insert's char, then the char is inserted there.
What you're saying, I think, is that it looks for the first element that is "bigger than" the one being inserted, and inserts the new element just before it. That implies that it maintains the "sorted" property of the list, if it is already sorted. This is horribly inefficient for the case of unsorted lists. Also, the technique you describe for finding the insertion point (linear search) would be inefficient, if that is truly what is happening. I would suspect that your understanding of the list/language semantics are not correct.
It would help a lot if you gave a concrete example in a specific language.

Building a set of unique lines in a file efficiently, without storing actual lines in the set

Recently I was trying to solve the following issue:
I have a very large file, containing long lines, and I need to find and print out all the unique lines in it.
I don't want to use a map or set storing the actual lines, as the file is very big and the lines are long, so this would result in O(N) space complexity with poor constants (where N is the number of lines). Preferably, I would rather generate a set storing the pointers to the lines in the files that are unique. Clearly, the size of such a pointer (8 bytes on 64 bit machine I believe) is generally much smaller than the size of line (1 byte per character I believe) in memory. Although space complexity is still O(N), the constants are much better now. Using this implementation, the file never needs to be fully loaded in memory.
Now, let's say I'll go through the file line by line, checking for uniqueness. To see if it is already in the set, I could compare to all lines pointed by the set so far, comparing character by character. This gives O(N^2*L) complexity, with L the average length of a line. When not caring about storing the full lines in the set, O(N*L) complexity can be achieved, thanks to hashing. Now, when using a set of pointers instead (to reduce space requirements), how can I still achieve this? Is there a neat way to do it? The only thing I can come up with is this approach:
Hash the sentence. Store the hash value to map (or actually: unordered_multimap unordered to get the hashmap style, multi as double keys may be inserted in case of 'false matches').
For every new sentence: check if its hash value is already in the map. If not, add it. If yes, compare the full sentences (new one and the one in the unordered map with same hash) character by character, to make sure there is no 'false match'. If it is a 'false match', still add it.
Is this the right way? Or is there a nicer way you could do it? All suggestions are welcome!
And can I use some clever 'comparison object' (or something like that, I don't know much about that yet) to make this checking for already existing sentences fully automated on every unordered_map::find() call?
Your solution looks fine to me since you are storing O(unique lines) hashes not N, so that's a lower bound.
Since you scan the file line by line you might as well sort the file. Now duplicate lines will be contiguous and you need only check against the hash of the previous line. This approach uses O(1) space but you've got to sort the file first.
As #saadtaame's answer says, your space is actually O(unique lines) - depending on your use case, this may be acceptable or not.
While hashing certainly has its merits, it can conceivably have many problems with collisions - and if you can't have false positives, then it is a no-go unless you actually keep the contents of the lines around for checking.
The solution you describe is to maintain a hash-based set. That is obviously the most straightforward thing to do, and yes it would require to maintain all the unique lines in memory. That may or may not be a problem, though. That solution would also the easiest to implement -- what you are trying to do is exactly what any implementation of a (hash-based) set would do. You can just use std::unordered_set, and add every line to the set.
Since we are throwing around ideas, you could also use a trie as a substitute for the set. You would maybe save some space, but it still would be O(unique lines).
If there isn't some special structure in the file you can leverage, then definitively go for hashing the lines. This will - by orders of magnitude - be faster than actually comparing each line against each other line in the file.
If your actual implementation is still too slow, you can e.g. limit the hashing to the first portion of each line. This will produce more false positives, but assuming, that most lines will deviate already in the first few words, it will significantly speed up the file processing (especially, if you are I/O-bound).

Suffix tree vs Suffix array for LCS

I'm working on a program to find the longest common substring between multiple strings. I've lowered my approach down to either using suffix array's or a suffix tree. I want to see which is the better approach (if there is one) and why. Also for suffix array's I've seen a few algorithms for two strings but not any for more then two strings. Any solid examples would be appreciated, thanks again for the advice!
Note: I didn't see any other questions that specifically addressed this issue, but if they exist please point me in that direction!
If you have a substring that occurs in all sequences, then in a suffix array, the pointers to each occurrence of that substring must sort close together. So you could attempt to find them by moving a window along the suffix array, where the window is just large enough to contain at least one occurrence of each sequence. You could do this in linear time by maintaining a table that tells you, for each sequence, how many times that sequence occurs within that window. Then when you move the rear end of the window forwards decrement the count for the sequence associated with the pointer you have just skipped over and, if necessary, move the forward end of the window just far enough to pick up a new occurrence of this sequence and update the table.
Now you need to be able to find the length of the common prefix shared by all substrings starting at the pointers in the window. This should be the minimum LCP value occurring between the pointers in the window. If you use a red-black tree, such as a Java Treeset, with a key which consists of the LCP value as most significant component and some tie-breaker such as the pointer itself as a less significant component, then you can maintain the minimum LCP value within the window at a cost of roughly log window size per window adjustment.

Concatenate 2 STL vectors in constant O(1) time

I'll give some context as to why I'm trying to do this, but ultimately the context can be ignored as it is largely a classic Computer Science and C++ problem (which must surely have been asked before, but a couple of cursory searches didn't turn up anything...)
I'm working with (large) real time streaming point clouds, and have a case where I need to take 2/3/4 point clouds from multiple sensors and stick them together to create one big point cloud. I am in a situation where I do actually need all the data in one structure, whereas normally when people are just visualising point clouds they can get away with feeding them into the viewer separately.
I'm using Point Cloud Library 1.6, and on closer inspection its PointCloud class (under <pcl/point_cloud.h> if you're interested) stores all data points in an STL vector.
Now we're back in vanilla CS land...
PointCloud has a += operator for adding the contents of one point cloud to another. So far so good. But this method is pretty inefficient - if I understand it correctly, it 1) resizes the target vector, then 2) runs through all Points in the other vector, and copies them over.
This looks to me like a case of O(n) time complexity, which normally might not be too bad, but is bad news when dealing with at least 300K points per cloud in real time.
The vectors don't need to be sorted or analysed, they just need to be 'stuck together' at the memory level, so the program knows that once it hits the end of the first vector it just has to jump to the start location of the second one. In other words, I'm looking for an O(1) vector merging method. Is there any way to do this in the STL? Or is it more the domain of something like std::list#splice?
Note: This class is a pretty fundamental part of PCL, so 'non-invasive surgery' is preferable. If changes need to be made to the class itself (e.g. changing from vector to list, or reserving memory), they have to be considered in terms of the knock on effects on the rest of PCL, which could be far reaching.
Update: I have filed an issue over at PCL's GitHub repo to get a discussion going with the library authors about the suggestions below. Once there's some kind of resolution on which approach to go with, I'll accept the relevant suggestion(s) as answers.
A vector is not a list, it represents a sequence, but with the additional requirement that elements must be stored in contiguous memory. You cannot just bundle two vectors (whose buffers won't be contiguous) into a single vector without moving objects around.
This problem has been solved many times before such as with String Rope classes.
The basic approach is to make a new container type that stores pointers to point clouds. This is like a std::deque except that yours will have chunks of variable size. Unless your clouds chunk into standard sizes?
With this new container your iterators start in the first chunk, proceed to the end then move into the next chunk. Doing random access in such a container with variable sized chunks requires a binary search. In fact, such a data structure could be written as a distorted form of B+ tree.
There is no vector equivalent of splice - there can't be, specifically because of the memory layout requirements, which are probably the reason it was selected in the first place.
There's also no constant-time way to concatenate vectors.
I can think of one (fragile) way to concatenate raw arrays in constant time, but it depends on them being aligned on page boundaries at both the beginning and the end, and then re-mapping them to be adjacent. This is going to be pretty hard to generalise.
There's another way to make something that looks like a concatenated vector, and that's with a wrapper container which works like a deque, and provides a unified iterator and operator[] over them. I don't know if the point cloud library is flexible enough to work with this, though. (Jamin's suggestion is essentially to use something like this instead of the vector, and Zan's is roughly what I had in mind).
No, you can't concatenate two vectors by a simple link, you actually have to copy them.
However! If you implement move-semantics in your element type, you'd probably get significant speed gains, depending on what your element contains. This won't help if your elements don't contain any non-trivial types.
Further, if you have your vector reserve way in advance the memory needed, then that'd also help speed things up by not requiring a resize (which would cause an undesired huge new allocation, possibly having to defragment at that memory size, and then a huge memcpy).
Barring that, you might want to create some kind of mix between linked-lists and vectors, with each 'element' of the list being a vector with 10k elements, so you only need to jump list links once every 10k elements, but it allows you to dynamically grow much easier, and make your concatenation breeze.
std::list<std::vector<element>> forIllustrationOnly; //Just roll your own custom type.
index = 52403;
listIndex = index % 1000
vectorIndex = index / 1000
forIllustrationOnly[listIndex][vectorIndex] = still fairly fast lookups
forIllustrationOnly[listIndex].push_back(vector-of-points) = much faster appending and removing of blocks of points.
You will not get this scaling behaviour with a vector, because with a vector, you do not get around the copying. And you can not copy an arbitrary amount of data in fixed time.
I do not know PointCloud, but if you can use other list types, e.g. a linked list, this behaviour is well possible. You might find a linked list implementation which works in your environment, and which can simply stick the second list to the end of the first list, as you imagined.
Take a look at Boost range joint at http://www.boost.org/doc/libs/1_54_0/libs/range/doc/html/range/reference/utilities/join.html
This will take 2 ranges and join them. Say you have vector1 and vector 2.
You should be able to write
auto combined = join(vector1,vector2).
Then you can use combined with algorithms, etc as needed.
No O(1) copy for vector, ever, but, you should check:
Is the element type trivially copyable? (aka memcpy)
Iff, is my vector implementation leveraging this fact, or is it stupidly looping over all 300k elements executing a trivial assignment (or worse, copy-ctor-call) for each element?
What I have seen is that, while both memcpyas well as an assignment-for-loop have O(n) complexity, a solution leveraging memcpy can be much, much faster.
So, the problem might be that the vector implementation is suboptimal for trivial types.

Find substring in many objects containing multiple strings

I am dealing with a collection of objects where the reasonable size of it could be anywhere between 1 and 50K (but there's no set upper limit). Each object contains a handful of strings.
I want to implement to a search function that can partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects.
If each object only contained a single string then I could simply lexicographically sort them, and pull out ranges fairly easily - but I am reluctant to implement a map-like structure for each of the contained strings due to speed/memory concerns.
Is there a data structure well suited to this kind of operation for speed and memory efficiency? I'm sensing a database maybe on the horizon, but I know little about them, so I want to hold off researching until someone more knowledgeable can nudge me in the right direction!
a map-like collection is probably your best bet, the key will be the string, and the value is a reference to the containing object. If your strings are held inside the objects as a stl string, then you could store a reference to the data in the key part of the map instead (alternatively use a shared_ptr for the strings and reference them in both the object and the map)
Searching, sorting just becomes a matter of implementing a custom search functor that uses the dereferenced data. The size of the map will be 2 references plus the map overhead which isn't going to be that bad if you consider the alternatives will be as large, if not larger.
partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects
Well, for exact matches, you could have a std::map<std::string, std::vector<object*> >. The key would be the exact string, and the vector holds pointers to matching objects, many of these pointers may point to a single object instance.
You could have a front-end map from partial strings to full strings: say the string is "dogged", you'd sadly have to put entries in for "dogged", "ogged", "gged", "ged", "ed" and "d" (stop wherever you like if you want a minimum match size)... then use lower_bound to search. That way, say you search on "dog" you could still see that there was a match for "dogged" (doesn't matter if it matches say "dogfood" instead. This would be a simple std::map<string, string>. While you increment forwards from the lower_bound position and the string still matches (i.e. from dogfood to dogged to ... until it doesn't start with dog), you can search for that in the "exact match" map and aggregate results.
For regular expressions, I have no good suggestion... I'd start with a brute force search through all the full strings. If it really isn't good enough, then you do some rough optimisations like checking for a constant substring to filter by before doing the brute force matching, but it's beyond me to imagine how to do this very thoroughly and fast.
(substitute your favourite smart pointers for object*s if useful)
Thanks for all the replies, but following on from techniques mentioned in this post, I've decided to use an enhanced suffix array from the header-only SeqAn project.