Rope data structure & Lines - c++

I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.
The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.
However, how would I get where a specific line (\n for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.
Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the Rope, find the newlines, and when you reach the 15th newline, then you stop.
Store the start and length of each line in a vector. So you would have your Rope data structure containing all the characters, and then a separate std::vector<line>. The line structure would just consist of 2 fields; start and length. Start represents where the line starts inside of the Rope, and length is the length of the line. To get where the 15th line starts, just do lines[14].start
Problems:
#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.
#2 is also not good. Although getting where a line starts is extremely fast (O(1)), every time you insert a line, you have to move all the lines ahead of it, which is O(N). Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start and length are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.
Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n)) for inserting & accessing lines)
I was thinking of using a BST, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode do this but with a PieceTable instead.
Any help would be greatly appreciated.
EDIT:
The answer that #interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?
I also noticed ropey, which is a rust library for the Rope. I was wondering if there was something similar but for C++.

In each rope node (both leaves and internal nodes), in addition to holding the number of characters in that subtree, you can also put the total number of newlines contained in the subtree.
Then finding a specific newline will work exactly the same way as finding the node holding a specific character index. You would look at the "number of newlines" field instead of the "number of characters" field.
All rope operations will work mostly the same. When creating a new internal node, you just need to add its children's number of newlines. Complexity of all operations is the same.

Related

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.
Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

Algorithm for writing limited number of lines to text file

I have a program where I need to write text lines to a log file very frequently. I would like to limit the number of lines in the log file to 1000. When I write lines to the file, it should append them normally. Once the file reaches 1000 lines, I'd like to get rid of the first line and then append the new one. Does anyone know if there is a way to do this without rewriting the entire file each time?
Generally it's a little bit better for a case like this to remove more than one line at a time from the beginning.
That is, if your limit is 1000 lines, and you hit 1000 lines, delete the first 300 or so, and then resume writing. That way, you're not performing the delete operation with every single line written thereafter, only every 300 times. If you need to persist 1000 lines, then instead keep up to 1300 and delete 300 when 1300 is reached.
All files have to be aligned to FS cluster size. So, no, there's no way. You can append a line to a file, but you can't delete the first line without file rewriting.
You can use 2 files by turns.
Or use some buffer in memory and flush it periodically.
I think you still have to scan the file to find out how many lines are in the file at this moment. In that case, you can put it in some sort of buffer that you could easily add and delete from.
Then you do your logging and when you are done, you could "re-write" the file with the buffer (or only last 1000 lines).
Other alternatives are discussed above.
And yeah, try to avoid deleting line-by-line. Generally, it is a costly operation.
I've found some similar topics here and on CodeProject:
Small logger class;
Flexible logger class using standard streams in C++
http://www.codeproject.com/Articles/584794/Simple-logger-for-Cplusplus
Hope you find them useful :)
Any time you want to log, you can open the file, read your write index, jump to the position, and write the fixed-width log entry. When your index hits your upper threshold, simply set it back to 0.
There are a lot of warnings with this, though - first is that each proper log entry (assuming you close the file in between) will require an open, a read, a seek, a write, a seek, a write and a close - to find your index, go to it, write the new entry, then update your index. You also have the inherent issues of writing a fixed-size data element. Also, a human reader will depend on your content to know where the "beginning" of the file is. Most people expect "line 1" to be the first line.
I'm a much bigger advocate for simply having a few files and "rolling" them, so that each file on its own is coherent, but if you want just one file with a fixed number of lines, the circular buffer idea can work.
When you only want to use one file, and the length of the lines are not constant, there is no way without rewriting the whole file.
Depending on how often you are appending to the file, I don't see any problem doing so. 1000 lines of approx 100 chars are only approx 100kb, which is not to much. Additionally you may add some hysteresis.
However:
If the line length is constant (or you hard-limit the line length to some constant), you could just overwrite the oldest line. But then you have to keep track of the log file positions of old/new lines
I would use two files: The first one where you append lines. When the file gets full, rename it to a second one, and fill the first one from the beginning.

Search Large Text File for Thousands of strings

I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.
I have a list of 20,000 unique strings. I want to know the offset for each string each time it appears in the file. Currently, my output looks like this:
netloader.cc found at offset: 46350917
netloader.cc found at offset: 48138591
netloader.cc found at offset: 50012089
netloader.cc found at offset: 51622874
netloader.cc found at offset: 52588949
...
360doc.com found at offset: 26411474
360doc.com found at offset: 26411508
360doc.com found at offset: 26483662
360doc.com found at offset: 26582000
I am loading the 20,000 strings into a std::set (to ensure uniqueness), then reading a 128MB chunk from the file, and then using string::find to search for the strings (start over by reading another 128MB chunk). This works and completes in about 4 days. I'm not concerned about a read boundary potentially breaking a string I'm searching for. If it does, that's OK.
I'd like to make it faster. Completing the search in 1 day would be ideal, but any significant performance improvement would be nice. I prefer to use standard C++ with Boost (if necessary) while avoiding other libraries.
So I have two questions:
Does the 4 day time seem reasonable considering the tools I'm using and the task?
What's the best approach to make it faster?
Thanks.
Edit: Using the Trie solution, I was able to shorten the run-time to 27 hours. Not within one day, but certainly much faster now. Thanks for the advice.
Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:
hand, has, have, foot, file
The resulting tree would look something like this:
The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.
Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.
If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree
This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.
Edit
The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm
tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
foreach node in nodes:
if node.has_child(character):
node.follow_edge(character)
if node.isLeaf():
# You found a match!!
else:
nodes.delete(node)
if tree.has_child(character):
nodes.add(tree.get_child(character))
Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.
The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.
Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.
Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find.

Splitting an ifstream in C++

I'm new to C++ and probably have a silly question. I have an ifstream which I'd like to split approximately in half.
The file in question is a sorted csv and I wish to search on the first value of each line of the file.
Eventually the file will be very large so I am trying to avoid having to read every line of the file.
e.g.
If the file contains 7 lines I'd like to split the ifstream to give 1 stream containing the first 3 lines and 1 stream containing the last 4 lines.
First, use the answer to this question to determine the size of your file. Then divide that number by two. Read the input line by line, and write it to the first output stream; check file.tellg() after each call. Once you're past the half-way point, switch the output to the second file.
This wouldn't split the strings evenly between the files, but the total number of characters in these strings should be close enough, and it wouldn't split your file in the middle of a string.
Think of it as a relational database with one huge table. In order to find a certain piece of data, you can either do a sequential scan over the entire table, or use an index (which must be usable for the type of query you want to perform).
A typical index for a text file would be a list of offsets inside the file, sorted by the index expression. If the csv file is sorted by a specific column already, then the offsets in the index would be ascending, which is useful to know when building the index.
So basically you have to read the file once anyway, to find out where lines end; this is the index for the sort column. To find a particular element, use a binary search, using the index to find individual elements in the data set.
Depending on the data type, you can extend your index to allow for quick comparison without reading the actual data table. For example, in a word list you could keep the first four letters of the word next to the offset, which allows you to get into the right area quickly and only requires data reads for the last accesses (which you can then optimize to a sequential scan, as filesystems handle that a lot better).
The same technique can be applied to the other columns as well; the offsets stored in the index would no longer be ascending in file order, of course.
Since it is CSV data, a special case also applies: If the only index you have is in the same order as the file data itself and the end of record can be determined easily (that is, either you have a fixed record length, or there is a clear record separator, such as an EOL character), then building the actual index can be omitted and the values guessed (for fixed length records, offset is always equal to record length times offset in the index; for separated records you can just jump into the middle of a record and seek for the next terminator; be aware that there are nasty corner cases with binary search here). This does however mean that you will always be reading data pages here, which is less efficient than just reading the index.

How to read partial data from large text file in C++

I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I don´t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?
If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).
You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.
If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.
If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).
As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.