I'm new to C++ and probably have a silly question. I have an ifstream which I'd like to split approximately in half.
The file in question is a sorted csv and I wish to search on the first value of each line of the file.
Eventually the file will be very large so I am trying to avoid having to read every line of the file.
e.g.
If the file contains 7 lines I'd like to split the ifstream to give 1 stream containing the first 3 lines and 1 stream containing the last 4 lines.
First, use the answer to this question to determine the size of your file. Then divide that number by two. Read the input line by line, and write it to the first output stream; check file.tellg() after each call. Once you're past the half-way point, switch the output to the second file.
This wouldn't split the strings evenly between the files, but the total number of characters in these strings should be close enough, and it wouldn't split your file in the middle of a string.
Think of it as a relational database with one huge table. In order to find a certain piece of data, you can either do a sequential scan over the entire table, or use an index (which must be usable for the type of query you want to perform).
A typical index for a text file would be a list of offsets inside the file, sorted by the index expression. If the csv file is sorted by a specific column already, then the offsets in the index would be ascending, which is useful to know when building the index.
So basically you have to read the file once anyway, to find out where lines end; this is the index for the sort column. To find a particular element, use a binary search, using the index to find individual elements in the data set.
Depending on the data type, you can extend your index to allow for quick comparison without reading the actual data table. For example, in a word list you could keep the first four letters of the word next to the offset, which allows you to get into the right area quickly and only requires data reads for the last accesses (which you can then optimize to a sequential scan, as filesystems handle that a lot better).
The same technique can be applied to the other columns as well; the offsets stored in the index would no longer be ascending in file order, of course.
Since it is CSV data, a special case also applies: If the only index you have is in the same order as the file data itself and the end of record can be determined easily (that is, either you have a fixed record length, or there is a clear record separator, such as an EOL character), then building the actual index can be omitted and the values guessed (for fixed length records, offset is always equal to record length times offset in the index; for separated records you can just jump into the middle of a record and seek for the next terminator; be aware that there are nasty corner cases with binary search here). This does however mean that you will always be reading data pages here, which is less efficient than just reading the index.
Related
I have a txt file with 100000000 words in every new line.
I want to write a function that takes an input of the word and searches if the word is there or not in the txt file.
I have tried this with map and trie method but I'm getting std:bac_alloc error, this is due to that large number of words can anyone suggest how to solve the issue
Data structures are quite important when programming. If possible I would recommend that you use something like a binary tree. This would require sorting the text file though. If you cannot sort the text file, the best way would be to just iterate over the text file until you get the word that you wanted. Also, your comment should contain more information as to allow us to more easily diagnose your problem
I assume you want to search this word list over and over. Because for a small number of searches just search linear through the file.
Parsing the word list into a suffix tree takes about 20 times the size of the file, more if not optimized. Since you ran out of memory constructing a trie of the word list I assume it's really big. So lets not keep it in memory but process it a bit so you can search faster.
The solution I would propose is to do a dictionary search.
So first turn every whitespace into a newline so you have one word per line instead of multiple lines with multiple words and then sort the file and store it. While you are at it you can remove duplicates. That is our dictionary. Remember the length of the longest word (L) while you do that.
To access the dictionary you need a helper function to read a word at offset X, which can be at the middle of some word. The function should seek to the offset - L and read 2 * L bytes into a buffer. Then from the middle of the buffer search backward and forward to find the word at offset X.
Now to search you open the dictionary and read the word at offset left=0 and offset right = size_of_file, i.e. the first and last word. If your search term is less then the first word or larger then the last word you are done, word not found. If you found the search term you are done too.
Next in a binary search you would take the std::midpoint of left and right, read the word at that offset and check if the search term is less or more and recurse into that interval. This would require O(log n) reads to find the word or determine it's not present.
A dictionary search can do better. Instead of using the midpoint you can approximate where the word should be in the dictionary. Say your dictionary goes from "Aal" to "Zoo" and you are searching for "Zebra". Would you open the dictionary in the middle? No, you would open it near the end because Zerba is much closer to Zoo than Aal. So you need a function that gives you a value (M) between 0 and 1 of where a search term is located relative to the left and right word. Your "midpoint" for the search is then (right - left) * M. Then, like with binary search, determine if the search term is in the left or right interval and recurse.
A dictionary search takes only log log n reads on average if the word list has reasonably uniform distribution.
I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.
The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.
However, how would I get where a specific line (\n for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.
Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the Rope, find the newlines, and when you reach the 15th newline, then you stop.
Store the start and length of each line in a vector. So you would have your Rope data structure containing all the characters, and then a separate std::vector<line>. The line structure would just consist of 2 fields; start and length. Start represents where the line starts inside of the Rope, and length is the length of the line. To get where the 15th line starts, just do lines[14].start
Problems:
#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.
#2 is also not good. Although getting where a line starts is extremely fast (O(1)), every time you insert a line, you have to move all the lines ahead of it, which is O(N). Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start and length are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.
Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n)) for inserting & accessing lines)
I was thinking of using a BST, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode do this but with a PieceTable instead.
Any help would be greatly appreciated.
EDIT:
The answer that #interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?
I also noticed ropey, which is a rust library for the Rope. I was wondering if there was something similar but for C++.
In each rope node (both leaves and internal nodes), in addition to holding the number of characters in that subtree, you can also put the total number of newlines contained in the subtree.
Then finding a specific newline will work exactly the same way as finding the node holding a specific character index. You would look at the "number of newlines" field instead of the "number of characters" field.
All rope operations will work mostly the same. When creating a new internal node, you just need to add its children's number of newlines. Complexity of all operations is the same.
Let's assume that we have a header-less CSV file (I'm including a header row for clarification but the actual file doesn't contain it):
ID,BookTitle,Author,Price
110,book1,author1,price1
178,book2,author2,price2
917,book3,author3,price3
How can I acquire the ID column without having to read whole rows of data into memory? i.e. read ID: 110 and add to a vector, go to next row (line), read ID: 178 and add to vector, and so on.
You cannot. Files don't have rows and columns. The content is just characters and a \n denotes a line break. Hence, you cannot know where a line starts or ends without reading characters until you find a line break.
Situation is different when lines have a fixed width. Then you can skip ahead and start reading the next line.
For any future readers who might stumble upon this question:
#largest_prime 's answer is the reasonable one. If your row data isn't that large (which for 99.99% of times, this is the case), it makes perfect sense to read it all, acquire the ID data and finally discard the rest of the row data.
However, in that 0.01% of the times that a row might contain big data (such as a whole book's text), #z80crew 's solution might work out nicely.
I'm very confused with the new programming assignment we got in class a few days ago. It asks us to read info from a file which contains an unknown number of rows and columns and then sort the data. My question is how do I do that?
My reasoning was that if I knew the number of columns, I would just create an array of structures and then create a new structure for each row. But since the number of columns is also unknown, I don't know how to approach this.
Also, we only allowed to use <iostream> <fstream>, <cctype> and <vector> libraries.
You could use a
std::vector<std::vector<WhateverTypeYouWantToStore>>
Use std::vector. You can create a 2D vector like this:
vector<vector<string> > table;
And then read the lines from a file, and put the data into a one dimensional vector (vector<string> line). And then, You can push_back the line vector into the table, like this:
table.push_back(line);
You can see more information about vector on this page: cplusplus.com
I hope you must know what format of data that you are going to read from text file's row and column. First to understand, you will read first row, then second row and so on. If you do not know type of data, then believe all of it as string of characters. So, you can assume wherever you fine null char '\0' then you are finding data for first row, so go on read character by character look for next '\0'. Then wherever you find '\n' that will be last point of first row and you just discovered last column. After '\n' you will start reading 2nd row and so on. With this you can determine how many rows and columns are there. You keep on reading text file until you reach EOF.
See the attached image.
Text File Format
Also, define a pointer to character type and use realloc to assign memory to it. With realloc() you can grow it, as you find more data. Please go through realloc() manual for reference.
I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I donĀ“t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?
If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).
You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.
If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.
If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).
As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.