I have a very long file, in which it can be assumed (if it helps) every line of which has the same format. I want to read a specific line of the file. Is it possible in C++ to move the pointer to that line via a binary search instead of starting at the top of the file and reading line by line and counting lines? That is, is it possible to access some line_of_file pointer and move it by binary search? If not in C++, is this task possible in assembly language or some other language?
You cannot usefully use binary search to find a line by line number in a text file, because text files are not indexed by line number. In other words, there is no way to figure out the line number of a given offset in the file other than starting at the beginning, reading every character, and counting the number of new line characters.
There is only one exception, and in that case binary search won't help you either. If every line in the file is the exact same length, then you can find the offset of a specific line by multiplying that length by the line number (using 0 as the number of the first line). Don't forget to include the newline character in the line length. You can use istream::seekg or ostream::seekp to position the next input or output operation, respectively. (You need to use the two-argument version. Some other warnings apply on platforms which translate newline characters to multicharacter sequences; here's looking at you, windows.)
Related
I would like to read a text file in C++ in following manner:
Ignore the entire first line as it is simply meant as an introduction.
Only read the following lines from a specific position.
That starting position for reading is a fixed one and remains the same for every line; however, the numbers after that may be of variable length. I need to save all of these numbers from line 2 to line n into an Array.
At the moment I can read a regular 2D Array with getline.
How can I work around these things?
An example for a line I want to read could be:
Person1: 25 988.3 0.0023 7
To set the file to a position, use std::ifstream::seekg().
To set the file to the beginning of a line, you must read and count the line endings. Many text files have variable length text lines.
How can I work around these things?
You can't, unless you can ensure that all of the data lines after the first line are all the same length.
If you can't ensure that, then all you can do is read through all of the preceding lines.
An alternative I have employed in the past is to generate an 'index' of line start positions in a secondary file in binary format (so that I CAN jump directly to the right place in that file), and use that to jump to the right place in the text file. Of course that means that you need to regenerate that index file every time you replace/amend the data file.
I have to extract information from a text file.
In the text file there is a list of strings.
This is an example of a string: AAA101;2015-01-01 00:00:00;0.784
The value after the last ; is a non integer value, which changes from line to line, so every line has different lenght of characters.
I want to map all of these lines into a structured vector as I can access to a specific line anytime I need without scan the whole file again.
I did some research and I found some threads about a command called, which permit me to reach a specific line of a text file but I read it only works if any line has the same characters lenght of the others.
I was thinking about converting all the lines in the file in a proper format in order to be able to map that file as I want but I hope there is a better and quick way
You can try TStringList*. It creates a list of AnsiStrings. Then each AnsiString can be accessed via ->operator [](numberOfTheLine).
This question already has answers here:
Moving the file cursor up lines?
(3 answers)
C++ Get Total File Line Number
(7 answers)
Closed 8 years ago.
Is there a fast way to get a line from a text file by the line number? If I wanted only line 20 is there anything that will allow me to do something like get line 20? I know getline(in, line) reads in each line one at a time but I rather not call getline 20 times to get the 20th line.
Thanks!
No, there is no fast and magical method.
Background
Text file records are variable length. Each text line may vary in the number of characters. Fixed records are easy since their length is known.
To find the Nth record, you have to find the beginnings or endings of the text records. This is often performed by searching for a newline character. Still tedious.
Converting to Random Access
If the data is requested many times, a map or dictionary of the record line number and its position would be handy. Use the line number, retrieve the file postion, then set the file pointer to the given position.
Memory mapped file
If there is enough memory, the file could be read and stored in memory.
However, one still has to search for the newlines and count them to find line X.
Summary
There is no fast method to find the start of a text line in a file, the first time. In any case, the text must be searched for the newlines and the newlines counted.
There are methods to speed up the process, but those involve reading the file one or more times. The mapping of line numbers to file positions is fast but requires an initial scan. Loading the file into memory (memory mapping) requires reading the file into memory (first read) then searching the memory; also, the OS may only load portions of file that are requested and not the entire file.
No, you have to use a loop that will advance to the next line twenty times.
The reason it is not possible to do what you want is the way the file is structured: It's a sequence of bytes, and a new line is just another byte (or a sequence of two bytes, by the Windows convention).
im trying to store a large list of prime numbers in a text file and if I end my program i need to be able to read the line of the file to see where I left off. I dont know how to read the last line with out reading every line of the file first.
I don't know either. Just write the last value into a separate file, and read that value to know where to resume.
You could use setg() to jump to the end of the file and do guesses how far a line is. If there's a newline between your point and the end of the file then you're in the next-to-last line and know what the last line is.
But Pete Beckers solution is a lot nicer, I'd go with that instead.
You can calculate the bytes of your numbers.
For example you have 5 number and you want to read last number.
1 integer is 4 byte. So you can move 4*4=16 Byte in file using fseek. After that you can read last line.
fseek (file , 16 , SEEK_SET);
SEEK_SET means begining of file.
Seek to the very end of the file, and just read backwards till you find the newline character which means you have found the the start of the last line
If you know the maximum length of line this will be easy.
Just go the the pointer that is the location of the end of file less this value.
Start reading lines from there and put them in a buffer. Clear buffer when the previous character was a new line
When you ran out of file the buffer will contain it.
If you do not know the maximum length you can always read the file backwards.
I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I donĀ“t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?
If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).
You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.
If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.
If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).
As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.