Reading line X until line Y from file in C++ - c++

I have a relatively simple question. Say I have a file but I only want to access line X of the file until line Y, whats the easiest way of doing that?
I know I can read in the lines one by one keeping count, until I reach the lines that I actually need, but is there a better more elegant solution?
Thanks.

In C++, no, not really (well, not in any language I'm familiar with, really).
You have to start at the start of the file so you can figure where line X starts (unless it's a fixed-record-length file but that's unlikely for text).
Similarly, you have to do that until you find the last line you're interested in.
You can read characters instead of lines if you're scared of buffer overflow exploits, or you can read in fixed-size block and count the newlines for speed but it all boils down to reading and checking every character (by your code explicitly or the language libraries implicitly) to count the newlines.

You can use istream::ignore() to avoid buffering the unneeded input.
bool skip_lines(std::istream &is, std::streamsize n)
{
while(is.good() && n--) {
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}
return is.good();
}

Search for \n X times, start 'reading' (whatever processing that entails) until you reach the Y-X \n or EOF. Assuming unix style new lines.

Since you have to ensure end line characters inside each line in order to be able to count line, you do still need to iterate over you file. The only optimization I can think about is not read the file line by line, but buffer it and then iterate counting the lines.

Using C or C++, there exist some functions that you can use to skip a specified number of byte within a file (fseek in the stdio and seekp with istreams and ostreams). However, you cannot specify a given number of lines since each line might have a variable number of characters and therefore it's impossible to calculate the correct offset. A file is not some kind of array where each line occupies a "row": you rather have to see it as a continuous memory space (not talking hardware here thought...)

Related

C++ how to align each line for a file?

I want to use openMP to read a big file which contains lots of lines from disk. One way to do it seems to use seekg() function. But the headache part is seekg() only support to move the file index to a particular byte.
This works fine if the size of each line is exactly the same. But I have no idea that how to do it if the size of each line is totally different.
So could you give me some hint?
One possibility:
Divide the file into equal-sized chunks based on bytes, one for each parallel task, without regard to line endings.
Have each task seek to the beginning of its chunk, then read and ignore characters until it finds a line ending, so that it can start processing the file at the beginning of a line. (As a special case, the task that starts at offset 0 should not do this, because it's already at the beginning of a line.)
When a task reaches the end of its chunk (i.e. the byte offset where the next chunk begins), continue reading past that point to the end of the current line. (As a special case, the end of the last chunk is also the end of the file, so there's nothing to read past that point.)
Basically, you initially choose boundaries based on byte offsets, but then move them forward to coincide with line endings. Each task skips some characters at the beginning of its chunk, and those characters are instead handled by another task reading past the end of the preceding chunk.
(I believe this is how Hadoop splits text-based input files by default, BTW.)

How to use go the m th line and n th character of a file??

If I want to insert or copy something from the the m th line and n th character in a file, what should I do? Is there a way better than using getline for m times and seekp? Thanks.
Is there a way better than using getline for m times and seekp?
Not really! Lines aren't "special" at the operating system level; they're just parts of a text file separated by the newline character. The only way to get to line m of a text file is to read through all of the file until you've seen m - 1 newlines. Your C++ library's getline() function is likely to have a pretty efficient implementation of that operation already, so you're probably best off just using that.
If your application needs to seek to specific lines of a large file many times during a single run, it may make sense to read in the whole file into a data structure at startup (e.g, an array of structures, each one representing a single line of text); once you've done this, seeking to a specific line is as easy as an array lookup. But if you only need to seek to a specific line once, that's not necessary.
A more memory-efficient approach for repeated seeks in larger files may be to record the file offset for each line number as you encounter it, so that you can easily return to a given line without starting over from the beginning. Again, though, this is only necessary if seeks will be repeated many times.

Fast way to get two first and last characters of a string from the input

I need to read a string from the input
a string has its length from 2 letters up to 1000 letters
I only need 2 first letters, 2 last letters, and the size of the entire string
Here is my way of doing it, HOWEVER, I do believe there is a smarter way, which is why I am asking this question. Could you please tell me, unexperienced and new C++ programmer, what are possible ways of doing this task better?
Thank you.
string word;
getline(cin, word);
// results - I need only those 5 numbers:
int l = word.length();
int c1 = word[0];
int c2 = word[1];
int c3 = word[l-2];
int c4 = word[l-1];
Why do I need this? I want to encode a huge number of really long strings, but I figured out I really need only those 5 values I mentioned, the rest is redundant. How many words will be loaded? Enough to make this part of code worth working on :)
I will take you at your word that this is something that is worth optimizing to an extreme. The method you've shown in the question is already the most straight-forward way to do it.
I'd start by using memory mapping to map chunks of the file into memory at a time. Then, loop through the buffer looking for newline characters. Take the first two characters after the previous newline and the last two characters before the one you just found. Subtract the address of the second newline from the first to get the length of the line. Rinse, lather, and repeat.
Obviously some care will need to be taken around boundaries, where one newline is in the previous mapped buffer and one is in the next.
The first two letters are easy to obtain and fast.
The issue is with the last two letters.
In order to read a text line, the input must be scanned until it finds an end-of-line character (usually a newline). Since your text lines are variable, there is no fast solution here.
You can mitigate the issue by reading in blocks of data from the file into memory and searching memory for the line endings. This avoids a call to getline, and it avoids a double search for the end of line (once by getline and the other by your program).
If you change the input to be fixed with, this issue can be sped up.
If you want to optimize this (although I can't imagine why you would want to do that, but surely you have your reasons), the first thing to do is to get rid of std::string and read the input directly. That will spare you one copy of the whole string.
If your input is stdin, you will be slowed down by the buffering too. As it has already been said, the best speed woukd be achieved by reading big chunks from a file in binary mode and doing the end of line detection yourself.
At any rate, you will be limited by the I/O bandwidth (disk access speed) in the end.

How to remove the first N lines of a std::istream (e.g. std::stringstream)? [duplicate]

I'm using std::getline() to read lines from an std::istream-derived class, how can I move forward a few lines?
Do I have to just read and discard them?
No, you don't have to use getline
The more efficient way is ignoring strings with std::istream::ignore
for (int currLineNumber = 0; currLineNumber < startLineNumber; ++currLineNumber){
if (addressesFile.ignore(numeric_limits<streamsize>::max(), addressesFile.widen('\n'))){
//just skipping the line
} else
return HandleReadingLineError(addressesFile, currLineNumber);
}
HandleReadingLineError is not standart but hand-made, of course.
The first parameter is maximum number of characters to extract. If this is exactly numeric_limits::max(), there is no limit:
Link at cplusplus.com: std::istream::ignore
If you are going to skip a lot of lines you definitely should use it instead of getline: when i needed to skip 100000 lines in my file it took about a second in opposite to 22 seconds with getline.
Edit: You can also use std::istream::ignore, see https://stackoverflow.com/a/25012566/492336
Do I have to use getline the number of lines I want to skip?
No, but it's probably going to be the clearest solution to those reading your code. If the number of lines you're skipping is large, you can improve performance by reading large blocks and counting newlines in each block, stopping and repositioning the file to the last newline's location. But unless you are having performance problems, I'd just put getline in a loop for the number of lines you want to skip.
Yes use std::getline unless you know the location of the newlines.
If for some strange reason you happen to know the location of where the newlines appear then you can use ifstream::seekg first.
You can read in other ways such as ifstream::read but std::getline is probably the easiest and most clear solution.

Skip lines in std::istream

I'm using std::getline() to read lines from an std::istream-derived class, how can I move forward a few lines?
Do I have to just read and discard them?
No, you don't have to use getline
The more efficient way is ignoring strings with std::istream::ignore
for (int currLineNumber = 0; currLineNumber < startLineNumber; ++currLineNumber){
if (addressesFile.ignore(numeric_limits<streamsize>::max(), addressesFile.widen('\n'))){
//just skipping the line
} else
return HandleReadingLineError(addressesFile, currLineNumber);
}
HandleReadingLineError is not standart but hand-made, of course.
The first parameter is maximum number of characters to extract. If this is exactly numeric_limits::max(), there is no limit:
Link at cplusplus.com: std::istream::ignore
If you are going to skip a lot of lines you definitely should use it instead of getline: when i needed to skip 100000 lines in my file it took about a second in opposite to 22 seconds with getline.
Edit: You can also use std::istream::ignore, see https://stackoverflow.com/a/25012566/492336
Do I have to use getline the number of lines I want to skip?
No, but it's probably going to be the clearest solution to those reading your code. If the number of lines you're skipping is large, you can improve performance by reading large blocks and counting newlines in each block, stopping and repositioning the file to the last newline's location. But unless you are having performance problems, I'd just put getline in a loop for the number of lines you want to skip.
Yes use std::getline unless you know the location of the newlines.
If for some strange reason you happen to know the location of where the newlines appear then you can use ifstream::seekg first.
You can read in other ways such as ifstream::read but std::getline is probably the easiest and most clear solution.