I want to use openMP to read a big file which contains lots of lines from disk. One way to do it seems to use seekg() function. But the headache part is seekg() only support to move the file index to a particular byte.
This works fine if the size of each line is exactly the same. But I have no idea that how to do it if the size of each line is totally different.
So could you give me some hint?
One possibility:
Divide the file into equal-sized chunks based on bytes, one for each parallel task, without regard to line endings.
Have each task seek to the beginning of its chunk, then read and ignore characters until it finds a line ending, so that it can start processing the file at the beginning of a line. (As a special case, the task that starts at offset 0 should not do this, because it's already at the beginning of a line.)
When a task reaches the end of its chunk (i.e. the byte offset where the next chunk begins), continue reading past that point to the end of the current line. (As a special case, the end of the last chunk is also the end of the file, so there's nothing to read past that point.)
Basically, you initially choose boundaries based on byte offsets, but then move them forward to coincide with line endings. Each task skips some characters at the beginning of its chunk, and those characters are instead handled by another task reading past the end of the preceding chunk.
(I believe this is how Hadoop splits text-based input files by default, BTW.)
Related
I have to read in a large file, character by character, and put each character in a map with a corresponding key. My question is, is there a way to read in the file to the map and save it there, so the program doesn't have to read the whole file character by character each time (takes forever)?
The characters are used later in the program to do an encode/decode thing.
Well, yes, ir gonna take forever, but you can use a std::unordered_multimap to speed it up a little, by skipping the sort phase.
If I want to insert or copy something from the the m th line and n th character in a file, what should I do? Is there a way better than using getline for m times and seekp? Thanks.
Is there a way better than using getline for m times and seekp?
Not really! Lines aren't "special" at the operating system level; they're just parts of a text file separated by the newline character. The only way to get to line m of a text file is to read through all of the file until you've seen m - 1 newlines. Your C++ library's getline() function is likely to have a pretty efficient implementation of that operation already, so you're probably best off just using that.
If your application needs to seek to specific lines of a large file many times during a single run, it may make sense to read in the whole file into a data structure at startup (e.g, an array of structures, each one representing a single line of text); once you've done this, seeking to a specific line is as easy as an array lookup. But if you only need to seek to a specific line once, that's not necessary.
A more memory-efficient approach for repeated seeks in larger files may be to record the file offset for each line number as you encounter it, so that you can easily return to a given line without starting over from the beginning. Again, though, this is only necessary if seeks will be repeated many times.
I need to read a string from the input
a string has its length from 2 letters up to 1000 letters
I only need 2 first letters, 2 last letters, and the size of the entire string
Here is my way of doing it, HOWEVER, I do believe there is a smarter way, which is why I am asking this question. Could you please tell me, unexperienced and new C++ programmer, what are possible ways of doing this task better?
Thank you.
string word;
getline(cin, word);
// results - I need only those 5 numbers:
int l = word.length();
int c1 = word[0];
int c2 = word[1];
int c3 = word[l-2];
int c4 = word[l-1];
Why do I need this? I want to encode a huge number of really long strings, but I figured out I really need only those 5 values I mentioned, the rest is redundant. How many words will be loaded? Enough to make this part of code worth working on :)
I will take you at your word that this is something that is worth optimizing to an extreme. The method you've shown in the question is already the most straight-forward way to do it.
I'd start by using memory mapping to map chunks of the file into memory at a time. Then, loop through the buffer looking for newline characters. Take the first two characters after the previous newline and the last two characters before the one you just found. Subtract the address of the second newline from the first to get the length of the line. Rinse, lather, and repeat.
Obviously some care will need to be taken around boundaries, where one newline is in the previous mapped buffer and one is in the next.
The first two letters are easy to obtain and fast.
The issue is with the last two letters.
In order to read a text line, the input must be scanned until it finds an end-of-line character (usually a newline). Since your text lines are variable, there is no fast solution here.
You can mitigate the issue by reading in blocks of data from the file into memory and searching memory for the line endings. This avoids a call to getline, and it avoids a double search for the end of line (once by getline and the other by your program).
If you change the input to be fixed with, this issue can be sped up.
If you want to optimize this (although I can't imagine why you would want to do that, but surely you have your reasons), the first thing to do is to get rid of std::string and read the input directly. That will spare you one copy of the whole string.
If your input is stdin, you will be slowed down by the buffering too. As it has already been said, the best speed woukd be achieved by reading big chunks from a file in binary mode and doing the end of line detection yourself.
At any rate, you will be limited by the I/O bandwidth (disk access speed) in the end.
In the following block of code I have created a numbers.txt document which has the number 1 written on it shouldn't this program spit the word OK back infinite number of times because it's going past the eof marker
while (!sample.eof())
{
char ch;
sample.get(ch);
sample.seekp(-1L, ios::cur);
sample >> initialnumber;
sample.seekp(2L, ios::cur);
cout << "OK";
}
There is no such thing as an "EOF marker"1. EOF is simply a file condition defined by going at or past the end of the file. Whether you seek 1 byte or seek 100000 bytes past the end makes no difference: if your file position pointer is past the end, you are at or beyond the End Of File.
Your code reads a character and then backs up (essentially negating the character read). It then reads an integer and skips two characters past that. This has the effect of always moving forward in the file (even if the integer read fails). Thus, you will eventually hit EOF: there is no infinite loop here.
1 In DOS days, files could contain the 0x1A byte ("ASCII EOF") which would cause certain text readers to stop at that byte. The file contents could physically extend beyond this byte, but text utilities could refuse to read past it. However, the standard C++ libraries treat 0x1A like any other character, and will happily read past it.
I have a relatively simple question. Say I have a file but I only want to access line X of the file until line Y, whats the easiest way of doing that?
I know I can read in the lines one by one keeping count, until I reach the lines that I actually need, but is there a better more elegant solution?
Thanks.
In C++, no, not really (well, not in any language I'm familiar with, really).
You have to start at the start of the file so you can figure where line X starts (unless it's a fixed-record-length file but that's unlikely for text).
Similarly, you have to do that until you find the last line you're interested in.
You can read characters instead of lines if you're scared of buffer overflow exploits, or you can read in fixed-size block and count the newlines for speed but it all boils down to reading and checking every character (by your code explicitly or the language libraries implicitly) to count the newlines.
You can use istream::ignore() to avoid buffering the unneeded input.
bool skip_lines(std::istream &is, std::streamsize n)
{
while(is.good() && n--) {
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}
return is.good();
}
Search for \n X times, start 'reading' (whatever processing that entails) until you reach the Y-X \n or EOF. Assuming unix style new lines.
Since you have to ensure end line characters inside each line in order to be able to count line, you do still need to iterate over you file. The only optimization I can think about is not read the file line by line, but buffer it and then iterate counting the lines.
Using C or C++, there exist some functions that you can use to skip a specified number of byte within a file (fseek in the stdio and seekp with istreams and ostreams). However, you cannot specify a given number of lines since each line might have a variable number of characters and therefore it's impossible to calculate the correct offset. A file is not some kind of array where each line occupies a "row": you rather have to see it as a continuous memory space (not talking hardware here thought...)