I'm using ifstream to parse a file in a c++ code. I'm not able using seekg() and tellg() to jump in a particular line of the code.
In particular I would like to read a line, with method getLine, from a particular position of the file. Position saved in the previously iteration.
You just have to skip required number of lines.
The best way to do it is ignoring strings with std::istream::ignore
for (int currLineNumber = 0; currLineNumber < startLineNumber; ++currLineNumber){
if (addressesFile.ignore(numeric_limits<streamsize>::max(), addressesFile.widen('\n'))){
//just skipping the line
} else {
// todo: handle the error
}
}
The first argument is maximum number of characters to extract. If this is exactly numeric_limits::max(), there is no limit.
You should use is instead of std::getline due to better performance.
It seems there are no specific C++ functions, like "seekline", for your needs, and I see two ways to solve this task:
Preliminary you can expand every line in textfile with spaces to reach a
constant length L. Than, to seek required line N, just use
seekg with L * N offset.
This method is more complicated. You can create an auxiliary binary
file, every byte of it will keep length of every line of source
file. This auxiliary file is a kind of database. Next, you have to
load this binary file into array in your program within initialization phase. The offset of a
required line in textfile should be calculated as sum of first N array's
elements. Of course, it's necessary to update an auxiliary file and source file simultaneously.
The first case is more efficient if a textfile is loyal for it's size requirements. The second case brings best perfomance for long textfile and rare edit operations.
Related
I would like to read a text file in C++ in following manner:
Ignore the entire first line as it is simply meant as an introduction.
Only read the following lines from a specific position.
That starting position for reading is a fixed one and remains the same for every line; however, the numbers after that may be of variable length. I need to save all of these numbers from line 2 to line n into an Array.
At the moment I can read a regular 2D Array with getline.
How can I work around these things?
An example for a line I want to read could be:
Person1: 25 988.3 0.0023 7
To set the file to a position, use std::ifstream::seekg().
To set the file to the beginning of a line, you must read and count the line endings. Many text files have variable length text lines.
How can I work around these things?
You can't, unless you can ensure that all of the data lines after the first line are all the same length.
If you can't ensure that, then all you can do is read through all of the preceding lines.
An alternative I have employed in the past is to generate an 'index' of line start positions in a secondary file in binary format (so that I CAN jump directly to the right place in that file), and use that to jump to the right place in the text file. Of course that means that you need to regenerate that index file every time you replace/amend the data file.
I want to access the last 6 lines in a text file using c++. Can anyone provide me with a code that reaches there in a constant time? Thanks in advance. :)
fstream myfile("test.txt");
myfile.seekg(-6,ios_base::end);
string line;
while(getline(myfile,line))
{
if(vect.size() != VSIZE)
{
vect.push_back(line);
}
else
{
vect.erase(v.begin());
vect.push_back(line);
}
}
It seems not to be working... and VSIZE is 6... please provide me with help and working code.
This line:
myfile.seekg(-6,ios_base::end);
seeks to the 6th byte before the end of the file, not 6 lines. You need to count the newline backwards or start from the beginning. So your code should work if you remove the line above.
This is quite a hard thing to do, and there are several edge cases to consider.
Broadly the strategy is:
Open the file in binary mode so you see every byte.
Seek to (end - N), where N is the size of an arbitrary buffer. About 1K should do it.
Read N bytes into a buffer. Scan backwards looking for LF characters ('\n). Skip the one at the end, if there is one.
Each line starts just after an LF, so count the lines backwards until you get to 6.
If you don't find 6 then seek backwards another N bytes, read another buffer and continue the scan.
If you reach the beginning of the file, stop.
I leave the code as an exercise.
This answer explains why what you do won't work. Below I explain what will work.
Open the file in the binary mode.
Read forward from the start storing positions of '\n' in a circular buffer of length 6. (boost::circular_buffer can help)
Dump the contents of the file starting from the smallest position in the ring buffer.
Step 2 can be improved by seeking to end-X where X is derived by some sort of bisection around the end of file.
Probably the easiest approach is to just mmap() the file. This puts its contents into your virtual address space, so you can easily scan it from the end for the first six line endings.
Since mmapping a file gives you the illusion of the entire file being in memory in a single large buffer without actually loading the parts that you don't need, it both avoids unnecessary I/O and alleviates you from managing a growing buffer as you search backwards for the line endings.
I am running C++ code where I need to import data from txt file.
The text file contains 10,000 lines. Each line contains n columns of binary data.
The code has to loop 100,000 times, each time it has to randomly select a line out of the txt file and assign the binary values in the columns to some variables.
What is the most efficient way to write this code? should I load the file first into the memory or should I randomly open a random line number?
How can I implement this in C++?
To randomly access a line in a text file, all lines need to have the same byte-length. If you don't have that, you need to loop until you get at the correct line. Since this will be pretty slow for so much access, better just load it into a std::vector of std::strings, each entry being one line (this is easily done with std::getline). Or since you want to assign values from the different columns, you can use a std::vector with your own struct like
struct MyValues{
double d;
int i;
// whatever you have / need
};
std::vector<MyValues> vec;
Which might be better instead of parsing the line all the time.
With the std::vector, you get your random access and only have to loop once through the whole file.
10K lines is a pretty small file.
If you have, say, 100 chars per line, it will use the HUGE amount of 1MB of your RAM.
Load it to a vector and access it the way you want.
maybe not THE most efficient, but you could try this:
int main() {
//use ifstream to read
ifstream in("yourfile.txt");
//string to store the line
string line = "";
//random number generator
srand(time(NULL));
for(int i = 0; i < 100000; i++) {
in.seekg(rand() % 10000);
in>>line;
//do what you want with the line here...
}
}
Im too lazy right now, but you need to make sure that you check your ifstream for errors like end-of-file, index-out-of-bounds, etc...
Since you're taking 100,000 samples from just 10,000 lines, the majority of lines will be sampled. Read the entire file into an array data structure, and then randomly sample the array. This avoids file seeking entirely.
The more common case is to sample only a small subset of the file's data. To do that, assuming the lines are different length, seek to random points in the file, skip to the next newline (for example cin.ignore( numeric_limits< streamsize >::max(), '\n' ), and then parse the subsequent text.
Is there a way that I can seek to a certain line in a file to read or write data?
Let's say I want to write some data starting on the 10th line in a text file. There might be some data already in the first few lines, or the file could even be empty. Is there a way I can seek directly to the line I want without having to worry about what's already in the file?
Only if the lines are all the same length (seek to 9 * bytes_per_line). Otherwise, you'll just have to scan your way to the appropriate spot in the file.
Also be wary of writing into the middle of a file. It may not do what you expect (insert new lines). It will simply overwrite whatever content is already there, and won't respect existing line boundaries.
You can seek to a position in a file, but that position must be a character offset from the start, end or current position - see for example fseek(). There is no way of seeking to a particular line, unless all the lines are exactly the same length.
No, you have to process the data to find the line delimiters (unless you have fixed length lines). Have a look at getline(), ftell() and fseek(). http://www.pixelbeat.org/programming/readline/cpp.cpp
The easy best way is to read the file in memory inserting for instance each line in a vector of strings, then modifying/adding whatever you want, and re-write each line in a new file.
(supposing the file fits in memory)
I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I donĀ“t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?
If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).
You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.
If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.
If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).
As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.