How to getline() from specific line in a file? C++ - c++

I've looked around a bit and have found no definitive answer on how to read a specific line of text from a file in C++. I have a text file with over 100,000 English words, each on its own line. I can't use arrays because they obviously won't hold that much data, and vectors take too long to store every word. How can I achieve this?
P.S. I found no duplicates of this question regarding C++
while (getline(words_file, word))
{
my_vect.push_back(word);
}
EDIT:
A commenter below has helped me to realize that the only reason loading a file to a vector was taking so long was because I was debugging. Plainly running the .exe loads the file nearly instantaneously. Thanks for everyones help.

If your words have no white-space (I assume they don't), you can use a more tricky non-getline solution using a deque!
using namespace std;
int main() {
deque<string> dictionary;
cout << "Loading file..." << endl;
ifstream myfile ("dict.txt");
if ( myfile.is_open() ) {
copy(istream_iterator<string>(myFile),
istream_iterator<string>(),
back_inserter<deque<string>>(dictionary));
myfile.close();
} else {
cout << "Unable to open file." << endl;
}
return 0;
}
The above reads the entire file into a string and then tokenizes the string based on the std::stream default (any whitespace - this is a big assumption on my part) which makes it slightly faster. This gets done in about 2-3 seconds with 100,000 words. I'm also using a deque, which is the best data structure (imo) for this particular scenario. When I use vectors, it takes around 20 seconds (not even close to your minute mark -- you must be doing something else that increases complexity).
To access the word at line 1:
cout << dictionary[0] << endl;
Hope this has been useful.

You have a few options, but none will automatically let you go to a specific line. File systems don't track line numbers within files.
One way is to have fixed-width lines in the file. Then read the appropriate amount of data based upon the line number you want and the number of bytes per line.
Another way is to loop, reading lines one a time until you get to the line that you want.
A third way would be to have a sort of index that you create at the beginning of the file to reference the location of each line. This, of course, would require that you control the file format.

I already mentioned this in a comment, but I wanted to give it a bit more visibility for anyone else who runs into this issue...
I think that the following code will take a long time to read from the file because std::vector probably has to re-allocate its internal memory several times to account for all of these elements that you are adding. This is an implementation detail, but if I understand correctly std::vector usually starts out small and increases its memory as necessary to accommodate new elements. This works fine when you're adding a handful of elements at a time, but is really inefficient when you're adding a thousand elements at once.
while (getline(words_file, word)) {
my_vect.append(word); }
So, before running the loop above, try to initialize the vector with my_vect(100000) (constructor with the number of elements specified). This forces std::vector to allocate enough memory in advance so that it doesn't need to shuffle things around later.

The question is exceedingly unclear. How do you determine the specific
line? If it is the nth line, simplest solution is just to call
getline n times, throwing out all but the last results; calling
ignore n-1 times might be slightly faster, but I suspect that if
you're always reading into the same string (rather than constructing a
new one each time), the difference in time won't be enormous. If you
have some other criteria, and the file is really big (which from your
description it isn't) and sorted, you might try using a binary search,
seeking to the middle of the file, reading enough ahead to find the
start of the next line, then deciding the next step according to it's
value. (I've used this to find relevant entries in log files. But
we're talking about files which are several Gigabytes in size.)
If you're willing to use system dependent code, it might be advantageous
to memory map the file, then search for the nth occurance of a '\n'
(std::find n times).
ADDED: Just some quick benchmarks. On my Linux box, getting the
100000th word from /usr/share/dict/words (479623 words, one per line,
on my machine), takes about
272 milliseconds, reading all words
into an std::vector, then indexing,
256 milliseconds doing the same, but
with std::deque,
30 milliseconds using getline, but
just ignoring the results until the
one I'm interested in,
20 milliseconds using
istream::ignore, and
6 milliseconds using mmap and
looping on std::find.
FWIW, the code in each case is:
For the std:: containers:
template<typename Container>
void Using<Container>::operator()()
{
std::ifstream input( m_filename.c_str() );
if ( !input )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
Container().swap( m_words );
std::copy( std::istream_iterator<Line>( input ),
std::istream_iterator<Line>(),
std::back_inserter( m_words ) );
if ( static_cast<int>( m_words.size() ) < m_target )
Gabi::ProgramManagement::fatal()
<< "Not enough words, had " << m_words.size()
<< ", wanted at least " << m_target;
m_result = m_words[ m_target ];
}
For getline without saving:
void UsingReadAndIgnore::operator()()
{
std::ifstream input( m_filename.c_str() );
if ( !input )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
std::string dummy;
for ( int count = m_target; count > 0; -- count )
std::getline( input, dummy );
std::getline( input, m_result );
}
For ignore:
void UsingIgnore::operator()()
{
std::ifstream input( m_filename.c_str() );
if ( !input )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
for ( int count = m_target; count > 0; -- count )
input.ignore( INT_MAX, '\n' );
std::getline( input, m_result );
}
And for mmap:
void UsingMMap::operator()()
{
int input = ::open( m_filename.c_str(), O_RDONLY );
if ( input < 0 )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
struct ::stat infos;
if ( ::fstat( input, &infos ) != 0 )
Gabi::ProgramManagement::fatal() << "Could not stat " << m_filename;
char* base = (char*)::mmap( NULL, infos.st_size, PROT_READ, MAP_PRIVATE, input, 0 );
if ( base == MAP_FAILED )
Gabi::ProgramManagement::fatal() << "Could not mmap " << m_filename;
char const* end = base + infos.st_size;
char const* curr = base;
char const* next = std::find( curr, end, '\n' );
for ( int count = m_target; count > 0 && curr != end; -- count ) {
curr = next + 1;
next = std::find( curr, end, '\n' );
}
m_result = std::string( curr, next );
::munmap( base, infos.st_size );
}
In each case, the code is run

You could seek to a specific position, but that requires that you know where the line starts. "A little less than a minute" for 100,000 words does sound slow to me.

Read some data, count the newlines, throw away that data and read some more, count the newlines again... and repeat until you've read enough newlines to hit your target.
Also, as others have suggested, this is not a particularly efficient way of accessing data. You'd be well-served by making an index.

Related

Is there a way to request a certain location of a text file to be read in C++?

I'm using a text file which is in two lines. Each line has several thousand numbers on it (signed doubles). So it would look like this:
X.11 X.12 X.13 ...
X.21 X.22 X.23 ...
For each loop of my program I want to read one number from each line. So for the first iteration of the loop it would be X.11 & X.21 and for the second iteration X.12 & X.22 and so on. I don't need to store the values.
the expected output:
X.11 X.21
X.12 X.22
X.13 X.23
How can this be done in C++?
I usually read files using fstream, and reading the file line-by-line with std::getline(file, line). How would I read one number from each line?
I don't need to store the values.
Sure, but if you did, say in two arrays of doubles, then your loop would be trivial, and much faster than regular disk reads. And two arrays of several thousand doubles is probably less memory usage than you think it is. 1 Mb of RAM can contain 131072 eight byte doubles.
I assume you need: I want to read one number from each line.
otherwise comment me; I will delete the answer.
Read the file with 2 streams in parallel:
std::ifstream input_file_stream_1( "file" );
std::ifstream input_file_stream_2( "file" );
std::string line_1;
std::string line_2;
std::string ignore;
std::getline( input_file_stream_2, ignore ); // ignore the whole first line
for( ; input_file_stream_1 >> line_1 && input_file_stream_2 >> line_2; ){
std::cout << line_1 << " and " << line_2 << '\n';
}
input_file_stream_1.close();
input_file_stream_2.close();
the input:
X.11 X.12 X.13 ...
X.21 X.22 X.23 ...
the output:
X.11 and X.21
X.12 and X.22
X.13 and X.23
... and ...
how it works?
Since your file only has 2 lines, so I used two input_stream over the same file. One of them for the first line and the other for the second line. But before going to for-loop. the input_file_stream_2 reads the first line and no need to use that since the input_file_stream_1 wants to read this. So after ignore that line ( first line ). The input_file_stream_1 has line 1 and input_file_stream_2 has line 2. Now you have two lines with to streams. And in the for loop, ( or while ) you can extract each text by >> operator
or with std::regex library:
std::ifstream input_file_stream( "file" );
std::string line_1;
std::string line_2;
std::getline( input_file_stream, line_1 );
std::getline( input_file_stream, line_2 );
std::regex regex( R"(\s+)" );
std::regex_token_iterator< std::string::iterator > begin_1( line_1.begin(), line_1.end(), regex, -1 ), end_1;
std::regex_token_iterator< std::string::iterator > begin_2( line_2.begin(), line_2.end(), regex, -1 ), end_2;
while( begin_1 != end_1 && begin_2 != end_2 ){
std::cout << *begin_1++ << " and " << *begin_2++ << '\n';
}
input_file_stream.close();
the output:(as the same above)
X.11 and X.21
X.12 and X.22
X.13 and X.23
... and ...
NOTE:
There is more than one way to this
If you have already read the line using std::getline(file, line), you can take the string you got, and tokenize it char* p = strtok (yourline, " "); and then *p would result X.11 in the first line, and for the next one you just call strtok again
For an fstream you can use tellg and seekg to store and restore the position in the stream. However, I haven't verified that they work well together with formatted input.
Assuming you don't want to store the result in memory and it is only two lines another solution would be to open the file twice - and treat it as if you were reading the lines from two different files.

Easiest/clearest way to read formatted data in C++

I'm reading in a file of space/newline delimited numbers. After trying stringstreams and ifstreams, it appears C++ hasn't improved much on fopen and fscanf for this simple task in terms of simplicity, readability, or efficiency.
What about robustness? Since I check that fscanf returned the number of items I expect, this doesn't seem like an issue. The only benefit I can think of is stringstream's giving you more options to handle a failure.
Here is a quick example using fscanf:
FILE * pFile;
pFile = fopen ("my_file.txt","r");
if( pFile == NULL ) return -1;
double x,y,z;
int items_read;
while( true )
{
items_read = fscanf( pFile, "%lf %lf %lf", x, y, z );
if( items_read < 3 ) break; // Checks for EOF (which is -1) or reading 1-2 numbers
std::cout << x << " " << y << " " << z << "\n";
}
NOTE: for extra security, could replace fopen/fscanf with fopen_s/fscanf_s in Visual Studio.
In my experience neither C nor C++ offer you "robust input that tolerates idiot users".
It is adequate for "well formed input where it's OK to say 'something wrong in your input, please fix it'...", but not for robust situations where you need to check everything carefully (e.g. someone putting two instead of three numbers on a line, so the whole rest of the data is happily acceped, but now all your z values are actually x values, and everything else is "shifted one").
In that case, you do need to write some functions that do the appropriate checking by reading a line, checking that it can fetch three numbers out of that line - or something like that. You may well find that using stringstream or something similar is adequate for checking that there are three valid numbers on the line, but just using f >> x >> y >> z; will obviously lead to the next line being used to satisfy whatever is missing on this line.

Need to write specific lines of a text into a new text

I have numerical text data lines ranging between 1mb - 150 mb in size, i need to write lines of numbers related to heights, for example: heights=4 , new text must include lines: 1,5,9,13,17,21.... consequentially.
i have been trying to find a way to do this for a while now, tried using a list instead of vector which ended up with compilation errors.
I have cleaned up the code as advised. It now writes all lines sample2 text, all done here. Thank you all
I am open to method change as long as it delivers what i need, Thank you for you time and help.
following is what i have so far:
#include <iostream>
#include <fstream>
#include <string>
#include <list>
#include <vector>
using namespace std;
int h,n,m;
int c=1;
int main () {
cout<< "Enter Number Of Heights: ";
cin>>h;
ifstream myfile_in ("C:\\sample.txt");
ofstream myfile_out ("C:\\sample2.txt");
string line;
std::string str;
vector <string> v;
if (myfile_in.is_open()) {
myfile_in >> noskipws;
int i=0;
int j=0;
while (std::getline(myfile_in, line)) {
v.push_back( line );
++n;
if (n-1==i) {
myfile_out<<v[i]<<endl;
i=i+h;
++j;
}
}
cout<<"Number of lines in text file: "<<n<<endl;
}
else cout << "Unable to open file(s) ";
cout<< "Reaching here, Writing one line"<<endl;
system("PAUSE");
return 0;
}
You need to use seekg to set the position at the beginning of the file, once you have read it (you have read it once, to count the lines (which I don't think you actually need, as this size is never used, at least in this piece of code)
And what is the point if the inner while? On each loop, you have
int i=1;
myfile_out<<v[i]; //Not writing to text
i=i+h;
So on each loop, i gets 1, so you output the element with index 1 all the time. Which is not the first element, as indices start from 0. So, once you put seekg or remove the first while, your program will start to crash.
So, make i start from 0. And get it out of the two while loops, right at the beginning of the if-statement.
Ah, the second while is also unnecessary. Leave just the first one.
EDIT:
Add
myfile_in.clear();
before seekg to clear the flags.
Also, your algorithm is wrong. You'll get seg fault, if h > 1, because you'll get out of range (of the vector). I'd advise to do it like this: read the file in the while, that counts the lines. And store each line in the vector. This way you'll be able to remove the second reading, seekg, clear, etc. Also, as you already store the content of the file into a vector, you'll NOT lose anything. Then just use for loop with step h.
Again edit, regarding your edit: no, it has nothing to do with any flags. The if, where you compare i==j is outside the while. Add it inside. Also, increment j outside the if. Or just remove j and use n-1 instead. Like
if ( n-1 == i )
Several things.
First you read the file completely, just to count the number of lines,
then you read it a second time to process it, building up an in memory
image in v. Why not just read it in the first time, and do everything
else on the in memory image? (v.size() will then give you the number
of lines, so you don't have to count them.)
And you never actually use the count anyway.
Second, once you've reached the end of file the first time, the
failbit is set; all further operations are no-ops, until it is reset.
If you have to read the file twice (say because you do away with v
completely), then you have to do myfile_in.clear() after the first
loop, but before seeking to the beginning.
You only test for is_open after having read the file once. This test
should be immediately after the open.
You also set noskipws, although you don't do any formatted input
which would be affected by it.
The final while is highly suspect. Because you haven't done the
clear, you probably never enter the loop, but if you did, you'd very
quickly start accessing out of bounds: after reading n lines, the size
of v will be n, but you read it with index i, which will be n * h.
Finally, you should explicitly close the output file and check for
errors after the close, just in case.
It's not clear to me what you're trying to do. If all you want to do is
insert h empty lines between each existing line, something like:
std::string separ( h + 1, '\n' );
std::string line;
while ( std::getline( myfile_in, line ) ) {
myfile_out << line << separ;
}
should do the trick. No need to store the complete input in memory.
(For that matter, you don't even have to write a program for this.
Something as simple a sed 's:$:\n\n\n\n:' < infile > outfile would do
the trick.)
EDIT:
Reading other responses, I gather that I may have misunderstood the
problem, and that he only wants to output every h-th line. If this is
the case:
std::string line;
while ( std::getline( myfile_in, line ) ) {
myfile_out << line << '\n';
for ( int count = h - 1; h > 0; -- h ) {
std::getline( myfile_in, line );
// or myfile_in.ignore( INT_MAX, '\n' );
}
}
But again, other tools seem more appropriate. (I'd follow thiton's
suggestion and use AWK.) Why write a program in a language you don't
know well when tools are already available to do the job.
If there is no absolutely compelling reason to do this in C++, you are using the wrong programming language for this. In awk, your whole program is:
{ if ( FNR % 4 == 1 ) print; }
Or, giving the whole command line e.g. in sh to filter lines 1,5,9,13,...:
awk '{ if ( FNR % 4 == 1 ) print; }' a.txt > b.txt

Read file and extract certain part only

ifstream toOpen;
openFile.open("sample.html", ios::in);
if(toOpen.is_open()){
while(!toOpen.eof()){
getline(toOpen,line);
if(line.find("href=") && !line.find(".pdf")){
start_pos = line.find("href");
tempString = line.substr(start_pos+1); // i dont want the quote
stop_pos = tempString .find("\"");
string testResult = tempString .substr(start_pos, stop_pos);
cout << testResult << endl;
}
}
toOpen.close();
}
What I am trying to do, is to extrat the "href" value. But I cant get it works.
EDIT:
Thanks to Tony hint, I use this:
if(line.find("href=") != std::string::npos ){
// Process
}
it works!!
I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.
Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.
You might want to look at the answers to this previous question for suggestions about an HTML parser to use.
As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:
#include <fstream>
#include <iostream>
#include <string>
int main ( int, char ** )
{
std::ifstream file("sample.html");
if ( !file.is_open() ) {
std::cerr << "Failed to open file." << std::endl;
return (EXIT_FAILURE);
}
for ( std::string line; (std::getline(file,line)); )
{
// process line.
}
}
As for the inner part the processes the line, there are several problems.
It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
There is confusion between variable names temp and tempString etc.
string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.
Here is a simple test content for sample.html.
<html>
<a href="foo.pdf"/>
</html>
Sticking the following inside the loop:
if ((line.find("href=") != std::string::npos) &&
(line.find(".pdf" ) != std::string::npos))
{
const std::size_t start_pos = line.find("href");
std::string temp = line.substr(start_pos+6);
const std::size_t stop_pos = temp.find("\"");
std::string result = temp.substr(0, stop_pos);
std::cout << "'" << result << "'" << std::endl;
}
I actually get the output
'foo.pdf'
However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.

ifstream::unget() fails. Is MS' implementation buggy or is my code erroneous?

Yesterday I discovered an odd bug in rather simple code that basically gets text from an ifstream and tokenizes it. The code that actually fails does a number of get()/peek() calls looking for the token "/*". If the token is found in the stream, unget() is called so the next method sees the stream starting with the token.
Sometimes, seemingly depending only on the length of the file, the unget() call fails. Internally it calls pbackfail() which then returns EOF. However after clearing the stream state, I can happily read more characters so it's not exactly EOF..
After digging in, here's the full code that easily reproduces the problem:
#include <iostream>
#include <fstream>
#include <string>
//generate simplest string possible that triggers problem
void GenerateTestString( std::string& s, const size_t nSpacesToInsert )
{
s.clear();
for( size_t i = 0 ; i < nSpacesToInsert ; ++i )
s += " ";
s += "/*";
}
//write string to file, then open same file again in ifs
bool WriteTestFileThenOpenIt( const char* sFile, const std::string& s, std::ifstream& ifs )
{
{
std::ofstream ofs( sFile );
if( ( ofs << s ).fail() )
return false;
}
ifs.open( sFile );
return ifs.good();
}
//find token, unget if found, report error, show extra data can be read even after error
bool Run( std::istream& ifs )
{
bool bSuccess = true;
for( ; ; )
{
int x = ifs.get();
if( ifs.fail() )
break;
if( x == '/' )
{
x = ifs.peek();
if( x == '*' )
{
ifs.unget();
if( ifs.fail() )
{
std::cout << "oops.. unget() failed" << std::endl;
bSuccess = false;
}
else
{
x = ifs.get();
}
}
}
}
if( !bSuccess )
{
ifs.clear();
std::string sNext;
ifs >> sNext;
if( !sNext.empty() )
std::cout << "remaining data after unget: '" << sNext << "'" << std::endl;
}
return bSuccess;
}
int main()
{
std::string s;
const char* testFile = "tmp.txt";
for( size_t i = 0 ; i < 12290 ; ++i )
{
GenerateTestString( s, i );
std::ifstream ifs;
if( !WriteTestFileThenOpenIt( testFile, s, ifs ) )
{
std::cout << "file I/O error, aborting..";
break;
}
if( !Run( ifs ) )
std::cout << "** failed for string length = " << s.length() << std::endl;
}
return 0;
}
The program fails when the string length gets near the typical multiple=of-2 buffersizes 4096, 8192, 12288, here's the output:
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 4097
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 8193
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 12289
This happens when tested on on Windows XP and 7, both compiled in debug/release mode, both dynamic/static runtime, both 32bit and 64bit systems/compiles, all with VS2008, default compiler/linker options.
No problem found when testing with gcc4.4.5 on a 64bit Debian system.
Questions:
can other people please test this? I would really appreciate some active collaboration form SO.
is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)
or any compiler flags that might trigger this behaviour?
all parser code is rather critical for the application and is tested heavily, but off course this problem was not found in the test code. Should I come up with extreme test cases, and if so, how do I do that? How could I ever predict this could cause a problem?
if this really is a bug, where should do I best report it?
is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)
Yes. Standard streams are required to have at least 1 unget() position. So you can safely do only one unget() after a call to get(). When you call peek() and the input buffer is empty, underflow() occurs and the implementation clears the buffer and loads a new portion of data. Note that peek() doesn't increase current input location, so it points to the beginning of the buffer. When you try to unget() the implementation tries to decrease current input position, but it's already at the beginning of the buffer so it fails.
Of course this depends on the implementation. If the stream buffer holds more than one character then it may sometimes fail and sometimes not. As far as I know microsoft's implementation stores only one character in basic_filebuf (unless you specify a greater buffer explicitly) and relies on <cstdio> internal buffering (btw, that's one reason why MVS iostreams are slow). Quality implementation may load the buffer again from the file when unget() fails. But it isn't required to do so.
Try to fix your code so you don't need more than one unget() position. If you really need it then wrap the stream with a stream that guarantees that unget() won't fail (look at Boost.Iostreams). Also the code you posted is nonsense. It tries to unget() and then get() again. Why?