parsing an sstream - c++

I am parsing a file which contains both strings and numerical values. I'd like to process the file field by field, each delimited by a space or an end-of-line character.
The ifstream::getline() operation only allows a single delimiting character. What I currently do is thus a getline with the character ' ' as a delimiter, and then manually go back to the previous position in the stream if a '\n' has been encountered :
ifstream ifs ( filename , ifstream::in );
streampos pos;
while (ifs.good())
{
char curField[255];
pos = ifs.tellg();
ifs.getline(curField, 255, ' ');
string s(curField);
if (s.find("\n")!=string::npos)
{
ifs.seekg(pos);
ifs.getline(curField, 255, '\n');
s = string(curField);
}
// process the field contained in the string s...
}
However, the "seekg" seems to position the stream one character too late (I thus miss the first character of each field before each line break).
I know there are other ways to code such a parser, by scanning line by line etc.., but I'd really like to understand why this particular piece of code fails...
Thank you very much!

As Loadmaster said, there may be unaccounted for characters, or this could just be an off-by-one error.
But this just has to be said... you can replace this:
ifstream ifs ( filename , ifstream::in );
streampos pos;
while (ifs.good())
{
char curField[255];
pos = ifs.tellg();
ifs.getline(curField, 255, ' ');
string s(curField);
if (s.find("\n")!=string::npos)
{
ifs.seekg(pos);
ifs.getline(curField, 255, '\n');
s = string(curField);
}
// process the field contained in the string s...
}
With this:
ifstream ifs ( filename , ifstream::in );
streampos pos;
string s;
while (ifs.good())
{
ifs >> s;
// process the field contained in the string s...
}
To get the behavior you want.

There may be a look-ahead/push-back character in the input stream. IIRC, the seek/tell functions are not aware of this.

Related

What types of indicators are there for the end of a string when tokenizing a sentence?

I am trying to take a string holding a sentence and break it up by words to add to a linked list class called wordList.
When dealing with strings in C++, what is the indicator that you have reached the end of a string? Searched here are found that c strings are null terminated and some are indicated with a '\0' but these solutions give me errors.
I know there are other ways to do this (like scanning through individual characters) but I am fuzzy on how to implement.
void lineScan( string line) // Adds words to wordList from line of a file
{
istringstream iss(line);
string lineWord;
getline(iss, lineWord, ' ');
wrds.addWords( lineWord );
while( lineWord!= NULL )
{
getline(iss, lineWord, ' ');
wrds.addWords( lineWord );
}
}
You probably want to skip all whitespace, not use a single space as separator (your code will read empty tokens).
But you're not really dealing with strings here, and in particular not with C strings.
Since you're using istringstream, you're looking for the end of a stream, and it works like all instreams.
void lineScan(string line) // Adds words to wordList from line of a file
{
istringstream iss(line);
string word;
while (iss >> word)
{
wrds.addWords(word);
}
}

populating a string vector with tab delimited text

I'm very new to C++.
I'm trying to populate a vector with elements from a tab delimited file. What is the easiest way to do that?
Thanks!
There could be many ways to do it, simple Google search give you a solution.
Here is example from one of my projects. It uses getline and read comma separated file (CSV), I let you change it for reading tab delimited file.
ifstream fin(filename.c_str());
string buffer;
while(!fin.eof() && getline(fin, buffer))
{
size_t prev_pos = 0, curr_pos = 0;
vector<string> tokenlist;
string token;
// check string
assert(buffer.length() != 0);
// tokenize string buffer.
curr_pos = buffer.find(',', prev_pos);
while(1) {
if(curr_pos == string::npos)
curr_pos = buffer.length();
// could be zero
int token_length = curr_pos-prev_pos;
// create new token and add it to tokenlist.
token = buffer.substr(prev_pos, token_length);
tokenlist.push_back(token);
// reached end of the line
if(curr_pos == buffer.length())
break;
prev_pos = curr_pos+1;
curr_pos = buffer.find(',', prev_pos);
}
}
UPDATE: Improved while condition.
This is probably the easiest way to do it, but vcp's approach can be more efficient.
std::vector<string> tokens;
std::string token;
while (std::getline(infile, token, '\t')
{
tokens.push_back(token);
}
Done. You can actually get this down to about three lines of code with an input iterator and a back inserter, but why?
Now if the file is cut up into lines and separated by tabs on those lines, you also have to handle the line delimiters. Now you just do the above twice, one loop for lines and an inner loop to parse the tabs.
std::vector<string> tokens;
std::string line;
while (std::getline(infile, line)
{
std::stringstream instream(line)
std::string token;
while (std::getline(instream, token, '\t')
{
tokens.push_back(token);
}
}
And if you needed to do line, then tabs, then... I dunno... quotes? Three loops. But to be honest by three I'm probably looking at writing a state machine. I doubt your teacher wants anything like that at this stage.

How to NOT use \n as delimiter in getline()

I'm trying to read in lines from a plain text file, but there are line breaks in the middle of sentences, so getline() reads until a line break as well as until a period. The text file looks like:
then he come tiptoeing down and stood right between us. we could
a touched him nearly. well likely it was minutes and minutes that
there warnt a sound and we all there so close together. there was a
place on my ankle that got to itching but i dasnt scratch it.
My read-in code:
// read in sentences
while (file)
{
string s, record;
if (!getline( file, s )) break;
istringstream ss(s);
while (ss)
{
string s;
if (!getline(ss, s, '.')) break;
record = s;
if(record[0] == ' ')
record.erase(record.begin());
sentences.push_back(record);
}
}
// output sentences
for (vector<string>::size_type i = 0; i < sentences.size(); i++)
cout << sentences[i] << "[][][][]" << endl;
The purpose of the [ ][ ][ ][ ] was to check if linebreaks were used as delimiters and were not just being read into the string. The output would look like:
then he come tiptoeing down and stood right between us.[][][][]
we could[][][][]
a touched him nearly.[][][][]
well likely it was minutes and minutes that[][][][]
there warnt a sound and we all there so close together.[][][][]
there was a[][][][]
place on my ankle that got to itching but i dasnt scratch it.[][][][]
What exactly is your question?
You're using getline() to read from the file stream with a newline delimiter, then parsing that line with a getline() using the istringstream is and a delimiter '.'. So of course you're getting your strings broken at both the new line and the '.'.
getdelim() works like getline(), except that a line delimiter other than newline can be specified as the delimiter argument. As with getline(), a delimiter character is not added if one was not present in the input before end of file was reached.
ssize_t getdelim(char **restrict lineptr, size_t *restrict n, int delimiter, FILE *restrict stream);

Cannot read binary file with bracket characters in C++

I have function which reads a File & checks its contents.
The file contains some binary content along with non alphabet characters like (), =, divided by symbol, etc.
The function which does the reading is:
int FindMyWord(const char *fileName)
{
ifstream myFile (fileName);
if(!myFile)
return false;
string wordSearch = "MyWord";
string line;
int result = 0;
while(getline(myFile, line))
{
if(line.find(wordSearch) != string::npos)
result++;
}
//if(!myFile.eof() || !myFile)
if(!myFile)
printf("Problem Reading the File: %s\n", (const char *)fileName);
myFile.close();
return result;
}
I am having these 2 problems:
If a line contains binary characters then it is not reading the complete line, just reading the first word (atleast that's what I am observing by opening the file in VS2010).
When it encounters the character ( for the beginning of a line the while loop is terminated & the printf() is printed.
If string::getline() cannot read such characters then what is the solution?
Thank You.
UPDATE: The Image of some of the binary data in the file:
A text input stream should not fail on a bracket character.
If you actually need a binary stream, use ifstream(filename, std::ios::binary)
Have a read through the std::getline docs at cppreference.com. You should check the failbit on the stream if you have any odd behaviour.

break long string into multiple c++

I have a string that is received from third party. This string is actually the text from a text file and it may contain UNIX LF or Windows CRLF for line termination. How can I break this into multiple strings ignoring blank lines? I was planning to do the following, but am not sure if there is a better way. All I need to do is read line by line. Vector here is just a convenience and I can avoid it.
* Unfortunately I donot have access to the actual file. I only receive the string object *
string textLine;
vector<string> tokens;
size_t pos = 0;
while( true ) {
size_t nextPos = textLine.find( pos, '\n\r' );
if( nextPos == textLine.npos )
break;
tokens.push_back( string( textLine.substr( pos, nextPos - pos ) ) );
pos = nextPos + 1;
}
You could use std::getline as you're reading from the file instead of reading the whole thing into a string. That will break things up line by line by default. You can simply not push_back any string that comes up empty.
string line;
vector<string> tokens;
while (getline(file, line))
{
if (!line.empty()) tokens.push_back(line);
}
UPDATE:
If you don't have access to the file, you can use the same code by initializing a stringstream with the whole text. std::getline works on all stream types, not just files.
I'd use getline to create new strings based on \n, and then manipulate the line endings.
string textLine;
vector<string> tokens;
istringstream sTextLine;
string line;
while(getline(sTextLine, line)) {
if(line.empty()) continue;
if(line[line.size()-1] == '\r') line.resize(line.size()-1);
if(line.empty()) continue;
tokens.push_back(line);
}
EDIT: Use istringstream instead of stringstream.
I would use the approach given here (std::getline on a std::istringstream)...
Splitting a C++ std::string using tokens, e.g. ";"
... except omit the ';' parameter to std::getline.
A lot depends on what is already present in your toolkit. I work a lot
with files which come from Windows and are read under Unix, and vice
versa, so I have most of the tools for converting CRLF into LF at hand.
If you don't have any, you might want a function along the lines of:
void addLine( std::vector<std::string>& dest, std::string line )
{
if ( !line.empty() && *(line.end() - 1) == '\r' ) {
line.erase( line.end() - 1 );
}
if ( !line.empty() ) {
dest.push_back( line );
}
}
to do your insertions. As for breaking the original text into lines,
you can use std::istringstream and std::getline, as others have
suggested; it's simple and straightforward, even if it is overkill.
(The std::istringstream is a fairly heavy mechanism, since it supports
all sorts of input conversions you don't need.) Alternatively, you
might consider a loop along the lines of:
std::string::const_iterator start = textLine.begin();
std::string::const_iterator end = textLine.end();
std::string::const_iterator next = std::find( start, end, '\n' );
while ( next != end ) {
addLine( tokens, std::string( start, next ) );
start = next + 1;
next = std::find( start, end, '\n' );
}
addLine( tokens, std::string( start, end ) );
Or you could break things down into separate operations:
textLine.erase(
std::remove( textLine.begin(), textLine.end(), '\r'),
textLine.end() );
to get rid of all of the CR's,
std::vector<std:;string> tokens( split( textLine, '\n' ) );
, to break it up into lines, where split is a generalized function
along the lines of the above loop (a useful tool to add to your
toolkit), and finally:
tokens.erase(
std::remove_if( tokens.begin(), tokens.end(),
boost::bind( &std::string::empty, _1 ) ),
tokens.end() );
. (Generally speaking: if this is a one-of situation, use the
std::istringstream based solution. If you think you may have to do
something like this from time to time in the future, add the split
function to your took kit, and use it.)
You could use strtok.
Split string into tokens
A sequence of calls to this function
split str into tokens, which are
sequences of contiguous characters
separated by any of the characters
that are part of delimiters.
I would put the string in a stringstream and then use the getline method like the previous answer mentioned. Then, you could just act like you were reading the text in from a file when it really comes from another string.