Parse buffered data line by line - c++

I want to write a parser for Wavefront OBJ file format, plain text file.
Example can be seen here: people.sc.fsu.edu/~jburkardt/data/obj/diamond.obj.
Most people use old scanf to parse this format line by line, however I would prefer to load the whole file at once to reduce IO operation count. Is there a way to parse this kind of buffered data line by line?
void ObjModelConditioner::Import(Model& asset)
{
uint8_t* buffer = SyncReadFile(asset.source_file_info());
delete [] buffer;
}
Or would it be preferable to load whole file into a string and try to parse that?

After a while It seems I found sufficient (and simple) solution. Since my goal is to create asset conditioning pipeline, the code has to be able to handle large amounts of data efficiently. Data can be read into a string at once and once loaded, stringstream can be initialized with this string.
std::string data;
SyncReadFile(asset.source_file_info(), data);
std::stringstream data_stream(data);
std::string line;
Then I simply call getline():
while(std::getline(data_stream, line))
{
std::stringstream line_stream(line);
std::string type_token;
line_stream >> type_token;
if (type_token == "v") {
// Vertex position
Vector3f position;
line_stream >> position.x >> position.y >> position.z;
// ...
}
else if (type_token == "vn") {
// Vertex normal
}
else if (type_token == "vt") {
// Texture coordinates
}
else if (type_token == "f") {
// Face
}
}

Here's a function that splits a char array into a vector of strings (assuming each new string starts with '\n' symbol):
#include <iostream>
#include <vector>
std::vector< std::string >split(char * arr)
{
std::string str = arr;
std::vector< std::string >result;
int beg=0, end=0;//begining and end of each line in the array
while( end = str.find( '\n', beg + 1 ) )
{
if(end == -1)
{
result.push_back(str.substr(beg));
break;
}
result.push_back(str.substr(beg, end - beg));
beg = end;
}
return result;
}
Here's the usage:
int main()
{
char * a = "asdasdasdasdasd \n asdasdasd \n asdasd";
std::vector< std::string >result = split(a);
}

If you've got the raw data in a char[] (or a unsigned char[]), and
you know its length, it's pretty trivial to write an input only, no seek
supported streambuf which will allow you to create an std::istream
and to use std::getline on it. Just call:
setg( start, start, start + length );
in the constructor. (Nothing else is needed.)

It really depends on how you're going to parse the text. One way to do this would be simply to read the data into a vector of strings. I'll assume that you've already covered issues such as scaleability / use of memory etc.
std::vector<std::string> lines;
std::string line;
ifstream file(filename.c_str(), ios_base::in);
while ( getline( file, line ) )
{
lines.push_back( line );
}
file.close();
This would cache your file in lines. Next you need to go through lines
for ( std::vector<std::string>::const_iterator it = lines.begin();
it != lines.end(); ++it)
{
const std::string& line = *it;
if ( line.empty() )
continue;
switch ( line[0] )
{
case 'g':
// Some stuff
break;
case 'v':
// Some stuff
break;
case 'f':
// Some stuff
break;
default:
// Default stuff including '#' (probably nothing)
}
}
Naturally, this is very simplistic and depends largely on what you want to do with your file.
The size of the file that you've given as an example is hardly likely to cause IO stress (unless you're using some very lightweight equipment) but if you're reading many files at once I suppose it might be an issue.
I think your concern here is to minimise IO and I'm not sure that this solution will really help that much since you're going to be iterating over a collection twice. If you need to go back and keep reading the same file over and over again, then it will definitely speed things up to cache the file in memory but there are just as easy ways to do this such as memory mapping a file and using normal file accessing. If you're really concerned, then try profiling a solution like this against simply processing the file directly as you read from IO.

Related

C++: buffer the cin istream

The problem is:
I have a code that operates on a fully functional istream. It uses methods like:
istream is;
is.seekg(...) // <--- going backwards at times
is.tellg() // <--- to save the position before looking forward
etc.
These methods are only available for istreams from, say, a file. However, if I use cin in this fashion, it will not work--cin does not have the option of saving a position, reading forward, then returning to the saved position.
// So, I can't cat the file into the program
cat file | ./program
// I can only read the file from inside the program
./program -f input.txt
// Which is the problem with a very, very large zipped file
// ... that cannot coexist on the same raid-10 drive system
// ... with the resulting output
zcat really_big_file.zip | ./program //<--- Doesn't work due to cin problem
./program -f really_big_file.zip //<--- not possible without unzipping
I can read cin into a deque, and process the deque. A 1mb deque buffer would be more than enough. However, this is problematic in three senses:
I have to rewrite everything to do this with a deque
It wont be as bulletproof as just using an istream, for which the code has already been debugged
It seems like, if I implement it as a deque with some difficulty, someone is going to come along and say, why didn't you just do it like ___
What is the proper/most efficient way to create a usable istream object, in the sense that all members are active, with a cin istream?
(Bearing in mind that performance is important)
You could create a filtering stream buffer reading from std::cin when getting new data but buffering all received characters. You'd be able to implement seeking within the buffered range of the input. Seeking beyond the end of the already buffered input would imply reading corresponding amounts of data. Here is an example of a corresponding implementation:
#include <iostream>
#include <vector>
class bufferbuf
: public std::streambuf {
private:
std::streambuf* d_sbuf;
std::vector<char> d_buffer;
int_type underflow() {
char buffer[1024];
std::streamsize size = this->d_sbuf->sgetn(buffer, sizeof(buffer));
if (size == 0) {
return std::char_traits<char>::eof();
}
this->d_buffer.insert(this->d_buffer.end(), buffer, buffer + size);
this->setg(this->d_buffer.data(),
this->d_buffer.data() + this->d_buffer.size() - size,
this->d_buffer.data() + this->d_buffer.size());
return std::char_traits<char>::to_int_type(*this->gptr());
}
pos_type seekoff(off_type off, std::ios_base::seekdir whence, std::ios_base::openmode) {
switch (whence) {
case std::ios_base::beg:
this->setg(this->eback(), this->eback() + off, this->egptr());
break;
case std::ios_base::cur:
this->setg(this->eback(), this->gptr() + off, this->egptr());
break;
case std::ios_base::end:
this->setg(this->eback(), this->egptr() + off, this->egptr());
break;
default: return pos_type(off_type(-1)); break;
}
return pos_type(off_type(this->gptr() - this->eback()));
}
pos_type seekpos(pos_type pos, std::ios_base::openmode) {
this->setg(this->eback(), this->eback() + pos, this->egptr());
return pos_type(off_type(this->gptr() - this->eback()));
}
public:
bufferbuf(std::streambuf* sbuf)
: d_sbuf(sbuf)
, d_buffer() {
this->setg(0, 0, 0); // actually the default setting
}
};
int main() {
bufferbuf sbuf(std::cin.rdbuf());
std::istream in(&sbuf);
std::streampos pos(in.tellg());
std::string line;
while (std::getline(in, line)) {
std::cout << "pass1: '" << line << "'\n";
}
in.clear();
in.seekg(pos);
while (std::getline(in, line)) {
std::cout << "pass2: '" << line << "'\n";
}
}
This implementation buffers input before passing it on to the reading step. You can read individual characters (e.g. change char buffer[1024]; to become char buffer[1]; or replace the use of sgetn() appropriately using sbumpc()) to provide a more direct response: there is a trade-off between immediate response and performance for batch processing.
cin is user input and should be treated as unpredictable. If you want to use mentioned functionality and you are sure about your input you can read whole input to istringstream and then operate on it

how to discard from streams? .ignore() doesnt work for this purpose, any other methods?

I have a lack of understanding about streams. The idea is, to read a file to the ifstream and then working with it. Extract Data from the stream to a string, and discard the part which is now in a string from the stream. Is that possible? Or how to handle those problems?
The following method, is for inserting a file which is properly read by the ifstream. (its a text file, containing informations about "Lost" episodes, its an episodeguide. It works fine, for one element of the class episodes. Every time i instantiate a episode file, i want to check the stream of that file, discard the informations about one episode (its indicated by "****", then the next episode starts) and process the informations discarded in a string. If I create a new object of Episode I want to discard the next informations about the episodes after "****" to the next "****" and so on.
void Episode::read(ifstream& in) {
string contents((istreambuf_iterator<char>(in)), istreambuf_iterator<char>());
size_t episodeEndPos = contents.find("****");
if ( episodeEndPos == -1) {
in.ignore(numeric_limits<char>::max());
in.clear(), in.sync();
fullContent = contents;
}
else { // empty stream for next episode
in.ignore(episodeEndPos + 4);
fullContent = contents.substr(0, episodeEndPos);
}
// fill attributes
setNrHelper();
setTitelHelper();
setFlashbackHelper();
setDescriptionHelper();
}
I tried it with inFile >> words (to read the words, this is a way to get the words out of the stream) another way i was thinking about is, to use .ignore (to ignore an amount of characters in the stream). But that doesnt work as intended. Sorry for my bad english, hopefully its clear what i want to do.
If your goal is at each call of Read() to read the next episode and advance in the file, then the trick is to to use tellg() and seekg() to bookmark the position and update it:
void Episode::Read(ifstream& in) {
streampos pos = in.tellg(); // backup current position
string fullContent;
string contents((istreambuf_iterator<char>(in)), istreambuf_iterator<char>());
size_t episodeEndPos = contents.find("****");
if (episodeEndPos == -1) {
in.ignore(numeric_limits<char>::max());
in.clear(), in.sync();
fullContent = contents;
}
else { // empty stream for next episode
fullContent = contents.substr(0, episodeEndPos);
in.seekg(pos + streamoff(episodeEndPos + 4)); // position file at next episode
}
}
In this way, you can call several time your function, every call reading the next episode.
However, please note that your approach is not optimised. When you construct your contents string from a stream iterator, you load the full rest of the file in the memory, starting at the current position in the stream. So here you keep on reading and reading again big subparts of the file.
Edit: streamlined version adapted to your format
You just need to read the line, check if it's not a separator line and concatenate...
void Episode::Read(ifstream& in) {
string line;
string fullContent;
while (getline(in, line) && line !="****") {
fullContent += line + "\n";
}
cout << "DATENSATZ: " << fullContent << endl; // just to verify content
// fill attributes
//...
}
The code you got reads the entire stream in one go just to use some part of the read text to initialize an object. Imagining a gigantic file that is almost certainly a bad idea. The easier approach is to just read until the end marker is found. In an ideal world, the end marker is easily found. Based on comments it seems to be on a line of its own which would make it quite easy:
void Episode::read(std::istream& in) {
std::string text;
for (std::string line; in >> line && line != "****"; ) {
text += line + "\n";
}
fullContent = text;
}
If the separate isn't on a line of its own, you could use code like this instead:
void Episode::read(std::istream& in) {
std::string text;
for (std::istreambuf_iterator<char> it(in), end; it != end; ++it) {
text.push_back(*it);
if (*it == '*' && 4u <= text.size() && text.substr(text.size() - 4) == "****") {
break;
}
if (4u <= text.size() && text.substr(text.size() - 4u) == "****") {
text.resize(text.size() - 4u);
}
fullContent = text;
}
Both of these approaches would simple read the file from start to end and consume the characters to be extracted in the process, stopping as soon as reading of one record is done.

Skip reading a line in a INI file if its length greater than n in C++

I want to skip reading a line in the INI file if has more than 1000 characters.This is the code i'm using:
#define MAX_LINE 1000
char buf[MAX_LINE];
CString strTemp;
str.Empty();
for(;;)
{
is.getline(buf,MAX_LINE);
strTemp=buf;
if(strTemp.IsEmpty()) break;
str+=strTemp;
if(str.Find("^")>-1)
{
str=str.Left( str.Find("^") );
do
{
is.get(buf,2);
} while(is.gcount()>0);
is.getline(buf,2);
}
else if(strTemp.GetLength()!=MAX_LINE-1) break;
}
//is.getline(buf,MAX_LINE);
return is;
...
The problem i'm facing is that if the characters exceed 1000 if seems to fall in a infinite loop(unable to read next line).How can i make the getline to skip that line and read the next line??
const std::size_t max_line = 1000; // not a macro, macros are disgusting
std::string line;
while (std::getline(is, line))
{
if (line.length() > max_line)
continue;
// else process the line ...
}
How abut checking the return value of getline and break if that fails?
..or if is is an istream, you could check for an eof() condition to break you out.
#define MAX_LINE 1000
char buf[MAX_LINE];
CString strTemp;
str.Empty();
while(is.eof() == false)
{
is.getline(buf,MAX_LINE);
strTemp=buf;
if(strTemp.IsEmpty()) break;
str+=strTemp;
if(str.Find("^")>-1)
{
str=str.Left( str.Find("^") );
do
{
is.get(buf,2);
} while((is.gcount()>0) && (is.eof() == false));
stillReading = is.getline(buf,2);
}
else if(strTemp.GetLength()!=MAX_LINE-1)
{
break;
}
}
return is;
For something completely different:
std::string strTemp;
str.Empty();
while(std::getline(is, strTemp)) {
if(strTemp.empty()) break;
str+=strTemp.c_str(); //don't need .c_str() if str is also a std::string
int pos = str.Find("^"); //extracted this for speed
if(pos>-1){
str=str.Left(pos);
//Did not translate this part since it was buggy
} else
//not sure of the intent here either
//it would stop reading if the line was less than 1000 characters.
}
return is;
This uses strings for ease of use, and no maximum limits on lines. It also uses the std::getline for the dynamic/magic everything, but I did not translate the bit in the middle since it seemed very buggy to me, and I couldn't interpret the intent.
The part in the middle simply reads two characters at a time until it reaches the end of the file, and then everything after that would have done bizarre stuff since you weren't checking return values. Since it was completely wrong, I didn't interpret it.

Reading multiple files

I want to alternate between reading multiple files. Below is a simplified version of my code.
ifstream* In_file1 = new ifstream("a.dat", ios::binary);
ifstream* In_file2 = new ifstream("b..dat", ios::binary);
ifstream* In_file;
int ID;
In_file = In_file1;
int ID = 0;
//LOOPING PORTION
if (In_file -> eof())
{
In_file -> seekg(0, ios_base::beg);
In_file->close();
switch (ID)
{
case 0:
In_file = In_file2; ID = 1; break;
case 1:
In_file = In_file1; ID = 0; break;
}
}
//some codes
:
:
In_file->read (data, sizeof(double));
//LOOPING PORTION
The code works well if I am reading the files one time and I thought that everything was cool. However, if the part termed 'looping portion' is within a loop, then the behaviour becomes weird and I start having a single repeating output. Please, can someone tell me what is wrong and how I can fix it? If you have a better method of tacking the problem, please suggest. I appreciate it.
//SOLVED
Thank you everybody for your comments, I appreciate it. Here is what I simple did:
Instead of the original
switch (ID)
{
case 0:
In_file = In_file2; ID = 1; break;
case 1:
In_file = In_file1; ID = 0; break;
}
I simply did
switch (ID)
{
case 0:
In_file = new ifstream("a.dat", ios::binary); ID = 1; break;
case 1:
In_file = new ifstream("b.dat", ios::binary); ID = 0; break;
}
Now it works like charm and I can loop as much as I want:-). I appreciate your comments, great to know big brother still helps.
Let's see: the code you posted works fine, and you want us to tell you
what's wrong with the code you didn't post. That's rather difficult.
Still, the code you posted probably doesn't work correctly either.
std::istream::eof can only be used reliably after an input (or some
other operation) has failed; in the code you've posted, it will almost
certainly be false, regardless.
In addition: there's no need to dynamically allocate ifstream; in
fact, there are almost no cases where dynamic allocation of ifstream
is appropriate. And you don't check that the opens have succeeded.
If you want to read two files, one after the other, the simplest way is
to use two loops, one after the other (calling a common function for
processing the data). If for some reason that's not appropriate, I'd
use a custom streambuf, which takes a list of filenames in the
constructor, and advances to the next whenever it reaches end of file on
one, only returning EOF when it has reached the end of all of the
files. (The only complication in doing this is what to do if one of the
opens fails. I do this often enough that it's part of my tool kit,
and I use a callback to handle failure. For a one time use, however,
you can just hard code in whatever is appropriate.)
As a quick example:
// We define our own streambuf, deriving from std::streambuf
// (All istream and ostream delegate to a streambuf for the
// actual data transfer; we'll use an instance of this to
// initialize the istream we're going to read from.)
class MultiFileInputStreambuf : public std::streambuf
{
// The list of files we will process
std::vector<std::string> m_filenames;
// And our current position in the list (actually
// one past the current position, since we increment
// it when we open the file).
std::vector<std::string>::const_iterator m_current;
// Rather than create a new filebuf for each file, we'll
// reuse this one, closing any previously open file, and
// opening a new file, as needed.
std::filebuf m_streambuf;
protected:
// This is part of the protocol for streambuf. The base
// class will call this function anytime it needs to
// get a character, and there aren't any in the buffer.
// This function can set up a buffer, if it wants, but
// in this case, the buffering is handled by the filebuf,
// so it's likely not worth the bother. (But this depends
// on the cost of virtual functions---without a buffer,
// each character read will require a virtual function call
// to get here.
//
// The protocol is to return the next character, or EOF if
// there isn't one.
virtual int underflow()
{
// Get one character from the current streambuf.
int result = m_streambuf.sgetc();
// As long as 1) the current streambuf is at end of file,
// and 2) there are more files to read, open the next file
// and try to get a character from it.
while ( result == EOF && m_current != m_filenames.eof() ) {
m_streambuf.close();
m_streambuf.open( m_current->c_str(), std::ios::in );
if ( !m_streambuf.is_open() )
// Error handling here...
++ m_current;
result = m_streambuf.sgetc();
}
// We've either gotten a character from the (now) current
// streambuf, or there are no more files, and we'll return
// the EOF from our last attempt at reading.
return result;
}
public:
// Use a template and two iterators to initialize the list
// of files from any STL sequence whose elements can be
// implicitly converted to std::string.
template<typename ForwardIterator>
MultiFileInputStreambuf(ForwardIterator begin, ForwardIterator end)
: m_filenames(begin, end)
, m_current(m_filenames.begin())
{
}
};
#include <iostream>
#include <fstream>
#include <string>
#define NO_OF_FILES 2
int main ()
{
std::ifstream in;
std::string line;
std::string files[NO_OF_FILES] =
{
"file1.txt",
"file2.txt",
};
// start our engine!
for (int i = 0; i < NO_OF_FILES; i++)
{
in.open(files[i].c_str(), std::fstream::in);
if (in.is_open())
{
std::cout << "reading... " << files[i] << endl;
while (in.good())
{
getline(in, line);
std::cout << line << std::endl;
}
in.close();
std::cout << "SUCCESS" << std::endl;
}
else
std::cout << "Error: unable to open " + files[i] << std::endl;
}
return 0;
}

What's the idiomatic way of parsing text by using ifstream?

I'm trying to parse a text file to find a pattern then grab a substring. This code fragment works fine, however can I improve this? Can I minimize copying here? I.e. I get a line and store it in the buf then construct a string, can this copying be eliminated?
In short what's the idiomatic way of achieving this?
std::ifstream f("/file/on/disk");
while (!f.eof()) {
char buf[256];
f.getline(buf, sizeof(buf));
std::string str(buf);
if (str.find(pattern) != std::string::npos)
{
// further processing, then break out of the while loop and return.
}
}
Here's one possible rewrite:
std::ifstream f("/file/on/disk");
char buffer[256];
while (f.getline(buffer, sizeof(buf))) { // Use the read operation as the test in the loop.
if (strstr(buffer, pattern) != NULL) { // Don't cast to string; costs time
// further processing, then break out of the while loop and return.
}
}
The main changes are marked inline, but to summarize:
Use the read operation as the test in the while loop. This makes the code a lot shorter and clearer.
Don't cast the C-style string to a std::string; just use strstr to do the scan.
As a further note, you probably don't want to use a C-style string here unless you're sure that's what you want. A C++ string is probably better:
std::ifstream f("/file/on/disk");
std::string buffer;
while (std::getline(f, buffer)) { // Use the read operation as the test in the loop.
if (buffer.find(pattern) != std::string::npos) {
// further processing, then break out of the while loop and return.
}
}
In your code, you first copy characters from the file into a char array. That should be all the copying necessary. If you'd need to read each character once then even that copy wouldn't be necessary.
Next, you construct a std::string from the array you filled. Again, unnecessary. If you want a string then copy from the stream directly into a string.
std::ifstream f("/file/on/disk");
for( std::string line; std::getline(f, line); ) {
if (str.find(pattern) != std::string::npos) {
// further processing, then break out of the while loop and return.
}
}
You don't need that char[] at all.
string line;
std::getline(f, line);
if (line.find(pattern) != std::string::npos)
{
....
}