What's the idiomatic way of parsing text by using ifstream? - c++

I'm trying to parse a text file to find a pattern then grab a substring. This code fragment works fine, however can I improve this? Can I minimize copying here? I.e. I get a line and store it in the buf then construct a string, can this copying be eliminated?
In short what's the idiomatic way of achieving this?
std::ifstream f("/file/on/disk");
while (!f.eof()) {
char buf[256];
f.getline(buf, sizeof(buf));
std::string str(buf);
if (str.find(pattern) != std::string::npos)
{
// further processing, then break out of the while loop and return.
}
}

Here's one possible rewrite:
std::ifstream f("/file/on/disk");
char buffer[256];
while (f.getline(buffer, sizeof(buf))) { // Use the read operation as the test in the loop.
if (strstr(buffer, pattern) != NULL) { // Don't cast to string; costs time
// further processing, then break out of the while loop and return.
}
}
The main changes are marked inline, but to summarize:
Use the read operation as the test in the while loop. This makes the code a lot shorter and clearer.
Don't cast the C-style string to a std::string; just use strstr to do the scan.
As a further note, you probably don't want to use a C-style string here unless you're sure that's what you want. A C++ string is probably better:
std::ifstream f("/file/on/disk");
std::string buffer;
while (std::getline(f, buffer)) { // Use the read operation as the test in the loop.
if (buffer.find(pattern) != std::string::npos) {
// further processing, then break out of the while loop and return.
}
}

In your code, you first copy characters from the file into a char array. That should be all the copying necessary. If you'd need to read each character once then even that copy wouldn't be necessary.
Next, you construct a std::string from the array you filled. Again, unnecessary. If you want a string then copy from the stream directly into a string.
std::ifstream f("/file/on/disk");
for( std::string line; std::getline(f, line); ) {
if (str.find(pattern) != std::string::npos) {
// further processing, then break out of the while loop and return.
}
}

You don't need that char[] at all.
string line;
std::getline(f, line);
if (line.find(pattern) != std::string::npos)
{
....
}

Related

How to read a complex input with istream&, string& and getline in c++?

I am very new to C++, so I apologize if this isn't a good question but I really need help in understanding how to use istream.
There is a project I have to create where it takes several amounts of input that can be on one line or multiple and then pass it to a vector (this is only part of the project and I would like to try the rest on my own), for example if I were to input this...
>> aaa bb
>> ccccc
>> ddd fff eeeee
Makes a vector of strings with "aaa", "bb", "ccccc", "ddd", "fff", "eeeee"
The input can be a char or string and the program stops asking for input when the return key is hit.
I know getline() gets a line of input and I could probably use a while loop to try and get the input such as...(correct me if I'm wrong)
while(!string.empty())
getline(cin, string);
However, I don't truly understand istream and it doesn't help that my class has not gone over pointers so I don't know how to use istream& or string& and pass it into a vector. On the project description, it said to NOT use stringstream but use functionality from getline(istream&, string&). Can anyone give somewhat of a detailed explanation as to how to make a function using getline(istream&, string&) and then how to use it in the main function?
Any little bit helps!
You're on the right way already; solely, you'd have to pre-fill the string with some dummy to enter the while loop at all. More elegant:
std::string line;
do
{
std::getline(std::cin, line);
}
while(!line.empty());
This should already do the trick reading line by line (but possibly multiple words on one line!) and exiting, if the user enters an empty line (be aware that whitespace followed by newline won't be recognised as such!).
However, if anything on the stream goes wrong, you'll be trapped in an endless loop processing previous input again and again. So best check the stream state as well:
if(!std::getline(std::cin, line))
{
// this is some sample error handling - do whatever you consider appropriate...
std::cerr << "error reading from console" << std::endl;
return -1;
}
As there might be multiple words on a single line, you'd yet have to split them. There are several ways to do so, quite an easy one is using an std::istringstream – you'll discover that it ressembles to what you likely are used to using std::cin:
std::istringstream s(line);
std::string word;
while(s >> word)
{
// append to vector...
}
Be aware that using operator>> ignores leading whitespace and stops after first trailing one (or end of stream, if reached), so you don't have to deal with explicitly.
OK, you're not allowed to use std::stringstream (well, I used std::istringstream, but I suppose this little difference doesn't count, does it?). Changes matter a little, it gets more complex, on the other hand, we can decide ourselves what counts as words an what as separators... We might consider punctuation marks as separators just like whitespace, but allow digits to be part of words, so we'd accept e. g. ab.7c d as "ab", "7c", "d":
auto begin = line.begin();
auto end = begin;
while(end != line.end()) // iterate over each character
{
if(std::isalnum(static_cast<unsigned char>(*end)))
{
// we are inside a word; don't touch begin to remember where
// the word started
++end;
}
else
{
// non-alpha-numeric character!
if(end != begin)
{
// we discovered a word already
// (i. e. we did not move begin together with end)
words.emplace_back(begin, end);
// ('words' being your std::vector<std::string> to place the input into)
}
++end;
begin = end; // skip whatever we had already
}
}
// corner case: a line might end with a word NOT followed by whitespace
// this isn't covered within the loop, so we need to add another check:
if(end != begin)
{
words.emplace_back(begin, end);
}
It shouldn't be too difficult to adjust to different interpretations of what is a separator and what counts as word (e. g. std::isalpha(...) || *end == '_' to detect underscore as part of words, but digits not). There are quite a few helper functions you might find useful...
You could input the value of the first column, then call functions based on the value:
void Process_Value_1(std::istream& input, std::string& value);
void Process_Value_2(std::istream& input, std::string& value);
int main()
{
// ...
std::string first_value;
while (input_file >> first_value)
{
if (first_value == "aaa")
{
Process_Value_1(input_file, first_value);
}
else if (first_value = "ccc")
{
Process_Value_2(input_file, first_value);
}
//...
}
return 0;
}
A sample function could be:
void Process_Value_1(std::istream& input, std::string& value)
{
std::string b;
input >> b;
std::cout << value << "\t" << b << endl;
input.ignore(1000, '\n'); // Ignore until newline.
}
There are other methods to perform the process, such as using tables of function pointers and std::map.

Reading from FileStream with arbitrary delimiter

I have encountered a problem to read msg from a file using C++. Usually what people does is create a file stream then use getline() function to fetch msg. getline() function can accept an additional parameter as delimiter so that it return each "line" separated by the new delimiter but not default '\n'. However, this delimiter has to be a char. In my usecase, it is possible the delimiter in the msg is something else like "|--|", so I try to get a solution such that it accept a string as delimiter instead of a char.
I have searched StackOverFlow a little bit and found some interesting posts.
Parse (split) a string in C++ using string delimiter (standard C++)
This one gives a solution to use string::find() and string::substr() to parse with arbitrary delimiter. However, all the solutions there assumes input is a string instead of a stream, In my case, the file stream data is too big/waste to fit into memory at once so it should read in msg by msg (or a bulk of msg at once).
Actually, read through the gcc implementation of std::getline() function, it seems it is much more easier to handle the case delimiter is a singe char. Since every time you load in a chunk of characters, you can always search the delimiter and separate them. While it is different if you delimiter is more than one char, the delimiter itself may straddle between two different chunks and cause many other corner cases.
Not sure whether anyone else has faced this kind of requirement before and how you guys handled it elegantly. It seems it would be nice to have a standard function like istream& getNext (istream&& is, string& str, string delim)? This seems to be a general usecase to me. Why not this one is in Standard lib so that people no longer to implement their own version separately?
Thank you very much
The STL simply does not natively support what you are asking for. You will have to write your own function (or find a 3rd party function) that does what you need.
For instance, you can use std::getline() to read up to the first character of your delimiter, and then use std::istream::get() to read subsequent characters and compare them to the rest of your delimiter. For example:
std::istream& my_getline(std::istream &input, std::string &str, const std::string &delim)
{
if (delim.empty())
throw std::invalid_argument("delim cannot be empty!");
if (delim.size() == 1)
return std::getline(input, str, delim[0]);
str.clear();
std::string temp;
char ch;
bool found = false;
do
{
if (!std::getline(input, temp, delim[0]))
break;
str += temp;
found = true;
for (int i = 1; i < delim.size(); ++i)
{
if (!input.get(ch))
{
if (input.eof())
input.clear(std::ios_base::eofbit);
str.append(delim.c_str(), i);
return input;
}
if (delim[i] != ch)
{
str.append(delim.c_str(), i);
str += ch;
found = false;
break;
}
}
}
while (!found);
return input;
}
if you are ok with reading byte by byte, you could build a state transition table implementation of a finite state machine to recognize your stop condition
std::string delimeter="someString";
//initialize table with a row per target string character, a column per possible char and all zeros
std::vector<vector<int> > table(delimeter.size(),std::vector<int>(256,0));
int endState=delimeter.size();
//set the entry for the state looking for the next letter and finding that character to the next state
for(unsigned int i=0;i<delimeter.size();i++){
table[i][(int)delimeter[i]]=i+1;
}
now in you can use it like this
int currentState=0;
int read=0;
bool done=false;
while(!done&&(read=<istream>.read())>=0){
if(read>=256){
currentState=0;
}else{
currentState=table[currentState][read];
}
if(currentState==endState){
done=true;
}
//do your streamy stuff
}
granted this only works if the delimiter is in extended ASCII, but it will work fine for some thing like your example.
It seems, it is easiest to create something like getline(): read to the last character of the separator. Then check if the string is long enough for the separator and, if so, if it ends with the separator. If it is not, carry on reading:
std::string getline(std::istream& in, std::string& value, std::string const& separator) {
std::istreambuf_iterator<char> it(in), end;
if (separator.empty()) { // empty separator -> return the entire stream
return std::string(it, end);
}
std::string rc;
char last(separator.back());
for (; it != end; ++it) {
rc.push_back(*it);
if (rc.back() == last
&& separator.size() <= rc.size()
&& rc.substr(rc.size() - separator.size()) == separator) {
return rc.resize(rc.size() - separator.size());
}
}
return rc; // no separator was found
}

How to read lines from a file using the ifstream?

I have a text file with the following information in it:
2B,410,AER,2965,KZN,2990,,0,CR2
2B,410,ASF,2966,KZN,2990,,0,CR2
2B,410,ASF,2966,MRV,2962,,0,CR2
2B,410,CEK,2968,KZN,2990,,0,CR2
2B,410,CEK,2968,OVB,4078,,0,CR2
2B,410,DME,4029,KZN,2990,,0,CR2
2B,410,DME,4029,NBC,6969,,0,CR2
2B,410,DME,4029,TGK,\N,,0,CR2
(it is airline route info)
I'm trying to loop through the file and extract each line into a char* - simple right?
Well, yes, it's simple but not when you've completely forgotten how to write successful i/o operations! :)
My code goes a little like:
char * FSXController::readLine(int offset, FileLookupFlag flag)
{
// Storage Buffer
char buffer[50];
std::streampos sPos(offset);
try
{
// Init stream
if (!m_ifs.is_open())
m_ifs.open(".\\Assets\\routes.txt", std::fstream::in);
}
catch (int errorCode)
{
showException(errorCode);
return nullptr;
}
// Set stream to read input line
m_ifs.getline(buffer, 50);
// Close stream if no multiple selection required
if (flag == FileLookupFlag::single)
m_ifs.close();
return buffer;
}
Where m_ifs is my ifStream object.
The problem is that when I breakpoint my code after the getline() operation, I notice that 'buffer' has not changed?
I know it is something simple, but please could someone shed some light onto this - I'm tearing my forgetful hair out! :)
P.S: I never finished writing the exception handling so it is pretty useless right now!
Thanks
Here is a fix with some important c++ libraries you may want to learn, and what I believe a better solution. Since you just need your final result to be strings:
// A program to read a file to a vector of strings
// - Each line is a string element of a vector container
#include <fstream>
#include <string>
#include <vector>
// ..
std::vector<std::string> ReadTheWholeFile()
{
std::vector<std::string> MyVector;
std::string JustPlaceHolderString;
std::ifstream InFile;
InFile.open("YourText.txt"); // or the full path of a text file
if (InFile.is_open())
while (std::getline(InFile, PlaceHolderStr));
MyVector.push_back(PlaceHolderStr);
InFile.close(); // we usually finish what we start - but not needed
return MyVector;
}
int main()
{
// result
std::vector<std::string> MyResult = ReadTheWholeFile();
return 0;
}
There are two basic problems with your code:
You are returning a local variable. The statement return buffer; results in a dangling pointer.
You are using a char buffer. C-style strings are discouraged in c++, you should always prefer std::string instead.
A far better approach is this:
string FSXController::readLine(int offset, FileLookupFlag flag) {
string line;
//your code here
getline(m_ifs, line) //or while(getline(my_ifs, line)){ //code here } to read multiple lines
//rest of your code
return line;
}
More information about std::string can be found here

Can I use 2 or more delimiters in C++ function getline? [duplicate]

This question already has answers here:
How can I read and parse CSV files in C++?
(39 answers)
Closed 4 years ago.
I would like to know how can I use 2 or more delimiters in the getline functon, that's my problem:
The program reads a text file... each line is goning to be like:
New Your, Paris, 100
CityA, CityB, 200
I am using getline(file, line), but I got the whole line, when I want to to get CityA, then CityB and then the number; and if I use ',' delimiter, I won't know when is the next line, so I'm trying to figure out some solution..
Though, how could I use comma and \n as a delimiter?
By the way,I'm manipulating string type,not char, so strtok is not possible :/
some scratch:
string line;
ifstream file("text.txt");
if(file.is_open())
while(!file.eof()){
getline(file, line);
// here I need to get each string before comma and \n
}
You can read a line using std::getline, then pass the line to a std::stringstream and read the comma separated values off it
string line;
ifstream file("text.txt");
if(file.is_open()){
while(getline(file, line)){ // get a whole line
std::stringstream ss(line);
while(getline(ss, line, ',')){
// You now have separate entites here
}
}
No, std::getline() only accepts a single character, to override the default delimiter. std::getline() does not have an option for multiple alternate delimiters.
The correct way to parse this kind of input is to use the default std::getline() to read the entire line into a std::string, then construct a std::istringstream, and then parse it further, into comma-separate values.
However, if you are truly parsing comma-separated values, you should be using a proper CSV parser.
Often, it is more intuitive and efficient to parse character input in a hierarchical, tree-like manner, where you start by splitting the string into its major blocks, then go on to process each of the blocks, splitting them up into smaller parts, and so on.
An alternative to this is to tokenize like strtok does -- from the beginning of input, handling one token at a time until the end of input is encountered. This may be preferred when parsing simple inputs, because its is straightforward to implement. This style can also be used when parsing inputs with nested structure, but this requires maintaining some kind of context information, which might grow too complex to maintain inside a single function or limited region of code.
Someone relying on the C++ std library usually ends up using a std::stringstream, along with std::getline to tokenize string input. But, this only gives you one delimiter. They would never consider using strtok, because it is a non-reentrant piece of junk from the C runtime library. So, they end up using streams, and with only one delimiter, one is obligated to use a hierarchical parsing style.
But zneak brought up std::string::find_first_of, which takes a set of characters and returns the position nearest to the beginning of the string containing a character from the set. And there are other member functions: find_last_of, find_first_not_of, and more, which seem to exist for the sole purpose of parsing strings. But std::string stops short of providing useful tokenizing functions.
Another option is the <regex> library, which can do anything you want, but it is new and you will need to get used to its syntax.
But, with very little effort, you can leverage existing functions in std::string to perform tokenizing tasks, and without resorting to streams. Here is a simple example. get_to() is the tokenizing function and tokenize demonstrates how it is used.
The code in this example will be slower than strtok, because it constantly erases characters from the beginning of the string being parsed, and also copies and returns substrings. This makes the code easy to understand, but it does not mean more efficient tokenizing is impossible. It wouldn't even be that much more complicated than this -- you would just keep track of your current position, use this as the start argument in std::string member functions, and never alter the source string. And even better techniques exist, no doubt.
To understand the example's code, start at the bottom, where main() is and where you can see how the functions are used. The top of this code is dominated by basic utility functions and dumb comments.
#include <iostream>
#include <string>
#include <utility>
namespace string_parsing {
// in-place trim whitespace off ends of a std::string
inline void trim(std::string &str) {
auto space_is_it = [] (char c) {
// A few asks:
// * Suppress criticism WRT localization concerns
// * Avoid jumping to conclusions! And seeing monsters everywhere!
// Things like...ah! Believing "thoughts" that assumptions were made
// regarding character encoding.
// * If an obvious, portable alternative exists within the C++ Standard Library,
// you will see it in 2.0, so no new defect tickets, please.
// * Go ahead and ignore the rumor that using lambdas just to get
// local function definitions is "cheap" or "dumb" or "ignorant."
// That's the latest round of FUD from...*mumble*.
return c > '\0' && c <= ' ';
};
for(auto rit = str.rbegin(); rit != str.rend(); ++rit) {
if(!space_is_it(*rit)) {
if(rit != str.rbegin()) {
str.erase(&*rit - &*str.begin() + 1);
}
for(auto fit=str.begin(); fit != str.end(); ++fit) {
if(!space_is_it(*fit)) {
if(fit != str.begin()) {
str.erase(str.begin(), fit);
}
return;
} } } }
str.clear();
}
// get_to(string, <delimiter set> [, delimiter])
// The input+output argument "string" is searched for the first occurance of one
// from a set of delimiters. All characters to the left of, and the delimiter itself
// are deleted in-place, and the substring which was to the left of the delimiter is
// returned, with whitespace trimmed.
// <delimiter set> is forwarded to std::string::find_first_of, so its type may match
// whatever this function's overloads accept, but this is usually expressed
// as a string literal: ", \n" matches commas, spaces and linefeeds.
// The optional output argument "found_delimiter" receives the delimiter character just found.
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters, char& found_delimiter) {
const auto pos = str.find_first_of(std::forward<D>(delimiters));
if(pos == std::string::npos) {
// When none of the delimiters are present,
// clear the string and return its last value.
// This effectively makes the end of a string an
// implied delimiter.
// This behavior is convenient for parsers which
// consume chunks of a string, looping until
// the string is empty.
// Without this feature, it would be possible to
// continue looping forever, when an iteration
// leaves the string unchanged, usually caused by
// a syntax error in the source string.
// So the implied end-of-string delimiter takes
// away the caller's burden of anticipating and
// handling the range of possible errors.
found_delimiter = '\0';
std::string result;
std::swap(result, str);
trim(result);
return result;
}
found_delimiter = str[pos];
auto left = str.substr(0, pos);
trim(left);
str.erase(0, pos + 1);
return left;
}
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters) {
char discarded_delimiter;
return get_to(str, std::forward<D>(delimiters), discarded_delimiter);
}
inline std::string pad_right(const std::string& str,
std::string::size_type min_length,
char pad_char=' ')
{
if(str.length() >= min_length ) return str;
return str + std::string(min_length - str.length(), pad_char);
}
inline void tokenize(std::string source) {
std::cout << source << "\n\n";
bool quote_opened = false;
while(!source.empty()) {
// If we just encountered an open-quote, only include the quote character
// in the delimiter set, so that a quoted token may contain any of the
// other delimiters.
const char* delimiter_set = quote_opened ? "'" : ",'{}";
char delimiter;
auto token = get_to(source, delimiter_set, delimiter);
quote_opened = delimiter == '\'' && !quote_opened;
std::cout << " " << pad_right('[' + token + ']', 16)
<< " " << delimiter << '\n';
}
std::cout << '\n';
}
}
int main() {
string_parsing::tokenize("{1.5, null, 88, 'hi, {there}!'}");
}
This outputs:
{1.5, null, 88, 'hi, {there}!'}
[] {
[1.5] ,
[null] ,
[88] ,
[] '
[hi, {there}!] '
[] }
I don't think that's how you should attack the problem (even if you could do it); instead:
Use what you have to read in each line
Then split up that line by the commas to get the pieces that you want.
If strtok will do the job for #2, you can always convert your string into a char array.

Parse buffered data line by line

I want to write a parser for Wavefront OBJ file format, plain text file.
Example can be seen here: people.sc.fsu.edu/~jburkardt/data/obj/diamond.obj.
Most people use old scanf to parse this format line by line, however I would prefer to load the whole file at once to reduce IO operation count. Is there a way to parse this kind of buffered data line by line?
void ObjModelConditioner::Import(Model& asset)
{
uint8_t* buffer = SyncReadFile(asset.source_file_info());
delete [] buffer;
}
Or would it be preferable to load whole file into a string and try to parse that?
After a while It seems I found sufficient (and simple) solution. Since my goal is to create asset conditioning pipeline, the code has to be able to handle large amounts of data efficiently. Data can be read into a string at once and once loaded, stringstream can be initialized with this string.
std::string data;
SyncReadFile(asset.source_file_info(), data);
std::stringstream data_stream(data);
std::string line;
Then I simply call getline():
while(std::getline(data_stream, line))
{
std::stringstream line_stream(line);
std::string type_token;
line_stream >> type_token;
if (type_token == "v") {
// Vertex position
Vector3f position;
line_stream >> position.x >> position.y >> position.z;
// ...
}
else if (type_token == "vn") {
// Vertex normal
}
else if (type_token == "vt") {
// Texture coordinates
}
else if (type_token == "f") {
// Face
}
}
Here's a function that splits a char array into a vector of strings (assuming each new string starts with '\n' symbol):
#include <iostream>
#include <vector>
std::vector< std::string >split(char * arr)
{
std::string str = arr;
std::vector< std::string >result;
int beg=0, end=0;//begining and end of each line in the array
while( end = str.find( '\n', beg + 1 ) )
{
if(end == -1)
{
result.push_back(str.substr(beg));
break;
}
result.push_back(str.substr(beg, end - beg));
beg = end;
}
return result;
}
Here's the usage:
int main()
{
char * a = "asdasdasdasdasd \n asdasdasd \n asdasd";
std::vector< std::string >result = split(a);
}
If you've got the raw data in a char[] (or a unsigned char[]), and
you know its length, it's pretty trivial to write an input only, no seek
supported streambuf which will allow you to create an std::istream
and to use std::getline on it. Just call:
setg( start, start, start + length );
in the constructor. (Nothing else is needed.)
It really depends on how you're going to parse the text. One way to do this would be simply to read the data into a vector of strings. I'll assume that you've already covered issues such as scaleability / use of memory etc.
std::vector<std::string> lines;
std::string line;
ifstream file(filename.c_str(), ios_base::in);
while ( getline( file, line ) )
{
lines.push_back( line );
}
file.close();
This would cache your file in lines. Next you need to go through lines
for ( std::vector<std::string>::const_iterator it = lines.begin();
it != lines.end(); ++it)
{
const std::string& line = *it;
if ( line.empty() )
continue;
switch ( line[0] )
{
case 'g':
// Some stuff
break;
case 'v':
// Some stuff
break;
case 'f':
// Some stuff
break;
default:
// Default stuff including '#' (probably nothing)
}
}
Naturally, this is very simplistic and depends largely on what you want to do with your file.
The size of the file that you've given as an example is hardly likely to cause IO stress (unless you're using some very lightweight equipment) but if you're reading many files at once I suppose it might be an issue.
I think your concern here is to minimise IO and I'm not sure that this solution will really help that much since you're going to be iterating over a collection twice. If you need to go back and keep reading the same file over and over again, then it will definitely speed things up to cache the file in memory but there are just as easy ways to do this such as memory mapping a file and using normal file accessing. If you're really concerned, then try profiling a solution like this against simply processing the file directly as you read from IO.