How to extract the string pattern in C++ efficiently?

How to extract the string pattern in C++ efficiently? - c++

I have a pattern that in the following format:
AUTHOR, "TITLE" (PAGES pp.) [CODE STATUS]
For example, I have a string
P.G. Wodehouse, "Heavy Weather" (336 pp.) [PH.409 AVAILABLE FOR LENDING]
I want to extract
AUTHOR = P.G. Wodehouse
TITLE = Heavy Weather
PAGES = 336
CODE = PH.409
STATUS = AVAILABLE FOR LENDING
I only know how to do that in Python, however, are there any efficient way to do the same thing in C++?

Exactly the same way as in Python. C++11 has regular expressions (and for earlier C++, there's Boost regex.) As for the read loop:
std::string line;
while ( std::getline( file, line ) ) {
// ...
}
is almost exactly the same as:
for line in file:
# ...
The only differences are:
The C++ version will not put the trailing '\n' in the buffer. (In general, the C++ version may be less flexible with regards to end of line handling.)
In case of a read error, the C++ version will break the loop; the Python version will raise an exception.
Neither should be an issue in your case.
EDIT:
It just occurs to me that while regular expressions in C++ and in Python are very similar, the syntax for using them isn't quite the same. So:
In C++, you'd normally declare an instance of the regular expression before using it; something like Python's re.match( r'...', line ) is theoretically possible, but not very idiomatic (and it would still involve explicitly constructuing a regular expression object in the expression). Also, the match function simply returns a boolean; if you want the captures, you need to define a separate object for them. Typical use would probably be something like:
static std::regex const matcher( "the regular expression" );
std::smatch forCaptures;
if ( std::regex_match( line, forCaptures, matcher ) ) {
std::string firstCapture = forCaptures[1];
// ...
}
This corresponds to the Python:
m = re.match( 'the regular expression', line )
if m:
firstCapture = m.group(1)
# ...
EDIT:
Another answer has suggested overloading operator>>; I heartily concur. Just out of curiousity, I gave it a go; something like the following works well:
struct Book
{
std::string author;
std::string title;
int pages;
std::string code;
std::string status;
};
std::istream&
operator>>( std::istream& source, Book& dest )
{
std::string line;
std::getline( source, line );
if ( source )
{
static std::regex const matcher(
R"^(([^,]*),\s*"([^"]*)"\s*\((\d+) pp.\)\s*\[(\S+)\s*([^\]]*)\])^"
);
std::smatch capture;
if ( ! std::regex_match( line, capture, matcher ) ) {
source.setstate( std::ios_base::failbit );
} else {
dest.author = capture[1];
dest.title = capture[2];
dest.pages = std::stoi( capture[3] );
dest.code = capture[4];
dest.status = capture[5];
}
}
return source;
}
Once you've done this, you can write things like:
std::vector<Book> v( (std::istream_iterator<Book>( inputFile )),
(std::istream_iterator<Book>()) );
And load an entire file in the initialization of a vector.
Note the error handling in the operator>>. If a line is misformed, we set failbit; this is the standard convention in C++.
EDIT:
Since there's been so much discussion: the above is fine for small, one time programs, things like school projects, or one time programs which will read the current file, output it in a new format, and then be thrown away. In production code, I would insist on support for comments and empty lines; continuing in case of error, in order to report multiple errors (with line numbers), and probably continuation lines (since titles can get long enough to become unwieldly). It's not practical to do this with operator>>, if for no other reason than the need to output line numbers, so I'd use a parser along the following line:
int
getContinuationLines( std::istream& source, std::string& line )
{
int results = 0;
while ( source.peek() == '&' ) {
std::string more;
std::getline( source, more ); // Cannot fail, because of peek
more[0] = ' ';
line += more;
++ results;
}
return results;
}
void
trimComment( std::string& line )
{
char quoted = '\0';
std::string::iterator position = line.begin();
while ( position != line.end() && (quoted != '\0' || *position == '#') ) {
if ( *position == '\' && std::next( position ) != line.end() ) {
++ position;
} else if ( *position == quoted ) {
quoted = '\0';
} else if ( *position == '\"' || *position == '\'' ) {
quoted = *position;
}
++ position;
}
line.erase( position, line.end() );
}
bool
isEmpty( std::string const& line )
{
return std::all_of(
line.begin(),
line.end(),
[]( unsigned char ch ) { return isspace( ch ); } );
}
std::vector<Book>
parseFile( std::istream& source )
{
std::vector<Book> results;
int lineNumber = 0;
std::string line;
bool errorSeen = false;
while ( std::getline( source, line ) ) {
++ lineNumber;
int extraLines = getContinuationLines( source, line );
trimComment( line );
if ( ! isEmpty( line ) ) {
static std::regex const matcher(
R"^(([^,]*),\s*"([^"]*)"\s*\((\d+) pp.\)\s*\[(\S+)\s*([^\]]*)\])^"
);
std::smatch capture;
if ( ! std::regex_match( line, capture, matcher ) ) {
std::cerr << "Format error, line " << lineNumber << std::endl;
errorSeen = true;
} else {
results.emplace_back(
capture[1],
capture[2],
std::stoi( capture[3] ),
capture[4],
capture[5] );
}
}
lineNumber += extraLines;
}
if ( errorSeen ) {
results.clear(); // Or more likely, throw some sort of exception.
}
return results;
}
The real issue here is how you report the error to the caller; I suspect that in most cases, and exception would be appropriate, but depending on the use case, other alternatives may be valid as well. In this example, I just return an empty vector. (The interaction between comments and continuation lines probably needs to be better defined as well, with modifications according to how it has been defined.)

Your input string is well delimited so I'd recommend using an extraction operator over a regex, for speed and for ease of use.
You'd first need to create a struct for your books:
struct book{
string author;
string title;
int pages;
string code;
string status;
};
Then you'd need to write the actual extraction operator:
istream& operator>>(istream& lhs, book& rhs){
lhs >> ws;
getline(lhs, rhs.author, ',');
lhs.ignore(numeric_limits<streamsize>::max(), '"');
getline(lhs, rhs.title, '"');
lhs.ignore(numeric_limits<streamsize>::max(), '(');
lhs >> rhs.pages;
lhs.ignore(numeric_limits<streamsize>::max(), '[');
lhs >> rhs.code >> ws;
getline(lhs, rhs.status, ']');
return lhs;
}
This gives you a tremendous amount of power. For example you can extract all the books from an istream into a vector like this:
istringstream foo("P.G. Wodehouse, \"Heavy Weather\" (336 pp.) [PH.409 AVAILABLE FOR LENDING]\nJohn Bunyan, \"The Pilgrim's Progress\" (336 pp.) [E.1173 CHECKED OUT]");
vector<book> bar{ istream_iterator<book>(foo), istream_iterator<book>() };

Use flex (it generates C or C++ code, to be used as a part or as the full program)
%%
^[^,]+/, {printf("Autor: %s\n",yytext );}
\"[^"]+\" {printf("Title: %s\n",yytext );}
\([^ ]+/[ ]pp\. {printf("Pages: %s\n",yytext+1);}
..................
.|\n {}
%%
(untested)

Here's the code:
#include <iostream>
#include <cstring>
using namespace std;
string extract (string a)
{
string str = "AUTHOR = "; //the result string
int i = 0;
while (a[i] != ',')
str += a[i++];
while (a[i++] != '\"');
str += "\nTITLE = ";
while (a[i] != '\"')
str += a[i++];
while (a[i++] != '(');
str += "\nPAGES = ";
while (a[i] != ' ')
str += a[i++];
while (a[i++] != '[');
str += "\nCODE = ";
while (a[i] != ' ')
str += a[i++];
while (a[i++] == ' ');
str += "\nSTATUS = ";
while (a[i] != ']')
str += a[i++];
return str;
}
int main ()
{
string a;
getline (cin, a);
cout << extract (a) << endl;
return 0;
}
Happy coding :)

Related

reading csv file for specific information

I am wondering how to read a specific value from a csv file in C++, and then read the next four items in the file. For example, this is what the file would look like:
fire,2.11,2,445,7891.22,water,234,332.11,355,5654.44,air,4535,122,334.222,16,earth,453,46,77.3,454
What I want to do is let my user select one of the values, let's say "air" and also read the next four items(4535 122 334.222 16).
I only want to use fstream,iostream,iomanip libraries. I am a newbie, and I am horrible at writing code, so please, be gentle.

You should read about parsers. Full CSV specifications.
If your fields are free of commas and double quotes, and you need a quick solution, search for getline/strtok, or try this (not compiled/tested):
typedef std::vector< std::string > svector;
bool get_line( std::istream& is, svector& d, const char sep = ',' )
{
d.clear();
if ( ! is )
return false;
char c;
std::string s;
while ( is.get(c) && c != '\n' )
{
if ( c == sep )
{
d.push_back( s );
s.clear();
}
else
{
s += c;
}
}
if ( ! s.empty() )
d.push_back( s );
return ! s.empty();
}
int main()
{
std::ifstream is( "test.txt" );
if ( ! is )
return -1;
svector line;
while ( get_line( is, line ) )
{
//...
}
return 0;
}

c++ Reading numbers from text files, ignoring comments

So I've seen lots of solutions on this site and tutorials about reading in from a text file in C++, but have yet to figure out a solution to my problem. I'm new at C++ so I think I'm having trouble piecing together some of the documentation to make sense of it all.
What I am trying to do is read a text file numbers while ignoring comments in the file that are denoted by "#". So an example file would look like:
#here is my comment
20 30 40 50
#this is my last comment
60 70 80 90
My code can read numbers fine when there aren't any comments, but I don't understand parsing the stream well enough to ignore the comments. Its kind of a hack solution right now.
/////////////////////// Read the file ///////////////////////
std::string line;
if (input_file.is_open())
{
//While we can still read the file
while (std::getline(input_file, line))
{
std::istringstream iss(line);
float num; // The number in the line
//while the iss is a number
while ((iss >> num))
{
//look at the number
}
}
}
else
{
std::cout << "Unable to open file";
}
/////////////////////// done reading file /////////////////
Is there a way I can incorporate comment handling with this solution or do I need a different approach? Any advice would be great, thanks.

If your file contains # always in the first column, then just test, if the line starts with # like this:
while (std::getline(input_file, line))
{
if (line[0] != "#" )
{
std::istringstream iss(line);
float num; // The number in the line
//while the iss is a number
while ((iss >> num))
{
//look at the number
}
}
}
It is wise though to trim the line of leading and trailing whitespaces, like shown here for example: Remove spaces from std::string in C++

If this is just a one of use, for line oriented input like yours, the
simplest solution is just to strip the comment from the line you just
read:
line.erase( std::find( line.begin(), line.end(), '#' ), line.end() );
A more generic solution would be to use a filtering streambuf, something
like:
class FilterCommentsStreambuf : public std::streambuf
{
std::istream& myOwner;
std::streambuf* mySource;
char myCommentChar;
char myBuffer;
protected:
int underflow()
{
int const eof = std::traits_type::eof();
int results = mySource->sbumpc();
if ( results == myCommentChar ) {
while ( results != eof && results != '\n') {
results = mySource->sbumpc(0;
}
}
if ( results != eof ) {
myBuffer = results;
setg( &myBuffer, &myBuffer, &myBuffer + 1 );
}
return results;
}
public:
FilterCommentsStreambuf( std::istream& source,
char comment = '#' )
: myOwner( source )
, mySource( source.rdbuf() )
, myCommentChar( comment )
{
myOwner.rdbuf( this );
}
~FilterCommentsStreambuf()
{
myOwner.rdbuf( mySource );
}
};
In this case, you could even forgo getline:
FilterCommentsStreambuf filter( input_file );
double num;
while ( input_file >> num || !input_file.eof() ) {
if ( ! input_file ) {
// Formatting error, output error message, clear the
// error, and resynchronize the input---probably by
// ignore'ing until end of line.
} else {
// Do something with the number...
}
}
(In such cases, I've found it useful to also track the line number in
the FilterCommentsStreambuf. That way you have it for error
messages.)

An alternative to the "read aline and parse it as a string", can be use the stream itself as the incoming buffer:
while(input_file)
{
int n = 0;
char c;
input_file >> c; // will skip spaces ad read the first non-blank
if(c == '#')
{
while(c!='\n' && input_file) input_file.get(c);
continue; //may be not soooo beautiful, but does not introduce useless dynamic memory
}
//c is part of something else but comment, so give it back to parse it as number
input_file.unget(); //< this is what all the fuss is about!
if(input_file >> n)
{
// look at the nunber
continue;
}
// something else, but not an integer is there ....
// if you cannot recover the lopop will exit
}

Search for string in file (line-by-line), ignoring the size of whitespace gaps between words

I'm a beginner to C++, so please be understanding...
I want to search for a string (needle) within a file (haystack), by reading each line separately, then searching for the needle in that line. However, ideally for a more robust code I would like to be able to just read individual words on the line, so that if there are larger (i.e. multiple) white-space gaps betweeen words they are ignored when searching for the needle. (e.g perhaps using the >> operator??) That is, the needle string should not have to exactly match the size of the space between words in the file.
so for example, if I have a needle:
"The quick brown fox jumps over the lazy dog"
in the file this might be written (on a particular line) as:
... "The quick brown fox jumps over the lazy dog" ...
Is there an efficient way to do this?
Currently I include the necessary number of spaces in my needle string but I would like to improve the code, if possible.
My code currently looks something like the following (within a method in a class):
double var1, var2;
char skip[5];
std::fstream haystack ("filename");
std::string needle = "This is a string, and var1 =";
std::string line;
int pos;
bool found = false;
// Search for needle
while ( !found && getline (haystack,line) ) {
pos = line.find(needle); // find position of needle in current line
if (pos != std::string::npos) { // current line contains needle
std::stringstream lineStream(line);
lineStream.seekg (pos + needle.length());
lineStream >> var1;
lineStream >> skip;
lineStream >> var2;
found = true;
}
}
(Just for clarity, after finding the string (needle) I want to store the next word on that line or in some cases store the next word, then skip a word and store the following word, for example:
With a file:
... ...
... This is a string, and var1 = 111 and 777 ...
... ...
I want to extract var1 = 111; var2 = 777; )
Thanks in advance for any help!

This will work, although I think there's a shorter solution:
std::size_t myfind(std::string ins, std::string str) {
for(std::string::iterator it = ins.begin(), mi = str.begin(); it != ins.end(); ++it) {
if(*it == *mi) {
++mi;
if (mi == str.end())
return std::distance(ins.begin(),it);
}
else {
if(*it == ' ')
continue;
mi = str.begin();
}
}
return std::string::npos;
}
// use:
myfind("foo The quick brown fox jumps over the lazy dog bar", "The quick brown fox");

You can find all sequences of white space characters in the line string, and replace them with a single white space. This way you would be able to replace multiple spaces in the needle as well, and the rest of your search algorithm would continue working unchanged.
Here is a way to remove duplicates using STL:
#include <iostream>
#include <algorithm>
#include <string>
#include <iterator>
using namespace std;
struct DupSpaceDetector {
bool wasSpace;
DupSpaceDetector() : wasSpace(0) {}
bool operator()(int c) {
if (c == ' ') {
if (wasSpace) {
return 1;
} else {
wasSpace = 1;
return 0;
}
} else {
wasSpace = 0;
return 0;
}
}
};
int main() {
string source("The quick brown fox jumps over the lazy dog");
string destination;
DupSpaceDetector detector;
remove_copy_if(
source.begin()
, source.end()
, back_inserter(destination)
, detector
);
cerr << destination << endl;
return 0;
}

To solve your problem you should strip extra spaces from the needle and the haystack line. std::unique is defined to do this. Normally it is used after sorting the range, but in this case all we really want to do is remove duplicate spaces.
struct dup_space
{
bool operator()( char lhs, char rhs )
{
return std::isspace( lhs ) && std::isspace( rhs );
}
};
void despacer( const std::string& in, std::string& out )
{
out.reserve( in.size() );
std::unique_copy( in.begin(), in.end(),
std::back_insert_iterator( out ),
dup_space()
);
}
You should use it like this:
void find( const std::string& needle, std::istream haystack )
{
std::string real_needle;
despacer( needle, real_needle );
std::string line;
std::string real_line;
while( haystack.good() )
{
line.clear();
std::getline( haystack, line );
real_line.clear();
despacer( line, real_line );
auto ret = real_line.find( real_needle );
if( ret != std::string::npos )
{
// found it
// do something creative
}
}
}

tokenizing a string of data into a vector of structs?

So I have the following string of data, which is being received through a TCP winsock connection, and would like to do an advanced tokenization, into a vector of structs, where each struct represents one record.
std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n"
struct table_t
{
std::string key;
std::string first;
std::string last;
std::string rank;
std::additional;
};
Each record in the string is delimited by a carriage return. My attempt at splitting up the records, but not yet splitting up the fields:
void tokenize(std::string& str, std::vector< string >records)
{
// Skip delimiters at beginning.
std::string::size_type lastPos = str.find_first_not_of("\n", 0);
// Find first "non-delimiter".
std::string::size_type pos = str.find_first_of("\n", lastPos);
while (std::string::npos != pos || std::string::npos != lastPos)
{
// Found a token, add it to the vector.
records.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of("\n", pos);
// Find next "non-delimiter"
pos = str.find_first_of("\n", lastPos);
}
}
It seems totally unnecessary to repeat all of that code again to further tokenize each record via the colon (internal field separator) into the struct and push each struct into a vector. I'm sure there is a better way of doing this, or perhaps the design is in itself wrong.
Thank you for any help.

My solution:
struct colon_separated_only: std::ctype<char>
{
colon_separated_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
typedef std::ctype<char> cctype;
static const cctype::mask *const_rc= cctype::classic_table();
static cctype::mask rc[cctype::table_size];
std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask));
rc[':'] = std::ctype_base::space;
return &rc[0];
}
};
struct table_t
{
std::string key;
std::string first;
std::string last;
std::string rank;
std::string additional;
};
int main() {
std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n";
stringstream s(buf);
s.imbue(std::locale(std::locale(), new colon_separated_only()));
table_t t;
std::vector<table_t> data;
while ( s >> t.key >> t.first >> t.last >> t.rank >> t.additional )
{
data.push_back(t);
}
for(size_t i = 0 ; i < data.size() ; ++i )
{
cout << data[i].key <<" ";
cout << data[i].first <<" "<<data[i].last <<" ";
cout << data[i].rank <<" "<< data[i].additional << endl;
}
return 0;
}
Output:
44 william adama commander stuff
33 luara roslin president data
Online Demo : http://ideone.com/JwZuk
The technique I used here is described in my another solution to different question:
Elegant ways to count the frequency of words in a file

For breaking the string up into records, I'd use istringstream, if only
because that will simplify the changes later when I want to read from
a file. For tokenizing, the most obvious solution is boost::regex, so:
std::vector<table_t> parse( std::istream& input )
{
std::vector<table_t> retval;
std::string line;
while ( std::getline( input, line ) ) {
static boost::regex const pattern(
"\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\)" );
boost::smatch matched;
if ( !regex_match( line, matched, pattern ) ) {
// Error handling...
} else {
retval.push_back(
table_t( matched[1], matched[2], matched[3],
matched[4], matched[5] ) );
}
}
return retval;
}
(I've assumed the logical constructor for table_t. Also: there's a very
long tradition in C that names ending in _t are typedef's, so you're
probably better off finding some other convention.)

Need a regular expression to extract only letters and whitespace from a string

I'm building a small utility method that parses a line (a string) and returns a vector of all the words. The istringstream code I have below works fine except for when there is punctuation so naturally my fix is to want to "sanitize" the line before I run it through the while loop.
I would appreciate some help in using the regex library in c++ for this. My initial solution was to us substr() and go to town but that seems complicated as I'll have to iterate and test each character to see what it is then perform some operations.
vector<string> lineParser(Line * ln)
{
vector<string> result;
string word;
string line = ln->getLine();
istringstream iss(line);
while(iss)
{
iss >> word;
result.push_back(word);
}
return result;
}

Don't need to use regular expressions just for punctuation:
// Replace all punctuation with space character.
std::replace_if(line.begin(), line.end(),
std::ptr_fun<int, int>(&std::ispunct),
' '
);
Or if you want everything but letters and numbers turned into space:
std::replace_if(line.begin(), line.end(),
std::not1(std::ptr_fun<int,int>(&std::isalphanum)),
' '
);
While we are here:
Your while loop is broken and will push the last value into the vector twice.
It should be:
while(iss)
{
iss >> word;
if (iss) // If the read of a word failed. Then iss state is bad.
{ result.push_back(word);// Only push_back() if the state is not bad.
}
}
Or the more common version:
while(iss >> word) // Loop is only entered if the read of the word worked.
{
result.push_back(word);
}
Or you can use the stl:
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::back_inserter(result)
);

[^A-Za-z\s] should do what you need if your replace the matching characters by nothing. It should remove all characters that are not letters and spaces. Or [^A-Za-z0-9\s] if you want to keep numbers too.
You can use online tools like this one : http://gskinner.com/RegExr/ to test out your patterns (Replace tab). Indeed some modifications can be required based on the regex lib you are using.

I'm not positive, but I think this is what you're looking for:
#include<iostream>
#include<regex>
#include<vector>
int
main()
{
std::string line("some words: with some punctuation.");
std::regex words("[\\w]+");
std::sregex_token_iterator i(line.begin(), line.end(), words);
std::vector<std::string> list(i, std::sregex_token_iterator());
for (auto j = list.begin(), e = list.end(); j != e; ++j)
std::cout << *j << '\n';
}
some
words
with
some
punctuation

The simplest solution is probably to create a filtering
streambuf to convert all non alphanumeric characters to space,
then to read using std::copy:
class StripPunct : public std::streambuf
{
std::streambuf* mySource;
char myBuffer;
protected:
virtual int underflow()
{
int result = mySource->sbumpc();
if ( result != EOF ) {
if ( !::isalnum( result ) )
result = ' ';
myBuffer = result;
setg( &myBuffer, &myBuffer, &myBuffer + 1 );
}
return result;
}
public:
explicit StripPunct( std::streambuf* source )
: mySource( source )
{
}
};
std::vector<std::string>
LineParser( std::istream& source )
{
StripPunct sb( source.rdbuf() );
std::istream src( &sb );
return std::vector<std::string>(
(std::istream_iterator<std::string>( src )),
(std::istream_iterator<std::string>()) );
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract the string pattern in C++ efficiently? - c++

Use flex (it generates C or C++ code, to be used as a part or as the full program) %% ^[^,]+/, {printf("Autor: %s\n",yytext );} \"[^"]+\" {printf("Title: %s\n",yytext );} \([^ ]+/[ ]pp\. {printf("Pages: %s\n",yytext+1);} .................. .|\n {} %% (untested)

Related

reading csv file for specific information

c++ Reading numbers from text files, ignoring comments

Search for string in file (line-by-line), ignoring the size of whitespace gaps between words

tokenizing a string of data into a vector of structs?

Need a regular expression to extract only letters and whitespace from a string

Categories

Resources