C++ tokenize a string using a regular expression

C++ tokenize a string using a regular expression - c++

I'm trying to learn myself some C++ from scratch at the moment.
I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a
classroom setting in the past. Please excuse the naivete of my question.
I would like to split a string using a regular expression but have not had much luck finding
a clear, definitive, efficient and complete example of how to do this in C++.
In perl this is action is common, and thus can be accomplished in a trivial manner,
/home/me$ cat test.txt
this is aXstringYwith, some problems
and anotherXY line with similar issues
/home/me$ cat test.txt | perl -e'
> while(<>){
> my #toks = split(/[\sXY,]+/);
> print join(" ",#toks)."\n";
> }'
this is a string with some problems
and another line with similar issues
I'd like to know how best to accomplish the equivalent in C++.
EDIT:
I think I found what I was looking for in the boost library, as mentioned below.
boost regex-token-iterator (why don't underscores work?)
I guess I didn't know what to search for.
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
int main(int argc)
{
string s;
do{
if(argc == 1)
{
cout << "Enter text to split (or \"quit\" to exit): ";
getline(cin, s);
if(s == "quit") break;
}
else
s = "This is a string of tokens";
boost::regex re("\\s+");
boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
boost::sregex_token_iterator j;
unsigned count = 0;
while(i != j)
{
cout << *i++ << endl;
count++;
}
cout << "There were " << count << " tokens found." << endl;
}while(argc == 1);
return 0;
}

The boost libraries are usually a good choice, in this case Boost.Regex. There even is an example for splitting a string into tokens that already does what you want. Basically it comes down to something like this:
boost::regex re("[\\sXY]+");
std::string s;
while (std::getline(std::cin, s)) {
boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
boost::sregex_token_iterator j;
while (i != j) {
std::cout << *i++ << " ";
}
std::cout << std::endl;
}

If you want to minimize use of iterators, and pithify your code, the following should work:
#include <string>
#include <iostream>
#include <boost/regex.hpp>
int main()
{
const boost::regex re("[\\sXY,]+");
for (std::string s; std::getline(std::cin, s); )
{
std::cout << regex_replace(s, re, " ") << std::endl;
}
}

Unlike in Perl, regular expressions are not "built in" into C++.
You need to use an external library, such as PCRE.

Regex are part of TR1 included in Visual C++ 2008 SP1 (including express edition) and G++ 4.3.
Header is <regex> and namespace std::tr1. Works great with STL.
Getting started with C++ TR1 regular expressions
Visual C++ Standard Library : TR1 Regular Expressions

Related

How to find all sentences except those defined using regular expressions?

The bottom line is that I need to find all the comments in some Python code and cut them out, leaving only the code itself.
But I can't do it from the opposite. That is, I find the comments themselves, but I cannot find everything except them.
I tried using "?!", Made up a regular expression like "(. *) (?! #. *)". But it does not work as I expected.
Just as in the code that I attached, there is an "else" that I tried to use too, that is, write to different variables, but for some reason it doesn't even go there
#include <iostream>
#include <fstream>
#include <string>
#include <regex>
int main()
{
std::string line;
std::string new_line;
std::string result;
std::string result_re;
std::string path;
std::smatch match;
std::regex re("(#.*)");
std::cout << "Enter the path\n";
std::cin >> path;
std::ifstream in(path);
if (in.is_open())
{
while (getline(in, line))
{
if (std::regex_search(line, match, re))
{
for (int i = 0; i < match.size(); i++)
result_re += match[i + 1];
result_re += "\n";
}
else
{
for (int i = 0; i < match.size(); i++)
result += match[i];
//result += "\n";
}
std::cout << line << std::endl;
}
}
in.close();
std::cout << result_re << std::endl;
std::cout << "End of program" << std::endl;
std::cout << result << std::endl;
system("pause");
return 0;
}
As I said above, I want to get everything except comments, and not the other way around.
I also need to do a search for multi-line comments, which are defined in """Text""".
But in this implementation, I can’t even imagine how to do it, since now it is reading line by line, and a multi-line comment in this case with the help of a regulars program is impossible for me to get
I would be grateful for your advices and help.

1. don't try parsing your input file line by line. Instead suck in the whole text and let regex to replace all the comments, this way your entire program would look like this:
#include <iostream>
#include <string>
#include <fstream>
#include <sstream>
#include <regex>
using namespace std; // for brevity
int main() {
cout << "Enter the path: ";
string filename;
getline(cin, filename);
string pprg{ istream_iterator<char>(ifstream{filename, ifstream::in} >> noskipws),
istream_iterator<char>{} };
pprg = regex_replace(pprg, regex{"#.*"}, "");
cout << pprg << endl;
}
to handle multi-line Python literals """...""", with C++ regex is quite uneasy to do (unlike in the example above): there are few mutually exclusive requirements (imho):
regex should be extended POSIX, but
POSIX regex does not support empty regex matches, however
for crafting an RE to match a negated sequence of characters a negative look-ahead assert is required, which will be an empty match :(
thus it would mean, you'd need to think and put up some programming logic to remove multi-line Python text literals

Bjarne Stroustrup Book - Vector and For loop - won't work

I'm having difficulties with a particular piece of code I'm learning from "Programming Principles And Practice Using C++".
I can't get an output from a loop refering to a vector. Am using std_lib_facilities and stdafx because the book and MVS told me so.
#include "stdafx.h"
#include "../../std_lib_facilities.h"
int main()
{
vector<string>words;
for(string temp; cin>>temp; )
words.push_back(temp);
cout << "Number of words: " << words.size() << '\n';
}
This will produce nothing. I'll get the prompt, type in some words, then enter, then nothing.
Tried some variations I got here and from other websites as well, such as:
//here i tried the libraries the guy used in his code
#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main()
{
cout << "Please enter a series of words followed by End-of-File: ";
vector<string> words;
string word;
string disliked = "Broccoli";
while (cin >> word)
words.push_back(word);
cout << "\nNumber of words: " << words.size() << endl;
// cycle through all strings in vector
for (int i = 0; i < words.size(); ++i)
{
if (words[i] != disliked)
cout << words[i] << endl;
else
cout << "BLEEP!\n";
}
return 0;
}
Still nothing.
After trying some things, by elimination I'm pretty certain the problem is with the loop-to-vector communication, because all of these work fine:
int main()
{
vector<string>words = { "hi" };
words.push_back("hello");
cout << words[1] << "\n"; // this will print hello
for (int i = 0; i < words.size();++i) {
cout << words[i] << "\n"; // this will print out all elements
// inside vector words, ( hi, hello)
}
cout << words.size();// this will print out number 2
for (string temp; cin >> temp;) {
words.push_back(temp);
}
cout << words.size();// this won't do anything after i type in some
// words; shouldn't it increase the size of the
// vector?
}
Neither will this alone:
int main()
{
vector<string>words = { "hi" };
for (string temp; cin >> temp;) {
words.push_back(temp);
}
cout << words.size();
}
What am I missing, please? Thank you in advance.

Input the strings and when done press Ctrl+Z (followed by Enter) if on Windows or Ctrl+D if on Linux. When you do that the cin>>temp; condition inside your for loop will evaluate to false and your program will exit the loop.

Firstly, it is not necessary to use the std_lib_facilities.h. That is just something used in the book to avoid having every example in the book regularly include a set of standard headers, or to correct for non-standard behaviours between compilers. It is also a header that that does using namespace std - you will find numerous examples (both on SO and the wider internet) explaining why it is VERY poor practice to have using namespace std in a header file.
Second, it is not necessary to use stdafx.h either. That is something generated by Microsoft IDE, and provides a means of speeding up compilation in large projects, because of how it causes the compiler to work with precompiled headers. If you only expect to use Microsoft compilers, then feel free to fill your boots and use this one. However, it is not standard C++, may (depending on IDE and project settings) include windows specific headers that will not work with non-Microsoft compilers, and in forums will probably discourage people who use other compilers from helping you - since they will have good reason to assume your code uses Microsoft-specific extensions, which will mean they probably can't help you.
The first sample of code can be rewritten, in standard C++ (without use of either header above) as
#include <vector> // for std::vector
#include <string> // for std::string
#include <iostream> // std::cout and other I/O facilities
int main()
{
std::vector<std::string> words;
for(std::string temp; std::cin >> temp; )
words.push_back(temp);
std::cout << "Number of words: " << words.size() << '\n';
}
Before you get excited, this will exhibit the same problem (not apparently finishing). The reason is actually the termination condition of the loop - std::cin >> temp will only terminate the loop if end of file or some other error is encountered in the stream.
So, if you type
The cow jumped over the moon
std::cin will continue to wait for input. It is generally necessary for the USER to trigger and end of file condition. Under windows, this requires the user to enter CTRL-Z on an empty line followed by the enter key.
An alternative would be to have some pre-agreed text that cause the loop to exit, such as
#include <vector> // for std::vector
#include <string> // for std::string
#include <iostream> // std::cout and other I/O facilities
int main()
{
std::vector<std::string> words;
for(std::string temp; std::cin >> temp && temp != "zzz"; )
words.push_back(temp);
std::cout << "Number of words: " << words.size() << '\n';
}
which will cause the program to exit when the input contains the word zzz. For example
The cow jumped over the moon zzz
There are other techniques, such as reading one character at a time, and stopping when the user enters two consecutive newlines. That requires your code to interpret every character, and decide what constitutes a word. I'll leave that as an exercise.
Note there is no means in standard C++ to directly read keystrokes - the problems above are related to how standard streams work, and interact with the host system.
The user can also use the program "as is" by placing the same text into an actual file, and (when the program is run) redirect input for your program to come from that file. For example your_executable < filename.

How to use cin with unknown input types?

I have a C++ program which needs to take user input. The user input will either be two ints (for example: 1 3) or it will be a char (for example: s).
I know I can get the twos ints like this:
cin >> x >> y;
But how do I go about getting the value of the cin if a char is input instead? I know cin.fail() will be called but when I call cin.get(), it does not retrieve the character that was input.
Thanks for the help!

Use std::getline to read the input into a string, then use std::istringstream to parse the values out.

You can do this in c++11. This solution is robust, will ignore spaces.
This is compiled with clang++-libc++ in ubuntu 13.10. Note that gcc doesn't have a full regex implementation yet, but you could use Boost.Regex as an alternative.
EDIT: Added negative numbers handling.
#include <regex>
#include <iostream>
#include <string>
#include <utility>
using namespace std;
int main() {
regex pattern(R"(\s*(-?\d+)\s+(-?\d+)\s*|\s*([[:alpha:]])\s*)");
string input;
smatch match;
char a_char;
pair<int, int> two_ints;
while (getline(cin, input)) {
if (regex_match(input, match, pattern)) {
if (match[3].matched) {
cout << match[3] << endl;
a_char = match[3].str()[0];
}
else {
cout << match[1] << " " << match[2] << endl;
two_ints = {stoi(match[1]), stoi(match[2])};
}
}
}
}

Trimming internal whitespace in std::string

I'm looking for an elegant way to transform an std::string from something like:
std::string text = " a\t very \t ugly \t\t\t\t string ";
To:
std::string text = "a very ugly string";
I've already trimmed the external whitespace with boost::trim(text);
[edit]
Thus, multiple whitespaces, and tabs, are reduced to just one space
[/edit]
Removing the external whitespace is trivial. But is there an elegant way of removing the internal whitespace that doesn't involve manual iteration and comparison of previous and next characters? Perhaps something in boost I have missed?

You can use std::unique with std::remove along with ::isspace to compress multiple whitespace characters into single spaces:
std::remove(std::unique(std::begin(text), std::end(text), [](char c, char c2) {
return ::isspace(c) && ::isspace(c2);
}), std::end(text));

std::istringstream iss(text);
text = "";
std::string s;
while(iss >> s){
if ( text != "" ) text += " " + s;
else text = s;
}
//use text, extra whitespaces are removed from it

Most of what I'd do is similar to what #Nawaz already posted -- read strings from an istringstream to get the data without whitespace, and then insert a single space between each of those strings. However, I'd use an infix_ostream_iterator from a previous answer to get (IMO) slightly cleaner/clearer code.
std::istringstream buffer(input);
std::copy(std::istream_iterator<std::string>(buffer),
std::istream_iterator<std::string>(),
infix_ostream_iterator<std::string>(result, " "));

#include <boost/algorithm/string/trim_all.hpp>
string s;
boost::algorithm::trim_all(s);

If you check out https://svn.boost.org/trac/boost/ticket/1808, you'll see a request for (almost) this exact functionality, and a suggested implementation:
std::string trim_all ( const std::string &str ) {
return boost::algorithm::find_format_all_copy(
boost::trim_copy(str),
boost::algorithm::token_finder (boost::is_space(),boost::algorithm::token_compress_on),
boost::algorithm::const_formatter(" "));
}

Here is a possible version using regular expressions. My GCC 4.6 doesn't have regex_replace yet, but Boost.Regex can serve as a drop-in replacement:
#include <string>
#include <iostream>
// #include <regex>
#include <boost/regex.hpp>
#include <boost/algorithm/string/trim.hpp>
int main() {
using namespace std;
using namespace boost;
string text = " a\t very \t ugly \t\t\t\t string ";
trim(text);
regex pattern{"[[:space:]]+", regex_constants::egrep};
string result = regex_replace(text, pattern, " ");
cout << result << endl;
}

Using Boost-Regex to parse string into characters and numerals

I'd like to use Boost's Regex library to separate a string containing labels and numbers into tokens. For example 'abc1def002g30' would be separated into {'abc','1','def','002','g','30'}. I modified the example given in Boost documentation to come up with this code:
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
int main(int argc,char **argv){
string s,str;
int count;
do{
count=0;
if(argc == 1)
{
cout << "Enter text to split (or \"quit\" to exit): ";
getline(cin, s);
if(s == "quit") break;
}
else
s = "This is a string of tokens";
boost::regex re("[0-9]+|[a-z]+");
boost::sregex_token_iterator i(s.begin(), s.end(), re, 0);
boost::sregex_token_iterator j;
while(i != j)
{
str=*i;
cout << str << endl;
count++;
i++;
}
cout << "There were " << count << " tokens found." << endl;
}while(argc == 1);
return 0;
}
The number of tokens stored in count is correct. However, *it contains only an empty string so nothing is printed. Any guesses as to what I am doing wrong?
EDIT: as per the fix suggested below, I modified the code and it now works correctly.

From the docs on the sregex_token_iterator:
Effects: constructs a regex_token_iterator that will enumerate one string for each regular expression match of the expression re found within the sequence [a,b), using match flags m (see match_flag_type). The string enumerated is the sub-expression submatch for each match found; if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting)
Since your regex matching all items (unlike the sample code, which only matched the strings), you get empty results.
Try replacing it with a 0.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ tokenize a string using a regular expression - c++

Unlike in Perl, regular expressions are not "built in" into C++. You need to use an external library, such as PCRE.

Regex are part of TR1 included in Visual C++ 2008 SP1 (including express edition) and G++ 4.3. Header is <regex> and namespace std::tr1. Works great with STL. Getting started with C++ TR1 regular expressions Visual C++ Standard Library : TR1 Regular Expressions

Related

How to find all sentences except those defined using regular expressions?

Bjarne Stroustrup Book - Vector and For loop - won't work

How to use cin with unknown input types?

Trimming internal whitespace in std::string

Using Boost-Regex to parse string into characters and numerals

Categories

Resources