How to show regex output in pair in c++? - c++

I dont know if it makes any sense or not but here it is
Is there a way that i can get two words from a regex result each time?
supose i have a text file which contains an string such as the following :
Alex Fenix is an Engineer who works for Ford Automotive Company. His
Personal ID is <123456>;etc....
basically if i use \w i would get a list of :
Alex
Fenix
is
an
Engineer
and etc
They are all separated by white space and punctuation marks
what i am asking is , whether there is a way to have a list such as :
Alex Fenix
is an
Engineer who
works for
Ford Automotive
Company His
Personal ID
is 123456
How can i achieve such a format?
Is it even possible or should i store those first results in an array and then iterate through them and create the second list?
By the way please note that the item Alex Fenix is actually an abstraction of a map or any container like that.
The reason i am asking is that i am trying to see if there is any way that i can directly read a file and apply a regex on it and get this second list without any further processing overhead
(I mean reading into a map or string , then iterating through them and creating pairs of the tokens and then carry on what ever is needed )

Try this regex
\w \w
It will match any word followed by a space and another word.
Although you can achieve such a format relatively easy without using a regex. Take a look at this for instance:
#include <iostream>
#include <sstream>
#include <string>
#include <algorithm>
int main() {
std::string s("Alex Fenix is an Engineer who works for Ford Automotive Company. His Personal ID is <123456>");
// Remove any occurences of '.', '<' or '>'.
s.assign(begin(s), std::remove_if(begin(s), end(s), [] (const char c) {
return (c == '.' || c == '<' || c == '>');
}));
// Tokenize.
std::istringstream iss(s);
std::string t1, t2;
while (iss >> t1 >> t2) {
std::cout << t1 << " " << t2 << std::endl;
}
}
Output:
Alex Fenix
is an
Engineer who
works for
Ford Automotive
Company His
Personal ID
is 123456

Related

Parse string with delimiter whitespace but having strings include whitespace as well?

I have a text file with state names and their respective abbreviations. It looks something like this:
Florida FL
Nevada NV
New York NY
So the number of whitespaces between state name and abbreviation differs. I want to extract the name and abbreviation and I thought about using getline with whitespace as a delimiter but I have problems with the whitespace in names like "New York". What function could I use instead?
You know that the abbreviation is always two characters.
So you can read the whole line, and split it at two characters from the end (probably using substr).
Then trim the first string and you have two nice strings for the name and abbreviation.
The systematic way is to analyze the all possible input data and then search for a pattern in the text. In your case, we analyze the problem and find out that
at the end of the string we have some consecutive uppercase letters
before that we have the state's name
So, if we search for the state abbreviation pattern and split that of, then the full name of the state will be available. But maybe with trailing and leading spaces. This we will remove and then the result is there.
For searching we will use a std::regex. The pattern is: 1 or more uppercase letters followed by 0 or more white spaces, followed by the end of the line. The regular expressions for that is: "([A-Z]+)\\s*$"
When this is available, the prefix of the result contains the full statename. We will remove leading and trailing spaces and that's it.
Please see:
#include <iostream>
#include <string>
#include <sstream>
#include <regex>
std::istringstream textFile(R"( Florida FL
Nevada NV
New York NY)");
std::regex regexStateAbbreviation("([A-Z]+)\\s*$");
int main()
{
// Split of some parts
std::smatch stateAbbreviationMatch{};
std::string line{};
while (std::getline(textFile, line)) {
if (std::regex_search(line, stateAbbreviationMatch, regexStateAbbreviation))
{
// Get the state
std::string state(stateAbbreviationMatch.prefix());
// Remove leading and trailing spaces
state = std::regex_replace(state, std::regex("^ +| +$|( ) +"), "$1");
// Get the state abbreviation
std::string stateabbreviation(stateAbbreviationMatch[0]);
// Print Result
std::cout << stateabbreviation << ' ' << state << '\n';
}
}
return 0;
}

Regex - How to capture all iterations of a repeating pattern? [duplicate]

I'm using the C++ tr1::regex with the ECMA regex grammar. What I'm trying to do is parse a header and return values associated with each item in the header.
Header:
-Testing some text
-Numbers 1 2 5
-MoreStuff some more text
-Numbers 1 10
What I would like to do is find all of the "-Numbers" lines and put each number into its own result with a single regex. As you can see, the "-Numbers" lines can have an arbitrary number of values on the line. Currently, I'm just searching for "-Numbers([\s0-9]+)" and then tokenizing that result. I was just wondering if there was any way to both find and tokenize the results in a single regex.
No, there is not.
I was about to ask this exact same question, and I kind of found a solution.
Let's say you have an arbitrary number of words you want to capture.
"there are four lights"
and
"captain picard is the bomb"
You might think that the solution is:
/((\w+)\s?)+/
But this will only match the whole input string and the last captured group.
What you can do is use the "g" switch.
So, an example in Perl:
use strict;
use warnings;
my $str1 = "there are four lights";
my $str2 = "captain picard is the bomb";
foreach ( $str1, $str2 ) {
my #a = ( $_ =~ /(\w+)\s?/g );
print "captured groups are: " . join( "|", #a ) . "\n";
}
Output is:
captured groups are: there|are|four|lights
captured groups are: captain|picard|is|the|bomb
So, there is a solution if your language of choice supports an equivalent of "g" (and I guess most do...).
Hope this helps someone who was in the same position as me!
S
Problem is that desired solution insists on use of capture groups. C++ provides tool regex_token_iterator to handle this in better way (C++11 example):
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::regex e (R"((?:^-Numbers)?\s*(\d+))");
string input;
while (getline(cin, input)) {
std::regex_token_iterator<std::string::iterator> a{
input.begin(), input.end(),
e, 1,
regex_constants::match_continuous
};
std::regex_token_iterator<std::string::iterator> end;
while (a != end) {
cout << *a << " - ";
++a;
}
cout << '\n';
}
return 0;
}
https://wandbox.org/permlink/TzVEqykXP1eYdo1c

C++11 Regex submatches

I have the following code to extract the left & right part from a string of type
[3->1],[2->2],[5->3]
My code looks like the following
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex expr("([[:d:]]+)->([[:d:]]+)");
string input = "[3->1],[2->2],[5->3]";
const std::sregex_token_iterator end;
int submatches[] = { 1, 2 };
string left, right;
for (std::sregex_token_iterator itr(input.begin(), input.end(), expr, submatches); itr != end;)
{
left = ((*itr).str()); ++itr;
right = ((*itr).str()); ++itr;
cout << left << " " << right << endl;
}
}
Output will be
3 1
2 2
5 3
Now I am trying to extend it so that first part will be a string instead of digit. For example, the input will be
[(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11)->1]
And I need to split it as
(3),(5),(0,1) 2
(32,2) 6
(27),(61,11) 1
Basic expressions that I tried ("(\\(.*+)->([[:d:]]+)") just splits the entire string to two as following
(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11) 1
Can somebody give me some suggestions on how to achieve this? Appreciate all the help.
You need to get everything after the first '[', except "->", kind of like if
you were doing a regex for the multiline comment /* ... */, where " */ " has to be excluded, or else the regex gets greedy and eats everything until the last one, like is happening in your case for "->". You can't really use the dot for any char, because it gets very greedy.
This works for me:
\\[([^-\\]]+)->([0-9]+)\\]
'^' at the start of [...] makes it so all chars, except '-', so you can avoid "->", and ']', are accepted
What you need is to make it a bit more specific:
\[([^]]*)->([^]]*)\]
In order to avoid capturing too many data. See live demo.
You could have use the .*? pattern instead of [^]]* but it would have been less efficient.

Is there a way to have a capture repeat an arbitrary number of times in a regex?

I'm using the C++ tr1::regex with the ECMA regex grammar. What I'm trying to do is parse a header and return values associated with each item in the header.
Header:
-Testing some text
-Numbers 1 2 5
-MoreStuff some more text
-Numbers 1 10
What I would like to do is find all of the "-Numbers" lines and put each number into its own result with a single regex. As you can see, the "-Numbers" lines can have an arbitrary number of values on the line. Currently, I'm just searching for "-Numbers([\s0-9]+)" and then tokenizing that result. I was just wondering if there was any way to both find and tokenize the results in a single regex.
No, there is not.
I was about to ask this exact same question, and I kind of found a solution.
Let's say you have an arbitrary number of words you want to capture.
"there are four lights"
and
"captain picard is the bomb"
You might think that the solution is:
/((\w+)\s?)+/
But this will only match the whole input string and the last captured group.
What you can do is use the "g" switch.
So, an example in Perl:
use strict;
use warnings;
my $str1 = "there are four lights";
my $str2 = "captain picard is the bomb";
foreach ( $str1, $str2 ) {
my #a = ( $_ =~ /(\w+)\s?/g );
print "captured groups are: " . join( "|", #a ) . "\n";
}
Output is:
captured groups are: there|are|four|lights
captured groups are: captain|picard|is|the|bomb
So, there is a solution if your language of choice supports an equivalent of "g" (and I guess most do...).
Hope this helps someone who was in the same position as me!
S
Problem is that desired solution insists on use of capture groups. C++ provides tool regex_token_iterator to handle this in better way (C++11 example):
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::regex e (R"((?:^-Numbers)?\s*(\d+))");
string input;
while (getline(cin, input)) {
std::regex_token_iterator<std::string::iterator> a{
input.begin(), input.end(),
e, 1,
regex_constants::match_continuous
};
std::regex_token_iterator<std::string::iterator> end;
while (a != end) {
cout << *a << " - ";
++a;
}
cout << '\n';
}
return 0;
}
https://wandbox.org/permlink/TzVEqykXP1eYdo1c

Need to parse a string, having a mask (something like this "%yr-%mh-%dy"), so i get the int values

For example i have to find time in format mentioned in the title(but %-tags order can be different) in a string "The date is 2009-August-25." How can i make the program interprete the tags and what construction is better to use for storing them among with information about how to act with certain pieces of date string?
First look into boost::date_time library. It has IO system witch may be what you want but I see lack of searching.
To do custom date searching you need boost::xpressive. It contain anything you will need. Lets look into my hastily writed example. First you should parse your custom pattern, witch is easy with Xpressive. First look at header you need:
#include <string>
#include <iostream>
#include <map>
#include <boost/xpressive/xpressive_static.hpp>
#include <boost/xpressive/regex_actions.hpp>
//make example shorter but less clear
using namespace boost::xpressive;
Second define map of your special tags:
std::map<std::string, int > number_map;
number_map["%yr"] = 0;
number_map["%mh"] = 1;
number_map["%dy"] = 2;
number_map["%%"] = 3; // escape a %
Next step is to create a regex witch will parse our pattern with tags and save values from map into variable tag_id when it find tag or save -1 otherwise:
int tag_id;
sregex rx=((a1=number_map)|(s1=+~as_xpr('%')))[ref(tag_id)=(a1|-1)];
More information and description look here and here.
Now lets parse some pattern:
std::string pattern("%yr-%mh-%dy"); // this will be parsed
sregex_token_iterator begin( pattern.begin(), pattern.end(), rx ), end;
if(begin == end) throw std::runtime_error("The pattern is empty!");
The sregex_token_iterator will iterate over our tokens, and each time it will set tag_id varible. All we have to do is to build regex using this tokens. We will construct this regex using tag corresponding parts of static regex defined in array:
sregex regex_group[] = {
range('1','9') >> repeat<3,3>( _d ), // 4 digit year
as_xpr( "January" ) | "February" | "August", // not all month XD so lazy
repeat<2,2>( range('0','9') )[ // two digit day
check(as<int>(_) >= 1 && as<int>(_) <= 31) ], //only bettwen 1 and 31
as_xpr( '%' ) // match escaped %
};
Finally, lets start build our special regex. The first match will construct first part of it. If the tag is matched and tag_id is non negative we choose regex from array, else the match is probably the delimiter and we construct regex witch match it:
sregex custom_regex = (tag_id>=0) ? regex_group[tag_id] : as_xpr(begin->str());
Next we will iterate from begin to end and append next regex:
while(++begin != end)
{
if(tag_id>=0)
{
sregex nextregex = custom_regex >> regex_group[tag_id];
custom_regex = nextregex;
}
else
{
sregex nextregex = custom_regex >> as_xpr(begin->str());
custom_regex = nextregex;
}
}
Now our regex is ready, lets find some dates :-]
std::string input = "The date is 2009-August-25.";
smatch mydate;
if( regex_search( input, mydate, custom_regex ) )
std::cout << "Found " << mydate.str() << "." << std::endl;
The xpressive library is very powerful and fast. It's also beautiful use of patterns.
If you like this example, let me know in comment or points ;-)
I'd transform the tagged string in a regular expression with capture for the 3 fields and search for it. The complexity of the regular expression will depend on what you want to accept for %yr. You can also have a less strict expression and then check for valid values, this can leads to better error messages ("Invalid month: Augsut" instead of "date not found") or to false positives depending on the context.