C++: How to extract words from string with regex - c++

I want to extract words from a string. There are two methods I can think of that would accomplish this:
Extraction by a delimiter.
Extraction by word pattern searching.
Before I get into the specifics of my problem, I want to clarify that while I do ask about the methods of extraction and their implementations, the main focus of my problem is the regexes; not the implementations.
The words that I want to match can contain apostrophes (e.g. "Don't"), can be inside double or single quotes (apostrophes) (e.g. "Hello" and 'world') and a combination of the two (e.g. "Didn't" and 'Won't'). They can also contain numbers (e.g. "2017" and "U2") and underscores and hyphens (e.g. "hello_world" and "time-turner"). In-word apostrophes, underscores, and hyphens must be surrounded by other word characters. A final requirement is that strings containing random non-word characters (e.g. "Good mor¨+%g.") should still recognize all word-characters as words.
Example strings to extract words from and what I want the result to look like:
"Hello, world!" should result in "Hello" and "world"
"Aren't you clever?" should result in "Aren't", "you" and "clever"
"'Later', she said." should result in "Later", "she" and "said"
"'Maybe 5 o'clock?'" should result in "Maybe", "5" and "o'clock"
"In the year 2017 ..." should result in "In", "the", "year" and "2017"
"G2g, cya l8r" should result in "G2g", "cya" and "l8r"
"hello_world.h" should result in "hello_world" and "h"
"Hermione's time-turner." should result in "Hermione's" and "time-turner"
"Good mor~+%g." should result in "Good", "mor" and "g"
"Hi' Testing_ Bye-" should result in "Hi", "Testing" and "Bye"
Because – as far as I can tell – the two methods I proposed require quite different solutions I'll divide my question into two parts – one for each method.
1. Extraction by delimiter
This is the method I have dedicated the most of my time to develop, and I have found a partially working solution – however, I suspect the regex I am using is not very efficient. My solution is this (using Boost.Regex because its Perl syntax supports look behinds):
#include <string>
#include <vector>
#include <iostream>
#include <boost/regex.hpp>
std::vector<std::string> phrases({ "Hello, world!", "Aren't you clever?",
"'Later', she said.", "'Maybe 5 o'clock?'",
"In the year 2017 ...", "G2g, cya l8r",
"hello_world.h", "Hermione's time-turner.",
"Good mor~+%g.", "Hi' Testing_ Bye-"});
std::vector<std::string> words;
boost::regex delimiterPattern("^'|[\\W]*(?<=\\W)'+\\W*|(?!\\w+(?<!')'(?!')\\w+)[^\\w']+|'$");
boost::sregex_token_iterator end;
for (std::string phrase : phrases) {
boost::sregex_token_iterator phraseIter(phrase.begin(), phrase.end(), delimiterPattern, -1);
for ( ; phraseIter != end; phraseIter++) {
words.push_back(*phraseIter);
std::cout << words[words.size()-1] << std::endl;
}
}
My largest problem with this solution is my regex, which I think looks too complex and could probably be done much better. It also doesn't correctly match apostrophes at the end of words – like in example 3. Here's a link to regex101.com with the regex and the example strings: Delimiter regex.
2. Extraction by word pattern searching
I haven't dedicated too much time to pursue this path myself and mainly included it as an alternative because my partial solution isn't necessarily the best one. My suggestion as to how to accomplish this would be to do something in the vein of repeatedly searching a string for a pattern, removing each match from the string as you go until there are no more matches. I have a working regex for this method, but would still like input on it: "[A-Za-z0-9]+(['_-]?[A-Za-z0-9]+)?". Here's a link to regex101.com with the regex and the example strings: Word pattern regex.
I want to emphasize again that I first and foremost want input on my regexes, but also appreciate help with implementing the methods.
Edit: Thanks #Galik for pointing out that possesive plurals can end in apostrophes. The apostrophes associated with these may be matched in a delimiter and do not have to be matched in a word pattern (i.e. "The kids' toys" should result in "The", "kids" and "toys").

You may use
[^\W_]+(?:['_-][^\W_]+)*
See the regex demo.
Pattern details:
[^\W_]+ - one or more chars other than non-word chars and _ (matches alphanumeric chars)
(?: - start of a non-capturing group that only groups subpatterns and matches:
['_-] - a ', _ or -
[^\W_]+ - 1+ alphanumeric chars
)* - repeats the group zero or more times.
C++ demo:
std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
std::string s = "Hello, world! Aren't you clever? 'Later', she said. Maybe 5 o'clock?' In the year 2017 ... G2g, cya l8r hello_world.h Hermione's time-turner. Good mor~+%g. Hi' Testing_ Bye- The kids' toys";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m.str() << '\n';
}

Related

Tricky substring problems

I'm having a problem with substrings, I have a string in the format below I'm
currently using getline.
Richard[12345/678910111213141516] was murdered
What I have been using is find_last_of and find_first_of to get the positions in between the brackets and forward slashes to retrieve each field. I have this working and functional but I have ran into a problem. The name field can be 32 characters in length, and can contain / and [] so when I finally ran into a user with a URL for his name it did not like that. The numbers are also random on a per user basis. I'm retrieving each field from the string, the name and the two identifying numbers.
Another string can look like this, so I would be grabbing 6 total substrings.
Richard[12345/678910111213141516] was murdered by Ralph[54321/161514131211109876]
Which is just a just another huge mess, what I was thinking about doing was starting from the back and moving to the front, but if the second name field (Ralph) contains any / or [] its going to ruin the count for retrieving the first part. Any insight would be helpful. Thank you.
In a nutshell. how do I account for these.
Names can also contain any alpha / numerical and special character.
Richard///[][][12345/678910111213141516] was murdered by Ralph[/[54321/161514131211109876]
The end result would be 6 substrings containing this.
Richard///[][]
12345
678910111213141516
Ralph[/
54321
161514131211109876
Regex has been mentioned to me, but I don't know if it would be better suited for the task or not, I included the tag so someone more experienced with it might answer/comment.
Here is a regex way to obtain all the values:
string str = "Richard///[][][12345/678910111213141516] was murdered by Ralph[/[54321/161514131211109876]";
regex rgx1(R"(([A-Z]\w*\s*\S*)\[(\d+)?(?:\/(\d+))?\])");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << "Name: " << smtch[1] << std::endl;
std::cout << "ID1: " << smtch[2] << std::endl;
std::cout << "ID2: " << smtch[3] << std::endl;
str = smtch.suffix().str();
}
See IDEONE demo
The regex (\S*)\[(\d+)?(?:/(\d+))?\] matches:
(\S*) - (Group 1) 0 or more non-whitespace symbols, as many as possible.
\[ - an opening square bracket (must be escaped as it is a special character in regex reserved for character classes)
(\d+)? - (Group 2) 1 or more digits (optional group, can be empty)
(?:/(\d+))? - non-capturing optional group matching
/ - literal /
(\d+) - (Group 3) 1 or more digits.
\] - closing square bracket.
A possible regex solution would be to use a pattern like follows:
(\S+)\[(\d+)/(\d+)\](?:\s|$)
which will match and store the names (with their meta attributes). I am currently thinking of ways when it could break.
You can test it on regex101.

Splitting strings separated by \r\n into array of strings [C/C++]

I have string containing e.g. "FirstWord\r\nSecondWord\r\nThird Word\n\r" and so on...
I want to split it to string array using vector <string> so I would get:
FileName[0] == "FirstWord";
FileName[1] == "SecondWord";
FileName[2] == "Third Word";
Also, note the space in the third string.
This is what I've got so far:
string text = Files; // Files var contains the huge string of lines separated by \r\n
vector<string> FileName; // (optionaly) Here I want to store the result without \r\n
regex rx("[^\\s]+\r\n");
sregex_iterator FormatedFileList(text.begin(), text.end(), rx), rxend;
while(FormatedFileList != rxend)
{
FileName.push_back(FormatedFileList->str().c_str());
++FormatedFileList;
}
It works, but when it comes to the third string which is "Third Word\r\n", it only gives me "Word\r\n".
Can anyone explain to me how do the regular expressions work? I'm a bit confused.
\s matches all spaces, including regular space, tab and a few others. You only want to exclude \r and \n, so your regex should be
regex rx("[^\r\n]+\r\n");
EDIT: This will not fit in a comment, and it will not be exhaustive -- regexes are a fairly complex topic, but I'll do my best to give a cursory explanation. All of this does make more sense if you grok formal languages, so I encourage you to read up on it, and there are countless regex tutorials on the net that go into more detail and that you should also read. Okay.
Your code uses sregex_iterator to walk through all places in the string text where the regular expression rx matches, then turns them into strings and saves them. So, what are regular expressions?
Regular expressions are a way of applying pattern matching to strings. This can range from simple substring searches to...well, to complex substring searches, really. Instead of just looking for an instance of "oba" in the string "foobar", for example, you might search for "oo" followed by any character followed by "a" and find it in "foobar" as well as in "foonarf".
In order to enable this kind of pattern search, you must have a way to specify what pattern you are looking for, and one such way are regular expressions. The details vary across implementations, but in general it works by defining special characters that match special things or modify the behaviour of other parts of the pattern. This sounds confusing, so let's consider a few examples:
The period . matches any single character
Something followed by the Kleene star * matches zero ore more instances of that something
Something followed by a + will match one or more instances of that something
brackets [, ] enclose a set of characters; the whole thing then matches any one of those characters.
The caret ^ inverts the selection of a bracket expression
Still confusing. So let's put it together:
oo.a
is a regular expression using the .. This will match "oo.a", "ooba", "oona", "oo|a" and anything else that is two o's followed by one character followed by an a. It will not match "ooa", "oba" or "nonsense".
a*
will match "", "a", "aa", "aaa", and any other sequence consisting only of a's but nothing else.
[fgh]oobar
will match any of "foobar", "goobar", and "hoobar", nothing else.
[^fgh]oobar
will match "aoobar", "boobar", "coobar" and so forth but not "foobar", "goobar" and "hoobar".
[^fgh]+oobar
will match "aoobar", "aboobar", "abcoobar", but not "oobar", "foobar", "agoobar", and "abhoobar".
In your case,
[^\r\n]+\r\n
will match any instance of one or more characters that are neither \r nor \n followed by \r\n. You then iterate through all those matches and save the matched portions of text.
That is about as deep as I believe I can reasonably go here. This rabbit hole is very deep, which means that you can do freaky cool stuff with regexes but that you should not expect to master them in a day or two. Most of it goes along the lines of what I just outlined, but in true programmer's fashion, most regex implementations go beyond the mathematical scope of regular languages and expressions and introduce useful but mindbendy stuff. Dragons be ahead, but the journey is worth it.
One simple alternative will be to use split_regex from Boost. Eg. split_regex(out, input, boost::regex("(\r\n)+")) where out is a vector of string and input is the input string. A complete example is pasted below:
#include <vector>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
#include <boost/regex.hpp>
using std::endl;
using std::cout;
using std::string;
using std::vector;
using boost::algorithm::split_regex;
int main()
{
vector<string> out;
string input = "aabcdabc\r\n\r\ndhhh\r\ndabcpqrshhsshabc";
split_regex(out, input, boost::regex("(\r\n)+"));
for (auto &x : out) {
std::cout << "Split: " << x << std::endl;
}
return 0;
}
This is also one way to go:
char * pch = strtok((LPSTR)Files.c_str(), "\r\n");
while(pch != NULL)
{
FileName.push_back(pch);
pch = strtok(NULL, "\r\n");
}
regex rx("[^\\s]+\r\n");, seems like you're trying to match the strings instead of splitting it. This [^\\s] negated character class means match any character but not space(horizontal spaces or line breaks). In the third line, there is an horizontal space, so your regex matches the text which was next to the horizontal space. In multiline mode, . would match any character but not of line breaks. You could use regex rx(".+\r\n"); instead of regex rx("[^\\s]+\r\n");

Regular expression that finds and replaces a long string of words

I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.
I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}
try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases
This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/
match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.

How do I capture all matches of a repeating group with Boost::regex_search?

I am trying to parse an input string using a regular expression. I am getting a problem when trying to capture a repeating group. I always seem to be matching last instance of the group. I have tried using Reluctant (non greedy) quantifiers, but I seems to be missing something. Can someone help?
Regular expression tried:
(OS)\\s((\\w{3})(([A-Za-z0-9]{2})|(\\w{3})(\\w{3}))\\/{0,1}){1,5}?\\r
(OS)\\s((\\w{3}?)(([A-Za-z0-9]{2}?)|(\\w{3}?)(\\w{3}?))\\/{0,1}?){1,5}?\\r
Input String:
OS BENKL/LHRBA/MANQFL\r\n
I always seem to get last group which is MANQFL group (MAN QFL), and my aim is to get all three groups (there can be 1-5 groups):
(BEN KL) , (LHR BA) and (MAN QFL).
C++ code snippet:
std::string::const_iterator start = str.begin(), end = str.end();
while(regex_search(start,end,what,expr))
{
cout << what[0];
cout << what[1];
...
start += what.position () + what.length ();
}
This loop only exceutes once, while I expect it to run 3 times in this example. Any help will be much appreciated.
The best way of getting multiple matches out of boost::regex is to use regex_iterators. This example should do what you want.
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main() {
std::string a = "OS BENKL/LHRBA/MANQFL\r\n";
const boost::regex re("[A-Z]{3}[A-Z]*");
boost::sregex_iterator res(a.begin(),a.end(),re);
boost::sregex_iterator end;
for (; res != end; ++res)
std::cout << (*res)[0] << std::endl;
}
The only regex flavor that I know that can give you all the iterations of a capturing group is the .NET regex flavor. Normally a regex engine only saves the last iteration of each capturing group.
The general solution to this kind of problem is to use one regex to capture all the iterations of the group, and a second regex to split the result of the first regex into the separate items. Alan already explained how you can do this in this particular situation.
That's the expected behavior: when a capturing group is controlled by a quantifier, each repetition overwrites whatever was captured the previous time. The simplest way to get all of the matches would be to put a capturing group around the whole thing, like this:
(OS)\\s(((\\w{3})(([A-Za-z0-9]{2})|(\\w{3})(\\w{3}))\\/?){1,5})\\r
That group will end up containing BENKL/LHRBA/MANQFL, which you can split on the /.
Read the section about repeated captures here: http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/boost_regex/captures.html
Basically, what you want is an experimental feature that can be enabled by passing the appropriate #defines and flags to your regex_search call.

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"