Tricky substring problems - c++

I'm having a problem with substrings, I have a string in the format below I'm
currently using getline.
Richard[12345/678910111213141516] was murdered
What I have been using is find_last_of and find_first_of to get the positions in between the brackets and forward slashes to retrieve each field. I have this working and functional but I have ran into a problem. The name field can be 32 characters in length, and can contain / and [] so when I finally ran into a user with a URL for his name it did not like that. The numbers are also random on a per user basis. I'm retrieving each field from the string, the name and the two identifying numbers.
Another string can look like this, so I would be grabbing 6 total substrings.
Richard[12345/678910111213141516] was murdered by Ralph[54321/161514131211109876]
Which is just a just another huge mess, what I was thinking about doing was starting from the back and moving to the front, but if the second name field (Ralph) contains any / or [] its going to ruin the count for retrieving the first part. Any insight would be helpful. Thank you.
In a nutshell. how do I account for these.
Names can also contain any alpha / numerical and special character.
Richard///[][][12345/678910111213141516] was murdered by Ralph[/[54321/161514131211109876]
The end result would be 6 substrings containing this.
Richard///[][]
12345
678910111213141516
Ralph[/
54321
161514131211109876
Regex has been mentioned to me, but I don't know if it would be better suited for the task or not, I included the tag so someone more experienced with it might answer/comment.

Here is a regex way to obtain all the values:
string str = "Richard///[][][12345/678910111213141516] was murdered by Ralph[/[54321/161514131211109876]";
regex rgx1(R"(([A-Z]\w*\s*\S*)\[(\d+)?(?:\/(\d+))?\])");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << "Name: " << smtch[1] << std::endl;
std::cout << "ID1: " << smtch[2] << std::endl;
std::cout << "ID2: " << smtch[3] << std::endl;
str = smtch.suffix().str();
}
See IDEONE demo
The regex (\S*)\[(\d+)?(?:/(\d+))?\] matches:
(\S*) - (Group 1) 0 or more non-whitespace symbols, as many as possible.
\[ - an opening square bracket (must be escaped as it is a special character in regex reserved for character classes)
(\d+)? - (Group 2) 1 or more digits (optional group, can be empty)
(?:/(\d+))? - non-capturing optional group matching
/ - literal /
(\d+) - (Group 3) 1 or more digits.
\] - closing square bracket.

A possible regex solution would be to use a pattern like follows:
(\S+)\[(\d+)/(\d+)\](?:\s|$)
which will match and store the names (with their meta attributes). I am currently thinking of ways when it could break.
You can test it on regex101.

Related

C++: How to extract words from string with regex

I want to extract words from a string. There are two methods I can think of that would accomplish this:
Extraction by a delimiter.
Extraction by word pattern searching.
Before I get into the specifics of my problem, I want to clarify that while I do ask about the methods of extraction and their implementations, the main focus of my problem is the regexes; not the implementations.
The words that I want to match can contain apostrophes (e.g. "Don't"), can be inside double or single quotes (apostrophes) (e.g. "Hello" and 'world') and a combination of the two (e.g. "Didn't" and 'Won't'). They can also contain numbers (e.g. "2017" and "U2") and underscores and hyphens (e.g. "hello_world" and "time-turner"). In-word apostrophes, underscores, and hyphens must be surrounded by other word characters. A final requirement is that strings containing random non-word characters (e.g. "Good mor¨+%g.") should still recognize all word-characters as words.
Example strings to extract words from and what I want the result to look like:
"Hello, world!" should result in "Hello" and "world"
"Aren't you clever?" should result in "Aren't", "you" and "clever"
"'Later', she said." should result in "Later", "she" and "said"
"'Maybe 5 o'clock?'" should result in "Maybe", "5" and "o'clock"
"In the year 2017 ..." should result in "In", "the", "year" and "2017"
"G2g, cya l8r" should result in "G2g", "cya" and "l8r"
"hello_world.h" should result in "hello_world" and "h"
"Hermione's time-turner." should result in "Hermione's" and "time-turner"
"Good mor~+%g." should result in "Good", "mor" and "g"
"Hi' Testing_ Bye-" should result in "Hi", "Testing" and "Bye"
Because – as far as I can tell – the two methods I proposed require quite different solutions I'll divide my question into two parts – one for each method.
1. Extraction by delimiter
This is the method I have dedicated the most of my time to develop, and I have found a partially working solution – however, I suspect the regex I am using is not very efficient. My solution is this (using Boost.Regex because its Perl syntax supports look behinds):
#include <string>
#include <vector>
#include <iostream>
#include <boost/regex.hpp>
std::vector<std::string> phrases({ "Hello, world!", "Aren't you clever?",
"'Later', she said.", "'Maybe 5 o'clock?'",
"In the year 2017 ...", "G2g, cya l8r",
"hello_world.h", "Hermione's time-turner.",
"Good mor~+%g.", "Hi' Testing_ Bye-"});
std::vector<std::string> words;
boost::regex delimiterPattern("^'|[\\W]*(?<=\\W)'+\\W*|(?!\\w+(?<!')'(?!')\\w+)[^\\w']+|'$");
boost::sregex_token_iterator end;
for (std::string phrase : phrases) {
boost::sregex_token_iterator phraseIter(phrase.begin(), phrase.end(), delimiterPattern, -1);
for ( ; phraseIter != end; phraseIter++) {
words.push_back(*phraseIter);
std::cout << words[words.size()-1] << std::endl;
}
}
My largest problem with this solution is my regex, which I think looks too complex and could probably be done much better. It also doesn't correctly match apostrophes at the end of words – like in example 3. Here's a link to regex101.com with the regex and the example strings: Delimiter regex.
2. Extraction by word pattern searching
I haven't dedicated too much time to pursue this path myself and mainly included it as an alternative because my partial solution isn't necessarily the best one. My suggestion as to how to accomplish this would be to do something in the vein of repeatedly searching a string for a pattern, removing each match from the string as you go until there are no more matches. I have a working regex for this method, but would still like input on it: "[A-Za-z0-9]+(['_-]?[A-Za-z0-9]+)?". Here's a link to regex101.com with the regex and the example strings: Word pattern regex.
I want to emphasize again that I first and foremost want input on my regexes, but also appreciate help with implementing the methods.
Edit: Thanks #Galik for pointing out that possesive plurals can end in apostrophes. The apostrophes associated with these may be matched in a delimiter and do not have to be matched in a word pattern (i.e. "The kids' toys" should result in "The", "kids" and "toys").
You may use
[^\W_]+(?:['_-][^\W_]+)*
See the regex demo.
Pattern details:
[^\W_]+ - one or more chars other than non-word chars and _ (matches alphanumeric chars)
(?: - start of a non-capturing group that only groups subpatterns and matches:
['_-] - a ', _ or -
[^\W_]+ - 1+ alphanumeric chars
)* - repeats the group zero or more times.
C++ demo:
std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
std::string s = "Hello, world! Aren't you clever? 'Later', she said. Maybe 5 o'clock?' In the year 2017 ... G2g, cya l8r hello_world.h Hermione's time-turner. Good mor~+%g. Hi' Testing_ Bye- The kids' toys";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m.str() << '\n';
}

Regular Expression starting and ending with special characters

I need to extract all matches from a huge text that start with [" and end with "]. These special characters separate each record from database. I need to extract all records.
Inside this record there are letters, numbers and special characters like -, ., &, (), /, {space} or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*".
With this pattern, I am able to extract the first word from each record, with the starting characters [". The count of found matches is correct.
Example of one record:
["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]
The question is : How can I extract the all records starting with [" and ending with "]?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.
The easiest way is to get rid of the initial and trailing [" and "] with either Replace or Left/Right/Mid functions, and then Split with "," (in VBA, """,""").
E.g.
input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")
If you plan to use Regex, you can use \["[\s\S]*?"] pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as
\["[^"]*(?:"(?!])[^"]*)*"]
See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"
Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"] matches any char but ", including a newline).
Pattern details:
\[" - [" literally
[^"]* - zero or more characters other than "
(?:"(?!])[^"]*)* - zero or more sequences of
"(?!]) - " not followed with ]
[^"]* - zero or more characters other than "
"] - literal character sequence "]

Replace group with spaces

I need to hide part of the string. Hide all before some ending part.
It easy to implement by regexp like this:
replace("123-134-04", ".(?=.*-)", " ")
replace any symbol if future part of string contains "-".
So result is: " -04"
It is important to keep spaces.
But, I can't use lookahead or lookbehind.
I can catch the group before ending part, but how to replace this for right number of spaces?
Or maybe some other ways to resolve this with regex?
Tnanks in advance!
If the number of to be replaced characters does not differ too much, and you have a means to match the part to be preserved, you could run through a series of search and replace:
replace("12-14-04", "^.{5}(-[^-]+)$", " \1")
replace("123-134-04", "^.{7}(-[^-]+)$", " \1")
replace("adfasd-adf-da7474-04", "^.{17}(-[^-]+)$", " \1")
Or you do:
split the string at the position, where the to be preserved part begins,
run the replace("ALL OF THIS SHOULD BECOME BLANKS", ".", " ") on the first part, and
join them up again.

Regular expression for extracting excerpt from long String

I want to extract excerpt from a long string using Regular expression
Example string: "" Is it possible that Germany, which beat Argentina 1-0 today to win the World Cup, that will end up as a loser in terms of economic growth? ""
String to search: " that "
Expected result from regex
" possible that Germany "
" rd Cup, that will end "
I want to search the desired text from the string with -9 and +9 characters from the forward and the backward of the occurence of the searched string. Search string can occur multiple times within the given string.
I am working on an iOS app
using iOS 7.
I have so far created this expression with my little knowledge about reguler expressions but not able to get desired result from that
" (.){0,9} (that) {0,9} "
Remove the spaces in your regex. If you want to capture the matched ones. Then enclose the pattern within capturing groups (ie, ()),
.{9}that.{9}
OR
(?:.{9}|.{0,9})that(?:.{9}|.{0,9})
DEMO
Make the preceding and following characters as optional to match the line which looks like that will change history
Well, in your expression you were just missing the second "." and maybe the "?" for spaces.
.{0,9} ?that ?.{0,9}
Try that.
You can add ( ) for making groups if you want. I added the "?" to make it comply with your other example:
" that will change history"

Matching exactly one occurrence in a string with a regular expression

The other day I sat with a regular expression problem. Eventually I solved it a different way, without regular expressions, but I would still like to know how you do it :)
The problem I was having was running svn update via an automated script, and I wanted to detect conflicts. Doing this with or without regex is trivial, but it got me thinking about a more obscure problem: How do you match exactly ONE occurrence of a character inside a fixed length field of whitespace?
For instance, let's say we wanted to match "C" inside a six-byte wide field:
"C " MATCH
" C " MATCH
" C C " NO MATCH
" M " NO MATCH
" " NO MATCH
"C " NO MATCH (7 characters, not 6)
" C " NO MATCH (5 characters, not 6)
I know it's not right to answer your own question, but I basically merged your answers ... please don't flame :)
^(?=.{6}$) *C *$
Edit:
Replacing . with Tomalak's response below [ C] increases the speed with about 4-5% or so
^(?=[ C]{6}$) *C *$
^(?=[ C]{6}$) *C(?! *C)
Explanation:
^ # start-of-string
(?=[ C]{6}$) # followed by exactly 6 times " " or "C" and the end-of-string
*C # any number of spaces and a "C"
(?! *C) # not followed by another C anywhere (negative lookahead)
Notes:
The ^(?=…{6}$) construct can be used anywhere you want to measure string length but not actually match anything yet.
Since the end of the string is already checked in the look-ahead, you do not need to put a $ at the end of the regex, but it does not hurt to do it.
^[^C]*C[^C]*$
but this will not verify the length of your string.