have done a lot of research but cannot find proper formating for Regexp mask in order to extract a string from another string.
Suppose I have the following string:
"The quick brown fox ABC3D97 jumps over the lazy wolf"
I need to extract the "ABC3D97" based on the mask: /[A-Z]{3}\d{1}[A-Z]{1}\d{2}/ but I just cannot find the proper syntax as the one above and variations of it return no match.
My test code is as below:
#include <Regexp.h>
void setup () {
Serial.begin (115200);
// match state object
MatchState ms;
// what we are searching (the target)
char buf [100] = "The quick brown fox ABC3D97 jumps over the lazy wolf";
ms.Target (buf); // set its address
Serial.println (buf);
char result = ms.Match ("d{3}"); <-- returns no match.
if (result > 0) {
Serial.print ("Found match at: ");
int matchStart = ms.MatchStart;
int matchLength = ms.MatchLength;
Serial.println (matchStart); // 16 in this case
Serial.print ("Match length: ");
Serial.println (matchLength); // 3 in this case
String text = String(buf);
Serial.println(text.substring(matchStart,matchStart+matchLength));
}
else
Serial.println ("No match.");
} // end of setup
void loop () {}
Assistance welcome.
The library you're using appears to be Nick Gammon's port of regular expression functionality from LUA.
LUA's regular expressions use a different syntax than other commonly used regular expressions. The README for the library gives a link to the documentation on LUA's regular expressions.
LUA uses % rather than \ for character classes, so \d needs to be written as %d. This library also doesn't support the {number} syntax to specify the number of matches. You have to repeat the match characters.
According to the documentation, the match string should be:
[A-Z][A-Z][A-Z]%d[A-Z]%d%d
and not
[A-Z]{3}\d{1}[A-Z]{1}\d{2}
Related
I was using RE/flex lexer for my project. In that, I want to match the syntax corresponding to ('*)".*?"\1. For eg, it should match "foo", ''"bar"'', but should not match ''"baz"'.
But RE/flex matcher doesn't work with lookaheads, lookbehinds and backreferences. So, is there a correct way to match this using reflex matcher? The nearest I could achieve was the following lexer:
%x STRING
%%
'*\" {
textLen = 0uz;
quoteLen = size();
start(STRING);
}
<STRING> {
\"'* {
if (size() - textLen < quoteLen) goto MORE_TEXT;
matcher().less(textLen + quoteLen);
start(INITIAL);
res = std::string{matcher().begin(), textLen};
return TokenKind::STR;
}
[^"]* {
MORE_TEXT:
textLen = size();
matcher().more();
}
<<EOF>> {
std::cerr << "Lexical error: Unterminated 'STRING' \n";
return TokenKind::ERR;
}
}
%%
The meta-character . in RE-flex matches any character, be it valid or invalid UTF8 sequence. Whereas the inverted character class - [^...] - matches only valid UTF8 sequences that are absent in the character class.
So, the problem with above lexer is that, it matches only valid UTF8 sequences inside strings. Whereas, I want it to match anything inside string until the delimiter.
I considered three workarounds. But all three seems to have some issues.
Use skip(). This skips all characters till it reaches delimiter. But in the process, it consumes all the string content. I don't get to keep them.
Use .*?/\" instead of [^"]*. This works for every properly terminated strings. But gets the lexer jammed if the string is not terminated.
Use consume string content character by character using .. Since . is synchronizing, it can even match invalid UTF8 sequences. But this approach feels way too slow.
So is there any better approach for solving this?
I didn't found any proper way to solve the problem. But I just did a dirty hack with 2nd workaround mentioned above.
Instead of RE/flex generated scanner loop, I added a custom loop inside string begin rule. In there, instead of failing with scanner jammed error, I am flushing remaining text and displaying unterminated string error message.
%x STRING
%%
'*\" {
auto textLen = 0uz;
const auto quoteLen = size();
matcher().pattern(PATTERN_STRING);
while (true) {
switch (matcher().scan()) {
case 1:
if (size() - textLen < quoteLen) break;
matcher().less(textLen + quoteLen);
res = std::string{matcher().begin(), textLen};
return TokenKind::STR;
case 0:
if (!matcher().at_end()) matcher().set_end(true);
std::cerr << "Lexical error: Unterminated 'STRING' \n";
return TokenKind::ERR;
default:
std::unreachable();
case 2:;
}
textLen = size();
matcher().more();
}
}
<STRING>{
\"'* |
.*?/\" |
<<EOF>> std::unreachable();
}
%%
So, I would like to change all words in a string except one, that stays in the middle.
#include <boost/algorithm/string/replace.hpp>
int main()
{
string test = "You want to join player group";
string find = "You want to join group";
string replace = "This is a test about group";
boost::replace_all(test, find, replace);
cout << test << endl;
}
The output was expected to be:
This is a test about player group
But it doesn't work, the output is:
You want to join player group
The problem is on finding out the words, since they are a unique string.
There's a function that reads all words, no matter their position and just change what I want?
EDIT2:
This is the best example of what I want to happen:
char* a = "This is MYYYYYYYYY line in the void Translate"; // This is the main line
char* b = "This is line in the void Translate"; // This is what needs to be find in the main line
char* c = "Testing - is line twatawtn thdwae voiwd Transwlate"; // This needs to replace ALL the words in the char* b, perserving the MYYYYYYYYY
// The output is expected to be:
Testing - is MYYYYYYYY is line twatawtn thdwae voiwd Transwlate
You need to invert your thinking here. Instead of matching "All words but one", you need to try to match that one word so you can extract it and insert it elsewhere.
We can do this with Regular Expressions, which became standardized in C++11:
std::string test = "You want to join player group";
static const std::regex find{R"(You want to join (\S+) group)"};
std::smatch search_result;
if (!std::regex_search(test, search_result, find))
{
std::cerr << "Could not match the string\n";
exit(1);
}
else
{
std::string found_group_name = search_result[1];
auto replace = boost::format("This is a test about %1% group") % found_group_name;
std::cout << replace;
}
Live Demo
To match the word "player" I used a pretty simply regular expression (\S+) which means "match one or more non-whitespace characters (greedily) and put that into a group"
"Groups" in regular expressions are enclosed by parentheses. The 0th group is always the entire match, and since we only have one set of parentheses, your word is therefore in group 1, hence the resulting access of the match result at search_result[1].
To create the regular expression, you'll notice I used the perhaps-unfamiliar string literal syntaxR"(...)". This is called a raw string literal and was also standardized in C++11. It was basically made for describing regular expressions without needing to escape backslashes. If you've used Python, it's the same as r'...'. If you've used C#, it's the same as #"..."
I threw in some boost::format to print the result because you were using Boost in the question and I thought you'd like to have some fun with it :-)
In your example, find is not a substring of test, so boost::replace_all(test, find, replace); has no effect.
Removing group from find and replace solves it:
#include <boost/algorithm/string/replace.hpp>
#include <iostream>
int main()
{
std::string test = "You want to join player group";
std::string find = "You want to join";
std::string replace = "This is a test about";
boost::replace_all(test, find, replace);
std::cout << test << std::endl;
}
Output: This is a test about player group.
In this case, there is just one replace of the beginning of the string because the end of the string is already the right one. You could have another call of replace_all to change the end if needed.
Some other options:
one is in the other answer.
split the strings into a vector (or array) of words, then insert the desired word (player) at the right spot of the replace vector, then build your output string from it.
I need to check a short string for matches with a list of substrings. Currently, I do this like shown below (working code on ideone)
bool ContainsMyWords(const std::wstring& input)
{
if (std::wstring::npos != input.find(L"white"))
return true;
if (std::wstring::npos != input.find(L"black"))
return true;
if (std::wstring::npos != input.find(L"green"))
return true;
// ...
return false;
}
int main() {
std::wstring input1 = L"any text goes here";
std::wstring input2 = L"any text goes here black";
std::cout << "input1 " << ContainsMyWords(input1) << std::endl;
std::cout << "input2 " << ContainsMyWords(input2) << std::endl;
return 0;
}
I have 10-20 substrings that I need to match against an input. My goal is to optimize code for CPU utilization and reduce time complexity for an average case. I receive input strings at a rate of 10 Hz, with bursts to 10 kHz (which is what I am worried about).
There is agrep library with source code written in C, I wonder if there is a standard equivalent in C++. From a quick look, it may be a bit difficult (but doable) to integrate it with what I have.
Is there a better way to match an input string against a set of predefined substrings in C++?
The best thing is to use a regular expression search, if you use the following regular expression:
"(white)|(black)|(green)"
that way, with only one pass over the string, you'll get in group 1 if a match was found for the "white" substring (and beginning and end points), in group 2 if a match of the "black" substring (and beginning and end points), and in group 3 if a match of the "green" substring. As you get, from group 0 the position of the end of the match, you can begin a new search to look for more matches, and everything in one pass over the string!!!
You could use one big if, instead of several if statements. However, Nathan's Oliver solution with std::any_of is faster than that though, when making the array of the substrings static (so that they do not get to be recreated again and again), as shown below.
bool ContainsMyWordsNathan(const std::wstring& input)
{
// do not forget to make the array static!
static std::wstring keywords[] = {L"white",L"black",L"green", ...};
return std::any_of(std::begin(keywords), std::end(keywords),
[&](const std::wstring& str){return input.find(str) != std::string::npos;});
}
PS: As discussed in Algorithm to find multiple string matches:
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
I need to filter strings based on two requirements
1) they must start with "city_date"
2) they should not have "metro" anywhere in the string.
This need to be done in just one check.
To start I know it should be like this but dont know hoe to eliminate strings with "metro"
string pattern = "city_date_"
Added: I need to use the regex for a SQL LIKE statement. hence i need it in a string.
Use a negative lookahead assertion (I don't know if this is supported in your regex lib)
string pattern = "^city_date(?!.*metro)"
I also added an anchor ^ at the start, that will match the start of the string.
The negative lookahead assertion (?!.*metro) will fail, if there is the string "metro" somewhere ahead.
Regular expressions are usually far more expensive than direct comparisons. If direct comparisons can easily express the requirements, use them. This problem doesn't need the overhead of a regular expression. Just write the code:
std::string str = /* whatever */
const std::string head = "city_date";
const std::string exclude = "metro";
if (str.compare(head, 0, head.size) == 0 && str.find(exclude) == std::string::npos) {
// process valid string
}
by using javascript
input="contains the string your matching"
var pattern=/^city_date/g;
if(pattern.test(input)) // to match city_data at the begining
{
var patt=/metro/g;
if(patt.test(input)) return "false";
else return input; //matched string without metro
}
else
return "false"; //unable to match city_data
Is it possible to count how many times a substring appears in a string using regex matching with GNU libc regexec()?
No, regexec() only finds one match per call. If you want to find the next match, you have to call it again further along the string.
If you only want to search for plain substrings, you are much better off using the standard C string.h function strstr(); then you won't have to worry about escaping special regex characters.
regexec returns in its fourth parameter "pmatch" a structure with all the matches.
"pmatch" is a fixed sized structure, if there are more matches you will call the function another time.
I have found this code with two nested loops and I have modified it. The original cod you cand find it in http://www.lemoda.net/c/unix-regex/index.html:
static int match_regex (regex_t * r, const char * to_match)
{
/* "P" is a pointer into the string which points to the end of the
previous match. */
const char * p = to_match;
/* "N_matches" is the maximum number of matches allowed. */
const int n_matches = 10;
/* "M" contains the matches found. */
regmatch_t m[n_matches];
int number_of_matches = 0;
while (1) {
int i = 0;
int nomatch = regexec (r, p, n_matches, m, 0);
if (nomatch) {
printf ("No more matches.\n");
return nomatch;
}
for (i = 0; i < n_matches; i++) {
if (m[i].rm_so == -1) {
break;
}
number_of_matches ++;
}
p += m[0].rm_eo;
}
return number_of_matches ;
}
sorry for creating another answer, because I have not 50 reputation. I cannot comment #Oscar Raig Colon's answer.
pmatch cannot match all the substrings, pmatch is used to save the of offset for subexpression, the key is to understand what's subexpression, subexpression is "\(\)" in BRE, "()" in ERE. if there is not subexpression in entire regular expression, regexec() only return the first match string's offset and put it to pmatch[0].
you can find a example at [http://pubs.opengroup.org/onlinepubs/007908799/xsh/regcomp.html][1]
The following demonstrates how the REG_NOTBOL flag could be used with regexec() to find all substrings in a line that match a pattern supplied by a user. (For simplicity of the example, very little error checking is done.)
(void) regcomp (&re, pattern, 0);
/* this call to regexec() finds the first match on the line */
error = regexec (&re, &buffer[0], 1, &pm, 0);
while (error == 0) { /* while matches found */
/* substring found between pm.rm_so and pm.rm_eo */
/* This call to regexec() finds the next match */
error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
}