Use GNU libc regexec() to count substring - regex

Is it possible to count how many times a substring appears in a string using regex matching with GNU libc regexec()?

No, regexec() only finds one match per call. If you want to find the next match, you have to call it again further along the string.
If you only want to search for plain substrings, you are much better off using the standard C string.h function strstr(); then you won't have to worry about escaping special regex characters.

regexec returns in its fourth parameter "pmatch" a structure with all the matches.
"pmatch" is a fixed sized structure, if there are more matches you will call the function another time.
I have found this code with two nested loops and I have modified it. The original cod you cand find it in http://www.lemoda.net/c/unix-regex/index.html:
static int match_regex (regex_t * r, const char * to_match)
{
/* "P" is a pointer into the string which points to the end of the
previous match. */
const char * p = to_match;
/* "N_matches" is the maximum number of matches allowed. */
const int n_matches = 10;
/* "M" contains the matches found. */
regmatch_t m[n_matches];
int number_of_matches = 0;
while (1) {
int i = 0;
int nomatch = regexec (r, p, n_matches, m, 0);
if (nomatch) {
printf ("No more matches.\n");
return nomatch;
}
for (i = 0; i < n_matches; i++) {
if (m[i].rm_so == -1) {
break;
}
number_of_matches ++;
}
p += m[0].rm_eo;
}
return number_of_matches ;
}

sorry for creating another answer, because I have not 50 reputation. I cannot comment #Oscar Raig Colon's answer.
pmatch cannot match all the substrings, pmatch is used to save the of offset for subexpression, the key is to understand what's subexpression, subexpression is "\(\)" in BRE, "()" in ERE. if there is not subexpression in entire regular expression, regexec() only return the first match string's offset and put it to pmatch[0].
you can find a example at [http://pubs.opengroup.org/onlinepubs/007908799/xsh/regcomp.html][1]
The following demonstrates how the REG_NOTBOL flag could be used with regexec() to find all substrings in a line that match a pattern supplied by a user. (For simplicity of the example, very little error checking is done.)
(void) regcomp (&re, pattern, 0);
/* this call to regexec() finds the first match on the line */
error = regexec (&re, &buffer[0], 1, &pm, 0);
while (error == 0) { /* while matches found */
/* substring found between pm.rm_so and pm.rm_eo */
/* This call to regexec() finds the next match */
error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
}

Related

ESP32 unable to extract string from string based on Regexp mask

have done a lot of research but cannot find proper formating for Regexp mask in order to extract a string from another string.
Suppose I have the following string:
"The quick brown fox ABC3D97 jumps over the lazy wolf"
I need to extract the "ABC3D97" based on the mask: /[A-Z]{3}\d{1}[A-Z]{1}\d{2}/ but I just cannot find the proper syntax as the one above and variations of it return no match.
My test code is as below:
#include <Regexp.h>
void setup () {
Serial.begin (115200);
// match state object
MatchState ms;
// what we are searching (the target)
char buf [100] = "The quick brown fox ABC3D97 jumps over the lazy wolf";
ms.Target (buf); // set its address
Serial.println (buf);
char result = ms.Match ("d{3}"); <-- returns no match.
if (result > 0) {
Serial.print ("Found match at: ");
int matchStart = ms.MatchStart;
int matchLength = ms.MatchLength;
Serial.println (matchStart); // 16 in this case
Serial.print ("Match length: ");
Serial.println (matchLength); // 3 in this case
String text = String(buf);
Serial.println(text.substring(matchStart,matchStart+matchLength));
}
else
Serial.println ("No match.");
} // end of setup
void loop () {}
Assistance welcome.
The library you're using appears to be Nick Gammon's port of regular expression functionality from LUA.
LUA's regular expressions use a different syntax than other commonly used regular expressions. The README for the library gives a link to the documentation on LUA's regular expressions.
LUA uses % rather than \ for character classes, so \d needs to be written as %d. This library also doesn't support the {number} syntax to specify the number of matches. You have to repeat the match characters.
According to the documentation, the match string should be:
[A-Z][A-Z][A-Z]%d[A-Z]%d%d
and not
[A-Z]{3}\d{1}[A-Z]{1}\d{2}

finding a character in a pattern in regular expression

I am trying to find all occurances of equals within quotes in a string
If my input string is:
anything='', bob2='age=24, sex=M', dilan=24, noble1='yellow'
I wish to find my characters as follows
anything='', bob2='age=24, sex=M', dilan=24, nobel1=24
^ ^
Followed by replacing it as
anything='', bob2='age~24, sex~M', dilan=24, nobel1=24
^ ^
I tried the following to find all the occurances
'[^',].+?'
But that didnt work.
It's quite difficult to implement your requirement just by regex.
I'd like to iterate the String char by char to implement it.
Please check the code below. I have put the comment inside it. I'm using Java but you can utilize the algorithm inside it.
public class Main {
public static void main(String args[]){
String input = "param1='', param2='age<b>=</b>24, sex<b>=</b>M', param3=24, param4='yellow'";
char[] arr = input.toCharArray();
boolean close = true;
/**
* Iterate the char array
*/
for(int i = 0;i < arr.length;i++){
if(arr[i] == '\''){
/**
* Ignore the escaped ' char in ''
*/
if(i > 0 && arr[i - 1] == '\\'){
break;
}
/**
* Use close to check whether equal sign is inside the ''
*/
if(close){
close = false;
}else{
close = true;
}
}else if(arr[i] == '='){
if(!close){
arr[i] = '~';
}
}
System.out.print(arr[i]);
}
}
}
Try this:
(?<!param[\d+])=
And replace by this:
~
Breakdown:
it will look for any '=' and will see if that precedes by param[\d+] or not..
if not preceded by param\d+ then it will capture the = sign.
That = will be replaced by ~
Explanation
You can use the groups to do that with regex.
Try this code:
(?<=age)(\=)(\S+\s\w+)(\=)
Then, substitute the 1st and 3rd group with ~, and keep the 2nd group intact: ~$2~
Demo: https://regex101.com/r/qxR9ty/1
Update
You can first use Negative Lookbehind as suggested by #Maverick_Mrt, and then cancel whatever category you want to exclude by adding | e.g. cat1|cat2
(?<!app|policy_name|dvc_host|sender|sal)\=
Demo: https://regex101.com/r/qxR9ty/

Replace substring within a string c++

I want to replace substring within a string,
For eg: the string is aa0_aa1_bb3_c*a0_a,
so I want to replace the substring a0_a with b1_a, but I dont want aa0_a to get replaced.
Basically, no alphabet should be present before and after the substring "a0_a" (to be replaced).
That's what regexes are good at. It exists in standard library since C++11, if you have an older version, you can also use Boost.
With the standard library version, you could do (ref):
std::string result;
std::regex rx("([^A-Za-Z])a0_a[^A-Za-Z])");
result = std::regex_replace("aa0_aa1_bb3_c*a0_a", rx, "$1b1_a$2");
(beware: untested)
Easy enough to do if you loop through each character. Some pseudocode:
string toReplace = "a0_a";
for (int i = 0; i < myString.length; i++) {
//filter out strings starting with another alphabetical char
if (!isAlphabet(myString.charAt(i))) {
//start the substring one char after the char we have verified to be not alphabetical
if (substring(myString(i + 1, toReplace.length)).equals(toReplace)) {
//make the replacement here
}
}
}
Note that you will need to check for indexing out of bounds when looking at the substrings.

Check if string contains other string elements

I am trying to check if string contains elements from different string in specific order.
For example:
large string: thisisstring
small string: hssg
it should return true.
I only figured out how to check if string contains whole other string but not parts.
This is the code that I wrote for checking for now:
if ([largestring rangeOfString:smallstring].location != NSNotFound) {
printf("contains");
}
If there are no more characters to search for from the small string, return true.
Starting from the position after the most recently found character in the large string, do a linear search for the first character from the small string that has not yet been searched for.
If the character was not found, return false.
Start back at 1.
There's no easy way to do this, at least, no built in way that I know of. You would have to iterate through each letter of your small string and find the first letter that matches your large string.
Each time you find a matching letter, you loop to the next smallstring letter, but instead only begin searching at the index after you found the previous letter.
EDIT:
some pseudo code, untested, may have syntax errors:
int foundChar = 0;
for (int l = 0; l < strlen(smallstring); l++)
{
bool found = false;
for (; foundChar < strlen(largestring); foundChar++)
{
if (smallstring[l] == largestring[foundChar])
{
// We break here because we found a matching letter.
// Notice that foundChar is still in scope so we preserve
// its value for the next check.
found = true;
foundChar++; // Increment so the next search starts with the next letter.
break;
}
}
// If we get down here, that means we've searched all of the letters
// and found no match, we can result with a failure to find the match.
if (found == false)
{
return false;
}
}
// If we get here, it means every loop resulted in a valid match.
return true;

Why the regular expression "//" and "/*" can't match the single comment and block comment?

I want to calculate the "empty line","single comment","block comment" about c++ program.
I write the tool use flex.But the tool can't match the c++ block comment.
1 flex code:
%{
int block_flag = 0;
int empty_num = 0;
int single_line_num = 0;
int block_line_num = 0;
int line = 0;
%}
%%
^[\t ]*\n {
empty_num++;
printf("empty line\n");
}
"//" {
single_line_num++;
printf("single line comment\n");
}
"/*" {
block_flag = 1;
block_line_num++;
printf("block comment begin.block line:%d\n", block_line_num);
}
"*/" {
block_flag = 0;
printf("block comment end.block line:%d\n", block_line_num);
}
^(.*)\n {
if(block_flag)
block_line_num++;
else
line++;
}
%%
int main(int argc , char *argv[])
{
yyin = fopen(argv[1], "r");
yylex();
printf("lines :%d\n" ,line);
fclose(yyin);
return 0;
}
2 hello.c
bbg#ubuntu:~$ cat hello.c
#include <stdlib.h>
//
//
/*
*/
/* */
3 output
bbg#ubuntu:~$ ./a.out hello.c
empty line
empty line
lines :6
Why the "//" and "/*" can't match the single comment and block comment ?
Flex:
doesn't search. It matches patterns sequentially, each one starting where the other one ends.
always picks the pattern with the longest match. (If two or more patterns match exactly the same amount, it picks the first one.
So, you have
"//" { /* Do something */ }
and
^.*\n { /* Do something else */ }
Suppose it has just matched the second one, so we're at the beginning of a line, and suppose the line starts //. Now, both these patterns match, but the second one matches the whole line, whereas the first one only matches two characters. So the second one wins. That wasn't what you wanted.
Hint 1: You probably want // comments to match to the end of the line
Hint 2: There is a regular expression which will match /* comments, although it's a bit tedious: "/*"[^*]*"*"+([^*/][^*]*"*"+)*"/" Unfortunately, if you use that, it won't count line ends for you, but you should be able to adapt it to do what you want.
Hint 3: You might want to think about comments which start in the middle of a line, possibly having been indented. You rule ^.*\n will swallow an entire line without even looking to see if there is a comment somewhere inside it.
Hint 4: String literals hide comments.