finding a character in a pattern in regular expression - regex

I am trying to find all occurances of equals within quotes in a string
If my input string is:
anything='', bob2='age=24, sex=M', dilan=24, noble1='yellow'
I wish to find my characters as follows
anything='', bob2='age=24, sex=M', dilan=24, nobel1=24
^ ^
Followed by replacing it as
anything='', bob2='age~24, sex~M', dilan=24, nobel1=24
^ ^
I tried the following to find all the occurances
'[^',].+?'
But that didnt work.

It's quite difficult to implement your requirement just by regex.
I'd like to iterate the String char by char to implement it.
Please check the code below. I have put the comment inside it. I'm using Java but you can utilize the algorithm inside it.
public class Main {
public static void main(String args[]){
String input = "param1='', param2='age<b>=</b>24, sex<b>=</b>M', param3=24, param4='yellow'";
char[] arr = input.toCharArray();
boolean close = true;
/**
* Iterate the char array
*/
for(int i = 0;i < arr.length;i++){
if(arr[i] == '\''){
/**
* Ignore the escaped ' char in ''
*/
if(i > 0 && arr[i - 1] == '\\'){
break;
}
/**
* Use close to check whether equal sign is inside the ''
*/
if(close){
close = false;
}else{
close = true;
}
}else if(arr[i] == '='){
if(!close){
arr[i] = '~';
}
}
System.out.print(arr[i]);
}
}
}

Try this:
(?<!param[\d+])=
And replace by this:
~
Breakdown:
it will look for any '=' and will see if that precedes by param[\d+] or not..
if not preceded by param\d+ then it will capture the = sign.
That = will be replaced by ~
Explanation

You can use the groups to do that with regex.
Try this code:
(?<=age)(\=)(\S+\s\w+)(\=)
Then, substitute the 1st and 3rd group with ~, and keep the 2nd group intact: ~$2~
Demo: https://regex101.com/r/qxR9ty/1
Update
You can first use Negative Lookbehind as suggested by #Maverick_Mrt, and then cancel whatever category you want to exclude by adding | e.g. cat1|cat2
(?<!app|policy_name|dvc_host|sender|sal)\=
Demo: https://regex101.com/r/qxR9ty/

Related

How to match *anything* until a delimiter is encountered in RE-flex lexer?

I was using RE/flex lexer for my project. In that, I want to match the syntax corresponding to ('*)".*?"\1. For eg, it should match "foo", ''"bar"'', but should not match ''"baz"'.
But RE/flex matcher doesn't work with lookaheads, lookbehinds and backreferences. So, is there a correct way to match this using reflex matcher? The nearest I could achieve was the following lexer:
%x STRING
%%
'*\" {
textLen = 0uz;
quoteLen = size();
start(STRING);
}
<STRING> {
\"'* {
if (size() - textLen < quoteLen) goto MORE_TEXT;
matcher().less(textLen + quoteLen);
start(INITIAL);
res = std::string{matcher().begin(), textLen};
return TokenKind::STR;
}
[^"]* {
MORE_TEXT:
textLen = size();
matcher().more();
}
<<EOF>> {
std::cerr << "Lexical error: Unterminated 'STRING' \n";
return TokenKind::ERR;
}
}
%%
The meta-character . in RE-flex matches any character, be it valid or invalid UTF8 sequence. Whereas the inverted character class - [^...] - matches only valid UTF8 sequences that are absent in the character class.
So, the problem with above lexer is that, it matches only valid UTF8 sequences inside strings. Whereas, I want it to match anything inside string until the delimiter.
I considered three workarounds. But all three seems to have some issues.
Use skip(). This skips all characters till it reaches delimiter. But in the process, it consumes all the string content. I don't get to keep them.
Use .*?/\" instead of [^"]*. This works for every properly terminated strings. But gets the lexer jammed if the string is not terminated.
Use consume string content character by character using .. Since . is synchronizing, it can even match invalid UTF8 sequences. But this approach feels way too slow.
So is there any better approach for solving this?
I didn't found any proper way to solve the problem. But I just did a dirty hack with 2nd workaround mentioned above.
Instead of RE/flex generated scanner loop, I added a custom loop inside string begin rule. In there, instead of failing with scanner jammed error, I am flushing remaining text and displaying unterminated string error message.
%x STRING
%%
'*\" {
auto textLen = 0uz;
const auto quoteLen = size();
matcher().pattern(PATTERN_STRING);
while (true) {
switch (matcher().scan()) {
case 1:
if (size() - textLen < quoteLen) break;
matcher().less(textLen + quoteLen);
res = std::string{matcher().begin(), textLen};
return TokenKind::STR;
case 0:
if (!matcher().at_end()) matcher().set_end(true);
std::cerr << "Lexical error: Unterminated 'STRING' \n";
return TokenKind::ERR;
default:
std::unreachable();
case 2:;
}
textLen = size();
matcher().more();
}
}
<STRING>{
\"'* |
.*?/\" |
<<EOF>> std::unreachable();
}
%%

C++11 regex replace

I have an XML string that i wish to log out. this XML contains some sensitive data that i'd like to mask out before sending to the log file. Currently using std::regex to do this:
std::regex reg("<SensitiveData>(\\d*)</SensitiveData>");
return std::regex_replace(xml, reg, "<SensitiveData>......</SensitiveData>");
Currently the data is being replaced by exactly 6 '.' characters, however what i really want to do is to replace the sensitive data with the correct number of dots. I.e. I'd like to get the length of the capture group and put that exact number of dots down.
Can this be done?
regex_replace of C++11 regular expressions does not have the capability you are asking for — the replacement format argument must be a string. Some regular expression APIs allow replacement to be a function that receives a match, and which could perform exactly the substitution you need.
But regexps are not the only way to solve a problem, and in C++ it's not exactly hard to look for two fixed strings and replace characters inbetween:
const char* const PREFIX = "<SensitiveData>";
const char* const SUFFIX = "</SensitiveData>";
void replace_sensitive(std::string& xml) {
size_t start = 0;
while (true) {
size_t pref, suff;
if ((pref = xml.find(PREFIX, start)) == std::string::npos)
break;
if ((suff = xml.find(SUFFIX, pref + strlen(PREFIX))) == std::string::npos)
break;
// replace stuff between prefix and suffix with '.'
for (size_t i = pref + strlen(PREFIX); i < suff; i++)
xml[i] = '.';
start = suff + strlen(SUFFIX);
}
}

Regex to filter strings

I need to filter strings based on two requirements
1) they must start with "city_date"
2) they should not have "metro" anywhere in the string.
This need to be done in just one check.
To start I know it should be like this but dont know hoe to eliminate strings with "metro"
string pattern = "city_date_"
Added: I need to use the regex for a SQL LIKE statement. hence i need it in a string.
Use a negative lookahead assertion (I don't know if this is supported in your regex lib)
string pattern = "^city_date(?!.*metro)"
I also added an anchor ^ at the start, that will match the start of the string.
The negative lookahead assertion (?!.*metro) will fail, if there is the string "metro" somewhere ahead.
Regular expressions are usually far more expensive than direct comparisons. If direct comparisons can easily express the requirements, use them. This problem doesn't need the overhead of a regular expression. Just write the code:
std::string str = /* whatever */
const std::string head = "city_date";
const std::string exclude = "metro";
if (str.compare(head, 0, head.size) == 0 && str.find(exclude) == std::string::npos) {
// process valid string
}
by using javascript
input="contains the string your matching"
var pattern=/^city_date/g;
if(pattern.test(input)) // to match city_data at the begining
{
var patt=/metro/g;
if(patt.test(input)) return "false";
else return input; //matched string without metro
}
else
return "false"; //unable to match city_data

My last regular expression won't work but i cannot figure out the reason why

I have two vectors, one which holds my regular expressions and one which holds the string in which will be checked against the regular expression, most of them work fine except for this one (shown below) the string is a correct string and matches the regular expression but it outputs incorrect instead of correct.
INPUT STRING
.C/IATA
CODE IS BELOW
std::string errorMessages [6][6] = {
{
"Correct Corparate Name\n",
},
{
"Incorrect Format for Corporate Name\n",
}
};
std::vector<std::string> el;
split(el,message,boost::is_any_of("\n"));
std::string a = ("");
for(int i = 0; i < el.size(); i++)
{
if(el[i].substr(0,3) == ".C/")
{
DCS_LOG_DEBUG("--------------- Validating .C/ ---------------");
output.push_back("\n--------------- Validating .C/ ---------------\n");
str = el[i].substr(3);
split(st,str,boost::is_any_of("/"));
for (int split_id = 0 ; split_id < splitMask.size() ; split_id++ )
{
boost::regex const string_matcher_id(splitMask[split_id]);
if(boost::regex_match(st[split_id],string_matcher_id))
{
a = errorMessages[0][split_id];
DCS_LOG_DEBUG("" << a )
}
else
{
a = errorMessages[1][split_id];
DCS_LOG_DEBUG("" << a)
}
output.push_back(a);
}
}
else
{
DCS_LOG_DEBUG("Do Nothing");
}
st[split_id] = "IATA"
splitMask[split_id] = "[a-zA-Z]{1,15}" <---
But it still outputs Incorrect format for corporate name
I cannot see why it prints incorrect when it should be correct can someone help me here please ?
Your regex and the surrounding logic is OK.
You need to extend your logging and to print the relevant part of splitMask and st right before the call to boost::regex_match to double check that the values are what you believe they are. Print them surrounded in some punctuation and also print the string length to be sure.
As you probably know, boost::regex_match only finds a match if the whole string is a match; therefore, if there is a non-printable character somewhere, or maybe a trailing space character, that will perfectly explain the result you have seen.

Use GNU libc regexec() to count substring

Is it possible to count how many times a substring appears in a string using regex matching with GNU libc regexec()?
No, regexec() only finds one match per call. If you want to find the next match, you have to call it again further along the string.
If you only want to search for plain substrings, you are much better off using the standard C string.h function strstr(); then you won't have to worry about escaping special regex characters.
regexec returns in its fourth parameter "pmatch" a structure with all the matches.
"pmatch" is a fixed sized structure, if there are more matches you will call the function another time.
I have found this code with two nested loops and I have modified it. The original cod you cand find it in http://www.lemoda.net/c/unix-regex/index.html:
static int match_regex (regex_t * r, const char * to_match)
{
/* "P" is a pointer into the string which points to the end of the
previous match. */
const char * p = to_match;
/* "N_matches" is the maximum number of matches allowed. */
const int n_matches = 10;
/* "M" contains the matches found. */
regmatch_t m[n_matches];
int number_of_matches = 0;
while (1) {
int i = 0;
int nomatch = regexec (r, p, n_matches, m, 0);
if (nomatch) {
printf ("No more matches.\n");
return nomatch;
}
for (i = 0; i < n_matches; i++) {
if (m[i].rm_so == -1) {
break;
}
number_of_matches ++;
}
p += m[0].rm_eo;
}
return number_of_matches ;
}
sorry for creating another answer, because I have not 50 reputation. I cannot comment #Oscar Raig Colon's answer.
pmatch cannot match all the substrings, pmatch is used to save the of offset for subexpression, the key is to understand what's subexpression, subexpression is "\(\)" in BRE, "()" in ERE. if there is not subexpression in entire regular expression, regexec() only return the first match string's offset and put it to pmatch[0].
you can find a example at [http://pubs.opengroup.org/onlinepubs/007908799/xsh/regcomp.html][1]
The following demonstrates how the REG_NOTBOL flag could be used with regexec() to find all substrings in a line that match a pattern supplied by a user. (For simplicity of the example, very little error checking is done.)
(void) regcomp (&re, pattern, 0);
/* this call to regexec() finds the first match on the line */
error = regexec (&re, &buffer[0], 1, &pm, 0);
while (error == 0) { /* while matches found */
/* substring found between pm.rm_so and pm.rm_eo */
/* This call to regexec() finds the next match */
error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
}