Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string - c++

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.
For example:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
becomes
[" Test Test ",ab,c," Test","Test ","Test","T e s t"]
For my use case however, the solution should work in the following test setting:
https://www.regextester.com/
All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.
For more specificity, I am using Boost::Regex C++ to do the parsing as follows:
...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+ : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\") : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!
//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);
For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.
Thank you all for you assistance I hope you can help me with the above problem.

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:
std::vector<std::string> tokens(std::string_view input) {
namespace x3 = boost::spirit::x3;
std::vector<std::string> r;
auto atom //
= '[' >> *~x3::char_(']') >> ']' //
| '{' >> *~x3::char_('}') >> '}' //
| '"' >> *~x3::char_('"') >> '"' //
| x3::graph;
auto token = x3::raw[*atom];
parse(input.begin(), input.end(), token % +x3::space, r);
return r;
}
This, off the bat, already performs as you intend:
Live On Coliru
int main() {
for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
std::cout << input << "\n";
for (auto& tok : tokens(input))
std::cout << " - " << quoted(tok, '\'') << "\n";
}
}
Output:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
- '" Test Test "'
- 'ab'
- 'c'
- '" Test"'
- '"Test "'
- '"Test"'
- '"T e s t"'
BONUS
Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).
Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:
const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:
const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Related

Regex to match strings not enclosed in macro

In a development context, I would like to make sure all strings in source files within certain directories are enclosed in some macro "STR_MACRO". For this I will be using a Python script parsing the source files, and I would like to design a regex for detecting non-commented lines with strings not enclosed in this macro.
For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar
Excluding commented lines containing strings seems to work well using the regex ^(?!\s*//).*"([^"]+)". However when I try to exclude non-commented strings already enclosed in the macro, using the regex ^(?!\s*//).*(?!STR_MACRO\()"([^"]+)", it does nothing more (seemingly due to with the opening parenthesis after STR_MACRO).
Any hints on how to achieve this?
With PyPi regex module (that you can install with pip install regex in the terminal) you can use
import regex
pattern = r'''(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F)|"[^"\\]*(?:\\.[^"\\]*)*"'''
text = r'''For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar'''
print( regex.sub(pattern, r'STR_MACRO(\g<0>)', text, flags=regex.M) )
Details:
(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F) - // at the line start and the rest of the line, or STR_MACRO( + a double quoted string literal pattern + ), and then the match is skipped, and the next match search starts at the failure location
| - or
"[^"\\]*(?:\\.[^"\\]*)*" - ", zero or more chars other than " and \, then zero or more reptitions of a \ and then any single char followed with zero or more chars other than a " and \ chars, and then a " char
See the Python demo. Output:
For instance, the regex should match the following strings:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar

Find regex matches & remove outer part of the match

I have a string
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
and I find to find all instances with func(...), and remove the function call. So that I would get
content = "std::cout << some_val << std::endl; auto i = some_other_val;"
So I've tried this:
import re
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
c = re.compile('func\([a-zA-Z0-9_]+\)')
print(c.sub('', content)) # gives "std::cout << << std::endl; auto i = ;"
but this removes the entire match, not just the func( and ).
Basically, how do I keep whatever matched with [a-zA-Z0-9_]+?
You can use re.sub to replace all the outer func(...) with only the value like below, See regex here , Here I've used [w]+, you can do changes if you use
import re
regex = r"func\(([\w]+)\)"
test_str = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
subst = "\\1"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo: https://rextester.com/QZJLF65281
Output:
std::cout << some_val << std::endl; auto i = some_other_val;
You should capture the part of the match that you want to keep into a group:
re.compile(r'func\(([a-zA-Z0-9_]+)\)')
Here I captured it into group 1.
And then you can refer to group 1 with \1:
print(c.sub(r'\1', content))
Note that in general, you should not use regex to parse source code of a non-regular language (such as C in this case) with regex. It might work in a few very specific cases, where the input is very limited, but you should still use a C parser to parse C code. I have found libraries such as this and this.

how to include a single bar symbol "|" to my regex

i have found a regex which can test if the user has entered ".(cpp)" or ".(txt)" or ".(5ft6)"
str is the test string and ext is the regex.
std::string str("\"\\.(cpp)\"");
const std::string regex_string = R"!(\"\\\.\([[:alpha:][:digit:]]*\)\")!";
const std::regex ext(regex_string);
if (regex_match(str, ext))
{
cout << "regex" << endl;
}
however i am not able to figure out how to make regex for an expression with a vertical bar or the OR symbol "|"
i mean when the user enters the expresion ".(cpp|h|hpp)" in command line args
there must be a regex for that that specifies that the user can enter any number or letter including the single bar "|"

Using REGULAR EXPRESSION to replace string between special characters in oracle

select a[b]c[d][e]f[g] from dual;
I need an output:
acf
i.e. Removed with all [] as well as the text between them .
Solution can be in Oracle or C++ function.
Tried erase function in C++ , something like :
int main ()
{
std::string str ("a[b]c[d]e[f]");
std::cout << str << '\n';
while(1)
{
std::size_t foundStart = str.find("[");
//if (foundStart != std::string::npos)
std::cout << "'[' found at: " << foundStart << '\n';
str.begin();
std::size_t foundClose = str.find("]");
//if (foundClose != std::string::npos)
std::cout << "']' found at: " << foundClose << '\n';
str.begin();
str.erase (foundStart,foundClose);
std::cout << str << '\n';
}
return 0;
}
which returns an output as :
a[b]c[d]e[f]
'[' found at: 1
']' found at: 3
ac[d]e[f]
'[' found at: 2
']' found at: 4
ac[f]
'[' found at: 2
']' found at: 4
ac
'[' found at: 18446744073709551615
']' found at: 18446744073709551615
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::erase
Thanks in Advance.
I don't know enough C++ or Oracle to implement it, but the Regular Expression would look something like this, I suppose:
(?<=[\s\]])[a-z](?=(\[[a-z]\])+[\sa-z])
This will match a, c and f.
You will need to iterate over the matches and print them accordingly.
The regular expression is decoupled from whatever text is around the target,
hello there a[b]c[d][e]f[g] way to go! will have the same matches,
just be sure to have spaces around the target string a[b]c[d][e]f[g]
I hope I helped you!
Good luck
You can use regexp_replace(<your_string>,'\[.*?\]')
Breaking down,
\[ --matches single square bracket '['. Should be escaped with a backslash '\', as '[' is regex operator
.*? --non greedy expression to match minimum text possible
\] --matches single square bracket ']'. Should be escaped with a backslash '\', as ']' is regex operator
Example:
SQL> with x(y) as (
select 'a[b]c[d][][e]f[g][he]ty'
from dual
)
select y, regexp_replace(y,'\[.*?\]') regex_str
from x;
Y REGEX_STR
----------------------- -----------
a[b]c[d][][e]f[g][he]ty acfty

Javacc Regular expression to match particular type of string?

I am trying to write the regular expression for the token, so that if the string is passed " 123 " ' " then the string should be 123 " '
Since my current regex is
<T_STRING:
"\"" (~["\""]|"^\"")* "\""
| "'"(~["'"]|"^'" )* "'"
>
I am not able to get the output like 123 " '. Instead it only detect first part "123" rest of them ' " are ignored .
Question is how can I write the regex for the token so that it can give me desired output.